linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mauricio Faria de Oliveira <mfo@canonical.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org, "Theodore Y. Ts'o" <tytso@mit.edu>,
	dann frazier <dann.frazier@canonical.com>,
	Andreas Dilger <adilger@dilger.ca>, Jan Kara <jack@suse.com>
Subject: Re: [RFC PATCH 00/11] ext4: data=journal: writeback mmap'ed pagecache
Date: Wed, 10 Jun 2020 12:15:50 -0300	[thread overview]
Message-ID: <CAO9xwp15iXOZZcYdw2vooAijKMps6JDMZncnf+GXj7B29VFqeA@mail.gmail.com> (raw)
In-Reply-To: <20200610132139.GG12551@quack2.suse.cz>

Hi Jan,

On Wed, Jun 10, 2020 at 10:21 AM Jan Kara <jack@suse.cz> wrote:
>
> Hello!
>
> Firstly, thanks for the patches and I'm sorry that it took me so long to
> get to this.
>

No worries at all. Thank you very much for reviewing, and explaining your
understanding of the problem and why this patchset doesn't fully address it.

I believe I understood the rationale and the pieces involved in the solution
you suggested (thanks for the level of detail), and should be able to work
on it and send something within a few weeks.

cheers,
Mauricio

> On Thu 23-04-20 20:36:54, Mauricio Faria de Oliveira wrote:
> > Ted Ts'o explains, in the linux-ext4 thread [1], that currently
> > data=journal + journal_checksum + mmap is not handled correctly,
> > and outlines the changes to fix it (+ items/breaks for clarity):
> >
> >     """
> >     The fix is going to have to involve
> >     - fixing __ext4_journalled_writepage() to call set_page_writeback()
> >       before it unlocks the page,
> >     - adding a list of pages under data=journalled writeback
> >       which is attached to the transaction handle,
> >     - have the jbd2 commit hook call end_page_writeback()
> >       on all of these pages,
> >     - and then in the places where ext4 calls wait_for_stable_page()
> >       or grab_cache_page_write_begin(), we need to add:
> >
> >       if (ext4_should_journal_data(inode))
> >               wait_on_page_writeback(page);
> >
> >     It's all relatively straightforward except for the part
> >     where we have to attach a list of pages to the currently
> >     running transaction.  That will require adding some
> >     plumbing into the jbd2 layer.
> >     """
>
> So I was pondering about this general design for some time. First let me
> state here how I see the problem you are hitting in data=journal mode:
>
> The journalling machinery expects the buffer contents cannot change from
> the moment transaction commit starts (more exactly from the moment
> transaction exists LOCKED state) until the moment transaction commit
> finishes. Places like jbd2_journal_get_write_access() are there exactly to
> ascertain this - they copy the "to be committed" contents into a bounce
> buffer (jh->b_frozen_data) or wait for commit to finish (if it's too late
> for copying and the data is already in flight). data=journal mode breaks
> this assumption because although ext4_page_mkwrite() calls
> do_journal_get_write_access() for each page buffer (and thus makes sure
> there's no commit with these buffers running at that moment), the commit
> can start the moment we call ext4_journal_stop() in ext4_page_mkwrite() and
> then this commit runs while buffers in the committing transaction are
> writeably mapped to userspace.
>
> However I don't think Ted's solution actually fully solves the above
> problem. Think for example about the following scenario:
>
> fd = open('file');
> addr = mmap(fd);
> addr[0] = 'a';
>         -> caused page fault, ext4_page_mkwrite() is called, page is
>            dirtied, all buffers are added to running transaction
> pwrite(fd, buf, 1, 1);
>         -> page dirtied again, all buffer are dirtied in the running
>            transaction
>
>                                         jbd2 thread commits transaction now
> -> the problem with committing buffers that are writeably mapped is still
> there and your patches wouldn't solve it because
> ext4_journalled_writepage() didn't get called at all.
>
> Also, as you found out, leaving pages under writeback while we didn't even
> start transaction commit that will end writeback on them is going to cause
> odd stalls in various unexpected places.
>
> So I'd suggest a somewhat different solution. Instead of trying to mark,
> when page is under writeback, do it the other way around and make jbd2 kick
> ext4 on transaction commit to writeprotect journalled pages. Then, because
> of do_journal_get_write_access() in ext4_page_mkwrite() we are sure pages
> cannot be writeably mapped into page tables until transaction commit
> finishes (or we'll copy data to commit to b_frozen_data).
>
> Now let me explain a bit more the "make jbd2 kick ext4 on transaction
> commit to writeprotect journalled pages" part. I think we could mostly
> reuse the data=ordered machinery for that. data=ordered machinery tracks
> with each running transaction also a list of inodes for which we need to
> flush data pages before doing commit of metadata into the journal. Now we
> could use the same mechanism for data=journal mode - we'd track in the
> inode list inodes with writeably mapped pages and on transaction commit we
> need to writeprotect these pages using clear_page_dirty_for_io(). Probably
> the best place to add inode to this list is ext4_journalled_writepage().
> To do the commit handling we probably need to introduce callbacks that jbd2
> will call instead of journal_submit_inode_data_buffers() in
> journal_submit_data_buffers() and instead of
> filemap_fdatawait_range_keep_errors() in
> journal_finish_inode_data_buffers(). In data=ordered mode ext4 will just do
> what jbd2 does in its callback, when inode's data should be journalled, the
> callback will use write_cache_pages() to iterate and writeprotect all dirty
> pages. The writepage callback just returns AOP_WRITEPAGE_ACTIVATE, some
> care needs to be taken to redirty a page in the writepage callback if not
> all its underlying buffers are part of the committing transaction or if
> some buffer already has b_next_transaction set (since in that case the page
> was already dirtied also against the following transaction). But it should
> all be reasonably doable.
>
>                                                                 Honza
>
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR



-- 
Mauricio Faria de Oliveira

      reply	other threads:[~2020-06-10 15:16 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-23 23:36 [RFC PATCH 00/11] ext4: data=journal: writeback mmap'ed pagecache Mauricio Faria de Oliveira
2020-04-23 23:36 ` [RFC PATCH 01/11] ext4: data=journal: introduce struct/kmem_cache ext4_journalled_wb_page/_cachep Mauricio Faria de Oliveira
2020-04-23 23:36 ` [RFC PATCH 02/11] ext4: data=journal: handle page writeback in __ext4_journalled_writepage() Mauricio Faria de Oliveira
2020-04-23 23:36 ` [RFC PATCH 03/11] ext4: data=journal: call ext4_force_commit() in ext4_writepages() for msync() Mauricio Faria de Oliveira
2020-04-23 23:36 ` [RFC PATCH 04/11] ext4: data=journal: introduce helpers for journalled writeback deadlock Mauricio Faria de Oliveira
2020-04-23 23:36 ` [RFC PATCH 05/11] ext4: data=journal: prevent journalled writeback deadlock in __ext4_journalled_writepage() Mauricio Faria de Oliveira
2020-04-23 23:37 ` [RFC PATCH 06/11] ext4: data=journal: prevent journalled writeback deadlock in ext4_write_begin() Mauricio Faria de Oliveira
2020-04-23 23:37 ` [RFC PATCH 07/11] ext4: grab page before starting transaction handle in ext4_convert_inline_data_to_extent() Mauricio Faria de Oliveira
2020-04-23 23:37 ` [RFC PATCH 08/11] ext4: data=journal: prevent journalled writeback deadlock " Mauricio Faria de Oliveira
2020-04-23 23:37 ` [RFC PATCH 09/11] ext4: grab page before starting transaction handle in ext4_try_to_write_inline_data() Mauricio Faria de Oliveira
2020-04-23 23:37 ` [RFC PATCH 10/11] ext4: deduplicate code with error legs " Mauricio Faria de Oliveira
2020-04-23 23:37 ` [RFC PATCH 11/11] ext4: data=journal: prevent journalled writeback deadlock " Mauricio Faria de Oliveira
2020-05-15 18:39 ` [RFC PATCH 00/11] ext4: data=journal: writeback mmap'ed pagecache Mauricio Faria de Oliveira
2020-05-17  7:40   ` Andreas Dilger
2020-06-10 13:21 ` Jan Kara
2020-06-10 15:15   ` Mauricio Faria de Oliveira [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAO9xwp15iXOZZcYdw2vooAijKMps6JDMZncnf+GXj7B29VFqeA@mail.gmail.com \
    --to=mfo@canonical.com \
    --cc=adilger@dilger.ca \
    --cc=dann.frazier@canonical.com \
    --cc=jack@suse.com \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).