Re: [PATCH] ext4: Fix mmap write protection for data=journal mode

From: Mauricio Faria de Oliveira <mfo@canonical.com>
To: Jan Kara <jack@suse.cz>
Cc: Ted Tso <tytso@mit.edu>, linux-ext4@vger.kernel.org
Subject: Re: [PATCH] ext4: Fix mmap write protection for data=journal mode
Date: Fri, 20 Nov 2020 12:27:56 -0300	[thread overview]
Message-ID: <CAO9xwp3sSjzy9W8pMjV6vYitfZ9BmZE-9bLwcLg1uz3CFBHUcQ@mail.gmail.com> (raw)
In-Reply-To: <CAO9xwp0AtCLG77g6fWgu9un9XPD3d5U6ZtjWc3FRJrB8NK44SQ@mail.gmail.com>

Hey Jan,

On Tue, Oct 27, 2020 at 1:10 PM Mauricio Faria de Oliveira
<mfo@canonical.com> wrote:
> On Tue, Oct 27, 2020 at 10:27 AM Jan Kara <jack@suse.cz> wrote:
> >
> > Commit afb585a97f81 "ext4: data=journal: write-protect pages on
> > j_submit_inode_data_buffers()") added calls ext4_jbd2_inode_add_write()
> > to track inode ranges whose mappings need to get write-protected during
> > transaction commits. However the added calls use wrong start of a range
> > (0 instead of page offset) and so write protection is not necessarily
> > effective. Use correct range start to fix the problem.
> >
> > Fixes: afb585a97f81 ("ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()")
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/ext4/inode.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > Mauricio, I think this could be the reason for occasional test failures you
> > were still seeing. Can you try whether this patch fixes those for you? Thanks!
> >
>
> Thanks! Nice catch. Sure, I'll give it a try and follow up.

TL;DR:

1) Thanks! The patch fixed almost 100% of the checksum failures.
2) I can send a debug patch to verify buffer checksums before write-out.
3) The remaining, rare checksum failures seem to be due to
    a race between commit/write-protect and page writeback
    related to PageChecked(), clearing pagecache dirty tag used to
write-protect.
4) Test results statistics confirm that the occurrence of checksum
failures is really low.

...

Sorry for the delay in following up on this.

Thanks for the patch! The results with v5.10-rc4 are almost 100%:

There are now _very rare_ occasions of journal checksum failures detected; with
_zero_ recovery failures in stress-ng/crash/reboot/mount in 1187 loops
overnight.
(Previously I'd get 3-5 out of 10.)

I plan to send the debug patch used to verify the buffer checksum in the tag
before write-out (catches the checksum failures that fail recovery in advance),
if you think it might be useful. I thought of it under CONFIG_JBD2_DEBUG.

...

The remaining checksum changes due to write-protect failures _seem_ to be
a race between our write-protect with write_cache_pages() and writeback/sync.
But I'm not exactly sure as it's been hard to find the timing/steps
for both threads.
The idea is,

Our write_cache_pages() during commit /
ext4_journalled_submit_inode_data_buffers()
depends on PAGECACHE_TAG_DIRTY being set, for pagevec_lookup_range_tag()
to put such pages in the array to be write-protected with
clear_page_dirty_for_io().

With a debug patch to dump_stack() callers of
__test_set_page_writeback(), that can
xas_clear_mark() it, _while_ the page is still going in our call to
write_cache_pages(),
we see this: wb_workfn() -> ext4_writepage() -> ext4_ext4_bio_write_page(),

i.e., _not_ going through ext4_journalled_writepage(), which
knows/waits on journal.
The other leg of ext4_writepage()  _doesn't_, and thus can clear the
tag and prevent
write-protect while the journal commit / our write_cache_pages() is running.

Since the switch to either leg is PageChecked(), I guess this might be due to
clearing it right away in ext4_journalled_writepage(), before write-protection.

If that is the issue, perhaps that should be done in our writepage
callback or finish callback,
but I haven't thought much about it, until confirming if that's
actually the root cause.
(but it seems there's a problem if that bit is set again while we
process and clear it.)

...

And just details in the testing methodology, for documentation purposes.

The reproducer for the checksum errors is 'stress-ng --mmap $((4 *
$(nproc))) --mmap-file' on data=journal.
There's a test/wait/crash/reboot script that runs it and crashes in
$RANDOM <= 120 seconds.

This used to _often_ hit a journal recovery failure in mount on boot,
that prevents mount/boot.
Now that the checksum errors occur much less often, there's been
_zero_ recovery failures.

With the debug patch to check/report for checksum changes, the kernel
logs now show
such occurrences along with the (re)boot markers; e.g.,

[    0.000000] Linux version 5.10.0-rc4.rechecksum ...
[   80.534910] TESTCASE: checksum changed (tag3: 0x2d57bbdf, calc: 0xe0b4ddcb)
[    0.000000] Linux version 5.10.0-rc4.rechecksum ...

So it's relatively easy to convert those into numbers/statistics with
shell glue.

There's a total of 1187 reboots (i.e., stress-ng runs); 1131 had no
checksum errors; only 56 had.

The average number of reboots/runs between checksum errors being
detected is 20 (min 1, max 140, mean 16)
The average number of checksum errors per reboot/run _that has errors_
is 1 (95%), and max is 2 (5%).

So this confirms it's really rare now, to the point that it's even
harder to happen
in the moment that crash happened, and thus jbd2 recovery can finish correctly:

[    1.492521] EXT4-fs (vda1): INFO: recovery required on readonly filesystem
[    1.493906] EXT4-fs (vda1): write access will be enabled during recovery
[    1.826665] EXT4-fs (vda1): orphan cleanup on readonly fs
[    1.832006] EXT4-fs (vda1): 16 orphan inodes deleted
[    1.833050] EXT4-fs (vda1): recovery complete
[    1.837041] EXT4-fs (vda1): mounted filesystem with journalled data
mode. Opts: (null)

cheers,

--
Mauricio Faria de Oliveira