From: Boris Burkov <boris@bur.io>
To: Chris Murphy <chris@colorremedies.com>
Cc: Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
Date: Wed, 15 Feb 2023 17:46:48 -0800 [thread overview]
Message-ID: <Y+2LCFrD4Qxff89Y@zen> (raw)
In-Reply-To: <Y+16BVPQiwf8e1J3@zen>
On Wed, Feb 15, 2023 at 04:34:13PM -0800, Boris Burkov wrote:
> On Wed, Feb 15, 2023 at 03:21:38PM -0800, Boris Burkov wrote:
> > On Wed, Feb 15, 2023 at 03:16:39PM -0500, Chris Murphy wrote:
> > >
> > >
> > > On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote:
> > > > Downstream bug report, reproducer test file, and gdb session transcript
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=2169947
> > > >
> > > > I speculated that maybe it's similar to the issue we have with VM's
> > > > when O_DIRECT is used, but it seems that's not the case here.
> > >
> > > I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests.
> > >
> > > kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem.
> > >
> >
> > I was also able to reproduce on the current misc-next. However, when I
> > hacked the kernel to always fall back from DIO to buffered IO, it no
> > longer reproduced. So that's one hint..
> >
> > The next observation comes from comparing the happy vs unhappy file
> > extents on disk:
> > happy: https://pastebin.com/k4EPFKhc
> > unhappy: https://pastebin.com/hNSBR0yv
> >
> > The broken DIO case is missing file extents between bytes 8192 and 65536
> > which corresponds to the observed zeros.
> >
> > Next, at Josef's suggestion, I looked at the IOMAP_DIO_PARTIAL and
> > instrumented that codepath. I observed a single successful write to 8192
> > bytes, then a second write which first does a partial write from 8192 to
> > 65536 and then faults in the rest of the iov_iter and finishes the
> > write.
> >
> > I'm now trying to figure out how these partial writes might lead us to
> > not create all the EXTENT_DATA items for the file extents.
>
> I believe the issue is indeed caused by faults reading the mapped region
> during direct io. Roughly what is happening is:
>
> - we start the dio write (offset 8192 len 1826816)
> - __iomap_dio_rw calls iomap_iter which calls btrfs_dio_iomap_begin which
> creates an ordered extent for the full write.
> - iomap_dio_iter hits a page fault in bio_iov_iter_get_pages after 57344
> bytes and breaks out early, but submits the partial bio.
> - the partial bio completes and calls the various endio callbacks,
> resulting in a call to btrfs_mark_ordered_io_finished.
> - btrfs_mark_ordered_io_finished looks up the ordered extent and finds
> the full ordered extent, but the write that finished is partial, so
> the check for entry->bytes_left fails, and we don't call
> finish_ordered_fn and thus don't create a file extent item for this
> endio.
> - the IOMAP_DIO_PARTIAL logic results in us retrying starting from 65536
> (8192 + 57344) but we fully exit and re-enter __iomap_dio_rw, which
> creates a new ordered extent for off 65536 len 1769472 and that
> ordered extent proceeds as above but successfully, and we get the
> second file extent.
>
> I'm not yet sure how to fix this, but have a couple ideas/questions:
> 1. Is there anyway we can split off a partial ordered extent and finish
> it when we get the partial write done?
> 2. Can we detect that there is an unfinished ordered extent that
> overlaps with our new one on the second write of the partial write
> logic?
>
> I'll play around and see if I can hack together a fix..
The following patch causes the problem to stop reproducing by splitting
the large ordered extent in the case of a short write and leaving it
alone otherwise. I haven't thoroughly tested it, or even thought it
through that well yet (e.g. I have no clue where that extract function
comes from!), but it's a start. I have to sign off for the evening, so
I will leave my investigation here for now.
diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index d8b90f95b157..016b1a77af71 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -647,6 +647,11 @@ static bool btrfs_submit_chunk(struct bio *bio, int mirror_num)
}
if (btrfs_op(bio) == BTRFS_MAP_WRITE) {
+
+ ret = btrfs_extract_ordered_extent(btrfs_bio(bio));
+ if (ret)
+ goto fail_put_bio;
+
if (use_append) {
bio->bi_opf &= ~REQ_OP_WRITE;
bio->bi_opf |= REQ_OP_ZONE_APPEND;
>
> >
> > Boris
> >
> > >
> > > --
> > > Chris Murphy
next prev parent reply other threads:[~2023-02-16 1:47 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-15 20:04 LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 Chris Murphy
2023-02-15 20:16 ` Chris Murphy
2023-02-15 21:41 ` Filipe Manana
2023-02-15 23:21 ` Boris Burkov
2023-02-16 0:34 ` Boris Burkov
2023-02-16 1:46 ` Boris Burkov [this message]
2023-02-16 5:58 ` Christoph Hellwig
2023-02-16 9:30 ` Christoph Hellwig
2023-02-16 11:57 ` Filipe Manana
2023-02-16 17:14 ` Boris Burkov
2023-02-16 18:00 ` Filipe Manana
2023-02-16 18:49 ` Christoph Hellwig
2023-02-16 21:43 ` Filipe Manana
2023-02-16 22:45 ` Boris Burkov
2023-02-17 11:19 ` Filipe Manana
2023-02-16 10:05 ` Qu Wenruo
2023-02-16 12:01 ` Filipe Manana
2023-02-17 0:15 ` Qu Wenruo
2023-02-17 11:38 ` Filipe Manana
2023-04-05 13:07 ` Linux regression tracking #adding (Thorsten Leemhuis)
2023-04-06 15:47 ` David Sterba
2023-04-06 22:40 ` Neal Gompa
2023-04-07 6:10 ` Linux regression tracking (Thorsten Leemhuis)
2023-04-08 0:08 ` Boris Burkov
2023-04-11 19:27 ` David Sterba
2023-04-12 9:57 ` Linux regression tracking (Thorsten Leemhuis)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y+2LCFrD4Qxff89Y@zen \
--to=boris@bur.io \
--cc=chris@colorremedies.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).