Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4

From: Boris Burkov <boris@bur.io>
To: Chris Murphy <chris@colorremedies.com>
Cc: Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
Date: Wed, 15 Feb 2023 17:46:48 -0800	[thread overview]
Message-ID: <Y+2LCFrD4Qxff89Y@zen> (raw)
In-Reply-To: <Y+16BVPQiwf8e1J3@zen>

On Wed, Feb 15, 2023 at 04:34:13PM -0800, Boris Burkov wrote:
> On Wed, Feb 15, 2023 at 03:21:38PM -0800, Boris Burkov wrote:
> > On Wed, Feb 15, 2023 at 03:16:39PM -0500, Chris Murphy wrote:
> > > 
> > > 
> > > On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote:
> > > > Downstream bug report, reproducer test file, and gdb session transcript
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=2169947
> > > >
> > > > I speculated that maybe it's similar to the issue we have with VM's 
> > > > when O_DIRECT is used, but it seems that's not the case here.
> > > 
> > > I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests.
> > > 
> > > kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem.
> > > 
> > 
> > I was also able to reproduce on the current misc-next. However, when I
> > hacked the kernel to always fall back from DIO to buffered IO, it no
> > longer reproduced. So that's one hint..
> > 
> > The next observation comes from comparing the happy vs unhappy file
> > extents on disk:
> > happy: https://pastebin.com/k4EPFKhc
> > unhappy: https://pastebin.com/hNSBR0yv
> > 
> > The broken DIO case is missing file extents between bytes 8192 and 65536
> > which corresponds to the observed zeros.
> > 
> > Next, at Josef's suggestion, I looked at the IOMAP_DIO_PARTIAL and
> > instrumented that codepath. I observed a single successful write to 8192
> > bytes, then a second write which first does a partial write from 8192 to
> > 65536 and then faults in the rest of the iov_iter and finishes the
> > write.
> > 
> > I'm now trying to figure out how these partial writes might lead us to
> > not create all the EXTENT_DATA items for the file extents.
> 
> I believe the issue is indeed caused by faults reading the mapped region
> during direct io. Roughly what is happening is:
> 
> - we start the dio write (offset 8192 len 1826816)
> - __iomap_dio_rw calls iomap_iter which calls btrfs_dio_iomap_begin which
>   creates an ordered extent for the full write.
> - iomap_dio_iter hits a page fault in bio_iov_iter_get_pages after 57344
>   bytes and breaks out early, but submits the partial bio.
> - the partial bio completes and calls the various endio callbacks,
>   resulting in a call to btrfs_mark_ordered_io_finished.
> - btrfs_mark_ordered_io_finished looks up the ordered extent and finds
>   the full ordered extent, but the write that finished is partial, so
>   the check for entry->bytes_left fails, and we don't call
>   finish_ordered_fn and thus don't create a file extent item for this
>   endio.
> - the IOMAP_DIO_PARTIAL logic results in us retrying starting from 65536
>   (8192 + 57344) but we fully exit and re-enter __iomap_dio_rw, which
>   creates a new ordered extent for off 65536 len 1769472 and that
>   ordered extent proceeds as above but successfully, and we get the
>   second file extent.
> 
> I'm not yet sure how to fix this, but have a couple ideas/questions:
> 1. Is there anyway we can split off a partial ordered extent and finish
> it when we get the partial write done?
> 2. Can we detect that there is an unfinished ordered extent that
> overlaps with our new one on the second write of the partial write
> logic?
> 
> I'll play around and see if I can hack together a fix..

The following patch causes the problem to stop reproducing by splitting
the large ordered extent in the case of a short write and leaving it
alone otherwise. I haven't thoroughly tested it, or even thought it
through that well yet (e.g. I have no clue where that extract function
comes from!), but it's a start. I have to sign off for the evening, so
I will leave my investigation here for now.

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index d8b90f95b157..016b1a77af71 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -647,6 +647,11 @@ static bool btrfs_submit_chunk(struct bio *bio, int mirror_num)
        }

        if (btrfs_op(bio) == BTRFS_MAP_WRITE) {
+
+               ret = btrfs_extract_ordered_extent(btrfs_bio(bio));
+               if (ret)
+                       goto fail_put_bio;
+
                if (use_append) {
                        bio->bi_opf &= ~REQ_OP_WRITE;
                        bio->bi_opf |= REQ_OP_ZONE_APPEND;

> 
> > 
> > Boris
> > 
> > > 
> > > -- 
> > > Chris Murphy