* LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 @ 2023-02-15 20:04 Chris Murphy 2023-02-15 20:16 ` Chris Murphy 2023-04-05 13:07 ` Linux regression tracking #adding (Thorsten Leemhuis) 0 siblings, 2 replies; 26+ messages in thread From: Chris Murphy @ 2023-02-15 20:04 UTC (permalink / raw) To: Btrfs BTRFS Downstream bug report, reproducer test file, and gdb session transcript https://bugzilla.redhat.com/show_bug.cgi?id=2169947 I speculated that maybe it's similar to the issue we have with VM's when O_DIRECT is used, but it seems that's not the case here. -- Chris Murphy ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 2023-02-15 20:04 LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 Chris Murphy @ 2023-02-15 20:16 ` Chris Murphy 2023-02-15 21:41 ` Filipe Manana 2023-02-15 23:21 ` Boris Burkov 2023-04-05 13:07 ` Linux regression tracking #adding (Thorsten Leemhuis) 1 sibling, 2 replies; 26+ messages in thread From: Chris Murphy @ 2023-02-15 20:16 UTC (permalink / raw) To: Btrfs BTRFS On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote: > Downstream bug report, reproducer test file, and gdb session transcript > https://bugzilla.redhat.com/show_bug.cgi?id=2169947 > > I speculated that maybe it's similar to the issue we have with VM's > when O_DIRECT is used, but it seems that's not the case here. I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests. kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem. -- Chris Murphy ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 2023-02-15 20:16 ` Chris Murphy @ 2023-02-15 21:41 ` Filipe Manana 2023-02-15 23:21 ` Boris Burkov 1 sibling, 0 replies; 26+ messages in thread From: Filipe Manana @ 2023-02-15 21:41 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS On Wed, Feb 15, 2023 at 8:30 PM Chris Murphy <chris@colorremedies.com> wrote: > > > > On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote: > > Downstream bug report, reproducer test file, and gdb session transcript > > https://bugzilla.redhat.com/show_bug.cgi?id=2169947 > > > > I speculated that maybe it's similar to the issue we have with VM's > > when O_DIRECT is used, but it seems that's not the case here. > > I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests. > > kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem. I'm able to reproduce it too. I'll have a look at it, thanks. > > > -- > Chris Murphy ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 2023-02-15 20:16 ` Chris Murphy 2023-02-15 21:41 ` Filipe Manana @ 2023-02-15 23:21 ` Boris Burkov 2023-02-16 0:34 ` Boris Burkov 2023-02-16 10:05 ` Qu Wenruo 1 sibling, 2 replies; 26+ messages in thread From: Boris Burkov @ 2023-02-15 23:21 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS On Wed, Feb 15, 2023 at 03:16:39PM -0500, Chris Murphy wrote: > > > On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote: > > Downstream bug report, reproducer test file, and gdb session transcript > > https://bugzilla.redhat.com/show_bug.cgi?id=2169947 > > > > I speculated that maybe it's similar to the issue we have with VM's > > when O_DIRECT is used, but it seems that's not the case here. > > I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests. > > kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem. > I was also able to reproduce on the current misc-next. However, when I hacked the kernel to always fall back from DIO to buffered IO, it no longer reproduced. So that's one hint.. The next observation comes from comparing the happy vs unhappy file extents on disk: happy: https://pastebin.com/k4EPFKhc unhappy: https://pastebin.com/hNSBR0yv The broken DIO case is missing file extents between bytes 8192 and 65536 which corresponds to the observed zeros. Next, at Josef's suggestion, I looked at the IOMAP_DIO_PARTIAL and instrumented that codepath. I observed a single successful write to 8192 bytes, then a second write which first does a partial write from 8192 to 65536 and then faults in the rest of the iov_iter and finishes the write. I'm now trying to figure out how these partial writes might lead us to not create all the EXTENT_DATA items for the file extents. Boris > > -- > Chris Murphy ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 2023-02-15 23:21 ` Boris Burkov @ 2023-02-16 0:34 ` Boris Burkov 2023-02-16 1:46 ` Boris Burkov 2023-02-16 11:57 ` Filipe Manana 2023-02-16 10:05 ` Qu Wenruo 1 sibling, 2 replies; 26+ messages in thread From: Boris Burkov @ 2023-02-16 0:34 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS On Wed, Feb 15, 2023 at 03:21:38PM -0800, Boris Burkov wrote: > On Wed, Feb 15, 2023 at 03:16:39PM -0500, Chris Murphy wrote: > > > > > > On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote: > > > Downstream bug report, reproducer test file, and gdb session transcript > > > https://bugzilla.redhat.com/show_bug.cgi?id=2169947 > > > > > > I speculated that maybe it's similar to the issue we have with VM's > > > when O_DIRECT is used, but it seems that's not the case here. > > > > I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests. > > > > kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem. > > > > I was also able to reproduce on the current misc-next. However, when I > hacked the kernel to always fall back from DIO to buffered IO, it no > longer reproduced. So that's one hint.. > > The next observation comes from comparing the happy vs unhappy file > extents on disk: > happy: https://pastebin.com/k4EPFKhc > unhappy: https://pastebin.com/hNSBR0yv > > The broken DIO case is missing file extents between bytes 8192 and 65536 > which corresponds to the observed zeros. > > Next, at Josef's suggestion, I looked at the IOMAP_DIO_PARTIAL and > instrumented that codepath. I observed a single successful write to 8192 > bytes, then a second write which first does a partial write from 8192 to > 65536 and then faults in the rest of the iov_iter and finishes the > write. > > I'm now trying to figure out how these partial writes might lead us to > not create all the EXTENT_DATA items for the file extents. I believe the issue is indeed caused by faults reading the mapped region during direct io. Roughly what is happening is: - we start the dio write (offset 8192 len 1826816) - __iomap_dio_rw calls iomap_iter which calls btrfs_dio_iomap_begin which creates an ordered extent for the full write. - iomap_dio_iter hits a page fault in bio_iov_iter_get_pages after 57344 bytes and breaks out early, but submits the partial bio. - the partial bio completes and calls the various endio callbacks, resulting in a call to btrfs_mark_ordered_io_finished. - btrfs_mark_ordered_io_finished looks up the ordered extent and finds the full ordered extent, but the write that finished is partial, so the check for entry->bytes_left fails, and we don't call finish_ordered_fn and thus don't create a file extent item for this endio. - the IOMAP_DIO_PARTIAL logic results in us retrying starting from 65536 (8192 + 57344) but we fully exit and re-enter __iomap_dio_rw, which creates a new ordered extent for off 65536 len 1769472 and that ordered extent proceeds as above but successfully, and we get the second file extent. I'm not yet sure how to fix this, but have a couple ideas/questions: 1. Is there anyway we can split off a partial ordered extent and finish it when we get the partial write done? 2. Can we detect that there is an unfinished ordered extent that overlaps with our new one on the second write of the partial write logic? I'll play around and see if I can hack together a fix.. > > Boris > > > > > -- > > Chris Murphy ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 2023-02-16 0:34 ` Boris Burkov @ 2023-02-16 1:46 ` Boris Burkov 2023-02-16 5:58 ` Christoph Hellwig 2023-02-16 11:57 ` Filipe Manana 1 sibling, 1 reply; 26+ messages in thread From: Boris Burkov @ 2023-02-16 1:46 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS On Wed, Feb 15, 2023 at 04:34:13PM -0800, Boris Burkov wrote: > On Wed, Feb 15, 2023 at 03:21:38PM -0800, Boris Burkov wrote: > > On Wed, Feb 15, 2023 at 03:16:39PM -0500, Chris Murphy wrote: > > > > > > > > > On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote: > > > > Downstream bug report, reproducer test file, and gdb session transcript > > > > https://bugzilla.redhat.com/show_bug.cgi?id=2169947 > > > > > > > > I speculated that maybe it's similar to the issue we have with VM's > > > > when O_DIRECT is used, but it seems that's not the case here. > > > > > > I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests. > > > > > > kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem. > > > > > > > I was also able to reproduce on the current misc-next. However, when I > > hacked the kernel to always fall back from DIO to buffered IO, it no > > longer reproduced. So that's one hint.. > > > > The next observation comes from comparing the happy vs unhappy file > > extents on disk: > > happy: https://pastebin.com/k4EPFKhc > > unhappy: https://pastebin.com/hNSBR0yv > > > > The broken DIO case is missing file extents between bytes 8192 and 65536 > > which corresponds to the observed zeros. > > > > Next, at Josef's suggestion, I looked at the IOMAP_DIO_PARTIAL and > > instrumented that codepath. I observed a single successful write to 8192 > > bytes, then a second write which first does a partial write from 8192 to > > 65536 and then faults in the rest of the iov_iter and finishes the > > write. > > > > I'm now trying to figure out how these partial writes might lead us to > > not create all the EXTENT_DATA items for the file extents. > > I believe the issue is indeed caused by faults reading the mapped region > during direct io. Roughly what is happening is: > > - we start the dio write (offset 8192 len 1826816) > - __iomap_dio_rw calls iomap_iter which calls btrfs_dio_iomap_begin which > creates an ordered extent for the full write. > - iomap_dio_iter hits a page fault in bio_iov_iter_get_pages after 57344 > bytes and breaks out early, but submits the partial bio. > - the partial bio completes and calls the various endio callbacks, > resulting in a call to btrfs_mark_ordered_io_finished. > - btrfs_mark_ordered_io_finished looks up the ordered extent and finds > the full ordered extent, but the write that finished is partial, so > the check for entry->bytes_left fails, and we don't call > finish_ordered_fn and thus don't create a file extent item for this > endio. > - the IOMAP_DIO_PARTIAL logic results in us retrying starting from 65536 > (8192 + 57344) but we fully exit and re-enter __iomap_dio_rw, which > creates a new ordered extent for off 65536 len 1769472 and that > ordered extent proceeds as above but successfully, and we get the > second file extent. > > I'm not yet sure how to fix this, but have a couple ideas/questions: > 1. Is there anyway we can split off a partial ordered extent and finish > it when we get the partial write done? > 2. Can we detect that there is an unfinished ordered extent that > overlaps with our new one on the second write of the partial write > logic? > > I'll play around and see if I can hack together a fix.. The following patch causes the problem to stop reproducing by splitting the large ordered extent in the case of a short write and leaving it alone otherwise. I haven't thoroughly tested it, or even thought it through that well yet (e.g. I have no clue where that extract function comes from!), but it's a start. I have to sign off for the evening, so I will leave my investigation here for now. diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c index d8b90f95b157..016b1a77af71 100644 --- a/fs/btrfs/bio.c +++ b/fs/btrfs/bio.c @@ -647,6 +647,11 @@ static bool btrfs_submit_chunk(struct bio *bio, int mirror_num) } if (btrfs_op(bio) == BTRFS_MAP_WRITE) { + + ret = btrfs_extract_ordered_extent(btrfs_bio(bio)); + if (ret) + goto fail_put_bio; + if (use_append) { bio->bi_opf &= ~REQ_OP_WRITE; bio->bi_opf |= REQ_OP_ZONE_APPEND; > > > > > Boris > > > > > > > > -- > > > Chris Murphy ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 2023-02-16 1:46 ` Boris Burkov @ 2023-02-16 5:58 ` Christoph Hellwig 2023-02-16 9:30 ` Christoph Hellwig 0 siblings, 1 reply; 26+ messages in thread From: Christoph Hellwig @ 2023-02-16 5:58 UTC (permalink / raw) To: Boris Burkov; +Cc: Chris Murphy, Btrfs BTRFS On Wed, Feb 15, 2023 at 05:46:48PM -0800, Boris Burkov wrote: > The following patch causes the problem to stop reproducing by splitting > the large ordered extent in the case of a short write and leaving it > alone otherwise. I haven't thoroughly tested it, or even thought it > through that well yet (e.g. I have no clue where that extract function > comes from!), but it's a start. I have to sign off for the evening, so > I will leave my investigation here for now. The extract version if used for writes to zoned devices, which need to record a physical location on a per-write basis. So what you done definitively works, as we already do it a little later for a subset of writes. I also think it fundamentally is the right thing to do and was planning to do something similar for completely different reasons a little down the road. The downside right now is that you pay the price of an extra ordered extent lookup for every submitted bio, and that an extra queue_work is required for the completion. In my WIP work the former will eventually go away, and I've not found any benchmark impact from the latter. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 2023-02-16 5:58 ` Christoph Hellwig @ 2023-02-16 9:30 ` Christoph Hellwig 0 siblings, 0 replies; 26+ messages in thread From: Christoph Hellwig @ 2023-02-16 9:30 UTC (permalink / raw) To: Boris Burkov; +Cc: Chris Murphy, Btrfs BTRFS On Wed, Feb 15, 2023 at 09:58:54PM -0800, Christoph Hellwig wrote: > The extract version if used for writes to zoned devices, which need > to record a physical location on a per-write basis. So what you done > definitively works, as we already do it a little later for a subset of > writes. > > I also think it fundamentally is the right thing to do and was planning > to do something similar for completely different reasons a little down > the road. > > The downside right now is that you pay the price of an extra ordered > extent lookup for every submitted bio, and that an extra queue_work is > required for the completion. In my WIP work the former will eventually > go away, and I've not found any benchmark impact from the latter. Thinking out loud a bit more: - if we go down this route, calc_bio_boundaries also needs to set len_to_oe_boundary for non-zoned file system to ensure bios don't span ordered_extents. They usually don't in my testing, and I have that change in my WIP tree for other reasons. - btrfs_extract_ordered_extent as-is actually is overkill for non-zone devices. We'll need a version of it that doesn't do call split_zoned_em as that is only needed for the zoned case. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 2023-02-16 0:34 ` Boris Burkov 2023-02-16 1:46 ` Boris Burkov @ 2023-02-16 11:57 ` Filipe Manana 2023-02-16 17:14 ` Boris Burkov 1 sibling, 1 reply; 26+ messages in thread From: Filipe Manana @ 2023-02-16 11:57 UTC (permalink / raw) To: Boris Burkov; +Cc: Chris Murphy, Btrfs BTRFS On Thu, Feb 16, 2023 at 12:42 AM Boris Burkov <boris@bur.io> wrote: > > On Wed, Feb 15, 2023 at 03:21:38PM -0800, Boris Burkov wrote: > > On Wed, Feb 15, 2023 at 03:16:39PM -0500, Chris Murphy wrote: > > > > > > > > > On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote: > > > > Downstream bug report, reproducer test file, and gdb session transcript > > > > https://bugzilla.redhat.com/show_bug.cgi?id=2169947 > > > > > > > > I speculated that maybe it's similar to the issue we have with VM's > > > > when O_DIRECT is used, but it seems that's not the case here. > > > > > > I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests. > > > > > > kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem. > > > > > > > I was also able to reproduce on the current misc-next. However, when I > > hacked the kernel to always fall back from DIO to buffered IO, it no > > longer reproduced. So that's one hint.. > > > > The next observation comes from comparing the happy vs unhappy file > > extents on disk: > > happy: https://pastebin.com/k4EPFKhc > > unhappy: https://pastebin.com/hNSBR0yv > > > > The broken DIO case is missing file extents between bytes 8192 and 65536 > > which corresponds to the observed zeros. > > > > Next, at Josef's suggestion, I looked at the IOMAP_DIO_PARTIAL and > > instrumented that codepath. I observed a single successful write to 8192 > > bytes, then a second write which first does a partial write from 8192 to > > 65536 and then faults in the rest of the iov_iter and finishes the > > write. > > > > I'm now trying to figure out how these partial writes might lead us to > > not create all the EXTENT_DATA items for the file extents. > > I believe the issue is indeed caused by faults reading the mapped region > during direct io. Roughly what is happening is: > > - we start the dio write (offset 8192 len 1826816) > - __iomap_dio_rw calls iomap_iter which calls btrfs_dio_iomap_begin which > creates an ordered extent for the full write. > - iomap_dio_iter hits a page fault in bio_iov_iter_get_pages after 57344 > bytes and breaks out early, but submits the partial bio. > - the partial bio completes and calls the various endio callbacks, > resulting in a call to btrfs_mark_ordered_io_finished. > - btrfs_mark_ordered_io_finished looks up the ordered extent and finds > the full ordered extent, but the write that finished is partial, so > the check for entry->bytes_left fails, and we don't call > finish_ordered_fn and thus don't create a file extent item for this > endio. > - the IOMAP_DIO_PARTIAL logic results in us retrying starting from 65536 > (8192 + 57344) but we fully exit and re-enter __iomap_dio_rw, which > creates a new ordered extent for off 65536 len 1769472 and that > ordered extent proceeds as above but successfully, and we get the > second file extent. So one thing doesn't add up here. The first write attempt creates the ordered extent for the range [8192, 1835008], we then submit a bio only for the range [8192, 65535], due to the page fault at 64K. That means that the ordered extent can not be completed, as you said before, as ->bytes_left will be > 0 even after the submitted bio completes. However, the second time we enter btrfs_dio_iomap_begin(), for the range [65536, 1835008], when we call lock_extent_direct(), we should find the previous ordered extent, for range [8192, 1835008], which intersects this second range - and that should make us wait for the previous ordered extent until it completes by calling btrfs_start_ordered_extent() at lock_extent_direct(). This is clearly not happening, as it would cause a deadlock/hang. So something is missing in your analysis, and that's important to figure out the best solution to the problem. Thanks. > > I'm not yet sure how to fix this, but have a couple ideas/questions: > 1. Is there anyway we can split off a partial ordered extent and finish > it when we get the partial write done? > 2. Can we detect that there is an unfinished ordered extent that > overlaps with our new one on the second write of the partial write > logic? > > I'll play around and see if I can hack together a fix.. > > > > > Boris > > > > > > > > -- > > > Chris Murphy ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 2023-02-16 11:57 ` Filipe Manana @ 2023-02-16 17:14 ` Boris Burkov 2023-02-16 18:00 ` Filipe Manana 0 siblings, 1 reply; 26+ messages in thread From: Boris Burkov @ 2023-02-16 17:14 UTC (permalink / raw) To: Filipe Manana; +Cc: Chris Murphy, Btrfs BTRFS On Thu, Feb 16, 2023 at 11:57:37AM +0000, Filipe Manana wrote: > On Thu, Feb 16, 2023 at 12:42 AM Boris Burkov <boris@bur.io> wrote: > > > > On Wed, Feb 15, 2023 at 03:21:38PM -0800, Boris Burkov wrote: > > > On Wed, Feb 15, 2023 at 03:16:39PM -0500, Chris Murphy wrote: > > > > > > > > > > > > On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote: > > > > > Downstream bug report, reproducer test file, and gdb session transcript > > > > > https://bugzilla.redhat.com/show_bug.cgi?id=2169947 > > > > > > > > > > I speculated that maybe it's similar to the issue we have with VM's > > > > > when O_DIRECT is used, but it seems that's not the case here. > > > > > > > > I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests. > > > > > > > > kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem. > > > > > > > > > > I was also able to reproduce on the current misc-next. However, when I > > > hacked the kernel to always fall back from DIO to buffered IO, it no > > > longer reproduced. So that's one hint.. > > > > > > The next observation comes from comparing the happy vs unhappy file > > > extents on disk: > > > happy: https://pastebin.com/k4EPFKhc > > > unhappy: https://pastebin.com/hNSBR0yv > > > > > > The broken DIO case is missing file extents between bytes 8192 and 65536 > > > which corresponds to the observed zeros. > > > > > > Next, at Josef's suggestion, I looked at the IOMAP_DIO_PARTIAL and > > > instrumented that codepath. I observed a single successful write to 8192 > > > bytes, then a second write which first does a partial write from 8192 to > > > 65536 and then faults in the rest of the iov_iter and finishes the > > > write. > > > > > > I'm now trying to figure out how these partial writes might lead us to > > > not create all the EXTENT_DATA items for the file extents. > > > > I believe the issue is indeed caused by faults reading the mapped region > > during direct io. Roughly what is happening is: > > > > - we start the dio write (offset 8192 len 1826816) > > - __iomap_dio_rw calls iomap_iter which calls btrfs_dio_iomap_begin which > > creates an ordered extent for the full write. > > - iomap_dio_iter hits a page fault in bio_iov_iter_get_pages after 57344 > > bytes and breaks out early, but submits the partial bio. > > - the partial bio completes and calls the various endio callbacks, > > resulting in a call to btrfs_mark_ordered_io_finished. > > - btrfs_mark_ordered_io_finished looks up the ordered extent and finds > > the full ordered extent, but the write that finished is partial, so > > the check for entry->bytes_left fails, and we don't call > > finish_ordered_fn and thus don't create a file extent item for this > > endio. > > - the IOMAP_DIO_PARTIAL logic results in us retrying starting from 65536 > > (8192 + 57344) but we fully exit and re-enter __iomap_dio_rw, which > > creates a new ordered extent for off 65536 len 1769472 and that > > ordered extent proceeds as above but successfully, and we get the > > second file extent. > > So one thing doesn't add up here. > > The first write attempt creates the ordered extent for the range > [8192, 1835008], > we then submit a bio only for the range [8192, 65535], due to the page > fault at 64K. > > That means that the ordered extent can not be completed, as you said > before, as ->bytes_left will be > 0 even after the submitted bio > completes. > > However, the second time we enter btrfs_dio_iomap_begin(), for the > range [65536, 1835008], when we call lock_extent_direct(), we should > find the previous ordered extent, for range [8192, 1835008], which > intersects this second range - and that should make us wait for the > previous > ordered extent until it completes by calling > btrfs_start_ordered_extent() at lock_extent_direct(). This is clearly > not happening, as it would cause > a deadlock/hang. > > So something is missing in your analysis, and that's important to > figure out the best solution to the problem. > > Thanks. > Good point. I was confused about the extra ordered extents floating around as well. I believe I do have an explanation now. I think the following annotated callchain should explain it decently, but the tl;dr is that iomap_iter calls btrfs_dio_iomap_end which marks the ordered extent finished and clears the extent bits, but hits an error so it does not do much other endio work. (the error comes from mark_ordered_io_finished getting called without uptodate set) btrfs_direct_write __iomap_dio_rw iomap_iter btrfs_dio_iomap_begin # creates the oe 8192 1835008 iomap_dio_iter # submits the partial bio 8192 57344 (CB1) btrfs_dio_iomap_end btrfs_mark_ordered_io_finished 65536 1769472 # ^ finds and munges the oe, decides bytes_left is 0 and queues # the full finish work (CB2) # partial write return __iomap_dio_rw # btrfs_direct_write tries again iomap_dio_iter btrfs_dio_iomap_begin # makes the next OE iomap_dio_iter # submits the remaining write 65536 1769472 (CB3) btrfs_dio_end_io 8192 57344 (CB1) # finds the full oe and bails, as discussed above btrfs_finish_ordered_io 65536 1769472 (CB2) # sees BTRFS_ORDERED_IOERR exits and clears the extent btrfs_dio_end_io 65536 1769472 (CB3) # succeeds on the new oe for that range and queues full finish work # which writes an extent In my opinion, the part of this that feels like it is not behaving properly is the interaction between the iomap_iter logic of __iomap_dio_rw and the implementation of btrfs_dio_iomap_begin/end. Something about the begin/end calls the generic loop is doing is getting us to believe that large extent was "finished" but with an error set. If that is an appropriate way to handle the partial write, then maybe the splitting I proposed earlier works as a fix. Otherwise we may need to handle this case more explicitly in iomap_iter/btrfs_dio_iomap_end. > > > > > > > I'm not yet sure how to fix this, but have a couple ideas/questions: > > 1. Is there anyway we can split off a partial ordered extent and finish > > it when we get the partial write done? > > 2. Can we detect that there is an unfinished ordered extent that > > overlaps with our new one on the second write of the partial write > > logic? > > > > I'll play around and see if I can hack together a fix.. > > > > > > > > Boris > > > > > > > > > > > -- > > > > Chris Murphy ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 2023-02-16 17:14 ` Boris Burkov @ 2023-02-16 18:00 ` Filipe Manana 2023-02-16 18:49 ` Christoph Hellwig 0 siblings, 1 reply; 26+ messages in thread From: Filipe Manana @ 2023-02-16 18:00 UTC (permalink / raw) To: Boris Burkov; +Cc: Chris Murphy, Btrfs BTRFS On Thu, Feb 16, 2023 at 5:15 PM Boris Burkov <boris@bur.io> wrote: > > On Thu, Feb 16, 2023 at 11:57:37AM +0000, Filipe Manana wrote: > > On Thu, Feb 16, 2023 at 12:42 AM Boris Burkov <boris@bur.io> wrote: > > > > > > On Wed, Feb 15, 2023 at 03:21:38PM -0800, Boris Burkov wrote: > > > > On Wed, Feb 15, 2023 at 03:16:39PM -0500, Chris Murphy wrote: > > > > > > > > > > > > > > > On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote: > > > > > > Downstream bug report, reproducer test file, and gdb session transcript > > > > > > https://bugzilla.redhat.com/show_bug.cgi?id=2169947 > > > > > > > > > > > > I speculated that maybe it's similar to the issue we have with VM's > > > > > > when O_DIRECT is used, but it seems that's not the case here. > > > > > > > > > > I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests. > > > > > > > > > > kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem. > > > > > > > > > > > > > I was also able to reproduce on the current misc-next. However, when I > > > > hacked the kernel to always fall back from DIO to buffered IO, it no > > > > longer reproduced. So that's one hint.. > > > > > > > > The next observation comes from comparing the happy vs unhappy file > > > > extents on disk: > > > > happy: https://pastebin.com/k4EPFKhc > > > > unhappy: https://pastebin.com/hNSBR0yv > > > > > > > > The broken DIO case is missing file extents between bytes 8192 and 65536 > > > > which corresponds to the observed zeros. > > > > > > > > Next, at Josef's suggestion, I looked at the IOMAP_DIO_PARTIAL and > > > > instrumented that codepath. I observed a single successful write to 8192 > > > > bytes, then a second write which first does a partial write from 8192 to > > > > 65536 and then faults in the rest of the iov_iter and finishes the > > > > write. > > > > > > > > I'm now trying to figure out how these partial writes might lead us to > > > > not create all the EXTENT_DATA items for the file extents. > > > > > > I believe the issue is indeed caused by faults reading the mapped region > > > during direct io. Roughly what is happening is: > > > > > > - we start the dio write (offset 8192 len 1826816) > > > - __iomap_dio_rw calls iomap_iter which calls btrfs_dio_iomap_begin which > > > creates an ordered extent for the full write. > > > - iomap_dio_iter hits a page fault in bio_iov_iter_get_pages after 57344 > > > bytes and breaks out early, but submits the partial bio. > > > - the partial bio completes and calls the various endio callbacks, > > > resulting in a call to btrfs_mark_ordered_io_finished. > > > - btrfs_mark_ordered_io_finished looks up the ordered extent and finds > > > the full ordered extent, but the write that finished is partial, so > > > the check for entry->bytes_left fails, and we don't call > > > finish_ordered_fn and thus don't create a file extent item for this > > > endio. > > > - the IOMAP_DIO_PARTIAL logic results in us retrying starting from 65536 > > > (8192 + 57344) but we fully exit and re-enter __iomap_dio_rw, which > > > creates a new ordered extent for off 65536 len 1769472 and that > > > ordered extent proceeds as above but successfully, and we get the > > > second file extent. > > > > So one thing doesn't add up here. > > > > The first write attempt creates the ordered extent for the range > > [8192, 1835008], > > we then submit a bio only for the range [8192, 65535], due to the page > > fault at 64K. > > > > That means that the ordered extent can not be completed, as you said > > before, as ->bytes_left will be > 0 even after the submitted bio > > completes. > > > > However, the second time we enter btrfs_dio_iomap_begin(), for the > > range [65536, 1835008], when we call lock_extent_direct(), we should > > find the previous ordered extent, for range [8192, 1835008], which > > intersects this second range - and that should make us wait for the > > previous > > ordered extent until it completes by calling > > btrfs_start_ordered_extent() at lock_extent_direct(). This is clearly > > not happening, as it would cause > > a deadlock/hang. > > > > So something is missing in your analysis, and that's important to > > figure out the best solution to the problem. > > > > Thanks. > > > > Good point. I was confused about the extra ordered extents floating > around as well. I believe I do have an explanation now. I think the > following annotated callchain should explain it decently, but the tl;dr > is that iomap_iter calls btrfs_dio_iomap_end which marks the ordered > extent finished and clears the extent bits, but hits an error so it does > not do much other endio work. (the error comes from > mark_ordered_io_finished getting called without uptodate set) > > btrfs_direct_write > __iomap_dio_rw > iomap_iter > btrfs_dio_iomap_begin # creates the oe 8192 1835008 > iomap_dio_iter # submits the partial bio 8192 57344 (CB1) > btrfs_dio_iomap_end > btrfs_mark_ordered_io_finished 65536 1769472 > # ^ finds and munges the oe, decides bytes_left is 0 and queues > # the full finish work (CB2) > # partial write return > __iomap_dio_rw # btrfs_direct_write tries again > iomap_dio_iter > btrfs_dio_iomap_begin # makes the next OE > iomap_dio_iter # submits the remaining write 65536 1769472 (CB3) > > btrfs_dio_end_io 8192 57344 (CB1) > # finds the full oe and bails, as discussed above > > btrfs_finish_ordered_io 65536 1769472 (CB2) > # sees BTRFS_ORDERED_IOERR exits and clears the extent > > btrfs_dio_end_io 65536 1769472 (CB3) > # succeeds on the new oe for that range and queues full finish work > # which writes an extent > > In my opinion, the part of this that feels like it is not behaving > properly is the interaction between the iomap_iter logic of > __iomap_dio_rw and the implementation of btrfs_dio_iomap_begin/end. > Something about the begin/end calls the generic loop is doing is getting > us to believe that large extent was "finished" but with an error set. If > that is an appropriate way to handle the partial write, then maybe the > splitting I proposed earlier works as a fix. Otherwise we may need to > handle this case more explicitly in iomap_iter/btrfs_dio_iomap_end. Ok, so the problem is btrfs_dio_iomap_end() detects the submitted amount is less than expected, so it marks the ordered extents as not up to date, setting the BTRFS_ORDERED_IOERR bit on it. That results in having an unexpected hole for the range [8192, 65535], and no error returned to btrfs_direct_write(). My initial thought was to truncate the ordered extent at btrfs_dio_iomap_end(), similar to what we do at btrfs_invalidate_folio(). I think that should work, however we would end up with a bookend extent (but so does your proposed fix), but I don't see an easy way to get around that. Thanks. > > > > > > > > > > > > > I'm not yet sure how to fix this, but have a couple ideas/questions: > > > 1. Is there anyway we can split off a partial ordered extent and finish > > > it when we get the partial write done? > > > 2. Can we detect that there is an unfinished ordered extent that > > > overlaps with our new one on the second write of the partial write > > > logic? > > > > > > I'll play around and see if I can hack together a fix.. > > > > > > > > > > > Boris > > > > > > > > > > > > > > -- > > > > > Chris Murphy ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 2023-02-16 18:00 ` Filipe Manana @ 2023-02-16 18:49 ` Christoph Hellwig 2023-02-16 21:43 ` Filipe Manana 0 siblings, 1 reply; 26+ messages in thread From: Christoph Hellwig @ 2023-02-16 18:49 UTC (permalink / raw) To: Filipe Manana; +Cc: Boris Burkov, Chris Murphy, Btrfs BTRFS On Thu, Feb 16, 2023 at 06:00:08PM +0000, Filipe Manana wrote: > Ok, so the problem is btrfs_dio_iomap_end() detects the submitted > amount is less than expected, so it marks the ordered extents as not > up to date, setting the BTRFS_ORDERED_IOERR bit on it. > That results in having an unexpected hole for the range [8192, 65535], > and no error returned to btrfs_direct_write(). > > My initial thought was to truncate the ordered extent at > btrfs_dio_iomap_end(), similar to what we do at > btrfs_invalidate_folio(). > I think that should work, however we would end up with a bookend > extent (but so does your proposed fix), but I don't see an easy way to > get around that. Wouldn't a better way to handle this be to cache the ordered_extent in the btrfs_dio_data, and just reuse it on the next iteration if present and covering the range? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 2023-02-16 18:49 ` Christoph Hellwig @ 2023-02-16 21:43 ` Filipe Manana 2023-02-16 22:45 ` Boris Burkov 0 siblings, 1 reply; 26+ messages in thread From: Filipe Manana @ 2023-02-16 21:43 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Boris Burkov, Chris Murphy, Btrfs BTRFS On Thu, Feb 16, 2023 at 6:49 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Thu, Feb 16, 2023 at 06:00:08PM +0000, Filipe Manana wrote: > > Ok, so the problem is btrfs_dio_iomap_end() detects the submitted > > amount is less than expected, so it marks the ordered extents as not > > up to date, setting the BTRFS_ORDERED_IOERR bit on it. > > That results in having an unexpected hole for the range [8192, 65535], > > and no error returned to btrfs_direct_write(). > > > > My initial thought was to truncate the ordered extent at > > btrfs_dio_iomap_end(), similar to what we do at > > btrfs_invalidate_folio(). > > I think that should work, however we would end up with a bookend > > extent (but so does your proposed fix), but I don't see an easy way to > > get around that. > > Wouldn't a better way to handle this be to cache the ordered_extent in > the btrfs_dio_data, and just reuse it on the next iteration if present > and covering the range? That may work too, yes. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 2023-02-16 21:43 ` Filipe Manana @ 2023-02-16 22:45 ` Boris Burkov 2023-02-17 11:19 ` Filipe Manana 0 siblings, 1 reply; 26+ messages in thread From: Boris Burkov @ 2023-02-16 22:45 UTC (permalink / raw) To: Filipe Manana; +Cc: Christoph Hellwig, Chris Murphy, Btrfs BTRFS On Thu, Feb 16, 2023 at 09:43:03PM +0000, Filipe Manana wrote: > On Thu, Feb 16, 2023 at 6:49 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > On Thu, Feb 16, 2023 at 06:00:08PM +0000, Filipe Manana wrote: > > > Ok, so the problem is btrfs_dio_iomap_end() detects the submitted > > > amount is less than expected, so it marks the ordered extents as not > > > up to date, setting the BTRFS_ORDERED_IOERR bit on it. > > > That results in having an unexpected hole for the range [8192, 65535], > > > and no error returned to btrfs_direct_write(). > > > > > > My initial thought was to truncate the ordered extent at > > > btrfs_dio_iomap_end(), similar to what we do at > > > btrfs_invalidate_folio(). > > > I think that should work, however we would end up with a bookend > > > extent (but so does your proposed fix), but I don't see an easy way to > > > get around that. > > > > Wouldn't a better way to handle this be to cache the ordered_extent in > > the btrfs_dio_data, and just reuse it on the next iteration if present > > and covering the range? > > That may work too, yes. Quick update, I just got a preliminary version of this proposal working: - reuse btrfs_dio_data across calls to __iomap_dio_rw - store the dio ordered_extent when we create it in btrfs_dio_iomap_begin - modify btrfs_dio_iomap_end to not mark the unfinished ios done in the incomplete case. (and to drop the ordered extent on done or error) - modify btrfs_dio_iomap_begin to short-circuit when it has a cached ordered_extent The resulting behavior on this workload is: - write 8192 - finish OE, write file extent - write 57344 (no extent, cached OE) - re-enter __iomap_dio_rw with a live OE - skip locking extent, reserving space, etc. - write 1769472 - finish OE, write file extent and the file looks as if there were no partial write. I think this is a good structure for a fix to this bug, and plan to polish it up and send it soon, unless someone objects and thinks we should go a different way. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 2023-02-16 22:45 ` Boris Burkov @ 2023-02-17 11:19 ` Filipe Manana 0 siblings, 0 replies; 26+ messages in thread From: Filipe Manana @ 2023-02-17 11:19 UTC (permalink / raw) To: Boris Burkov; +Cc: Christoph Hellwig, Chris Murphy, Btrfs BTRFS On Thu, Feb 16, 2023 at 10:45 PM Boris Burkov <boris@bur.io> wrote: > > On Thu, Feb 16, 2023 at 09:43:03PM +0000, Filipe Manana wrote: > > On Thu, Feb 16, 2023 at 6:49 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > On Thu, Feb 16, 2023 at 06:00:08PM +0000, Filipe Manana wrote: > > > > Ok, so the problem is btrfs_dio_iomap_end() detects the submitted > > > > amount is less than expected, so it marks the ordered extents as not > > > > up to date, setting the BTRFS_ORDERED_IOERR bit on it. > > > > That results in having an unexpected hole for the range [8192, 65535], > > > > and no error returned to btrfs_direct_write(). > > > > > > > > My initial thought was to truncate the ordered extent at > > > > btrfs_dio_iomap_end(), similar to what we do at > > > > btrfs_invalidate_folio(). > > > > I think that should work, however we would end up with a bookend > > > > extent (but so does your proposed fix), but I don't see an easy way to > > > > get around that. > > > > > > Wouldn't a better way to handle this be to cache the ordered_extent in > > > the btrfs_dio_data, and just reuse it on the next iteration if present > > > and covering the range? > > > > That may work too, yes. > > Quick update, I just got a preliminary version of this proposal working: > - reuse btrfs_dio_data across calls to __iomap_dio_rw > - store the dio ordered_extent when we create it in btrfs_dio_iomap_begin > - modify btrfs_dio_iomap_end to not mark the unfinished ios done in the > incomplete case. (and to drop the ordered extent on done or error) > - modify btrfs_dio_iomap_begin to short-circuit when it has a cached > ordered_extent > > The resulting behavior on this workload is: > - write 8192 > - finish OE, write file extent > - write 57344 (no extent, cached OE) > - re-enter __iomap_dio_rw with a live OE > - skip locking extent, reserving space, etc. > - write 1769472 > - finish OE, write file extent > > and the file looks as if there were no partial write. I think this is a > good structure for a fix to this bug, and plan to polish it up and send > it soon, unless someone objects and thinks we should go a different way. Sounds good to me. Thanks. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 2023-02-15 23:21 ` Boris Burkov 2023-02-16 0:34 ` Boris Burkov @ 2023-02-16 10:05 ` Qu Wenruo 2023-02-16 12:01 ` Filipe Manana 1 sibling, 1 reply; 26+ messages in thread From: Qu Wenruo @ 2023-02-16 10:05 UTC (permalink / raw) To: Boris Burkov, Chris Murphy; +Cc: Btrfs BTRFS On 2023/2/16 07:21, Boris Burkov wrote: > On Wed, Feb 15, 2023 at 03:16:39PM -0500, Chris Murphy wrote: >> >> >> On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote: >>> Downstream bug report, reproducer test file, and gdb session transcript >>> https://bugzilla.redhat.com/show_bug.cgi?id=2169947 >>> >>> I speculated that maybe it's similar to the issue we have with VM's >>> when O_DIRECT is used, but it seems that's not the case here. >> >> I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests. >> >> kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem. >> > > I was also able to reproduce on the current misc-next. However, when I > hacked the kernel to always fall back from DIO to buffered IO, it no > longer reproduced. So that's one hint.. > > The next observation comes from comparing the happy vs unhappy file > extents on disk: > happy: https://pastebin.com/k4EPFKhc > unhappy: https://pastebin.com/hNSBR0yv > > The broken DIO case is missing file extents between bytes 8192 and 65536 > which corresponds to the observed zeros. > > Next, at Josef's suggestion, I looked at the IOMAP_DIO_PARTIAL and > instrumented that codepath. I observed a single successful write to 8192 > bytes, then a second write which first does a partial write from 8192 to > 65536 and then faults in the rest of the iov_iter and finishes the > write. A little off-topic, as I'm not familiar with Direct IO at all. That fault (I guess means page fault?) means the Direct IO source (memory of the user space program) can not be accessible right now? (being swapped?) If that's the case, any idea why we can not wait for the page fault to fill the user space memory? I guess it's some deadlock but is not for sure. Thanks, Qu > > I'm now trying to figure out how these partial writes might lead us to > not create all the EXTENT_DATA items for the file extents. > > Boris > >> >> -- >> Chris Murphy ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 2023-02-16 10:05 ` Qu Wenruo @ 2023-02-16 12:01 ` Filipe Manana 2023-02-17 0:15 ` Qu Wenruo 0 siblings, 1 reply; 26+ messages in thread From: Filipe Manana @ 2023-02-16 12:01 UTC (permalink / raw) To: Qu Wenruo; +Cc: Boris Burkov, Chris Murphy, Btrfs BTRFS On Thu, Feb 16, 2023 at 10:23 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > On 2023/2/16 07:21, Boris Burkov wrote: > > On Wed, Feb 15, 2023 at 03:16:39PM -0500, Chris Murphy wrote: > >> > >> > >> On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote: > >>> Downstream bug report, reproducer test file, and gdb session transcript > >>> https://bugzilla.redhat.com/show_bug.cgi?id=2169947 > >>> > >>> I speculated that maybe it's similar to the issue we have with VM's > >>> when O_DIRECT is used, but it seems that's not the case here. > >> > >> I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests. > >> > >> kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem. > >> > > > > I was also able to reproduce on the current misc-next. However, when I > > hacked the kernel to always fall back from DIO to buffered IO, it no > > longer reproduced. So that's one hint.. > > > > The next observation comes from comparing the happy vs unhappy file > > extents on disk: > > happy: https://pastebin.com/k4EPFKhc > > unhappy: https://pastebin.com/hNSBR0yv > > > > The broken DIO case is missing file extents between bytes 8192 and 65536 > > which corresponds to the observed zeros. > > > > Next, at Josef's suggestion, I looked at the IOMAP_DIO_PARTIAL and > > instrumented that codepath. I observed a single successful write to 8192 > > bytes, then a second write which first does a partial write from 8192 to > > 65536 and then faults in the rest of the iov_iter and finishes the > > write. > > A little off-topic, as I'm not familiar with Direct IO at all. > > That fault (I guess means page fault?) means the Direct IO source > (memory of the user space program) can not be accessible right now? > (being swapped?) > > If that's the case, any idea why we can not wait for the page fault to > fill the user space memory? Sure, you can call fault_in_iov_iter_readable() for the whole iov_iter before starting the actual direct IO, and that will work and not result in any problems most of the time. However: 1) After that and before starting the write, all pages or some pages may have been evicted from memory due to memory pressure for e.g.; 2) If the iov_iter refers to a very large buffer, it's inefficient to fault in all pages. > I guess it's some deadlock but is not for sure. > > Thanks, > Qu > > > > I'm now trying to figure out how these partial writes might lead us to > > not create all the EXTENT_DATA items for the file extents. > > > > Boris > > > >> > >> -- > >> Chris Murphy ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 2023-02-16 12:01 ` Filipe Manana @ 2023-02-17 0:15 ` Qu Wenruo 2023-02-17 11:38 ` Filipe Manana 0 siblings, 1 reply; 26+ messages in thread From: Qu Wenruo @ 2023-02-17 0:15 UTC (permalink / raw) To: Filipe Manana; +Cc: Boris Burkov, Chris Murphy, Btrfs BTRFS On 2023/2/16 20:01, Filipe Manana wrote: > On Thu, Feb 16, 2023 at 10:23 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> >> >> >> On 2023/2/16 07:21, Boris Burkov wrote: >>> On Wed, Feb 15, 2023 at 03:16:39PM -0500, Chris Murphy wrote: >>>> >>>> >>>> On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote: >>>>> Downstream bug report, reproducer test file, and gdb session transcript >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=2169947 >>>>> >>>>> I speculated that maybe it's similar to the issue we have with VM's >>>>> when O_DIRECT is used, but it seems that's not the case here. >>>> >>>> I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests. >>>> >>>> kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem. >>>> >>> >>> I was also able to reproduce on the current misc-next. However, when I >>> hacked the kernel to always fall back from DIO to buffered IO, it no >>> longer reproduced. So that's one hint.. >>> >>> The next observation comes from comparing the happy vs unhappy file >>> extents on disk: >>> happy: https://pastebin.com/k4EPFKhc >>> unhappy: https://pastebin.com/hNSBR0yv >>> >>> The broken DIO case is missing file extents between bytes 8192 and 65536 >>> which corresponds to the observed zeros. >>> >>> Next, at Josef's suggestion, I looked at the IOMAP_DIO_PARTIAL and >>> instrumented that codepath. I observed a single successful write to 8192 >>> bytes, then a second write which first does a partial write from 8192 to >>> 65536 and then faults in the rest of the iov_iter and finishes the >>> write. >> >> A little off-topic, as I'm not familiar with Direct IO at all. >> >> That fault (I guess means page fault?) means the Direct IO source >> (memory of the user space program) can not be accessible right now? >> (being swapped?) >> >> If that's the case, any idea why we can not wait for the page fault to >> fill the user space memory? > > Sure, you can call fault_in_iov_iter_readable() for the whole iov_iter > before starting the actual direct IO, > and that will work and not result in any problems most of the time. > > However: > > 1) After that and before starting the write, all pages or some pages > may have been evicted from memory due to memory pressure for e.g.; > > 2) If the iov_iter refers to a very large buffer, it's inefficient to > fault in all pages. Indeed. As even we load in all the pages, there is no guarantee unless the user space program pins the pages in advance. Another question is, can we fall back to buffered IO half way? E.g. if we hit page fault for certain range, then fall back to buffered IO for the remaining part? Thanks, Qu > >> I guess it's some deadlock but is not for sure. >> >> Thanks, >> Qu >>> >>> I'm now trying to figure out how these partial writes might lead us to >>> not create all the EXTENT_DATA items for the file extents. >>> >>> Boris >>> >>>> >>>> -- >>>> Chris Murphy ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 2023-02-17 0:15 ` Qu Wenruo @ 2023-02-17 11:38 ` Filipe Manana 0 siblings, 0 replies; 26+ messages in thread From: Filipe Manana @ 2023-02-17 11:38 UTC (permalink / raw) To: Qu Wenruo; +Cc: Boris Burkov, Chris Murphy, Btrfs BTRFS On Fri, Feb 17, 2023 at 12:15 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > On 2023/2/16 20:01, Filipe Manana wrote: > > On Thu, Feb 16, 2023 at 10:23 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > >> > >> > >> > >> On 2023/2/16 07:21, Boris Burkov wrote: > >>> On Wed, Feb 15, 2023 at 03:16:39PM -0500, Chris Murphy wrote: > >>>> > >>>> > >>>> On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote: > >>>>> Downstream bug report, reproducer test file, and gdb session transcript > >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=2169947 > >>>>> > >>>>> I speculated that maybe it's similar to the issue we have with VM's > >>>>> when O_DIRECT is used, but it seems that's not the case here. > >>>> > >>>> I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests. > >>>> > >>>> kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem. > >>>> > >>> > >>> I was also able to reproduce on the current misc-next. However, when I > >>> hacked the kernel to always fall back from DIO to buffered IO, it no > >>> longer reproduced. So that's one hint.. > >>> > >>> The next observation comes from comparing the happy vs unhappy file > >>> extents on disk: > >>> happy: https://pastebin.com/k4EPFKhc > >>> unhappy: https://pastebin.com/hNSBR0yv > >>> > >>> The broken DIO case is missing file extents between bytes 8192 and 65536 > >>> which corresponds to the observed zeros. > >>> > >>> Next, at Josef's suggestion, I looked at the IOMAP_DIO_PARTIAL and > >>> instrumented that codepath. I observed a single successful write to 8192 > >>> bytes, then a second write which first does a partial write from 8192 to > >>> 65536 and then faults in the rest of the iov_iter and finishes the > >>> write. > >> > >> A little off-topic, as I'm not familiar with Direct IO at all. > >> > >> That fault (I guess means page fault?) means the Direct IO source > >> (memory of the user space program) can not be accessible right now? > >> (being swapped?) > >> > >> If that's the case, any idea why we can not wait for the page fault to > >> fill the user space memory? > > > > Sure, you can call fault_in_iov_iter_readable() for the whole iov_iter > > before starting the actual direct IO, > > and that will work and not result in any problems most of the time. > > > > However: > > > > 1) After that and before starting the write, all pages or some pages > > may have been evicted from memory due to memory pressure for e.g.; > > > > 2) If the iov_iter refers to a very large buffer, it's inefficient to > > fault in all pages. > > Indeed. As even we load in all the pages, there is no guarantee unless > the user space program pins the pages in advance. > > Another question is, can we fall back to buffered IO half way? > E.g. if we hit page fault for certain range, then fall back to buffered > IO for the remaining part? That doesn't solve any problem in this case. It would still result in the partial ordered extent getting marked with an IO error and result in the hole for the partially completed range. > > Thanks, > Qu > > > > >> I guess it's some deadlock but is not for sure. > >> > >> Thanks, > >> Qu > >>> > >>> I'm now trying to figure out how these partial writes might lead us to > >>> not create all the EXTENT_DATA items for the file extents. > >>> > >>> Boris > >>> > >>>> > >>>> -- > >>>> Chris Murphy ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 2023-02-15 20:04 LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 Chris Murphy 2023-02-15 20:16 ` Chris Murphy @ 2023-04-05 13:07 ` Linux regression tracking #adding (Thorsten Leemhuis) 2023-04-06 15:47 ` David Sterba 1 sibling, 1 reply; 26+ messages in thread From: Linux regression tracking #adding (Thorsten Leemhuis) @ 2023-04-05 13:07 UTC (permalink / raw) To: Btrfs BTRFS, Linux kernel regressions list [TLDR: I'm adding this report to the list of tracked Linux kernel regressions; the text you find below is based on a few templates paragraphs you might have encountered already in similar form. See link in footer if these mails annoy you.] On 15.02.23 21:04, Chris Murphy wrote: > Downstream bug report, reproducer test file, and gdb session transcript > https://bugzilla.redhat.com/show_bug.cgi?id=2169947 > > I speculated that maybe it's similar to the issue we have with VM's when O_DIRECT is used, but it seems that's not the case here. To properly track this, let me add this report as well to the tracking (I already track another report not mentioned in the commit log of the proposed fix: https://bugzilla.kernel.org/show_bug.cgi?id=217042 ) #regzbot ^introduced 51bd9563b678 #regzbot dup-of: https://lore.kernel.org/all/efc853aa-0592-e43d-3ad1-c42d33f0ab6b@leemhuis.info/ #regzbot title btrfs: mdb_copy produces a corrupt database on btrfs #regzbot monitor https://lore.kernel.org/all/b7c72ffeb725c2a359965655df9827bdbeebfb4e.1679512207.git.boris@bur.io/ #regzbot ignore-activity Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr That page also explains what to do if mails like this annoy you. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 2023-04-05 13:07 ` Linux regression tracking #adding (Thorsten Leemhuis) @ 2023-04-06 15:47 ` David Sterba 2023-04-06 22:40 ` Neal Gompa 2023-04-07 6:10 ` Linux regression tracking (Thorsten Leemhuis) 0 siblings, 2 replies; 26+ messages in thread From: David Sterba @ 2023-04-06 15:47 UTC (permalink / raw) To: Linux regressions mailing list; +Cc: Btrfs BTRFS, boris On Wed, Apr 05, 2023 at 03:07:52PM +0200, Linux regression tracking #adding (Thorsten Leemhuis) wrote: > [TLDR: I'm adding this report to the list of tracked Linux kernel > regressions; the text you find below is based on a few templates > paragraphs you might have encountered already in similar form. > See link in footer if these mails annoy you.] > > On 15.02.23 21:04, Chris Murphy wrote: > > Downstream bug report, reproducer test file, and gdb session transcript > > https://bugzilla.redhat.com/show_bug.cgi?id=2169947 > > > > I speculated that maybe it's similar to the issue we have with VM's when O_DIRECT is used, but it seems that's not the case here. > > To properly track this, let me add this report as well to the tracking > (I already track another report not mentioned in the commit log of the > proposed fix: https://bugzilla.kernel.org/show_bug.cgi?id=217042 ) There were several iterations of the fix and the final version is actually 11 patches (below), and it does not apply cleanly to current master because of other cleanups. Given that it's fixing a corruption it should be merged and backported (at least to 6.1), but we may need to rework it again and minimize/drop the cleanups. a8e793f97686 btrfs: add function to create and return an ordered extent b85d0977f5be btrfs: pass flags as unsigned long to btrfs_add_ordered_extent c5e9a883e7c8 btrfs: stash ordered extent in dio_data during iomap dio d90af6fe39e6 btrfs: move ordered_extent internal sanity checks into btrfs_split_ordered_extent 8d4f5839fe88 btrfs: simplify splitting logic in btrfs_extract_ordered_extent 880c3efad384 btrfs: sink parameter len to btrfs_split_ordered_extent 812f614a7ad7 btrfs: fold btrfs_clone_ordered_extent into btrfs_split_ordered_extent 1334edcf5fa2 btrfs: simplify extent map splitting and rename split_zoned_em 3e99488588fa btrfs: pass an ordered_extent to btrfs_extract_ordered_extent df701737e7a6 btrfs: don't split NOCOW extent_maps in btrfs_extract_ordered_extent 87606cb305ca btrfs: split partial dio bios before submit ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 2023-04-06 15:47 ` David Sterba @ 2023-04-06 22:40 ` Neal Gompa 2023-04-07 6:10 ` Linux regression tracking (Thorsten Leemhuis) 1 sibling, 0 replies; 26+ messages in thread From: Neal Gompa @ 2023-04-06 22:40 UTC (permalink / raw) To: dsterba; +Cc: Linux regressions mailing list, Btrfs BTRFS, boris On Thu, Apr 6, 2023 at 11:52 AM David Sterba <dsterba@suse.cz> wrote: > > On Wed, Apr 05, 2023 at 03:07:52PM +0200, Linux regression tracking #adding (Thorsten Leemhuis) wrote: > > [TLDR: I'm adding this report to the list of tracked Linux kernel > > regressions; the text you find below is based on a few templates > > paragraphs you might have encountered already in similar form. > > See link in footer if these mails annoy you.] > > > > On 15.02.23 21:04, Chris Murphy wrote: > > > Downstream bug report, reproducer test file, and gdb session transcript > > > https://bugzilla.redhat.com/show_bug.cgi?id=2169947 > > > > > > I speculated that maybe it's similar to the issue we have with VM's when O_DIRECT is used, but it seems that's not the case here. > > > > To properly track this, let me add this report as well to the tracking > > (I already track another report not mentioned in the commit log of the > > proposed fix: https://bugzilla.kernel.org/show_bug.cgi?id=217042 ) > > There were several iterations of the fix and the final version is > actually 11 patches (below), and it does not apply cleanly to current > master because of other cleanups. > > Given that it's fixing a corruption it should be merged and backported > (at least to 6.1), but we may need to rework it again and minimize/drop > the cleanups. > > a8e793f97686 btrfs: add function to create and return an ordered extent > b85d0977f5be btrfs: pass flags as unsigned long to btrfs_add_ordered_extent > c5e9a883e7c8 btrfs: stash ordered extent in dio_data during iomap dio > d90af6fe39e6 btrfs: move ordered_extent internal sanity checks into btrfs_split_ordered_extent > 8d4f5839fe88 btrfs: simplify splitting logic in btrfs_extract_ordered_extent > 880c3efad384 btrfs: sink parameter len to btrfs_split_ordered_extent > 812f614a7ad7 btrfs: fold btrfs_clone_ordered_extent into btrfs_split_ordered_extent > 1334edcf5fa2 btrfs: simplify extent map splitting and rename split_zoned_em > 3e99488588fa btrfs: pass an ordered_extent to btrfs_extract_ordered_extent > df701737e7a6 btrfs: don't split NOCOW extent_maps in btrfs_extract_ordered_extent > 87606cb305ca btrfs: split partial dio bios before submit At least from Fedora's perspective, we've shifted to the 6.2 kernel series now. If it's reasonable to get backported to there ASAP, that'll help for Fedora (where this was discovered). -- 真実はいつも一つ!/ Always, there's only one truth! ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 2023-04-06 15:47 ` David Sterba 2023-04-06 22:40 ` Neal Gompa @ 2023-04-07 6:10 ` Linux regression tracking (Thorsten Leemhuis) 2023-04-08 0:08 ` Boris Burkov 2023-04-11 19:27 ` David Sterba 1 sibling, 2 replies; 26+ messages in thread From: Linux regression tracking (Thorsten Leemhuis) @ 2023-04-07 6:10 UTC (permalink / raw) To: dsterba, Linux regressions mailing list; +Cc: Btrfs BTRFS, boris On 06.04.23 17:47, David Sterba wrote: > On Wed, Apr 05, 2023 at 03:07:52PM +0200, Linux regression tracking #adding (Thorsten Leemhuis) wrote: >> [TLDR: I'm adding this report to the list of tracked Linux kernel >> regressions; the text you find below is based on a few templates >> paragraphs you might have encountered already in similar form. >> See link in footer if these mails annoy you.] >> >> On 15.02.23 21:04, Chris Murphy wrote: >>> Downstream bug report, reproducer test file, and gdb session transcript >>> https://bugzilla.redhat.com/show_bug.cgi?id=2169947 >>> >>> I speculated that maybe it's similar to the issue we have with VM's when O_DIRECT is used, but it seems that's not the case here. >> >> To properly track this, let me add this report as well to the tracking >> (I already track another report not mentioned in the commit log of the >> proposed fix: https://bugzilla.kernel.org/show_bug.cgi?id=217042 ) > > There were several iterations of the fix and the final version is > actually 11 patches (below), and it does not apply cleanly to current > master because of other cleanups. > > Given that it's fixing a corruption it should be merged and backported > (at least to 6.1), but we may need to rework it again and minimize/drop > the cleanups. > > a8e793f97686 btrfs: add function to create and return an ordered extent > b85d0977f5be btrfs: pass flags as unsigned long to btrfs_add_ordered_extent > c5e9a883e7c8 btrfs: stash ordered extent in dio_data during iomap dio > d90af6fe39e6 btrfs: move ordered_extent internal sanity checks into btrfs_split_ordered_extent > 8d4f5839fe88 btrfs: simplify splitting logic in btrfs_extract_ordered_extent > 880c3efad384 btrfs: sink parameter len to btrfs_split_ordered_extent > 812f614a7ad7 btrfs: fold btrfs_clone_ordered_extent into btrfs_split_ordered_extent > 1334edcf5fa2 btrfs: simplify extent map splitting and rename split_zoned_em > 3e99488588fa btrfs: pass an ordered_extent to btrfs_extract_ordered_extent > df701737e7a6 btrfs: don't split NOCOW extent_maps in btrfs_extract_ordered_extent > 87606cb305ca btrfs: split partial dio bios before submit David, many thx for the update; and thx Boris for all your work on this. I kept a loose eye on this and noticed that fixing this turned out to be quite complicated and thus required quite a bit of time. Well, that's a bit unfortunate, but how it is sometimes, so nothing to worry about. Makes me wonder if "revert the culprit temporarily, to get this fixed quickly" was really properly evaluated in this case (if it was, sorry, I might have missed it or forgotten). But whatever, for this particular regression that's afaics not worth discussing anymore at this point. Again, thx to everyone involved helping to get this fixed. Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page. P.S.: let me use this opportunity to update the regzbot status #regzbot fix: btrfs: split partial dio bios before submit ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 2023-04-07 6:10 ` Linux regression tracking (Thorsten Leemhuis) @ 2023-04-08 0:08 ` Boris Burkov 2023-04-11 19:27 ` David Sterba 1 sibling, 0 replies; 26+ messages in thread From: Boris Burkov @ 2023-04-08 0:08 UTC (permalink / raw) To: Linux regressions mailing list; +Cc: dsterba, Btrfs BTRFS On Fri, Apr 07, 2023 at 08:10:24AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote: > On 06.04.23 17:47, David Sterba wrote: > > On Wed, Apr 05, 2023 at 03:07:52PM +0200, Linux regression tracking #adding (Thorsten Leemhuis) wrote: > >> [TLDR: I'm adding this report to the list of tracked Linux kernel > >> regressions; the text you find below is based on a few templates > >> paragraphs you might have encountered already in similar form. > >> See link in footer if these mails annoy you.] > >> > >> On 15.02.23 21:04, Chris Murphy wrote: > >>> Downstream bug report, reproducer test file, and gdb session transcript > >>> https://bugzilla.redhat.com/show_bug.cgi?id=2169947 > >>> > >>> I speculated that maybe it's similar to the issue we have with VM's when O_DIRECT is used, but it seems that's not the case here. > >> > >> To properly track this, let me add this report as well to the tracking > >> (I already track another report not mentioned in the commit log of the > >> proposed fix: https://bugzilla.kernel.org/show_bug.cgi?id=217042 ) > > > > There were several iterations of the fix and the final version is > > actually 11 patches (below), and it does not apply cleanly to current > > master because of other cleanups. > > > > Given that it's fixing a corruption it should be merged and backported > > (at least to 6.1), but we may need to rework it again and minimize/drop > > the cleanups. > > > > a8e793f97686 btrfs: add function to create and return an ordered extent > > b85d0977f5be btrfs: pass flags as unsigned long to btrfs_add_ordered_extent > > c5e9a883e7c8 btrfs: stash ordered extent in dio_data during iomap dio > > d90af6fe39e6 btrfs: move ordered_extent internal sanity checks into btrfs_split_ordered_extent > > 8d4f5839fe88 btrfs: simplify splitting logic in btrfs_extract_ordered_extent > > 880c3efad384 btrfs: sink parameter len to btrfs_split_ordered_extent > > 812f614a7ad7 btrfs: fold btrfs_clone_ordered_extent into btrfs_split_ordered_extent > > 1334edcf5fa2 btrfs: simplify extent map splitting and rename split_zoned_em > > 3e99488588fa btrfs: pass an ordered_extent to btrfs_extract_ordered_extent > > df701737e7a6 btrfs: don't split NOCOW extent_maps in btrfs_extract_ordered_extent > > 87606cb305ca btrfs: split partial dio bios before submit > > David, many thx for the update; and thx Boris for all your work on this. > > I kept a loose eye on this and noticed that fixing this turned out to be > quite complicated and thus required quite a bit of time. Well, that's a > bit unfortunate, but how it is sometimes, so nothing to worry about. > Makes me wonder if "revert the culprit temporarily, to get this fixed > quickly" was really properly evaluated in this case (if it was, sorry, I > might have missed it or forgotten). But whatever, for this particular > regression that's afaics not worth discussing anymore at this point. In this case, the broken patch was also a serious fix for a deadlock, so I worry it would be "out of the frying pan and into the fire." with reverting it, so it never really entered my mind as an option. I'll try to keep that option top of mind next time, though, as it did take a while to iterate towards a (hopefully) successful fix. Also thanks for linking that other report. I had no clue that someone had already found the previous patch on 2/15. I wasn't sure of it till the next day, when I finished root causing the bug, IIRC. Thanks, Boris > > Again, thx to everyone involved helping to get this fixed. > > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) > -- > Everything you wanna know about Linux kernel regression tracking: > https://linux-regtracking.leemhuis.info/about/#tldr > If I did something stupid, please tell me, as explained on that page. > > P.S.: let me use this opportunity to update the regzbot status > > #regzbot fix: btrfs: split partial dio bios before submit ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 2023-04-07 6:10 ` Linux regression tracking (Thorsten Leemhuis) 2023-04-08 0:08 ` Boris Burkov @ 2023-04-11 19:27 ` David Sterba 2023-04-12 9:57 ` Linux regression tracking (Thorsten Leemhuis) 1 sibling, 1 reply; 26+ messages in thread From: David Sterba @ 2023-04-11 19:27 UTC (permalink / raw) To: Linux regressions mailing list; +Cc: dsterba, Btrfs BTRFS, boris On Fri, Apr 07, 2023 at 08:10:24AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote: > On 06.04.23 17:47, David Sterba wrote: > > On Wed, Apr 05, 2023 at 03:07:52PM +0200, Linux regression tracking #adding (Thorsten Leemhuis) wrote: > >> [TLDR: I'm adding this report to the list of tracked Linux kernel > >> regressions; the text you find below is based on a few templates > >> paragraphs you might have encountered already in similar form. > >> See link in footer if these mails annoy you.] > >> > >> On 15.02.23 21:04, Chris Murphy wrote: > >>> Downstream bug report, reproducer test file, and gdb session transcript > >>> https://bugzilla.redhat.com/show_bug.cgi?id=2169947 > >>> > >>> I speculated that maybe it's similar to the issue we have with VM's when O_DIRECT is used, but it seems that's not the case here. > >> > >> To properly track this, let me add this report as well to the tracking > >> (I already track another report not mentioned in the commit log of the > >> proposed fix: https://bugzilla.kernel.org/show_bug.cgi?id=217042 ) > > > > There were several iterations of the fix and the final version is > > actually 11 patches (below), and it does not apply cleanly to current > > master because of other cleanups. > > > > Given that it's fixing a corruption it should be merged and backported > > (at least to 6.1), but we may need to rework it again and minimize/drop > > the cleanups. > > > > a8e793f97686 btrfs: add function to create and return an ordered extent > > b85d0977f5be btrfs: pass flags as unsigned long to btrfs_add_ordered_extent > > c5e9a883e7c8 btrfs: stash ordered extent in dio_data during iomap dio > > d90af6fe39e6 btrfs: move ordered_extent internal sanity checks into btrfs_split_ordered_extent > > 8d4f5839fe88 btrfs: simplify splitting logic in btrfs_extract_ordered_extent > > 880c3efad384 btrfs: sink parameter len to btrfs_split_ordered_extent > > 812f614a7ad7 btrfs: fold btrfs_clone_ordered_extent into btrfs_split_ordered_extent > > 1334edcf5fa2 btrfs: simplify extent map splitting and rename split_zoned_em > > 3e99488588fa btrfs: pass an ordered_extent to btrfs_extract_ordered_extent > > df701737e7a6 btrfs: don't split NOCOW extent_maps in btrfs_extract_ordered_extent > > 87606cb305ca btrfs: split partial dio bios before submit > > David, many thx for the update; and thx Boris for all your work on this. > > I kept a loose eye on this and noticed that fixing this turned out to be > quite complicated and thus required quite a bit of time. Well, that's a > bit unfortunate, but how it is sometimes, so nothing to worry about. > Makes me wonder if "revert the culprit temporarily, to get this fixed > quickly" was really properly evaluated in this case (if it was, sorry, I > might have missed it or forgotten). I haven't evaluated a revert before, the patch being fixed is actually fixing a deadlock and a fstests case tests it. A clean revert works on 5.16 but not on the most recent stable version (6.1). I have a backport on top of 6.2, so once the patches land in Linus' tree they can be sent for stable backport, but I don't have an ETA. It's hard to plan things when we're that close to release, one possibility is to send the patches now and see. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 2023-04-11 19:27 ` David Sterba @ 2023-04-12 9:57 ` Linux regression tracking (Thorsten Leemhuis) 0 siblings, 0 replies; 26+ messages in thread From: Linux regression tracking (Thorsten Leemhuis) @ 2023-04-12 9:57 UTC (permalink / raw) To: dsterba, Linux regressions mailing list; +Cc: Btrfs BTRFS, boris On 11.04.23 21:27, David Sterba wrote: > On Fri, Apr 07, 2023 at 08:10:24AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote: >> On 06.04.23 17:47, David Sterba wrote: >>> On Wed, Apr 05, 2023 at 03:07:52PM +0200, Linux regression tracking #adding (Thorsten Leemhuis) wrote: >>>> [TLDR: I'm adding this report to the list of tracked Linux kernel >>>> regressions; the text you find below is based on a few templates >>>> paragraphs you might have encountered already in similar form. >>>> See link in footer if these mails annoy you.] >>>> >>>> On 15.02.23 21:04, Chris Murphy wrote: >>>>> Downstream bug report, reproducer test file, and gdb session transcript >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=2169947 >>>>> >>>>> I speculated that maybe it's similar to the issue we have with VM's when O_DIRECT is used, but it seems that's not the case here. >>>> >>>> To properly track this, let me add this report as well to the tracking >>>> (I already track another report not mentioned in the commit log of the >>>> proposed fix: https://bugzilla.kernel.org/show_bug.cgi?id=217042 ) >>> >>> There were several iterations of the fix and the final version is >>> actually 11 patches (below), and it does not apply cleanly to current >>> master because of other cleanups. >>> >>> Given that it's fixing a corruption it should be merged and backported >>> (at least to 6.1), but we may need to rework it again and minimize/drop >>> the cleanups. >>> >>> a8e793f97686 btrfs: add function to create and return an ordered extent >>> b85d0977f5be btrfs: pass flags as unsigned long to btrfs_add_ordered_extent >>> c5e9a883e7c8 btrfs: stash ordered extent in dio_data during iomap dio >>> d90af6fe39e6 btrfs: move ordered_extent internal sanity checks into btrfs_split_ordered_extent >>> 8d4f5839fe88 btrfs: simplify splitting logic in btrfs_extract_ordered_extent >>> 880c3efad384 btrfs: sink parameter len to btrfs_split_ordered_extent >>> 812f614a7ad7 btrfs: fold btrfs_clone_ordered_extent into btrfs_split_ordered_extent >>> 1334edcf5fa2 btrfs: simplify extent map splitting and rename split_zoned_em >>> 3e99488588fa btrfs: pass an ordered_extent to btrfs_extract_ordered_extent >>> df701737e7a6 btrfs: don't split NOCOW extent_maps in btrfs_extract_ordered_extent >>> 87606cb305ca btrfs: split partial dio bios before submit >> >> David, many thx for the update; and thx Boris for all your work on this. >> >> I kept a loose eye on this and noticed that fixing this turned out to be >> quite complicated and thus required quite a bit of time. Well, that's a >> bit unfortunate, but how it is sometimes, so nothing to worry about. >> Makes me wonder if "revert the culprit temporarily, to get this fixed >> quickly" was really properly evaluated in this case (if it was, sorry, I >> might have missed it or forgotten). > > I haven't evaluated a revert before, [...] No worries, I just try to make that option something that developer consider and use more often when dealing with regressions -- especially when the problem like in this case made it to stable trees (either though a backport or a new mainline release). > I have a backport on top of 6.2, so once the patches land in Linus' tree > they can be sent for stable backport, but I don't have an ETA. It's hard > to plan things when we're that close to release, one possibility is to > send the patches now and see. Thx for your work. And don't feel rushed due to my tracking. There are risks involved in this and experts like you are the best to judge when the right time to mainline and backport fixes is. Ciao, Thorsten ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2023-04-12 9:57 UTC | newest] Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-02-15 20:04 LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 Chris Murphy 2023-02-15 20:16 ` Chris Murphy 2023-02-15 21:41 ` Filipe Manana 2023-02-15 23:21 ` Boris Burkov 2023-02-16 0:34 ` Boris Burkov 2023-02-16 1:46 ` Boris Burkov 2023-02-16 5:58 ` Christoph Hellwig 2023-02-16 9:30 ` Christoph Hellwig 2023-02-16 11:57 ` Filipe Manana 2023-02-16 17:14 ` Boris Burkov 2023-02-16 18:00 ` Filipe Manana 2023-02-16 18:49 ` Christoph Hellwig 2023-02-16 21:43 ` Filipe Manana 2023-02-16 22:45 ` Boris Burkov 2023-02-17 11:19 ` Filipe Manana 2023-02-16 10:05 ` Qu Wenruo 2023-02-16 12:01 ` Filipe Manana 2023-02-17 0:15 ` Qu Wenruo 2023-02-17 11:38 ` Filipe Manana 2023-04-05 13:07 ` Linux regression tracking #adding (Thorsten Leemhuis) 2023-04-06 15:47 ` David Sterba 2023-04-06 22:40 ` Neal Gompa 2023-04-07 6:10 ` Linux regression tracking (Thorsten Leemhuis) 2023-04-08 0:08 ` Boris Burkov 2023-04-11 19:27 ` David Sterba 2023-04-12 9:57 ` Linux regression tracking (Thorsten Leemhuis)
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).