* About the md-bitmap behavior @ 2022-06-20 7:29 Qu Wenruo 2022-06-20 7:48 ` Wols Lists 0 siblings, 1 reply; 14+ messages in thread From: Qu Wenruo @ 2022-06-20 7:29 UTC (permalink / raw) To: linux-raid, linux-block Hi, Recently I'm trying to implement a write-intent bitmap for btrfs to address its write-hole problems for RAID56. (In theory, btrfs only needs to know where the partial stripe write doesn't finish properly, and do a mandatory scrub for those stripes before mount to address it). My initial assumption for write-intent bitmap is, before any write can be submitted, corresponding bit(s) must be set in the bitmap, and the bitmap must be flushed to disk, then the bio can really be submitted. Thus functions like md_bitmap_startwrite() should not only set the bits, but also submit and flush the bio. (With some bio plug to optimize). But to my surprise, md_bitmap_startwrite() really just set the bitmap, no obvious submit/flush path. Is my assumption on write-intent bitmap completely wrong, or is there some special handling for md write-intent bitmap? Thanks, Qu ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: About the md-bitmap behavior 2022-06-20 7:29 About the md-bitmap behavior Qu Wenruo @ 2022-06-20 7:48 ` Wols Lists 2022-06-20 7:56 ` Qu Wenruo 0 siblings, 1 reply; 14+ messages in thread From: Wols Lists @ 2022-06-20 7:48 UTC (permalink / raw) To: Qu Wenruo, linux-raid, linux-block On 20/06/2022 08:29, Qu Wenruo wrote: > Hi, > > Recently I'm trying to implement a write-intent bitmap for btrfs to > address its write-hole problems for RAID56. Is there any reason you want a bit-map? Not a journal? The write-hole has been addressed with journaling already, and this will be adding a new and not-needed feature - not saying it wouldn't be nice to have, but do we need another way to skin this cat? Cheers, Wol ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: About the md-bitmap behavior 2022-06-20 7:48 ` Wols Lists @ 2022-06-20 7:56 ` Qu Wenruo 2022-06-20 9:56 ` Wols Lists 0 siblings, 1 reply; 14+ messages in thread From: Qu Wenruo @ 2022-06-20 7:56 UTC (permalink / raw) To: Wols Lists, linux-raid, linux-block On 2022/6/20 15:48, Wols Lists wrote: > On 20/06/2022 08:29, Qu Wenruo wrote: >> Hi, >> >> Recently I'm trying to implement a write-intent bitmap for btrfs to >> address its write-hole problems for RAID56. > > Is there any reason you want a bit-map? Not a journal? For btrfs, it's a tradeoff here Bitmap is a little easier, and way less data to writeback. And since btrfs already has all of its metadata, and quite some of its data protected by COW (and checksum), a btrfs write-intent bitmap is enough to close the write-hole already. Although we may want to implement journal later, mostly to be able to address combined cases, like powerloss followed by a missing device at recovery time. > > The write-hole has been addressed with journaling already, and this will > be adding a new and not-needed feature - not saying it wouldn't be nice > to have, but do we need another way to skin this cat? I'm talking about the BTRFS RAID56, not md-raid RAID56, which is a completely different thing. Here I'm just trying to understand how the md-bitmap works, so that I can do a proper bitmap for btrfs RAID56. Thanks, Qu > > Cheers, > Wol ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: About the md-bitmap behavior 2022-06-20 7:56 ` Qu Wenruo @ 2022-06-20 9:56 ` Wols Lists 2022-06-22 2:15 ` Doug Ledford 0 siblings, 1 reply; 14+ messages in thread From: Wols Lists @ 2022-06-20 9:56 UTC (permalink / raw) To: Qu Wenruo, linux-raid, linux-block On 20/06/2022 08:56, Qu Wenruo wrote: >> The write-hole has been addressed with journaling already, and this will >> be adding a new and not-needed feature - not saying it wouldn't be nice >> to have, but do we need another way to skin this cat? > > I'm talking about the BTRFS RAID56, not md-raid RAID56, which is a > completely different thing. > > Here I'm just trying to understand how the md-bitmap works, so that I > can do a proper bitmap for btrfs RAID56. Ah. Okay. Neil Brown is likely to be the best help here as I believe he wrote a lot of the code, although I don't think he's much involved with md-raid any more. Cheers, Wol ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: About the md-bitmap behavior 2022-06-20 9:56 ` Wols Lists @ 2022-06-22 2:15 ` Doug Ledford 2022-06-22 2:37 ` Qu Wenruo 0 siblings, 1 reply; 14+ messages in thread From: Doug Ledford @ 2022-06-22 2:15 UTC (permalink / raw) To: Wols Lists, Qu Wenruo, linux-raid, linux-block [-- Attachment #1: Type: text/plain, Size: 1778 bytes --] On Mon, 2022-06-20 at 10:56 +0100, Wols Lists wrote: > On 20/06/2022 08:56, Qu Wenruo wrote: > > > The write-hole has been addressed with journaling already, and > > > this will > > > be adding a new and not-needed feature - not saying it wouldn't be > > > nice > > > to have, but do we need another way to skin this cat? > > > > I'm talking about the BTRFS RAID56, not md-raid RAID56, which is a > > completely different thing. > > > > Here I'm just trying to understand how the md-bitmap works, so that > > I > > can do a proper bitmap for btrfs RAID56. > > Ah. Okay. > > Neil Brown is likely to be the best help here as I believe he wrote a > lot of the code, although I don't think he's much involved with md- > raid > any more. I can't speak to how it is today, but I know it was *designed* to be sync flush of the dirty bit setting, then lazy, async write out of the clear bits. But, yes, in order for the design to be reliable, you must flush out the dirty bits before you put writes in flight. One thing I'm not sure about though, is that MD RAID5/6 uses fixed stripes. I thought btrfs, since it was an allocation filesystem, didn't have to use full stripes? Am I wrong about that? Because it would seem that if your data isn't necessarily in full stripes, then a bitmap might not work so well since it just marks a range of full stripes as "possibly dirty, we were writing to them, do a parity resync to make sure". In any case, Wols is right, probably want to ping Neil on this. Might need to ping him directly though. Not sure he'll see it just on the list. -- Doug Ledford <dledford@redhat.com> GPG KeyID: B826A3330E572FDD Fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: About the md-bitmap behavior 2022-06-22 2:15 ` Doug Ledford @ 2022-06-22 2:37 ` Qu Wenruo 2022-06-22 22:32 ` NeilBrown 0 siblings, 1 reply; 14+ messages in thread From: Qu Wenruo @ 2022-06-22 2:37 UTC (permalink / raw) To: Doug Ledford, Wols Lists, linux-raid, linux-block, NeilBrown On 2022/6/22 10:15, Doug Ledford wrote: > On Mon, 2022-06-20 at 10:56 +0100, Wols Lists wrote: >> On 20/06/2022 08:56, Qu Wenruo wrote: >>>> The write-hole has been addressed with journaling already, and >>>> this will >>>> be adding a new and not-needed feature - not saying it wouldn't be >>>> nice >>>> to have, but do we need another way to skin this cat? >>> >>> I'm talking about the BTRFS RAID56, not md-raid RAID56, which is a >>> completely different thing. >>> >>> Here I'm just trying to understand how the md-bitmap works, so that >>> I >>> can do a proper bitmap for btrfs RAID56. >> >> Ah. Okay. >> >> Neil Brown is likely to be the best help here as I believe he wrote a >> lot of the code, although I don't think he's much involved with md- >> raid >> any more. > > I can't speak to how it is today, but I know it was *designed* to be > sync flush of the dirty bit setting, then lazy, async write out of the > clear bits. But, yes, in order for the design to be reliable, you must > flush out the dirty bits before you put writes in flight. Thank you very much confirming my concern. So maybe it's me not checking the md-bitmap code carefully enough to expose the full picture. > > One thing I'm not sure about though, is that MD RAID5/6 uses fixed > stripes. I thought btrfs, since it was an allocation filesystem, didn't > have to use full stripes? Am I wrong about that? Unfortunately, we only go allocation for the RAID56 chunks. In side a RAID56 the underlying devices still need to go the regular RAID56 full stripe scheme. Thus the btrfs RAID56 is still the same regular RAID56 inside one btrfs RAID56 chunk, but without bitmap/journal. > Because it would seem > that if your data isn't necessarily in full stripes, then a bitmap might > not work so well since it just marks a range of full stripes as > "possibly dirty, we were writing to them, do a parity resync to make > sure". For the resync part is where btrfs shines, as the extra csum (for the untouched part) and metadata COW ensures us only see the old untouched data, and with the extra csum, we can safely rebuild the full stripe. Thus as long as no device is missing, a write-intent-bitmap is enough to address the write hole in btrfs (at least for COW protected data and all metadata). > > In any case, Wols is right, probably want to ping Neil on this. Might > need to ping him directly though. Not sure he'll see it just on the > list. > Adding Neil into this thread. Any clue on the existing md_bitmap_startwrite() behavior? Thanks, Qu ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: About the md-bitmap behavior 2022-06-22 2:37 ` Qu Wenruo @ 2022-06-22 22:32 ` NeilBrown 2022-06-22 23:00 ` Song Liu 2022-06-23 0:39 ` Qu Wenruo 0 siblings, 2 replies; 14+ messages in thread From: NeilBrown @ 2022-06-22 22:32 UTC (permalink / raw) To: Qu Wenruo; +Cc: Doug Ledford, Wols Lists, linux-raid, linux-block On Wed, 22 Jun 2022, Qu Wenruo wrote: > > On 2022/6/22 10:15, Doug Ledford wrote: > > On Mon, 2022-06-20 at 10:56 +0100, Wols Lists wrote: > >> On 20/06/2022 08:56, Qu Wenruo wrote: > >>>> The write-hole has been addressed with journaling already, and > >>>> this will > >>>> be adding a new and not-needed feature - not saying it wouldn't be > >>>> nice > >>>> to have, but do we need another way to skin this cat? > >>> > >>> I'm talking about the BTRFS RAID56, not md-raid RAID56, which is a > >>> completely different thing. > >>> > >>> Here I'm just trying to understand how the md-bitmap works, so that > >>> I > >>> can do a proper bitmap for btrfs RAID56. > >> > >> Ah. Okay. > >> > >> Neil Brown is likely to be the best help here as I believe he wrote a > >> lot of the code, although I don't think he's much involved with md- > >> raid > >> any more. > > > > I can't speak to how it is today, but I know it was *designed* to be > > sync flush of the dirty bit setting, then lazy, async write out of the > > clear bits. But, yes, in order for the design to be reliable, you must > > flush out the dirty bits before you put writes in flight. > > Thank you very much confirming my concern. > > So maybe it's me not checking the md-bitmap code carefully enough to > expose the full picture. > > > > > One thing I'm not sure about though, is that MD RAID5/6 uses fixed > > stripes. I thought btrfs, since it was an allocation filesystem, didn't > > have to use full stripes? Am I wrong about that? > > Unfortunately, we only go allocation for the RAID56 chunks. In side a > RAID56 the underlying devices still need to go the regular RAID56 full > stripe scheme. > > Thus the btrfs RAID56 is still the same regular RAID56 inside one btrfs > RAID56 chunk, but without bitmap/journal. > > > Because it would seem > > that if your data isn't necessarily in full stripes, then a bitmap might > > not work so well since it just marks a range of full stripes as > > "possibly dirty, we were writing to them, do a parity resync to make > > sure". > > For the resync part is where btrfs shines, as the extra csum (for the > untouched part) and metadata COW ensures us only see the old untouched > data, and with the extra csum, we can safely rebuild the full stripe. > > Thus as long as no device is missing, a write-intent-bitmap is enough to > address the write hole in btrfs (at least for COW protected data and all > metadata). > > > > > In any case, Wols is right, probably want to ping Neil on this. Might > > need to ping him directly though. Not sure he'll see it just on the > > list. > > > > Adding Neil into this thread. Any clue on the existing > md_bitmap_startwrite() behavior? md_bitmap_startwrite() is used to tell the bitmap code that the raid module is about to start writing at a location. This may result in md_bitmap_file_set_bit() being called to set a bit in the in-memory copy of the bitmap, and to make that page of the bitmap as BITMAP_PAGE_DIRTY. Before raid actually submits the writes to the device it will call md_bitmap_unplug() which will submit the writes and wait for them to complete. The is a comment at the top of md/raid5.c titled "BITMAP UNPLUGGING" which says a few things about how raid5 ensure things happen in the right order. However I don't think if any sort of bitmap can solve the write-hole problem for RAID5 - even in btrfs. The problem is that if the host crashes while the array is degraded and while some write requests were in-flight, then you might have lost data. i.e. to update a block you must write both that block and the parity block. If you actually wrote neither or both, everything is fine. If you wrote one but not the other then you CANNOT recover the data that was on the missing device (there must be a missing device as the array is degraded). Even having checksums of everything is not enough to recover that missing block. You must either: 1/ have a safe duplicate of the blocks being written, so they can be recovered and re-written after a crash. This is what journalling does. Or 2/ Only write to location which don't contain valid data. i.e. always write full stripes to locations which are unused on each device. This way you cannot lose existing data. Worst case: that whole stripe is ignored. This is how I would handle RAID5 in a copy-on-write filesystem. However, I see you wrote: > Thus as long as no device is missing, a write-intent-bitmap is enough to > address the write hole in btrfs (at least for COW protected data and all > metadata). That doesn't make sense. If no device is missing, then there is no write hole. If no device is missing, all you need to do is recalculate the parity blocks on any stripe that was recently written. In md with use the write-intent-bitmap. In btrfs I would expect that you would already have some way of knowing where recent writes happened, so you can validiate the various checksums. That should be sufficient to recalculate the parity. I've be very surprised if btrfs doesn't already do this. So I'm somewhat confuses as to what your real goal is. NeilBrown ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: About the md-bitmap behavior 2022-06-22 22:32 ` NeilBrown @ 2022-06-22 23:00 ` Song Liu 2022-06-23 0:53 ` Qu Wenruo 2022-06-23 0:39 ` Qu Wenruo 1 sibling, 1 reply; 14+ messages in thread From: Song Liu @ 2022-06-22 23:00 UTC (permalink / raw) To: NeilBrown; +Cc: Qu Wenruo, Doug Ledford, Wols Lists, linux-raid, linux-block On Wed, Jun 22, 2022 at 3:33 PM NeilBrown <neilb@suse.de> wrote: > > On Wed, 22 Jun 2022, Qu Wenruo wrote: > > > > On 2022/6/22 10:15, Doug Ledford wrote: > > > On Mon, 2022-06-20 at 10:56 +0100, Wols Lists wrote: > > >> On 20/06/2022 08:56, Qu Wenruo wrote: > > >>>> The write-hole has been addressed with journaling already, and > > >>>> this will > > >>>> be adding a new and not-needed feature - not saying it wouldn't be > > >>>> nice > > >>>> to have, but do we need another way to skin this cat? > > >>> > > >>> I'm talking about the BTRFS RAID56, not md-raid RAID56, which is a > > >>> completely different thing. > > >>> > > >>> Here I'm just trying to understand how the md-bitmap works, so that > > >>> I > > >>> can do a proper bitmap for btrfs RAID56. > > >> > > >> Ah. Okay. > > >> > > >> Neil Brown is likely to be the best help here as I believe he wrote a > > >> lot of the code, although I don't think he's much involved with md- > > >> raid > > >> any more. > > > > > > I can't speak to how it is today, but I know it was *designed* to be > > > sync flush of the dirty bit setting, then lazy, async write out of the > > > clear bits. But, yes, in order for the design to be reliable, you must > > > flush out the dirty bits before you put writes in flight. > > > > Thank you very much confirming my concern. > > > > So maybe it's me not checking the md-bitmap code carefully enough to > > expose the full picture. > > > > > > > > One thing I'm not sure about though, is that MD RAID5/6 uses fixed > > > stripes. I thought btrfs, since it was an allocation filesystem, didn't > > > have to use full stripes? Am I wrong about that? > > > > Unfortunately, we only go allocation for the RAID56 chunks. In side a > > RAID56 the underlying devices still need to go the regular RAID56 full > > stripe scheme. > > > > Thus the btrfs RAID56 is still the same regular RAID56 inside one btrfs > > RAID56 chunk, but without bitmap/journal. > > > > > Because it would seem > > > that if your data isn't necessarily in full stripes, then a bitmap might > > > not work so well since it just marks a range of full stripes as > > > "possibly dirty, we were writing to them, do a parity resync to make > > > sure". > > > > For the resync part is where btrfs shines, as the extra csum (for the > > untouched part) and metadata COW ensures us only see the old untouched > > data, and with the extra csum, we can safely rebuild the full stripe. > > > > Thus as long as no device is missing, a write-intent-bitmap is enough to > > address the write hole in btrfs (at least for COW protected data and all > > metadata). > > > > > > > > In any case, Wols is right, probably want to ping Neil on this. Might > > > need to ping him directly though. Not sure he'll see it just on the > > > list. > > > > > > > Adding Neil into this thread. Any clue on the existing > > md_bitmap_startwrite() behavior? > > md_bitmap_startwrite() is used to tell the bitmap code that the raid > module is about to start writing at a location. This may result in > md_bitmap_file_set_bit() being called to set a bit in the in-memory copy > of the bitmap, and to make that page of the bitmap as BITMAP_PAGE_DIRTY. > > Before raid actually submits the writes to the device it will call > md_bitmap_unplug() which will submit the writes and wait for them to > complete. > > The is a comment at the top of md/raid5.c titled "BITMAP UNPLUGGING" > which says a few things about how raid5 ensure things happen in the > right order. > > However I don't think if any sort of bitmap can solve the write-hole > problem for RAID5 - even in btrfs. > > The problem is that if the host crashes while the array is degraded and > while some write requests were in-flight, then you might have lost data. > i.e. to update a block you must write both that block and the parity > block. If you actually wrote neither or both, everything is fine. If > you wrote one but not the other then you CANNOT recover the data that > was on the missing device (there must be a missing device as the array > is degraded). Even having checksums of everything is not enough to > recover that missing block. > > You must either: > 1/ have a safe duplicate of the blocks being written, so they can be > recovered and re-written after a crash. This is what journalling > does. Or > 2/ Only write to location which don't contain valid data. i.e. always > write full stripes to locations which are unused on each device. > This way you cannot lose existing data. Worst case: that whole > stripe is ignored. This is how I would handle RAID5 in a > copy-on-write filesystem. Thanks Neil for explaining this. I was about to say the same idea, but couldn't phrase it well. md raid5 suffers from write hole because the mapping from array-LBA to component-LBA is fixed. As a result, we have to update the data in place. btrfs already has file-to-LBA mapping, so it shouldn't be too expensive to make btrfs free of write hole. (no need for maintain extra mapping, or add journaling). Thanks, Song > > However, I see you wrote: > > Thus as long as no device is missing, a write-intent-bitmap is enough to > > address the write hole in btrfs (at least for COW protected data and all > > metadata). > > That doesn't make sense. If no device is missing, then there is no > write hole. > If no device is missing, all you need to do is recalculate the parity > blocks on any stripe that was recently written. In md with use the > write-intent-bitmap. In btrfs I would expect that you would already > have some way of knowing where recent writes happened, so you can > validiate the various checksums. That should be sufficient to > recalculate the parity. I've be very surprised if btrfs doesn't already > do this. > > So I'm somewhat confuses as to what your real goal is. > > NeilBrown ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: About the md-bitmap behavior 2022-06-22 23:00 ` Song Liu @ 2022-06-23 0:53 ` Qu Wenruo 0 siblings, 0 replies; 14+ messages in thread From: Qu Wenruo @ 2022-06-23 0:53 UTC (permalink / raw) To: Song Liu, NeilBrown; +Cc: Doug Ledford, Wols Lists, linux-raid, linux-block On 2022/6/23 07:00, Song Liu wrote: > On Wed, Jun 22, 2022 at 3:33 PM NeilBrown <neilb@suse.de> wrote: >> >> On Wed, 22 Jun 2022, Qu Wenruo wrote: >>> >>> On 2022/6/22 10:15, Doug Ledford wrote: >>>> On Mon, 2022-06-20 at 10:56 +0100, Wols Lists wrote: >>>>> On 20/06/2022 08:56, Qu Wenruo wrote: >>>>>>> The write-hole has been addressed with journaling already, and >>>>>>> this will >>>>>>> be adding a new and not-needed feature - not saying it wouldn't be >>>>>>> nice >>>>>>> to have, but do we need another way to skin this cat? >>>>>> >>>>>> I'm talking about the BTRFS RAID56, not md-raid RAID56, which is a >>>>>> completely different thing. >>>>>> >>>>>> Here I'm just trying to understand how the md-bitmap works, so that >>>>>> I >>>>>> can do a proper bitmap for btrfs RAID56. >>>>> >>>>> Ah. Okay. >>>>> >>>>> Neil Brown is likely to be the best help here as I believe he wrote a >>>>> lot of the code, although I don't think he's much involved with md- >>>>> raid >>>>> any more. >>>> >>>> I can't speak to how it is today, but I know it was *designed* to be >>>> sync flush of the dirty bit setting, then lazy, async write out of the >>>> clear bits. But, yes, in order for the design to be reliable, you must >>>> flush out the dirty bits before you put writes in flight. >>> >>> Thank you very much confirming my concern. >>> >>> So maybe it's me not checking the md-bitmap code carefully enough to >>> expose the full picture. >>> >>>> >>>> One thing I'm not sure about though, is that MD RAID5/6 uses fixed >>>> stripes. I thought btrfs, since it was an allocation filesystem, didn't >>>> have to use full stripes? Am I wrong about that? >>> >>> Unfortunately, we only go allocation for the RAID56 chunks. In side a >>> RAID56 the underlying devices still need to go the regular RAID56 full >>> stripe scheme. >>> >>> Thus the btrfs RAID56 is still the same regular RAID56 inside one btrfs >>> RAID56 chunk, but without bitmap/journal. >>> >>>> Because it would seem >>>> that if your data isn't necessarily in full stripes, then a bitmap might >>>> not work so well since it just marks a range of full stripes as >>>> "possibly dirty, we were writing to them, do a parity resync to make >>>> sure". >>> >>> For the resync part is where btrfs shines, as the extra csum (for the >>> untouched part) and metadata COW ensures us only see the old untouched >>> data, and with the extra csum, we can safely rebuild the full stripe. >>> >>> Thus as long as no device is missing, a write-intent-bitmap is enough to >>> address the write hole in btrfs (at least for COW protected data and all >>> metadata). >>> >>>> >>>> In any case, Wols is right, probably want to ping Neil on this. Might >>>> need to ping him directly though. Not sure he'll see it just on the >>>> list. >>>> >>> >>> Adding Neil into this thread. Any clue on the existing >>> md_bitmap_startwrite() behavior? >> >> md_bitmap_startwrite() is used to tell the bitmap code that the raid >> module is about to start writing at a location. This may result in >> md_bitmap_file_set_bit() being called to set a bit in the in-memory copy >> of the bitmap, and to make that page of the bitmap as BITMAP_PAGE_DIRTY. >> >> Before raid actually submits the writes to the device it will call >> md_bitmap_unplug() which will submit the writes and wait for them to >> complete. >> >> The is a comment at the top of md/raid5.c titled "BITMAP UNPLUGGING" >> which says a few things about how raid5 ensure things happen in the >> right order. >> >> However I don't think if any sort of bitmap can solve the write-hole >> problem for RAID5 - even in btrfs. >> >> The problem is that if the host crashes while the array is degraded and >> while some write requests were in-flight, then you might have lost data. >> i.e. to update a block you must write both that block and the parity >> block. If you actually wrote neither or both, everything is fine. If >> you wrote one but not the other then you CANNOT recover the data that >> was on the missing device (there must be a missing device as the array >> is degraded). Even having checksums of everything is not enough to >> recover that missing block. >> >> You must either: >> 1/ have a safe duplicate of the blocks being written, so they can be >> recovered and re-written after a crash. This is what journalling >> does. Or >> 2/ Only write to location which don't contain valid data. i.e. always >> write full stripes to locations which are unused on each device. >> This way you cannot lose existing data. Worst case: that whole >> stripe is ignored. This is how I would handle RAID5 in a >> copy-on-write filesystem. > > Thanks Neil for explaining this. I was about to say the same idea, but > couldn't phrase it well. > > md raid5 suffers from write hole because the mapping from array-LBA to > component-LBA is fixed. In fact, inside one btrfs RAID56 chunk, it's the same fixed logical->physical mapping. Thus we still have the problem. > As a result, we have to update the data in place. > btrfs already has file-to-LBA mapping, so it shouldn't be too expensive to > make btrfs free of write hole. (no need for maintain extra mapping, or > add journaling). Unfortunately, btrfs is not that flex yet. In fact, btrfs just does its mapping in a much smaller graduality. So in btrfs we have the following mapping scheme: 1G 2G 3G 4G Btrfs logical address space: | RAID1 | RAID5 | EMPTY | ... And logical address range [1G, 2G) is mapped using RAID1, using some physical ranges from 2 devices in the pool Logical address range [2G, 3G) is mapped using RAID5, using some physical ranges from several devices in the pool. Logical address range [3G, 4G) is not mapped, read/write that range would directly lead to -EIO. By this, you can see, btrfs is not as flex as you think. Yes, we have file -> logical address mapping, but inside each mapped logical address range, everything is still fixed mapping. If we want to really make extent allocator (which currently works at logical address level, no caring the underlying mapping at all) to avoid partial stripe write, it's a lot of cross-layer work. In fact, Johannes is working on an extra layer of mapping for RAID56, by that it can be possible to do extra mapping to avoid partial write. But that requires a lot of work, and may not even work for metadata. Thus I'm still exploring the tried-and-true methods like write-intent-bitmap and journal for btrfs RAID56. Thanks, Qu > > Thanks, > Song > >> >> However, I see you wrote: >>> Thus as long as no device is missing, a write-intent-bitmap is enough to >>> address the write hole in btrfs (at least for COW protected data and all >>> metadata). >> >> That doesn't make sense. If no device is missing, then there is no >> write hole. >> If no device is missing, all you need to do is recalculate the parity >> blocks on any stripe that was recently written. In md with use the >> write-intent-bitmap. In btrfs I would expect that you would already >> have some way of knowing where recent writes happened, so you can >> validiate the various checksums. That should be sufficient to >> recalculate the parity. I've be very surprised if btrfs doesn't already >> do this. >> >> So I'm somewhat confuses as to what your real goal is. >> >> NeilBrown ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: About the md-bitmap behavior 2022-06-22 22:32 ` NeilBrown 2022-06-22 23:00 ` Song Liu @ 2022-06-23 0:39 ` Qu Wenruo 2022-06-23 3:32 ` Song Liu 1 sibling, 1 reply; 14+ messages in thread From: Qu Wenruo @ 2022-06-23 0:39 UTC (permalink / raw) To: NeilBrown; +Cc: Doug Ledford, Wols Lists, linux-raid, linux-block On 2022/6/23 06:32, NeilBrown wrote: > On Wed, 22 Jun 2022, Qu Wenruo wrote: >> >> On 2022/6/22 10:15, Doug Ledford wrote: >>> On Mon, 2022-06-20 at 10:56 +0100, Wols Lists wrote: >>>> On 20/06/2022 08:56, Qu Wenruo wrote: >>>>>> The write-hole has been addressed with journaling already, and >>>>>> this will >>>>>> be adding a new and not-needed feature - not saying it wouldn't be >>>>>> nice >>>>>> to have, but do we need another way to skin this cat? >>>>> >>>>> I'm talking about the BTRFS RAID56, not md-raid RAID56, which is a >>>>> completely different thing. >>>>> >>>>> Here I'm just trying to understand how the md-bitmap works, so that >>>>> I >>>>> can do a proper bitmap for btrfs RAID56. >>>> >>>> Ah. Okay. >>>> >>>> Neil Brown is likely to be the best help here as I believe he wrote a >>>> lot of the code, although I don't think he's much involved with md- >>>> raid >>>> any more. >>> >>> I can't speak to how it is today, but I know it was *designed* to be >>> sync flush of the dirty bit setting, then lazy, async write out of the >>> clear bits. But, yes, in order for the design to be reliable, you must >>> flush out the dirty bits before you put writes in flight. >> >> Thank you very much confirming my concern. >> >> So maybe it's me not checking the md-bitmap code carefully enough to >> expose the full picture. >> >>> >>> One thing I'm not sure about though, is that MD RAID5/6 uses fixed >>> stripes. I thought btrfs, since it was an allocation filesystem, didn't >>> have to use full stripes? Am I wrong about that? >> >> Unfortunately, we only go allocation for the RAID56 chunks. In side a >> RAID56 the underlying devices still need to go the regular RAID56 full >> stripe scheme. >> >> Thus the btrfs RAID56 is still the same regular RAID56 inside one btrfs >> RAID56 chunk, but without bitmap/journal. >> >>> Because it would seem >>> that if your data isn't necessarily in full stripes, then a bitmap might >>> not work so well since it just marks a range of full stripes as >>> "possibly dirty, we were writing to them, do a parity resync to make >>> sure". >> >> For the resync part is where btrfs shines, as the extra csum (for the >> untouched part) and metadata COW ensures us only see the old untouched >> data, and with the extra csum, we can safely rebuild the full stripe. >> >> Thus as long as no device is missing, a write-intent-bitmap is enough to >> address the write hole in btrfs (at least for COW protected data and all >> metadata). >> >>> >>> In any case, Wols is right, probably want to ping Neil on this. Might >>> need to ping him directly though. Not sure he'll see it just on the >>> list. >>> >> >> Adding Neil into this thread. Any clue on the existing >> md_bitmap_startwrite() behavior? > > md_bitmap_startwrite() is used to tell the bitmap code that the raid > module is about to start writing at a location. This may result in > md_bitmap_file_set_bit() being called to set a bit in the in-memory copy > of the bitmap, and to make that page of the bitmap as BITMAP_PAGE_DIRTY. > > Before raid actually submits the writes to the device it will call > md_bitmap_unplug() which will submit the writes and wait for them to > complete. Ah, that's the missing piece, thank you very much for pointing this out. Looks like I'm not familiar with that unplug part at all. Great to learn something new. > > The is a comment at the top of md/raid5.c titled "BITMAP UNPLUGGING" > which says a few things about how raid5 ensure things happen in the > right order. > > However I don't think if any sort of bitmap can solve the write-hole > problem for RAID5 - even in btrfs. > > The problem is that if the host crashes while the array is degraded and > while some write requests were in-flight, then you might have lost data. > i.e. to update a block you must write both that block and the parity > block. If you actually wrote neither or both, everything is fine. If > you wrote one but not the other then you CANNOT recover the data that > was on the missing device (there must be a missing device as the array > is degraded). Even having checksums of everything is not enough to > recover that missing block. However btrfs also has COW, this ensure after crash, we will only try to read the old data (aka, the untouched part). E.g. btrfs uses 64KiB as stripe size. O = Old data N = New writes 0 32K 64K D1 |OOOOOOO|NNNNNNN| D2 |NNNNNNN|OOOOOOO| P |NNNNNNN|NNNNNNN| In above case, no matter if the new write reaches disks, as long as the crash happens before we update all the metadata and superblock (which implies a flush for all involved devices), the fs will only try to read the old data. So at this point, our data read on old data is still correct. But the parity no longer matches, thus degrading our ability to tolerate device lost. With write-intent bitmap, we know this full stripe has something out of sync, so we can re-calculate the parity. Although, all above condition needs two things: - The new write is CoWed. It's mandatory for btrfs metadata, so no problem. But for btrfs data, we can have NODATACOW (also implies NDOATASUM), and in that case, corruption will be unavoidable. - The old data should never be changed This means, the device can not disappear during the recovery. If powerloss + device missing happens, this will not work at all. > > You must either: > 1/ have a safe duplicate of the blocks being written, so they can be > recovered and re-written after a crash. This is what journalling > does. Or Yes, journal would be the next step to handle NODATACOW case and device missing case. > 2/ Only write to location which don't contain valid data. i.e. always > write full stripes to locations which are unused on each device. > This way you cannot lose existing data. Worst case: that whole > stripe is ignored. This is how I would handle RAID5 in a > copy-on-write filesystem. That is something we considered in the past, but considering even now we still have space reservation problems sometimes, I doubt such change would cause even more problems than it can solve. > > However, I see you wrote: >> Thus as long as no device is missing, a write-intent-bitmap is enough to >> address the write hole in btrfs (at least for COW protected data and all >> metadata). > > That doesn't make sense. If no device is missing, then there is no > write hole. > If no device is missing, all you need to do is recalculate the parity > blocks on any stripe that was recently written. That's exactly what we need and want to do. > In md with use the > write-intent-bitmap. In btrfs I would expect that you would already > have some way of knowing where recent writes happened, so you can > validiate the various checksums. > That should be sufficient to > recalculate the parity. I've be very surprised if btrfs doesn't already > do this. That's the problem, we previously completely rely on COW, thus there is no such facility like write-intent-bitmap at all. After a powerloss, btrfs knows nothing about previous crash, and completely rely on csum + COW + duplication to correct any error at read time. It's completely fine for RAID1 based profiles, but not for RAID56. > > So I'm somewhat confuses as to what your real goal is. Yep, the btrfs RAID56 is missing something very basic, thus I guess that's causing the confusion. Thanks, Qu > > NeilBrown ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: About the md-bitmap behavior 2022-06-23 0:39 ` Qu Wenruo @ 2022-06-23 3:32 ` Song Liu 2022-06-23 4:52 ` Qu Wenruo 0 siblings, 1 reply; 14+ messages in thread From: Song Liu @ 2022-06-23 3:32 UTC (permalink / raw) To: Qu Wenruo; +Cc: NeilBrown, Doug Ledford, Wols Lists, linux-raid, linux-block On Wed, Jun 22, 2022 at 5:39 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > [...] > E.g. > btrfs uses 64KiB as stripe size. > O = Old data > N = New writes > > 0 32K 64K > D1 |OOOOOOO|NNNNNNN| > D2 |NNNNNNN|OOOOOOO| > P |NNNNNNN|NNNNNNN| > > In above case, no matter if the new write reaches disks, as long as the > crash happens before we update all the metadata and superblock (which > implies a flush for all involved devices), the fs will only try to read > the old data. I guess we are using "write hole" for different scenarios. I use "write hole" for the case that we corrupt data that is not being written to. This happens with the combination of failed drive and power loss. For example, we have raid5 with 3 drives. Each stripe has two data and one parity. When D1 failed, read to D1 is calculated based on D2 and P; and write to D1 requires updating D2 and P at the same time. Now imagine we lost power (or crash) while writing to D2 (and P). When the system comes back after reboot, D2 and P are out of sync. Now we lost both D2 and D1. Note that D1 is not being written to before the power loss. For btrfs, maybe we can avoid write hole by NOT writing to D2 when D1 contains valid data (and the drive is failed). Instead, we can write a new version of D1 and D2 to a different stripe. If we loss power during the write, the old data is not corrupted. Does this make sense? I am not sure whether it is practical in btrfs though. > > So at this point, our data read on old data is still correct. > But the parity no longer matches, thus degrading our ability to tolerate > device lost. > > With write-intent bitmap, we know this full stripe has something out of > sync, so we can re-calculate the parity. > > Although, all above condition needs two things: > > - The new write is CoWed. > It's mandatory for btrfs metadata, so no problem. But for btrfs data, > we can have NODATACOW (also implies NDOATASUM), and in that case, > corruption will be unavoidable. > > - The old data should never be changed > This means, the device can not disappear during the recovery. > If powerloss + device missing happens, this will not work at all. > > > > > You must either: > > 1/ have a safe duplicate of the blocks being written, so they can be > > recovered and re-written after a crash. This is what journalling > > does. Or > > Yes, journal would be the next step to handle NODATACOW case and device > missing case. > > > 2/ Only write to location which don't contain valid data. i.e. always > > write full stripes to locations which are unused on each device. > > This way you cannot lose existing data. Worst case: that whole > > stripe is ignored. This is how I would handle RAID5 in a > > copy-on-write filesystem. > > That is something we considered in the past, but considering even now we > still have space reservation problems sometimes, I doubt such change > would cause even more problems than it can solve. > > > > > However, I see you wrote: > >> Thus as long as no device is missing, a write-intent-bitmap is enough to > >> address the write hole in btrfs (at least for COW protected data and all > >> metadata). > > > > That doesn't make sense. If no device is missing, then there is no > > write hole. > > If no device is missing, all you need to do is recalculate the parity > > blocks on any stripe that was recently written. > > That's exactly what we need and want to do. I guess the goal is to find some files after crash/power loss. Can we achieve this with file mtime? (Sorry if this is a stupid question...) Thanks, Song ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: About the md-bitmap behavior 2022-06-23 3:32 ` Song Liu @ 2022-06-23 4:52 ` Qu Wenruo 2022-06-24 0:55 ` Jani Partanen 0 siblings, 1 reply; 14+ messages in thread From: Qu Wenruo @ 2022-06-23 4:52 UTC (permalink / raw) To: Song Liu; +Cc: NeilBrown, Doug Ledford, Wols Lists, linux-raid, linux-block On 2022/6/23 11:32, Song Liu wrote: > On Wed, Jun 22, 2022 at 5:39 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> > [...] >> E.g. >> btrfs uses 64KiB as stripe size. >> O = Old data >> N = New writes >> >> 0 32K 64K >> D1 |OOOOOOO|NNNNNNN| >> D2 |NNNNNNN|OOOOOOO| >> P |NNNNNNN|NNNNNNN| >> >> In above case, no matter if the new write reaches disks, as long as the >> crash happens before we update all the metadata and superblock (which >> implies a flush for all involved devices), the fs will only try to read >> the old data. > > I guess we are using "write hole" for different scenarios. I use "write hole" > for the case that we corrupt data that is not being written to. This happens > with the combination of failed drive and power loss. For example, we have > raid5 with 3 drives. Each stripe has two data and one parity. When D1 > failed, read to D1 is calculated based on D2 and P; and write to D1 > requires updating D2 and P at the same time. Now imagine we lost > power (or crash) while writing to D2 (and P). When the system comes back > after reboot, D2 and P are out of sync. Now we lost both D2 and D1. Note > that D1 is not being written to before the power loss. For that powerloss + device lose case, journal is the only way to go, unless we do extra work to avoid partial write. > > For btrfs, maybe we can avoid write hole by NOT writing to D2 when D1 > contains valid data (and the drive is failed). Instead, we can write a new > version of D1 and D2 to a different stripe. If we loss power during the write, > the old data is not corrupted. Does this make sense? I am not sure > whether it is practical in btrfs though. That makes sense, but that also means the extent allocator needs extra info, not just which space is available. And there would make ENOSPC handling even more challenging, what if we have no space left but only partially written stripes? There are some ideas, like extra layer for RAID56 to do extra mapping between logical address to physical address, but I'm not yet confident if we will see new (and even more complex) challenges going that path. > >> >> So at this point, our data read on old data is still correct. >> But the parity no longer matches, thus degrading our ability to tolerate >> device lost. >> >> With write-intent bitmap, we know this full stripe has something out of >> sync, so we can re-calculate the parity. >> >> Although, all above condition needs two things: >> >> - The new write is CoWed. >> It's mandatory for btrfs metadata, so no problem. But for btrfs data, >> we can have NODATACOW (also implies NDOATASUM), and in that case, >> corruption will be unavoidable. >> >> - The old data should never be changed >> This means, the device can not disappear during the recovery. >> If powerloss + device missing happens, this will not work at all. >> >>> >>> You must either: >>> 1/ have a safe duplicate of the blocks being written, so they can be >>> recovered and re-written after a crash. This is what journalling >>> does. Or >> >> Yes, journal would be the next step to handle NODATACOW case and device >> missing case. >> >>> 2/ Only write to location which don't contain valid data. i.e. always >>> write full stripes to locations which are unused on each device. >>> This way you cannot lose existing data. Worst case: that whole >>> stripe is ignored. This is how I would handle RAID5 in a >>> copy-on-write filesystem. >> >> That is something we considered in the past, but considering even now we >> still have space reservation problems sometimes, I doubt such change >> would cause even more problems than it can solve. >> >>> >>> However, I see you wrote: >>>> Thus as long as no device is missing, a write-intent-bitmap is enough to >>>> address the write hole in btrfs (at least for COW protected data and all >>>> metadata). >>> >>> That doesn't make sense. If no device is missing, then there is no >>> write hole. >>> If no device is missing, all you need to do is recalculate the parity >>> blocks on any stripe that was recently written. >> >> That's exactly what we need and want to do. > > I guess the goal is to find some files after crash/power loss. Can we > achieve this with file mtime? (Sorry if this is a stupid question...) There are two problems here: 1. After power loss, we won't see the mtime update at all. As the mtime update will be protected by metadata CoW, since the powerloss happens when the current transaction is not committed, we will only see the old metadata after recovery. Thus to btrfs, at next reboot, it can not see the new mtime at all. AKA everything CoWed is updated atomically, we can only see trnsaction last committed. Although there is something special like log tree for fsync(), it has its own limitation (can not happen across transaction boundary), thus still not suitable for things like random data write. 2. Will not work for metadata, unless we scrub the whole metadata at recovery time The core problem here is graduality. The target file can be in TiB size, or for the whole metadata. Scrubbing such large range before allowing user to do any write can lead to super unhappy end users. So for now, as a (kinda) quick solution, I'd like to go write-intent bitmap first, then journal, just like md-raid. Thanks, Qu > > Thanks, > Song ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: About the md-bitmap behavior 2022-06-23 4:52 ` Qu Wenruo @ 2022-06-24 0:55 ` Jani Partanen 2022-06-24 1:35 ` Qu Wenruo 0 siblings, 1 reply; 14+ messages in thread From: Jani Partanen @ 2022-06-24 0:55 UTC (permalink / raw) To: Qu Wenruo, Song Liu Cc: NeilBrown, Doug Ledford, Wols Lists, linux-raid, linux-block Qu Wenruo kirjoitti 23/06/2022 klo 7.52: > That makes sense, but that also means the extent allocator needs extra > info, not just which space is available. > > And there would make ENOSPC handling even more challenging, what if we > have no space left but only partially written stripes? > > There are some ideas, like extra layer for RAID56 to do extra mapping > between logical address to physical address, but I'm not yet confident > if we will see new (and even more complex) challenges going that path. Isn't there already in btrfs system in place for ENOSPC situation? You just add some space temporaly? Thats what I remember when I was playing around with different situations with btrfs. For me bigger issue with btrfs raid56 is the fact that scrub is very very slow. That should be one of the top priority to solve. // JiiPee ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: About the md-bitmap behavior 2022-06-24 0:55 ` Jani Partanen @ 2022-06-24 1:35 ` Qu Wenruo 0 siblings, 0 replies; 14+ messages in thread From: Qu Wenruo @ 2022-06-24 1:35 UTC (permalink / raw) To: Jani Partanen, Song Liu Cc: NeilBrown, Doug Ledford, Wols Lists, linux-raid, linux-block On 2022/6/24 08:55, Jani Partanen wrote: > > > Qu Wenruo kirjoitti 23/06/2022 klo 7.52: >> That makes sense, but that also means the extent allocator needs extra >> info, not just which space is available. >> >> And there would make ENOSPC handling even more challenging, what if we >> have no space left but only partially written stripes? >> >> There are some ideas, like extra layer for RAID56 to do extra mapping >> between logical address to physical address, but I'm not yet confident >> if we will see new (and even more complex) challenges going that path. > > Isn't there already in btrfs system in place for ENOSPC situation? It's not perfect, we still have some of reports of ENOSPC in the most inconvenient timing from time to time. > You > just add some space temporaly? That is not a valid solution, not even a valid workaround. > Thats what I remember when I was playing > around with different situations with btrfs. The situation is improving a lot recently, but it's still far from write-in-place fs level. > For me bigger issue with btrfs raid56 is the fact that scrub is very > very slow. That should be one of the top priority to solve. In fact, that's caused by the way how we do scrub. We start a scrub for each device, this is mostly fine for regular profiles, but a big no no to RAID56. Since scrub a full stripe will read all data/P/Q from disk, it makes no sense to scrub a full stripe multiple times. That would be a target during the write-intent bitmap, as we will rely on scrub to re-sync the data at recovery time. After that, I'll try to create a better, RAID56 friendly interface for scrub ioctl. Thanks, Qu > > // JiiPee ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2022-06-24 1:36 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-06-20 7:29 About the md-bitmap behavior Qu Wenruo 2022-06-20 7:48 ` Wols Lists 2022-06-20 7:56 ` Qu Wenruo 2022-06-20 9:56 ` Wols Lists 2022-06-22 2:15 ` Doug Ledford 2022-06-22 2:37 ` Qu Wenruo 2022-06-22 22:32 ` NeilBrown 2022-06-22 23:00 ` Song Liu 2022-06-23 0:53 ` Qu Wenruo 2022-06-23 0:39 ` Qu Wenruo 2022-06-23 3:32 ` Song Liu 2022-06-23 4:52 ` Qu Wenruo 2022-06-24 0:55 ` Jani Partanen 2022-06-24 1:35 ` Qu Wenruo
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.