On Wed, 13 Jul 2022 14:01:32 +0000 Johannes Thumshirn wrote: > On 13.07.22 15:47, Qu Wenruo wrote: > > > > > > On 2022/7/13 20:42, Johannes Thumshirn wrote: > >> On 13.07.22 14:01, Qu Wenruo wrote: > >>> > >>> > >>> On 2022/7/13 19:43, Johannes Thumshirn wrote: > >>>> On 13.07.22 12:54, Qu Wenruo wrote: > >>>>> > >>>>> > >>>>> On 2022/5/16 22:31, Johannes Thumshirn wrote: > >>>>>> Introduce a raid-stripe-tree to record writes in a RAID environment. > >>>>>> > >>>>>> In essence this adds another address translation layer between the logical > >>>>>> and the physical addresses in btrfs and is designed to close two gaps. The > >>>>>> first is the ominous RAID-write-hole we suffer from with RAID5/6 and the > >>>>>> second one is the inability of doing RAID with zoned block devices due to the > >>>>>> constraints we have with REQ_OP_ZONE_APPEND writes. > >>>>> > >>>>> Here I want to discuss about something related to RAID56 and RST. > >>>>> > >>>>> One of my long existing concern is, P/Q stripes have a higher update > >>>>> frequency, thus with certain transaction commit/data writeback timing, > >>>>> wouldn't it cause the device storing P/Q stripes go out of space before > >>>>> the data stripe devices? > >>>> > >>>> P/Q stripes on a dedicated drive would be RAID4, which we don't have. > >>> > >>> I'm just using one block group as an example. > >>> > >>> Sure, the next bg can definitely go somewhere else. > >>> > >>> But inside one bg, we are still using one zone for the bg, right? > >> > >> Ok maybe I'm not understanding the code in volumes.c correctly, but > >> doesn't __btrfs_map_block() calculate a rotation per stripe-set? > >> > >> I'm looking at this code: > >> > >> /* Build raid_map */ > >> if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK && need_raid_map && > >> (need_full_stripe(op) || mirror_num > 1)) { > >> u64 tmp; > >> unsigned rot; > >> > >> /* Work out the disk rotation on this stripe-set */ > >> div_u64_rem(stripe_nr, num_stripes, &rot); > >> > >> /* Fill in the logical address of each stripe */ > >> tmp = stripe_nr * data_stripes; > >> for (i = 0; i < data_stripes; i++) > >> bioc->raid_map[(i + rot) % num_stripes] = > >> em->start + (tmp + i) * map->stripe_len; > >> > >> bioc->raid_map[(i + rot) % map->num_stripes] = RAID5_P_STRIPE; > >> if (map->type & BTRFS_BLOCK_GROUP_RAID6) > >> bioc->raid_map[(i + rot + 1) % num_stripes] = > >> RAID6_Q_STRIPE; > >> > >> sort_parity_stripes(bioc, num_stripes); > >> } > > > > That's per full-stripe. AKA, the rotation only kicks in after a full stripe. > > > > In my example, we're inside one full stripe, no rotation, until next > > full stripe. > > > > > Ah ok, my apologies. For sub-stripe size writes My idea was to 0-pad up to > stripe size. Then we can do full CoW of stripes. If we have an older generation > of a stripe, we can just override it on regular btrfs. On zoned btrfs this > just accounts for more zone_unusable bytes and waits for the GC to kick in. > Have you considered variable stripe size? I believe ZFS does this. Should be easy for raid5 since it's just xor, not sure for raid6. PS: ZFS seems to do variable-_width_ stripes https://pthree.org/2012/12/05/zfs-administration-part-ii-raidz/ Regards, Lukas Straub --