On Wed, 13 Jul 2022 14:01:32 +0000
Johannes Thumshirn <Johannes.Thumshirn@wdc.com> wrote:

> On 13.07.22 15:47, Qu Wenruo wrote:
> > 
> > 
> > On 2022/7/13 20:42, Johannes Thumshirn wrote:  
> >> On 13.07.22 14:01, Qu Wenruo wrote:  
> >>>
> >>>
> >>> On 2022/7/13 19:43, Johannes Thumshirn wrote:  
> >>>> On 13.07.22 12:54, Qu Wenruo wrote:  
> >>>>>
> >>>>>
> >>>>> On 2022/5/16 22:31, Johannes Thumshirn wrote:  
> >>>>>> Introduce a raid-stripe-tree to record writes in a RAID environment.
> >>>>>>
> >>>>>> In essence this adds another address translation layer between the logical
> >>>>>> and the physical addresses in btrfs and is designed to close two gaps. The
> >>>>>> first is the ominous RAID-write-hole we suffer from with RAID5/6 and the
> >>>>>> second one is the inability of doing RAID with zoned block devices due to the
> >>>>>> constraints we have with REQ_OP_ZONE_APPEND writes.  
> >>>>>
> >>>>> Here I want to discuss about something related to RAID56 and RST.
> >>>>>
> >>>>> One of my long existing concern is, P/Q stripes have a higher update
> >>>>> frequency, thus with certain transaction commit/data writeback timing,
> >>>>> wouldn't it cause the device storing P/Q stripes go out of space before
> >>>>> the data stripe devices?  
> >>>>
> >>>> P/Q stripes on a dedicated drive would be RAID4, which we don't have.  
> >>>
> >>> I'm just using one block group as an example.
> >>>
> >>> Sure, the next bg can definitely go somewhere else.
> >>>
> >>> But inside one bg, we are still using one zone for the bg, right?  
> >>
> >> Ok maybe I'm not understanding the code in volumes.c correctly, but
> >> doesn't __btrfs_map_block() calculate a rotation per stripe-set?
> >>
> >> I'm looking at this code:
> >>
> >> 	/* Build raid_map */
> >> 	if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK && need_raid_map &&
> >> 	    (need_full_stripe(op) || mirror_num > 1)) {
> >> 		u64 tmp;
> >> 		unsigned rot;
> >>
> >> 		/* Work out the disk rotation on this stripe-set */
> >> 		div_u64_rem(stripe_nr, num_stripes, &rot);
> >>
> >> 		/* Fill in the logical address of each stripe */
> >> 		tmp = stripe_nr * data_stripes;
> >> 		for (i = 0; i < data_stripes; i++)
> >> 			bioc->raid_map[(i + rot) % num_stripes] =
> >> 				em->start + (tmp + i) * map->stripe_len;
> >>
> >> 		bioc->raid_map[(i + rot) % map->num_stripes] = RAID5_P_STRIPE;
> >> 		if (map->type & BTRFS_BLOCK_GROUP_RAID6)
> >> 			bioc->raid_map[(i + rot + 1) % num_stripes] =
> >> 				RAID6_Q_STRIPE;
> >>
> >> 		sort_parity_stripes(bioc, num_stripes);
> >> 	}  
> > 
> > That's per full-stripe. AKA, the rotation only kicks in after a full stripe.
> > 
> > In my example, we're inside one full stripe, no rotation, until next
> > full stripe.
> >   
> 
> 
> Ah ok, my apologies. For sub-stripe size writes My idea was to 0-pad up to  
> stripe size. Then we can do full CoW of stripes. If we have an older generation
> of a stripe, we can just override it on regular btrfs. On zoned btrfs this
> just accounts for more zone_unusable bytes and waits for the GC to kick in.
> 

Have you considered variable stripe size? I believe ZFS does this.
Should be easy for raid5 since it's just xor, not sure for raid6.

PS: ZFS seems to do variable-_width_ stripes
https://pthree.org/2012/12/05/zfs-administration-part-ii-raidz/

Regards,
Lukas Straub

--