All of lore.kernel.org
 help / color / mirror / Atom feed
* Any bio_clone_slow() implementation which doesn't share bi_io_vec?
@ 2021-11-23  6:44 ` Qu Wenruo
  0 siblings, 0 replies; 30+ messages in thread
From: Qu Wenruo @ 2021-11-23  6:44 UTC (permalink / raw)
  To: Linux FS Devel, dm-devel; +Cc: linux-btrfs

Hi,

Although there are some out-of-date comments mentions other
bio_clone_*() variants, but there isn't really any other bio clone
variants other than __bio_clone_fast(), which shares bi_io_vec with the
source bio.

This limits means we can't free the source bio before the cloned one.

Is there any bio_clone variant which do a deep clone, including bi_io_vec?



The background story is a little long, with btrfs involved.

Unlike dm/md-raid which uses bio_split(), btrfs splits its bio before
even creating a bio.
Every time btrfs wants to create a bio, it will calculate the boundary,
thus will only create a bio which never cross various raid boundary.

But this is not really the common practice, I'd like to at least change
the timing of bio split so that reflects more like dm/md raid.

That's why the bio_clone thing is involved, there is still some corner
cases that we don't want to fail the whole large bio if there is only
one stripe failed (mostly for read bio, that we want to salvage as much
data as possible)

Thus regular bio_split() + bio_chain() solution is not that good here.

Any idea why no such bio_clone_slow() or bio_split_slow() provided in
block layer?

Or really bio_split() + bio_chain() is the only recommended solution?

Thanks,
Qu

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [dm-devel] Any bio_clone_slow() implementation which doesn't share bi_io_vec?
@ 2021-11-23  6:44 ` Qu Wenruo
  0 siblings, 0 replies; 30+ messages in thread
From: Qu Wenruo @ 2021-11-23  6:44 UTC (permalink / raw)
  To: Linux FS Devel, dm-devel; +Cc: linux-btrfs

Hi,

Although there are some out-of-date comments mentions other
bio_clone_*() variants, but there isn't really any other bio clone
variants other than __bio_clone_fast(), which shares bi_io_vec with the
source bio.

This limits means we can't free the source bio before the cloned one.

Is there any bio_clone variant which do a deep clone, including bi_io_vec?



The background story is a little long, with btrfs involved.

Unlike dm/md-raid which uses bio_split(), btrfs splits its bio before
even creating a bio.
Every time btrfs wants to create a bio, it will calculate the boundary,
thus will only create a bio which never cross various raid boundary.

But this is not really the common practice, I'd like to at least change
the timing of bio split so that reflects more like dm/md raid.

That's why the bio_clone thing is involved, there is still some corner
cases that we don't want to fail the whole large bio if there is only
one stripe failed (mostly for read bio, that we want to salvage as much
data as possible)

Thus regular bio_split() + bio_chain() solution is not that good here.

Any idea why no such bio_clone_slow() or bio_split_slow() provided in
block layer?

Or really bio_split() + bio_chain() is the only recommended solution?

Thanks,
Qu


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Any bio_clone_slow() implementation which doesn't share bi_io_vec?
  2021-11-23  6:44 ` [dm-devel] " Qu Wenruo
@ 2021-11-23  7:43   ` Christoph Hellwig
  -1 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2021-11-23  7:43 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Linux FS Devel, dm-devel, linux-btrfs

On Tue, Nov 23, 2021 at 02:44:32PM +0800, Qu Wenruo wrote:
> Hi,
> 
> Although there are some out-of-date comments mentions other
> bio_clone_*() variants, but there isn't really any other bio clone
> variants other than __bio_clone_fast(), which shares bi_io_vec with the
> source bio.
> 
> This limits means we can't free the source bio before the cloned one.
> 
> Is there any bio_clone variant which do a deep clone, including bi_io_vec?

There is no use case for that, unless the actual data changes like in
the bounce buffering code.

> That's why the bio_clone thing is involved, there is still some corner
> cases that we don't want to fail the whole large bio if there is only
> one stripe failed (mostly for read bio, that we want to salvage as much
> data as possible)
> 
> Thus regular bio_split() + bio_chain() solution is not that good here.
> 
> Any idea why no such bio_clone_slow() or bio_split_slow() provided in
> block layer?
> 
> Or really bio_split() + bio_chain() is the only recommended solution?

You can use bio_split witout bio_chain.  You just need your own
bi_end_io handler that first performs the action you want and then
contains code equivalent to __bio_chain_endio.  As a bonus you can
pint bi_private to whatever you want, it does not have to be the parent
bio, just something that allows you to find it.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dm-devel] Any bio_clone_slow() implementation which doesn't share bi_io_vec?
@ 2021-11-23  7:43   ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2021-11-23  7:43 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Linux FS Devel, dm-devel, linux-btrfs

On Tue, Nov 23, 2021 at 02:44:32PM +0800, Qu Wenruo wrote:
> Hi,
> 
> Although there are some out-of-date comments mentions other
> bio_clone_*() variants, but there isn't really any other bio clone
> variants other than __bio_clone_fast(), which shares bi_io_vec with the
> source bio.
> 
> This limits means we can't free the source bio before the cloned one.
> 
> Is there any bio_clone variant which do a deep clone, including bi_io_vec?

There is no use case for that, unless the actual data changes like in
the bounce buffering code.

> That's why the bio_clone thing is involved, there is still some corner
> cases that we don't want to fail the whole large bio if there is only
> one stripe failed (mostly for read bio, that we want to salvage as much
> data as possible)
> 
> Thus regular bio_split() + bio_chain() solution is not that good here.
> 
> Any idea why no such bio_clone_slow() or bio_split_slow() provided in
> block layer?
> 
> Or really bio_split() + bio_chain() is the only recommended solution?

You can use bio_split witout bio_chain.  You just need your own
bi_end_io handler that first performs the action you want and then
contains code equivalent to __bio_chain_endio.  As a bonus you can
pint bi_private to whatever you want, it does not have to be the parent
bio, just something that allows you to find it.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Any bio_clone_slow() implementation which doesn't share bi_io_vec?
  2021-11-23  7:43   ` [dm-devel] " Christoph Hellwig
@ 2021-11-23  8:10     ` Qu Wenruo
  -1 siblings, 0 replies; 30+ messages in thread
From: Qu Wenruo @ 2021-11-23  8:10 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo; +Cc: Linux FS Devel, dm-devel, linux-btrfs



On 2021/11/23 15:43, Christoph Hellwig wrote:
> On Tue, Nov 23, 2021 at 02:44:32PM +0800, Qu Wenruo wrote:
>> Hi,
>>
>> Although there are some out-of-date comments mentions other
>> bio_clone_*() variants, but there isn't really any other bio clone
>> variants other than __bio_clone_fast(), which shares bi_io_vec with the
>> source bio.
>>
>> This limits means we can't free the source bio before the cloned one.
>>
>> Is there any bio_clone variant which do a deep clone, including bi_io_vec?
>
> There is no use case for that, unless the actual data changes like in
> the bounce buffering code.
>
>> That's why the bio_clone thing is involved, there is still some corner
>> cases that we don't want to fail the whole large bio if there is only
>> one stripe failed (mostly for read bio, that we want to salvage as much
>> data as possible)
>>
>> Thus regular bio_split() + bio_chain() solution is not that good here.
>>
>> Any idea why no such bio_clone_slow() or bio_split_slow() provided in
>> block layer?
>>
>> Or really bio_split() + bio_chain() is the only recommended solution?
>
> You can use bio_split witout bio_chain.  You just need your own
> bi_end_io handler that first performs the action you want and then
> contains code equivalent to __bio_chain_endio.  As a bonus you can
> pint bi_private to whatever you want, it does not have to be the parent
> bio, just something that allows you to find it.
>
Without bio_chain() sounds pretty good, as we can still utilize
bi_end_io and bi_private.

But this also means, we're now responsible not to release the source bio
since it has the real bi_io_vec.

Let me explore this and hopefully to align btrfs with dm/md code more.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dm-devel] Any bio_clone_slow() implementation which doesn't share bi_io_vec?
@ 2021-11-23  8:10     ` Qu Wenruo
  0 siblings, 0 replies; 30+ messages in thread
From: Qu Wenruo @ 2021-11-23  8:10 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo; +Cc: Linux FS Devel, dm-devel, linux-btrfs



On 2021/11/23 15:43, Christoph Hellwig wrote:
> On Tue, Nov 23, 2021 at 02:44:32PM +0800, Qu Wenruo wrote:
>> Hi,
>>
>> Although there are some out-of-date comments mentions other
>> bio_clone_*() variants, but there isn't really any other bio clone
>> variants other than __bio_clone_fast(), which shares bi_io_vec with the
>> source bio.
>>
>> This limits means we can't free the source bio before the cloned one.
>>
>> Is there any bio_clone variant which do a deep clone, including bi_io_vec?
>
> There is no use case for that, unless the actual data changes like in
> the bounce buffering code.
>
>> That's why the bio_clone thing is involved, there is still some corner
>> cases that we don't want to fail the whole large bio if there is only
>> one stripe failed (mostly for read bio, that we want to salvage as much
>> data as possible)
>>
>> Thus regular bio_split() + bio_chain() solution is not that good here.
>>
>> Any idea why no such bio_clone_slow() or bio_split_slow() provided in
>> block layer?
>>
>> Or really bio_split() + bio_chain() is the only recommended solution?
>
> You can use bio_split witout bio_chain.  You just need your own
> bi_end_io handler that first performs the action you want and then
> contains code equivalent to __bio_chain_endio.  As a bonus you can
> pint bi_private to whatever you want, it does not have to be the parent
> bio, just something that allows you to find it.
>
Without bio_chain() sounds pretty good, as we can still utilize
bi_end_io and bi_private.

But this also means, we're now responsible not to release the source bio
since it has the real bi_io_vec.

Let me explore this and hopefully to align btrfs with dm/md code more.

Thanks,
Qu


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Any bio_clone_slow() implementation which doesn't share bi_io_vec?
  2021-11-23  8:10     ` [dm-devel] " Qu Wenruo
@ 2021-11-23  8:13       ` Christoph Hellwig
  -1 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2021-11-23  8:13 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, Linux FS Devel, dm-devel, linux-btrfs

On Tue, Nov 23, 2021 at 04:10:35PM +0800, Qu Wenruo wrote:
> Without bio_chain() sounds pretty good, as we can still utilize
> bi_end_io and bi_private.
> 
> But this also means, we're now responsible not to release the source bio
> since it has the real bi_io_vec.

Just call bio_inc_remaining before submitting the cloned bio, and then
call bio_endio on the root bio every time a clone completes.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dm-devel] Any bio_clone_slow() implementation which doesn't share bi_io_vec?
@ 2021-11-23  8:13       ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2021-11-23  8:13 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, Linux FS Devel, dm-devel, linux-btrfs

On Tue, Nov 23, 2021 at 04:10:35PM +0800, Qu Wenruo wrote:
> Without bio_chain() sounds pretty good, as we can still utilize
> bi_end_io and bi_private.
> 
> But this also means, we're now responsible not to release the source bio
> since it has the real bi_io_vec.

Just call bio_inc_remaining before submitting the cloned bio, and then
call bio_endio on the root bio every time a clone completes.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Any bio_clone_slow() implementation which doesn't share bi_io_vec?
  2021-11-23  8:13       ` [dm-devel] " Christoph Hellwig
@ 2021-11-23 11:09         ` Qu Wenruo
  -1 siblings, 0 replies; 30+ messages in thread
From: Qu Wenruo @ 2021-11-23 11:09 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Linux FS Devel, dm-devel, linux-btrfs



On 2021/11/23 16:13, Christoph Hellwig wrote:
> On Tue, Nov 23, 2021 at 04:10:35PM +0800, Qu Wenruo wrote:
>> Without bio_chain() sounds pretty good, as we can still utilize
>> bi_end_io and bi_private.
>>
>> But this also means, we're now responsible not to release the source bio
>> since it has the real bi_io_vec.
>
> Just call bio_inc_remaining before submitting the cloned bio, and then
> call bio_endio on the root bio every time a clone completes.
>
Yeah, that sounds pretty good for regular usage.

But there is another very tricky case involved.

For btrfs, it supports zoned device, thus we have special calls sites to
switch between bio_add_page() and bio_add_zoned_append_page().

But zoned write can't not be split, nor there is an easy way to directly
convert a regular bio into a bio with zoned append pages.

Currently if we go the slow path, by allocating a new bio, then add
pages from original bio, and advance the original bio, we're able to do
the conversion from regular bio to zoned append bio.

Any idea on this corner case?

Thanks,
Qu

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dm-devel] Any bio_clone_slow() implementation which doesn't share bi_io_vec?
@ 2021-11-23 11:09         ` Qu Wenruo
  0 siblings, 0 replies; 30+ messages in thread
From: Qu Wenruo @ 2021-11-23 11:09 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Linux FS Devel, dm-devel, linux-btrfs



On 2021/11/23 16:13, Christoph Hellwig wrote:
> On Tue, Nov 23, 2021 at 04:10:35PM +0800, Qu Wenruo wrote:
>> Without bio_chain() sounds pretty good, as we can still utilize
>> bi_end_io and bi_private.
>>
>> But this also means, we're now responsible not to release the source bio
>> since it has the real bi_io_vec.
>
> Just call bio_inc_remaining before submitting the cloned bio, and then
> call bio_endio on the root bio every time a clone completes.
>
Yeah, that sounds pretty good for regular usage.

But there is another very tricky case involved.

For btrfs, it supports zoned device, thus we have special calls sites to
switch between bio_add_page() and bio_add_zoned_append_page().

But zoned write can't not be split, nor there is an easy way to directly
convert a regular bio into a bio with zoned append pages.

Currently if we go the slow path, by allocating a new bio, then add
pages from original bio, and advance the original bio, we're able to do
the conversion from regular bio to zoned append bio.

Any idea on this corner case?

Thanks,
Qu


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Any bio_clone_slow() implementation which doesn't share bi_io_vec?
  2021-11-23 11:09         ` [dm-devel] " Qu Wenruo
@ 2021-11-23 11:39           ` Johannes Thumshirn
  -1 siblings, 0 replies; 30+ messages in thread
From: Johannes Thumshirn @ 2021-11-23 11:39 UTC (permalink / raw)
  To: Qu Wenruo, hch; +Cc: Linux FS Devel, dm-devel, linux-btrfs

On 23/11/2021 12:09, Qu Wenruo wrote:
> 
> 
> On 2021/11/23 16:13, Christoph Hellwig wrote:
>> On Tue, Nov 23, 2021 at 04:10:35PM +0800, Qu Wenruo wrote:
>>> Without bio_chain() sounds pretty good, as we can still utilize
>>> bi_end_io and bi_private.
>>>
>>> But this also means, we're now responsible not to release the source bio
>>> since it has the real bi_io_vec.
>>
>> Just call bio_inc_remaining before submitting the cloned bio, and then
>> call bio_endio on the root bio every time a clone completes.
>>
> Yeah, that sounds pretty good for regular usage.
> 
> But there is another very tricky case involved.
> 
> For btrfs, it supports zoned device, thus we have special calls sites to
> switch between bio_add_page() and bio_add_zoned_append_page().
> 
> But zoned write can't not be split, nor there is an easy way to directly
> convert a regular bio into a bio with zoned append pages.
> 
> Currently if we go the slow path, by allocating a new bio, then add
> pages from original bio, and advance the original bio, we're able to do
> the conversion from regular bio to zoned append bio.
> 
> Any idea on this corner case?

I think we have to differentiate two cases here:
A "regular" REQ_OP_ZONE_APPEND bio and a RAID stripe REQ_OP_ZONE_APPEND
bio. The 1st one (i.e. the regular REQ_OP_ZONE_APPEND bio) can't be split
because we cannot guarantee the order the device writes the data to disk. 
For the RAID stripe bio we can split it into the two (or more) parts that
will end up on _different_ devices. All we need to do is a) ensure it 
doesn't cross the device's zone append limit and b) clamp all 
bi_iter.bi_sector down to the start of the target zone, a.k.a sticking to
the rules of REQ_OP_ZONE_APPEND.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dm-devel] Any bio_clone_slow() implementation which doesn't share bi_io_vec?
@ 2021-11-23 11:39           ` Johannes Thumshirn
  0 siblings, 0 replies; 30+ messages in thread
From: Johannes Thumshirn @ 2021-11-23 11:39 UTC (permalink / raw)
  To: Qu Wenruo, hch; +Cc: Linux FS Devel, dm-devel, linux-btrfs

On 23/11/2021 12:09, Qu Wenruo wrote:
> 
> 
> On 2021/11/23 16:13, Christoph Hellwig wrote:
>> On Tue, Nov 23, 2021 at 04:10:35PM +0800, Qu Wenruo wrote:
>>> Without bio_chain() sounds pretty good, as we can still utilize
>>> bi_end_io and bi_private.
>>>
>>> But this also means, we're now responsible not to release the source bio
>>> since it has the real bi_io_vec.
>>
>> Just call bio_inc_remaining before submitting the cloned bio, and then
>> call bio_endio on the root bio every time a clone completes.
>>
> Yeah, that sounds pretty good for regular usage.
> 
> But there is another very tricky case involved.
> 
> For btrfs, it supports zoned device, thus we have special calls sites to
> switch between bio_add_page() and bio_add_zoned_append_page().
> 
> But zoned write can't not be split, nor there is an easy way to directly
> convert a regular bio into a bio with zoned append pages.
> 
> Currently if we go the slow path, by allocating a new bio, then add
> pages from original bio, and advance the original bio, we're able to do
> the conversion from regular bio to zoned append bio.
> 
> Any idea on this corner case?

I think we have to differentiate two cases here:
A "regular" REQ_OP_ZONE_APPEND bio and a RAID stripe REQ_OP_ZONE_APPEND
bio. The 1st one (i.e. the regular REQ_OP_ZONE_APPEND bio) can't be split
because we cannot guarantee the order the device writes the data to disk. 
For the RAID stripe bio we can split it into the two (or more) parts that
will end up on _different_ devices. All we need to do is a) ensure it 
doesn't cross the device's zone append limit and b) clamp all 
bi_iter.bi_sector down to the start of the target zone, a.k.a sticking to
the rules of REQ_OP_ZONE_APPEND.



--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Any bio_clone_slow() implementation which doesn't share bi_io_vec?
  2021-11-23 11:39           ` [dm-devel] " Johannes Thumshirn
@ 2021-11-23 14:28             ` hch
  -1 siblings, 0 replies; 30+ messages in thread
From: hch @ 2021-11-23 14:28 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: Qu Wenruo, hch, Linux FS Devel, dm-devel, linux-btrfs

On Tue, Nov 23, 2021 at 11:39:11AM +0000, Johannes Thumshirn wrote:
> I think we have to differentiate two cases here:
> A "regular" REQ_OP_ZONE_APPEND bio and a RAID stripe REQ_OP_ZONE_APPEND
> bio. The 1st one (i.e. the regular REQ_OP_ZONE_APPEND bio) can't be split
> because we cannot guarantee the order the device writes the data to disk. 
> For the RAID stripe bio we can split it into the two (or more) parts that
> will end up on _different_ devices. All we need to do is a) ensure it 
> doesn't cross the device's zone append limit and b) clamp all 
> bi_iter.bi_sector down to the start of the target zone, a.k.a sticking to
> the rules of REQ_OP_ZONE_APPEND.

Exactly.  A stacking driver must never split a REQ_OP_ZONE_APPEND bio.
But the file system itself can of course split it as long as each split
off bio has it's own bi_end_io handler to record where it has been
written to.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dm-devel] Any bio_clone_slow() implementation which doesn't share bi_io_vec?
@ 2021-11-23 14:28             ` hch
  0 siblings, 0 replies; 30+ messages in thread
From: hch @ 2021-11-23 14:28 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: hch, Linux FS Devel, dm-devel, linux-btrfs, Qu Wenruo

On Tue, Nov 23, 2021 at 11:39:11AM +0000, Johannes Thumshirn wrote:
> I think we have to differentiate two cases here:
> A "regular" REQ_OP_ZONE_APPEND bio and a RAID stripe REQ_OP_ZONE_APPEND
> bio. The 1st one (i.e. the regular REQ_OP_ZONE_APPEND bio) can't be split
> because we cannot guarantee the order the device writes the data to disk. 
> For the RAID stripe bio we can split it into the two (or more) parts that
> will end up on _different_ devices. All we need to do is a) ensure it 
> doesn't cross the device's zone append limit and b) clamp all 
> bi_iter.bi_sector down to the start of the target zone, a.k.a sticking to
> the rules of REQ_OP_ZONE_APPEND.

Exactly.  A stacking driver must never split a REQ_OP_ZONE_APPEND bio.
But the file system itself can of course split it as long as each split
off bio has it's own bi_end_io handler to record where it has been
written to.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Any bio_clone_slow() implementation which doesn't share bi_io_vec?
  2021-11-23 14:28             ` [dm-devel] " hch
@ 2021-11-23 23:07               ` Qu Wenruo
  -1 siblings, 0 replies; 30+ messages in thread
From: Qu Wenruo @ 2021-11-23 23:07 UTC (permalink / raw)
  To: hch, Johannes Thumshirn; +Cc: Linux FS Devel, dm-devel, linux-btrfs



On 2021/11/23 22:28, hch@infradead.org wrote:
> On Tue, Nov 23, 2021 at 11:39:11AM +0000, Johannes Thumshirn wrote:
>> I think we have to differentiate two cases here:
>> A "regular" REQ_OP_ZONE_APPEND bio and a RAID stripe REQ_OP_ZONE_APPEND
>> bio. The 1st one (i.e. the regular REQ_OP_ZONE_APPEND bio) can't be split
>> because we cannot guarantee the order the device writes the data to disk.

That's correct.

But if we want to move all bio split into chunk layer, we want a initial
bio without any limitation, and then using that bio to create real
REQ_OP_ZONE_APPEND bios with proper size limitations.

>> For the RAID stripe bio we can split it into the two (or more) parts that
>> will end up on _different_ devices. All we need to do is a) ensure it
>> doesn't cross the device's zone append limit and b) clamp all
>> bi_iter.bi_sector down to the start of the target zone, a.k.a sticking to
>> the rules of REQ_OP_ZONE_APPEND.
>
> Exactly.  A stacking driver must never split a REQ_OP_ZONE_APPEND bio.
> But the file system itself can of course split it as long as each split
> off bio has it's own bi_end_io handler to record where it has been
> written to.
>

This makes me wonder, can we really forget the zone thing for the
initial bio so we just create a plain bio without any special
limitation, and let every split condition be handled in the lower layer?

Including raid stripe boundary, zone limitations etc.

(yeah, it's still not pure stacking driver, but it's more
stacking-driver like).

In that case, the missing piece seems to be a way to convert a splitted
plain bio into a REQ_OP_ZONE_APPEND bio.

Can this be done without slow bvec copying?

Thanks,
Qu

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dm-devel] Any bio_clone_slow() implementation which doesn't share bi_io_vec?
@ 2021-11-23 23:07               ` Qu Wenruo
  0 siblings, 0 replies; 30+ messages in thread
From: Qu Wenruo @ 2021-11-23 23:07 UTC (permalink / raw)
  To: hch, Johannes Thumshirn; +Cc: Linux FS Devel, dm-devel, linux-btrfs



On 2021/11/23 22:28, hch@infradead.org wrote:
> On Tue, Nov 23, 2021 at 11:39:11AM +0000, Johannes Thumshirn wrote:
>> I think we have to differentiate two cases here:
>> A "regular" REQ_OP_ZONE_APPEND bio and a RAID stripe REQ_OP_ZONE_APPEND
>> bio. The 1st one (i.e. the regular REQ_OP_ZONE_APPEND bio) can't be split
>> because we cannot guarantee the order the device writes the data to disk.

That's correct.

But if we want to move all bio split into chunk layer, we want a initial
bio without any limitation, and then using that bio to create real
REQ_OP_ZONE_APPEND bios with proper size limitations.

>> For the RAID stripe bio we can split it into the two (or more) parts that
>> will end up on _different_ devices. All we need to do is a) ensure it
>> doesn't cross the device's zone append limit and b) clamp all
>> bi_iter.bi_sector down to the start of the target zone, a.k.a sticking to
>> the rules of REQ_OP_ZONE_APPEND.
>
> Exactly.  A stacking driver must never split a REQ_OP_ZONE_APPEND bio.
> But the file system itself can of course split it as long as each split
> off bio has it's own bi_end_io handler to record where it has been
> written to.
>

This makes me wonder, can we really forget the zone thing for the
initial bio so we just create a plain bio without any special
limitation, and let every split condition be handled in the lower layer?

Including raid stripe boundary, zone limitations etc.

(yeah, it's still not pure stacking driver, but it's more
stacking-driver like).

In that case, the missing piece seems to be a way to convert a splitted
plain bio into a REQ_OP_ZONE_APPEND bio.

Can this be done without slow bvec copying?

Thanks,
Qu


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Any bio_clone_slow() implementation which doesn't share bi_io_vec?
  2021-11-23 23:07               ` [dm-devel] " Qu Wenruo
@ 2021-11-24  6:09                 ` hch
  -1 siblings, 0 replies; 30+ messages in thread
From: hch @ 2021-11-24  6:09 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: hch, Johannes Thumshirn, Linux FS Devel, dm-devel, linux-btrfs

On Wed, Nov 24, 2021 at 07:07:18AM +0800, Qu Wenruo wrote:
> In that case, the missing piece seems to be a way to convert a splitted
> plain bio into a REQ_OP_ZONE_APPEND bio.
> 
> Can this be done without slow bvec copying?

Yes.  I have a WIP stacking driver that converts writes to zone appends
and it does just that:

	sector_t orig_sector = bio->bi_iter.bi_sector;
	unsigned int bio_flags = bio->bi_opf & ~REQ_OP_MASK;

	...

	clone = bio_clone_fast(bio, GFP_NOIO, &bdev->write_bio_set);

	...

	clone->bi_opf = REQ_OP_ZONE_APPEND | REQ_NOMERGE | bio_flags;
	bio_set_dev(clone, dev->lower_bdev);
	clone->bi_iter.bi_sector = zone_sector;
	trace_block_bio_remap(clone, disk_devt(disk), orig_sector);

> 
> Thanks,
> Qu
---end quoted text---

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dm-devel] Any bio_clone_slow() implementation which doesn't share bi_io_vec?
@ 2021-11-24  6:09                 ` hch
  0 siblings, 0 replies; 30+ messages in thread
From: hch @ 2021-11-24  6:09 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: hch, Johannes Thumshirn, dm-devel, Linux FS Devel, linux-btrfs

On Wed, Nov 24, 2021 at 07:07:18AM +0800, Qu Wenruo wrote:
> In that case, the missing piece seems to be a way to convert a splitted
> plain bio into a REQ_OP_ZONE_APPEND bio.
> 
> Can this be done without slow bvec copying?

Yes.  I have a WIP stacking driver that converts writes to zone appends
and it does just that:

	sector_t orig_sector = bio->bi_iter.bi_sector;
	unsigned int bio_flags = bio->bi_opf & ~REQ_OP_MASK;

	...

	clone = bio_clone_fast(bio, GFP_NOIO, &bdev->write_bio_set);

	...

	clone->bi_opf = REQ_OP_ZONE_APPEND | REQ_NOMERGE | bio_flags;
	bio_set_dev(clone, dev->lower_bdev);
	clone->bi_iter.bi_sector = zone_sector;
	trace_block_bio_remap(clone, disk_devt(disk), orig_sector);

> 
> Thanks,
> Qu
---end quoted text---

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Any bio_clone_slow() implementation which doesn't share bi_io_vec?
  2021-11-24  6:09                 ` [dm-devel] " hch
@ 2021-11-24  6:18                   ` Qu Wenruo
  -1 siblings, 0 replies; 30+ messages in thread
From: Qu Wenruo @ 2021-11-24  6:18 UTC (permalink / raw)
  To: hch; +Cc: Johannes Thumshirn, Linux FS Devel, dm-devel, linux-btrfs



On 2021/11/24 14:09, hch@infradead.org wrote:
> On Wed, Nov 24, 2021 at 07:07:18AM +0800, Qu Wenruo wrote:
>> In that case, the missing piece seems to be a way to convert a splitted
>> plain bio into a REQ_OP_ZONE_APPEND bio.
>>
>> Can this be done without slow bvec copying?
>
> Yes.  I have a WIP stacking driver that converts writes to zone appends
> and it does just that:
>
> 	sector_t orig_sector = bio->bi_iter.bi_sector;
> 	unsigned int bio_flags = bio->bi_opf & ~REQ_OP_MASK;
>
> 	...
>
> 	clone = bio_clone_fast(bio, GFP_NOIO, &bdev->write_bio_set);
>
> 	...
>
> 	clone->bi_opf = REQ_OP_ZONE_APPEND | REQ_NOMERGE | bio_flags;

Just so simple? Then that's super awesome.

But I'm a little concerned about the bio_add_hw_page() call in
bio_add_zoned_append().

It's not exactly the same as bio_add_page().

Does it mean as long as our splitted bio doesn't exceed zone limit, we
can do the convert without any further problem?

Thanks,
Qu
> 	bio_set_dev(clone, dev->lower_bdev);
> 	clone->bi_iter.bi_sector = zone_sector;
> 	trace_block_bio_remap(clone, disk_devt(disk), orig_sector);
>
>>
>> Thanks,
>> Qu
> ---end quoted text---
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dm-devel] Any bio_clone_slow() implementation which doesn't share bi_io_vec?
@ 2021-11-24  6:18                   ` Qu Wenruo
  0 siblings, 0 replies; 30+ messages in thread
From: Qu Wenruo @ 2021-11-24  6:18 UTC (permalink / raw)
  To: hch; +Cc: Linux FS Devel, Johannes Thumshirn, dm-devel, linux-btrfs



On 2021/11/24 14:09, hch@infradead.org wrote:
> On Wed, Nov 24, 2021 at 07:07:18AM +0800, Qu Wenruo wrote:
>> In that case, the missing piece seems to be a way to convert a splitted
>> plain bio into a REQ_OP_ZONE_APPEND bio.
>>
>> Can this be done without slow bvec copying?
>
> Yes.  I have a WIP stacking driver that converts writes to zone appends
> and it does just that:
>
> 	sector_t orig_sector = bio->bi_iter.bi_sector;
> 	unsigned int bio_flags = bio->bi_opf & ~REQ_OP_MASK;
>
> 	...
>
> 	clone = bio_clone_fast(bio, GFP_NOIO, &bdev->write_bio_set);
>
> 	...
>
> 	clone->bi_opf = REQ_OP_ZONE_APPEND | REQ_NOMERGE | bio_flags;

Just so simple? Then that's super awesome.

But I'm a little concerned about the bio_add_hw_page() call in
bio_add_zoned_append().

It's not exactly the same as bio_add_page().

Does it mean as long as our splitted bio doesn't exceed zone limit, we
can do the convert without any further problem?

Thanks,
Qu
> 	bio_set_dev(clone, dev->lower_bdev);
> 	clone->bi_iter.bi_sector = zone_sector;
> 	trace_block_bio_remap(clone, disk_devt(disk), orig_sector);
>
>>
>> Thanks,
>> Qu
> ---end quoted text---
>


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Any bio_clone_slow() implementation which doesn't share bi_io_vec?
  2021-11-24  6:18                   ` [dm-devel] " Qu Wenruo
@ 2021-11-24  7:02                     ` hch
  -1 siblings, 0 replies; 30+ messages in thread
From: hch @ 2021-11-24  7:02 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: hch, Johannes Thumshirn, Linux FS Devel, dm-devel, linux-btrfs

On Wed, Nov 24, 2021 at 02:18:00PM +0800, Qu Wenruo wrote:
> Just so simple? Then that's super awesome.
> 
> But I'm a little concerned about the bio_add_hw_page() call in
> bio_add_zoned_append().
> 
> It's not exactly the same as bio_add_page().
> 
> Does it mean as long as our splitted bio doesn't exceed zone limit, we
> can do the convert without any further problem?

You need to look at the limits when splitting.  I have modified
blk_bio_segment_split and exported it to deal with that.  Let me split
those changes out cleanly and push out a branch.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dm-devel] Any bio_clone_slow() implementation which doesn't share bi_io_vec?
@ 2021-11-24  7:02                     ` hch
  0 siblings, 0 replies; 30+ messages in thread
From: hch @ 2021-11-24  7:02 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: hch, Johannes Thumshirn, dm-devel, Linux FS Devel, linux-btrfs

On Wed, Nov 24, 2021 at 02:18:00PM +0800, Qu Wenruo wrote:
> Just so simple? Then that's super awesome.
> 
> But I'm a little concerned about the bio_add_hw_page() call in
> bio_add_zoned_append().
> 
> It's not exactly the same as bio_add_page().
> 
> Does it mean as long as our splitted bio doesn't exceed zone limit, we
> can do the convert without any further problem?

You need to look at the limits when splitting.  I have modified
blk_bio_segment_split and exported it to deal with that.  Let me split
those changes out cleanly and push out a branch.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Any bio_clone_slow() implementation which doesn't share bi_io_vec?
  2021-11-24  7:02                     ` [dm-devel] " hch
@ 2021-11-24  7:22                       ` hch
  -1 siblings, 0 replies; 30+ messages in thread
From: hch @ 2021-11-24  7:22 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: hch, Johannes Thumshirn, Linux FS Devel, dm-devel, linux-btrfs

On Tue, Nov 23, 2021 at 11:02:52PM -0800, hch@infradead.org wrote:
> > Does it mean as long as our splitted bio doesn't exceed zone limit, we
> > can do the convert without any further problem?
> 
> You need to look at the limits when splitting.  I have modified
> blk_bio_segment_split and exported it to deal with that.  Let me split
> those changes out cleanly and push out a branch.

Here:

http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/zone-append-split

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dm-devel] Any bio_clone_slow() implementation which doesn't share bi_io_vec?
@ 2021-11-24  7:22                       ` hch
  0 siblings, 0 replies; 30+ messages in thread
From: hch @ 2021-11-24  7:22 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: hch, Johannes Thumshirn, dm-devel, Linux FS Devel, linux-btrfs

On Tue, Nov 23, 2021 at 11:02:52PM -0800, hch@infradead.org wrote:
> > Does it mean as long as our splitted bio doesn't exceed zone limit, we
> > can do the convert without any further problem?
> 
> You need to look at the limits when splitting.  I have modified
> blk_bio_segment_split and exported it to deal with that.  Let me split
> those changes out cleanly and push out a branch.

Here:

http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/zone-append-split

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Any bio_clone_slow() implementation which doesn't share bi_io_vec?
  2021-11-23 23:07               ` [dm-devel] " Qu Wenruo
@ 2021-11-24  7:25                 ` Naohiro Aota
  -1 siblings, 0 replies; 30+ messages in thread
From: Naohiro Aota @ 2021-11-24  7:25 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: hch, Johannes Thumshirn, Linux FS Devel, dm-devel, linux-btrfs

On Wed, Nov 24, 2021 at 07:07:18AM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/11/23 22:28, hch@infradead.org wrote:
> > On Tue, Nov 23, 2021 at 11:39:11AM +0000, Johannes Thumshirn wrote:
> > > I think we have to differentiate two cases here:
> > > A "regular" REQ_OP_ZONE_APPEND bio and a RAID stripe REQ_OP_ZONE_APPEND
> > > bio. The 1st one (i.e. the regular REQ_OP_ZONE_APPEND bio) can't be split
> > > because we cannot guarantee the order the device writes the data to disk.
> 
> That's correct.
> 
> But if we want to move all bio split into chunk layer, we want a initial
> bio without any limitation, and then using that bio to create real
> REQ_OP_ZONE_APPEND bios with proper size limitations.
> 
> > > For the RAID stripe bio we can split it into the two (or more) parts that
> > > will end up on _different_ devices. All we need to do is a) ensure it
> > > doesn't cross the device's zone append limit and b) clamp all
> > > bi_iter.bi_sector down to the start of the target zone, a.k.a sticking to
> > > the rules of REQ_OP_ZONE_APPEND.
> > 
> > Exactly.  A stacking driver must never split a REQ_OP_ZONE_APPEND bio.
> > But the file system itself can of course split it as long as each split
> > off bio has it's own bi_end_io handler to record where it has been
> > written to.
> > 
> 
> This makes me wonder, can we really forget the zone thing for the
> initial bio so we just create a plain bio without any special
> limitation, and let every split condition be handled in the lower layer?
> 
> Including raid stripe boundary, zone limitations etc.

What really matters is to ensure the "one bio (for real zoned device)
== one ordered extent" rule. When a device rewrites ZONE_APPEND bio's
sector address, we rewrite the ordered extent's logical address
accordingly in the end_io process. For ensuring the rewriting works,
one extent must be composed with one contiguous bio.

So, if we can split an ordered extent at the bio splitting process,
that will be fine. Or, it is also fine if we can split an ordered
extent at end_bio process. But, I think it is difficult because
someone can be already waiting for the ordered extent, and splitting
it at that point will break some assumptions in the code.

> (yeah, it's still not pure stacking driver, but it's more
> stacking-driver like).
> 
> In that case, the missing piece seems to be a way to convert a splitted
> plain bio into a REQ_OP_ZONE_APPEND bio.
> 
> Can this be done without slow bvec copying?
> 
> Thanks,
> Qu

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dm-devel] Any bio_clone_slow() implementation which doesn't share bi_io_vec?
@ 2021-11-24  7:25                 ` Naohiro Aota
  0 siblings, 0 replies; 30+ messages in thread
From: Naohiro Aota @ 2021-11-24  7:25 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: hch, Johannes Thumshirn, dm-devel, Linux FS Devel, linux-btrfs

On Wed, Nov 24, 2021 at 07:07:18AM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/11/23 22:28, hch@infradead.org wrote:
> > On Tue, Nov 23, 2021 at 11:39:11AM +0000, Johannes Thumshirn wrote:
> > > I think we have to differentiate two cases here:
> > > A "regular" REQ_OP_ZONE_APPEND bio and a RAID stripe REQ_OP_ZONE_APPEND
> > > bio. The 1st one (i.e. the regular REQ_OP_ZONE_APPEND bio) can't be split
> > > because we cannot guarantee the order the device writes the data to disk.
> 
> That's correct.
> 
> But if we want to move all bio split into chunk layer, we want a initial
> bio without any limitation, and then using that bio to create real
> REQ_OP_ZONE_APPEND bios with proper size limitations.
> 
> > > For the RAID stripe bio we can split it into the two (or more) parts that
> > > will end up on _different_ devices. All we need to do is a) ensure it
> > > doesn't cross the device's zone append limit and b) clamp all
> > > bi_iter.bi_sector down to the start of the target zone, a.k.a sticking to
> > > the rules of REQ_OP_ZONE_APPEND.
> > 
> > Exactly.  A stacking driver must never split a REQ_OP_ZONE_APPEND bio.
> > But the file system itself can of course split it as long as each split
> > off bio has it's own bi_end_io handler to record where it has been
> > written to.
> > 
> 
> This makes me wonder, can we really forget the zone thing for the
> initial bio so we just create a plain bio without any special
> limitation, and let every split condition be handled in the lower layer?
> 
> Including raid stripe boundary, zone limitations etc.

What really matters is to ensure the "one bio (for real zoned device)
== one ordered extent" rule. When a device rewrites ZONE_APPEND bio's
sector address, we rewrite the ordered extent's logical address
accordingly in the end_io process. For ensuring the rewriting works,
one extent must be composed with one contiguous bio.

So, if we can split an ordered extent at the bio splitting process,
that will be fine. Or, it is also fine if we can split an ordered
extent at end_bio process. But, I think it is difficult because
someone can be already waiting for the ordered extent, and splitting
it at that point will break some assumptions in the code.

> (yeah, it's still not pure stacking driver, but it's more
> stacking-driver like).
> 
> In that case, the missing piece seems to be a way to convert a splitted
> plain bio into a REQ_OP_ZONE_APPEND bio.
> 
> Can this be done without slow bvec copying?
> 
> Thanks,
> Qu


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Any bio_clone_slow() implementation which doesn't share bi_io_vec?
  2021-11-24  7:25                 ` [dm-devel] " Naohiro Aota
@ 2021-11-24  7:39                   ` Qu Wenruo
  -1 siblings, 0 replies; 30+ messages in thread
From: Qu Wenruo @ 2021-11-24  7:39 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: hch, Johannes Thumshirn, Linux FS Devel, dm-devel, linux-btrfs



On 2021/11/24 15:25, Naohiro Aota wrote:
> On Wed, Nov 24, 2021 at 07:07:18AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2021/11/23 22:28, hch@infradead.org wrote:
>>> On Tue, Nov 23, 2021 at 11:39:11AM +0000, Johannes Thumshirn wrote:
>>>> I think we have to differentiate two cases here:
>>>> A "regular" REQ_OP_ZONE_APPEND bio and a RAID stripe REQ_OP_ZONE_APPEND
>>>> bio. The 1st one (i.e. the regular REQ_OP_ZONE_APPEND bio) can't be split
>>>> because we cannot guarantee the order the device writes the data to disk.
>>
>> That's correct.
>>
>> But if we want to move all bio split into chunk layer, we want a initial
>> bio without any limitation, and then using that bio to create real
>> REQ_OP_ZONE_APPEND bios with proper size limitations.
>>
>>>> For the RAID stripe bio we can split it into the two (or more) parts that
>>>> will end up on _different_ devices. All we need to do is a) ensure it
>>>> doesn't cross the device's zone append limit and b) clamp all
>>>> bi_iter.bi_sector down to the start of the target zone, a.k.a sticking to
>>>> the rules of REQ_OP_ZONE_APPEND.
>>>
>>> Exactly.  A stacking driver must never split a REQ_OP_ZONE_APPEND bio.
>>> But the file system itself can of course split it as long as each split
>>> off bio has it's own bi_end_io handler to record where it has been
>>> written to.
>>>
>>
>> This makes me wonder, can we really forget the zone thing for the
>> initial bio so we just create a plain bio without any special
>> limitation, and let every split condition be handled in the lower layer?
>>
>> Including raid stripe boundary, zone limitations etc.
>
> What really matters is to ensure the "one bio (for real zoned device)
> == one ordered extent" rule. When a device rewrites ZONE_APPEND bio's
> sector address, we rewrite the ordered extent's logical address
> accordingly in the end_io process. For ensuring the rewriting works,
> one extent must be composed with one contiguous bio.
>
> So, if we can split an ordered extent at the bio splitting process,
> that will be fine. Or, it is also fine if we can split an ordered
> extent at end_bio process. But, I think it is difficult because
> someone can be already waiting for the ordered extent, and splitting
> it at that point will break some assumptions in the code.

OK, I see the problem now.

It's extract_ordered_extent() relying on the zoned append bio to split
the ordered extents.

Not the opposite, thus it will be still more complex than I thought to
split bio in chunk layer.

I'll leave the zoned part untouched for now until I have a better solution.

Thanks,
Qu
>
>> (yeah, it's still not pure stacking driver, but it's more
>> stacking-driver like).
>>
>> In that case, the missing piece seems to be a way to convert a splitted
>> plain bio into a REQ_OP_ZONE_APPEND bio.
>>
>> Can this be done without slow bvec copying?
>>
>> Thanks,
>> Qu

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dm-devel] Any bio_clone_slow() implementation which doesn't share bi_io_vec?
@ 2021-11-24  7:39                   ` Qu Wenruo
  0 siblings, 0 replies; 30+ messages in thread
From: Qu Wenruo @ 2021-11-24  7:39 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: hch, Johannes Thumshirn, dm-devel, Linux FS Devel, linux-btrfs



On 2021/11/24 15:25, Naohiro Aota wrote:
> On Wed, Nov 24, 2021 at 07:07:18AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2021/11/23 22:28, hch@infradead.org wrote:
>>> On Tue, Nov 23, 2021 at 11:39:11AM +0000, Johannes Thumshirn wrote:
>>>> I think we have to differentiate two cases here:
>>>> A "regular" REQ_OP_ZONE_APPEND bio and a RAID stripe REQ_OP_ZONE_APPEND
>>>> bio. The 1st one (i.e. the regular REQ_OP_ZONE_APPEND bio) can't be split
>>>> because we cannot guarantee the order the device writes the data to disk.
>>
>> That's correct.
>>
>> But if we want to move all bio split into chunk layer, we want a initial
>> bio without any limitation, and then using that bio to create real
>> REQ_OP_ZONE_APPEND bios with proper size limitations.
>>
>>>> For the RAID stripe bio we can split it into the two (or more) parts that
>>>> will end up on _different_ devices. All we need to do is a) ensure it
>>>> doesn't cross the device's zone append limit and b) clamp all
>>>> bi_iter.bi_sector down to the start of the target zone, a.k.a sticking to
>>>> the rules of REQ_OP_ZONE_APPEND.
>>>
>>> Exactly.  A stacking driver must never split a REQ_OP_ZONE_APPEND bio.
>>> But the file system itself can of course split it as long as each split
>>> off bio has it's own bi_end_io handler to record where it has been
>>> written to.
>>>
>>
>> This makes me wonder, can we really forget the zone thing for the
>> initial bio so we just create a plain bio without any special
>> limitation, and let every split condition be handled in the lower layer?
>>
>> Including raid stripe boundary, zone limitations etc.
>
> What really matters is to ensure the "one bio (for real zoned device)
> == one ordered extent" rule. When a device rewrites ZONE_APPEND bio's
> sector address, we rewrite the ordered extent's logical address
> accordingly in the end_io process. For ensuring the rewriting works,
> one extent must be composed with one contiguous bio.
>
> So, if we can split an ordered extent at the bio splitting process,
> that will be fine. Or, it is also fine if we can split an ordered
> extent at end_bio process. But, I think it is difficult because
> someone can be already waiting for the ordered extent, and splitting
> it at that point will break some assumptions in the code.

OK, I see the problem now.

It's extract_ordered_extent() relying on the zoned append bio to split
the ordered extents.

Not the opposite, thus it will be still more complex than I thought to
split bio in chunk layer.

I'll leave the zoned part untouched for now until I have a better solution.

Thanks,
Qu
>
>> (yeah, it's still not pure stacking driver, but it's more
>> stacking-driver like).
>>
>> In that case, the missing piece seems to be a way to convert a splitted
>> plain bio into a REQ_OP_ZONE_APPEND bio.
>>
>> Can this be done without slow bvec copying?
>>
>> Thanks,
>> Qu


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [dm-devel] Any bio_clone_slow() implementation which doesn't share bi_io_vec?
  2021-11-23  8:13       ` [dm-devel] " Christoph Hellwig
@ 2021-11-26 12:33         ` Qu Wenruo
  -1 siblings, 0 replies; 30+ messages in thread
From: Qu Wenruo @ 2021-11-26 12:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Linux FS Devel, dm-devel, linux-btrfs



On 2021/11/23 16:13, Christoph Hellwig wrote:
> On Tue, Nov 23, 2021 at 04:10:35PM +0800, Qu Wenruo wrote:
>> Without bio_chain() sounds pretty good, as we can still utilize
>> bi_end_io and bi_private.
>>
>> But this also means, we're now responsible not to release the source bio
>> since it has the real bi_io_vec.
>
> Just call bio_inc_remaining before submitting the cloned bio, and then
> call bio_endio on the root bio every time a clone completes.
>

Does this also mean, we need to save a bi_iter before bio submission, so
that at endio time, we can iterate only the split part of the bio?

Or we have to go back to bio_chain() method, and rely on the parent bio
to iterate all bvecs...

Thanks,
Qu


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Any bio_clone_slow() implementation which doesn't share bi_io_vec?
@ 2021-11-26 12:33         ` Qu Wenruo
  0 siblings, 0 replies; 30+ messages in thread
From: Qu Wenruo @ 2021-11-26 12:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Linux FS Devel, dm-devel, linux-btrfs



On 2021/11/23 16:13, Christoph Hellwig wrote:
> On Tue, Nov 23, 2021 at 04:10:35PM +0800, Qu Wenruo wrote:
>> Without bio_chain() sounds pretty good, as we can still utilize
>> bi_end_io and bi_private.
>>
>> But this also means, we're now responsible not to release the source bio
>> since it has the real bi_io_vec.
>
> Just call bio_inc_remaining before submitting the cloned bio, and then
> call bio_endio on the root bio every time a clone completes.
>

Does this also mean, we need to save a bi_iter before bio submission, so
that at endio time, we can iterate only the split part of the bio?

Or we have to go back to bio_chain() method, and rely on the parent bio
to iterate all bvecs...

Thanks,
Qu

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2021-11-26 12:35 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-23  6:44 Any bio_clone_slow() implementation which doesn't share bi_io_vec? Qu Wenruo
2021-11-23  6:44 ` [dm-devel] " Qu Wenruo
2021-11-23  7:43 ` Christoph Hellwig
2021-11-23  7:43   ` [dm-devel] " Christoph Hellwig
2021-11-23  8:10   ` Qu Wenruo
2021-11-23  8:10     ` [dm-devel] " Qu Wenruo
2021-11-23  8:13     ` Christoph Hellwig
2021-11-23  8:13       ` [dm-devel] " Christoph Hellwig
2021-11-23 11:09       ` Qu Wenruo
2021-11-23 11:09         ` [dm-devel] " Qu Wenruo
2021-11-23 11:39         ` Johannes Thumshirn
2021-11-23 11:39           ` [dm-devel] " Johannes Thumshirn
2021-11-23 14:28           ` hch
2021-11-23 14:28             ` [dm-devel] " hch
2021-11-23 23:07             ` Qu Wenruo
2021-11-23 23:07               ` [dm-devel] " Qu Wenruo
2021-11-24  6:09               ` hch
2021-11-24  6:09                 ` [dm-devel] " hch
2021-11-24  6:18                 ` Qu Wenruo
2021-11-24  6:18                   ` [dm-devel] " Qu Wenruo
2021-11-24  7:02                   ` hch
2021-11-24  7:02                     ` [dm-devel] " hch
2021-11-24  7:22                     ` hch
2021-11-24  7:22                       ` [dm-devel] " hch
2021-11-24  7:25               ` Naohiro Aota
2021-11-24  7:25                 ` [dm-devel] " Naohiro Aota
2021-11-24  7:39                 ` Qu Wenruo
2021-11-24  7:39                   ` [dm-devel] " Qu Wenruo
2021-11-26 12:33       ` Qu Wenruo
2021-11-26 12:33         ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.