All of lore.kernel.org
 help / color / mirror / Atom feed
* "creative" bio usage in the RAID code
@ 2016-11-10 19:46 Christoph Hellwig
  2016-11-11 19:02 ` Shaohua Li
  2016-11-13 23:03 ` NeilBrown
  0 siblings, 2 replies; 15+ messages in thread
From: Christoph Hellwig @ 2016-11-10 19:46 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, linux-block

Hi Shaohua,

one of the major issues with Ming Lei's multipage biovec works
is that we can't easily enabled the MD RAID code for it.  I had
a quick chat on that with Chris and Jens and they suggested talking
to you about it.

It's mostly about the RAID1 and RAID10 code which does a lot of funny
things with the bi_iov_vec and bi_vcnt fields, which we'd prefer that
drivers don't touch.  One example is the r1buf_pool_alloc code,
which I think should simply use bio_clone for the MD_RECOVERY_REQUESTED
case, which would also take care of r1buf_pool_free.  I'm not sure
about all the others cases, as some bits don't fully make sense to me,
e.g. why we're trying to do single page I/O out of a bigger bio.

Maybe you have some better ideas what's going on there?

Another not quite as urgent issue is how the RAID5 code abuses
->bi_phys_segments as and outstanding I/O counter, and I have no
really good answer to that either.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "creative" bio usage in the RAID code
  2016-11-10 19:46 "creative" bio usage in the RAID code Christoph Hellwig
@ 2016-11-11 19:02 ` Shaohua Li
  2016-11-12 17:42   ` Christoph Hellwig
  2016-11-13 23:03 ` NeilBrown
  1 sibling, 1 reply; 15+ messages in thread
From: Shaohua Li @ 2016-11-11 19:02 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-raid, linux-block, neilb

On Thu, Nov 10, 2016 at 11:46:36AM -0800, Christoph Hellwig wrote:
> Hi Shaohua,
> 
> one of the major issues with Ming Lei's multipage biovec works
> is that we can't easily enabled the MD RAID code for it.  I had
> a quick chat on that with Chris and Jens and they suggested talking
> to you about it.
> 
> It's mostly about the RAID1 and RAID10 code which does a lot of funny
> things with the bi_iov_vec and bi_vcnt fields, which we'd prefer that
> drivers don't touch.  One example is the r1buf_pool_alloc code,
> which I think should simply use bio_clone for the MD_RECOVERY_REQUESTED
> case, which would also take care of r1buf_pool_free.  I'm not sure
> about all the others cases, as some bits don't fully make sense to me,

The problem is we use the iov_vec to track the pages allocated. We will read
data to the pages and write out later for resync. If we add new fields to track
the pages in r1bio, we could use standard API bio_kmalloc/bio_add_page and
avoid the tricky parts. This should work for both the resync and writebehind
cases.

> e.g. why we're trying to do single page I/O out of a bigger bio.

what's this one?

> Maybe you have some better ideas what's going on there?
> 
> Another not quite as urgent issue is how the RAID5 code abuses
> ->bi_phys_segments as and outstanding I/O counter, and I have no
> really good answer to that either.

I don't have good idea for this one either if we don't want to allocate extra
memory. The good side is we never dispatch the original bio to under layer
disks.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "creative" bio usage in the RAID code
  2016-11-11 19:02 ` Shaohua Li
@ 2016-11-12 17:42   ` Christoph Hellwig
  2016-11-13 22:53       ` NeilBrown
  2016-11-15  0:13     ` Shaohua Li
  0 siblings, 2 replies; 15+ messages in thread
From: Christoph Hellwig @ 2016-11-12 17:42 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Christoph Hellwig, linux-raid, linux-block, neilb

On Fri, Nov 11, 2016 at 11:02:23AM -0800, Shaohua Li wrote:
> > It's mostly about the RAID1 and RAID10 code which does a lot of funny
> > things with the bi_iov_vec and bi_vcnt fields, which we'd prefer that
> > drivers don't touch.  One example is the r1buf_pool_alloc code,
> > which I think should simply use bio_clone for the MD_RECOVERY_REQUESTED
> > case, which would also take care of r1buf_pool_free.  I'm not sure
> > about all the others cases, as some bits don't fully make sense to me,
> 
> The problem is we use the iov_vec to track the pages allocated. We will read
> data to the pages and write out later for resync. If we add new fields to track
> the pages in r1bio, we could use standard API bio_kmalloc/bio_add_page and
> avoid the tricky parts. This should work for both the resync and writebehind
> cases.

I don't think we need to track the pages specificly - if we clone
a bio we share the bio_vec, e.g. for the !MD_RECOVERY_REQUESTED
we do one bio_kmalloc, then bio_alloc_pages then clone it for the
others bios.  for MD_RECOVERY_REQUESTED we do a bio_kmalloc +
bio_alloc_pages for each.

While we're at it - I find the way MD_RECOVERY_REQUESTED is used highly
confusing, and I'm not 100% sure it's correct.  After all we check it
in r1buf_pool_alloc, which is a mempool alloc callback, so we rely
on these callbacks being done after the flag has been raise / cleared,
which makes me bit suspicious, and also question why we even need the
mempool.

> 
> > e.g. why we're trying to do single page I/O out of a bigger bio.
> 
> what's this one?

fix_sync_read_error

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "creative" bio usage in the RAID code
  2016-11-12 17:42   ` Christoph Hellwig
@ 2016-11-13 22:53       ` NeilBrown
  2016-11-15  0:13     ` Shaohua Li
  1 sibling, 0 replies; 15+ messages in thread
From: NeilBrown @ 2016-11-13 22:53 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Christoph Hellwig, linux-raid, linux-block

[-- Attachment #1: Type: text/plain, Size: 3090 bytes --]

On Sun, Nov 13 2016, Christoph Hellwig wrote:

> On Fri, Nov 11, 2016 at 11:02:23AM -0800, Shaohua Li wrote:
>> > It's mostly about the RAID1 and RAID10 code which does a lot of funny
>> > things with the bi_iov_vec and bi_vcnt fields, which we'd prefer that
>> > drivers don't touch.  One example is the r1buf_pool_alloc code,
>> > which I think should simply use bio_clone for the MD_RECOVERY_REQUESTED
>> > case, which would also take care of r1buf_pool_free.  I'm not sure
>> > about all the others cases, as some bits don't fully make sense to me,
>> 
>> The problem is we use the iov_vec to track the pages allocated. We will read
>> data to the pages and write out later for resync. If we add new fields to track
>> the pages in r1bio, we could use standard API bio_kmalloc/bio_add_page and
>> avoid the tricky parts. This should work for both the resync and writebehind
>> cases.
>
> I don't think we need to track the pages specificly - if we clone
> a bio we share the bio_vec, e.g. for the !MD_RECOVERY_REQUESTED
> we do one bio_kmalloc, then bio_alloc_pages then clone it for the
> others bios.  for MD_RECOVERY_REQUESTED we do a bio_kmalloc +
> bio_alloc_pages for each.

Part of the reason for the oddities in this code is that I wanted a
collection of bios, one per device, which were all the same size.  As
different devices might impose different restrictions on the size of the
bios, I built them carefully, step by step.

Now that those restrictions are gone, we can - as you say - just
allocate a suitably sized bio and then clone it for each device.

>
> While we're at it - I find the way MD_RECOVERY_REQUESTED is used highly
> confusing, and I'm not 100% sure it's correct.  After all we check it
> in r1buf_pool_alloc, which is a mempool alloc callback, so we rely
> on these callbacks being done after the flag has been raise / cleared,
> which makes me bit suspicious, and also question why we even need the
> mempool.

MD_RECOVERY_REQUEST is only set or cleared when no recovery is running.
The ->reconfig_mutex and MD_RECOVERY_RUNNING flags ensure there are no
races there.
The r1buf_pool mempool is created are the start of resync, so at that
time MD_RECOVERY_RUNNING will be stable, and it will remain stable until
after the mempool is freed.

To perform a resync we need a pool of memory buffers.  We don't want to
have to cope with kmalloc failing, but are quite able to cope with
mempool_alloc() blocking.
We probably don't need nearly as many bufs as we allocate (4 is probably
plenty), but having a pool is certainly convenient.

>
>> 
>> > e.g. why we're trying to do single page I/O out of a bigger bio.
>> 
>> what's this one?
>
> fix_sync_read_error

The "bigger bio" might cover a large number of sectors.  If there are
media errors, there might be only one sector that is bad.  So we repeat
the read with finer granularity (pages in the current code, though
device block would be ideal) and only recovery bad blocks for individual
pages which are bad and cannot be fixed.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "creative" bio usage in the RAID code
@ 2016-11-13 22:53       ` NeilBrown
  0 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2016-11-13 22:53 UTC (permalink / raw)
  To: Christoph Hellwig, Shaohua Li; +Cc: Christoph Hellwig, linux-raid, linux-block

[-- Attachment #1: Type: text/plain, Size: 3090 bytes --]

On Sun, Nov 13 2016, Christoph Hellwig wrote:

> On Fri, Nov 11, 2016 at 11:02:23AM -0800, Shaohua Li wrote:
>> > It's mostly about the RAID1 and RAID10 code which does a lot of funny
>> > things with the bi_iov_vec and bi_vcnt fields, which we'd prefer that
>> > drivers don't touch.  One example is the r1buf_pool_alloc code,
>> > which I think should simply use bio_clone for the MD_RECOVERY_REQUESTED
>> > case, which would also take care of r1buf_pool_free.  I'm not sure
>> > about all the others cases, as some bits don't fully make sense to me,
>> 
>> The problem is we use the iov_vec to track the pages allocated. We will read
>> data to the pages and write out later for resync. If we add new fields to track
>> the pages in r1bio, we could use standard API bio_kmalloc/bio_add_page and
>> avoid the tricky parts. This should work for both the resync and writebehind
>> cases.
>
> I don't think we need to track the pages specificly - if we clone
> a bio we share the bio_vec, e.g. for the !MD_RECOVERY_REQUESTED
> we do one bio_kmalloc, then bio_alloc_pages then clone it for the
> others bios.  for MD_RECOVERY_REQUESTED we do a bio_kmalloc +
> bio_alloc_pages for each.

Part of the reason for the oddities in this code is that I wanted a
collection of bios, one per device, which were all the same size.  As
different devices might impose different restrictions on the size of the
bios, I built them carefully, step by step.

Now that those restrictions are gone, we can - as you say - just
allocate a suitably sized bio and then clone it for each device.

>
> While we're at it - I find the way MD_RECOVERY_REQUESTED is used highly
> confusing, and I'm not 100% sure it's correct.  After all we check it
> in r1buf_pool_alloc, which is a mempool alloc callback, so we rely
> on these callbacks being done after the flag has been raise / cleared,
> which makes me bit suspicious, and also question why we even need the
> mempool.

MD_RECOVERY_REQUEST is only set or cleared when no recovery is running.
The ->reconfig_mutex and MD_RECOVERY_RUNNING flags ensure there are no
races there.
The r1buf_pool mempool is created are the start of resync, so at that
time MD_RECOVERY_RUNNING will be stable, and it will remain stable until
after the mempool is freed.

To perform a resync we need a pool of memory buffers.  We don't want to
have to cope with kmalloc failing, but are quite able to cope with
mempool_alloc() blocking.
We probably don't need nearly as many bufs as we allocate (4 is probably
plenty), but having a pool is certainly convenient.

>
>> 
>> > e.g. why we're trying to do single page I/O out of a bigger bio.
>> 
>> what's this one?
>
> fix_sync_read_error

The "bigger bio" might cover a large number of sectors.  If there are
media errors, there might be only one sector that is bad.  So we repeat
the read with finer granularity (pages in the current code, though
device block would be ideal) and only recovery bad blocks for individual
pages which are bad and cannot be fixed.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "creative" bio usage in the RAID code
  2016-11-10 19:46 "creative" bio usage in the RAID code Christoph Hellwig
  2016-11-11 19:02 ` Shaohua Li
@ 2016-11-13 23:03 ` NeilBrown
  2016-11-14  8:51   ` Christoph Hellwig
  1 sibling, 1 reply; 15+ messages in thread
From: NeilBrown @ 2016-11-13 23:03 UTC (permalink / raw)
  To: Christoph Hellwig, Shaohua Li; +Cc: linux-raid, linux-block

[-- Attachment #1: Type: text/plain, Size: 720 bytes --]

On Fri, Nov 11 2016, Christoph Hellwig wrote:
>
> Another not quite as urgent issue is how the RAID5 code abuses
> ->bi_phys_segments as and outstanding I/O counter, and I have no
> really good answer to that either.

I would suggest adding a "bi_dev_private" field to the bio which is for
use by the lowest-level driver (much as bi_private is for use by the
top-level initiator).
That could be in a union with any or all of:
	unsigned int		bi_phys_segments;
	unsigned int		bi_seg_front_size;
	unsigned int		bi_seg_back_size;

(any driver that needs those, would see a 'request' rather than a 'bio'
and so could use rq->special)

raid5.c could then use bi_dev_private (or bi_special, or whatever it is call).

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "creative" bio usage in the RAID code
  2016-11-13 23:03 ` NeilBrown
@ 2016-11-14  8:51   ` Christoph Hellwig
  2016-11-14  9:43       ` NeilBrown
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2016-11-14  8:51 UTC (permalink / raw)
  To: NeilBrown; +Cc: Christoph Hellwig, Shaohua Li, linux-raid, linux-block

On Mon, Nov 14, 2016 at 10:03:20AM +1100, NeilBrown wrote:
> I would suggest adding a "bi_dev_private" field to the bio which is for
> use by the lowest-level driver (much as bi_private is for use by the
> top-level initiator).
> That could be in a union with any or all of:
> 	unsigned int		bi_phys_segments;
> 	unsigned int		bi_seg_front_size;
> 	unsigned int		bi_seg_back_size;
> 
> (any driver that needs those, would see a 'request' rather than a 'bio'
> and so could use rq->special)
> 
> raid5.c could then use bi_dev_private (or bi_special, or whatever it is call).

All the three above fields are those that could go away with a full
implementation of the multipage bvec scheme.  So any field for driver
use would still be be overhead.  If it's just for raid5 it could
be a smaller 16 bit (or maybe even just 8 bit) one.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "creative" bio usage in the RAID code
  2016-11-13 22:53       ` NeilBrown
  (?)
@ 2016-11-14  8:57       ` Christoph Hellwig
  2016-11-14  9:51           ` NeilBrown
  -1 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2016-11-14  8:57 UTC (permalink / raw)
  To: NeilBrown; +Cc: Christoph Hellwig, Shaohua Li, linux-raid, linux-block

On Mon, Nov 14, 2016 at 09:53:46AM +1100, NeilBrown wrote:
> > While we're at it - I find the way MD_RECOVERY_REQUESTED is used highly
> > confusing, and I'm not 100% sure it's correct.  After all we check it
> > in r1buf_pool_alloc, which is a mempool alloc callback, so we rely
> > on these callbacks being done after the flag has been raise / cleared,
> > which makes me bit suspicious, and also question why we even need the
> > mempool.
> 
> MD_RECOVERY_REQUEST is only set or cleared when no recovery is running.
> The ->reconfig_mutex and MD_RECOVERY_RUNNING flags ensure there are no
> races there.
> The r1buf_pool mempool is created are the start of resync, so at that
> time MD_RECOVERY_RUNNING will be stable, and it will remain stable until
> after the mempool is freed.
> 
> To perform a resync we need a pool of memory buffers.  We don't want to
> have to cope with kmalloc failing, but are quite able to cope with
> mempool_alloc() blocking.
> We probably don't need nearly as many bufs as we allocate (4 is probably
> plenty), but having a pool is certainly convenient.

Would it be good to create/delete the pool explicitly through methods
to start/emd the sync?  Right now the behavior looks very, very
confusing.

> The "bigger bio" might cover a large number of sectors.  If there are
> media errors, there might be only one sector that is bad.  So we repeat
> the read with finer granularity (pages in the current code, though
> device block would be ideal) and only recovery bad blocks for individual
> pages which are bad and cannot be fixed.

i have no problems with the behavior - the point is that these days
this should be without poking into the bio internals, but by using
a bio iterator for just the range you want to re-read.  Potentially
using a bio clone if we can't reusing the existing bio, although I'm
not sure we even need that from looking at the code.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "creative" bio usage in the RAID code
  2016-11-14  8:51   ` Christoph Hellwig
@ 2016-11-14  9:43       ` NeilBrown
  0 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2016-11-14  9:43 UTC (permalink / raw)
  Cc: Christoph Hellwig, Shaohua Li, linux-raid, linux-block

[-- Attachment #1: Type: text/plain, Size: 1415 bytes --]

On Mon, Nov 14 2016, Christoph Hellwig wrote:

> On Mon, Nov 14, 2016 at 10:03:20AM +1100, NeilBrown wrote:
>> I would suggest adding a "bi_dev_private" field to the bio which is for
>> use by the lowest-level driver (much as bi_private is for use by the
>> top-level initiator).
>> That could be in a union with any or all of:
>> 	unsigned int		bi_phys_segments;
>> 	unsigned int		bi_seg_front_size;
>> 	unsigned int		bi_seg_back_size;
>> 
>> (any driver that needs those, would see a 'request' rather than a 'bio'
>> and so could use rq->special)
>> 
>> raid5.c could then use bi_dev_private (or bi_special, or whatever it is call).
>
> All the three above fields are those that could go away with a full
> implementation of the multipage bvec scheme.  So any field for driver
> use would still be be overhead.  If it's just for raid5 it could
> be a smaller 16 bit (or maybe even just 8 bit) one.

We currently store 2 counters in that field, and before
commit 5b99c2ffa980528a197f26 one of the fields was only 8 bits,
and that caused problems

We could possibly use __bi_remaining in place of
raid5_X_bi_active_stripes().  It wouldn't be a completely
straightforward conversion, but I think it could be made to work.

We *might* be able to use bvec_iter_advance() in place of
raid5_bi_processed_stripes().  A careful audit of the code would be
needed to be certain.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "creative" bio usage in the RAID code
@ 2016-11-14  9:43       ` NeilBrown
  0 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2016-11-14  9:43 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Christoph Hellwig, Shaohua Li, linux-raid, linux-block

[-- Attachment #1: Type: text/plain, Size: 1415 bytes --]

On Mon, Nov 14 2016, Christoph Hellwig wrote:

> On Mon, Nov 14, 2016 at 10:03:20AM +1100, NeilBrown wrote:
>> I would suggest adding a "bi_dev_private" field to the bio which is for
>> use by the lowest-level driver (much as bi_private is for use by the
>> top-level initiator).
>> That could be in a union with any or all of:
>> 	unsigned int		bi_phys_segments;
>> 	unsigned int		bi_seg_front_size;
>> 	unsigned int		bi_seg_back_size;
>> 
>> (any driver that needs those, would see a 'request' rather than a 'bio'
>> and so could use rq->special)
>> 
>> raid5.c could then use bi_dev_private (or bi_special, or whatever it is call).
>
> All the three above fields are those that could go away with a full
> implementation of the multipage bvec scheme.  So any field for driver
> use would still be be overhead.  If it's just for raid5 it could
> be a smaller 16 bit (or maybe even just 8 bit) one.

We currently store 2 counters in that field, and before
commit 5b99c2ffa980528a197f26 one of the fields was only 8 bits,
and that caused problems

We could possibly use __bi_remaining in place of
raid5_X_bi_active_stripes().  It wouldn't be a completely
straightforward conversion, but I think it could be made to work.

We *might* be able to use bvec_iter_advance() in place of
raid5_bi_processed_stripes().  A careful audit of the code would be
needed to be certain.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "creative" bio usage in the RAID code
  2016-11-14  8:57       ` Christoph Hellwig
@ 2016-11-14  9:51           ` NeilBrown
  0 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2016-11-14  9:51 UTC (permalink / raw)
  Cc: Christoph Hellwig, Shaohua Li, linux-raid, linux-block

[-- Attachment #1: Type: text/plain, Size: 2551 bytes --]

On Mon, Nov 14 2016, Christoph Hellwig wrote:

> On Mon, Nov 14, 2016 at 09:53:46AM +1100, NeilBrown wrote:
>> > While we're at it - I find the way MD_RECOVERY_REQUESTED is used highly
>> > confusing, and I'm not 100% sure it's correct.  After all we check it
>> > in r1buf_pool_alloc, which is a mempool alloc callback, so we rely
>> > on these callbacks being done after the flag has been raise / cleared,
>> > which makes me bit suspicious, and also question why we even need the
>> > mempool.
>> 
>> MD_RECOVERY_REQUEST is only set or cleared when no recovery is running.
>> The ->reconfig_mutex and MD_RECOVERY_RUNNING flags ensure there are no
>> races there.
>> The r1buf_pool mempool is created are the start of resync, so at that
>> time MD_RECOVERY_RUNNING will be stable, and it will remain stable until
>> after the mempool is freed.
>> 
>> To perform a resync we need a pool of memory buffers.  We don't want to
>> have to cope with kmalloc failing, but are quite able to cope with
>> mempool_alloc() blocking.
>> We probably don't need nearly as many bufs as we allocate (4 is probably
>> plenty), but having a pool is certainly convenient.
>
> Would it be good to create/delete the pool explicitly through methods
> to start/emd the sync?  Right now the behavior looks very, very
> confusing.

Maybe.  It is created the first time ->sync_request is called,
and destroyed when it is called with a sector_nr at-or-beyond the end of
the device.  I guess some of that could be made a bit more obvious.
I'm not strongly against adding new methods for "start_sync" and "stop_sync"
but I don't see that it is really needed.

>
>> The "bigger bio" might cover a large number of sectors.  If there are
>> media errors, there might be only one sector that is bad.  So we repeat
>> the read with finer granularity (pages in the current code, though
>> device block would be ideal) and only recovery bad blocks for individual
>> pages which are bad and cannot be fixed.
>
> i have no problems with the behavior - the point is that these days
> this should be without poking into the bio internals, but by using
> a bio iterator for just the range you want to re-read.  Potentially
> using a bio clone if we can't reusing the existing bio, although I'm
> not sure we even need that from looking at the code.

Fair enough.  The code predates bio iterators and "if it ain't broke,
don't fix it".  If it is now causing problems, then maybe it is now
"broke" and should be "fixed".

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "creative" bio usage in the RAID code
@ 2016-11-14  9:51           ` NeilBrown
  0 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2016-11-14  9:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Christoph Hellwig, Shaohua Li, linux-raid, linux-block

[-- Attachment #1: Type: text/plain, Size: 2551 bytes --]

On Mon, Nov 14 2016, Christoph Hellwig wrote:

> On Mon, Nov 14, 2016 at 09:53:46AM +1100, NeilBrown wrote:
>> > While we're at it - I find the way MD_RECOVERY_REQUESTED is used highly
>> > confusing, and I'm not 100% sure it's correct.  After all we check it
>> > in r1buf_pool_alloc, which is a mempool alloc callback, so we rely
>> > on these callbacks being done after the flag has been raise / cleared,
>> > which makes me bit suspicious, and also question why we even need the
>> > mempool.
>> 
>> MD_RECOVERY_REQUEST is only set or cleared when no recovery is running.
>> The ->reconfig_mutex and MD_RECOVERY_RUNNING flags ensure there are no
>> races there.
>> The r1buf_pool mempool is created are the start of resync, so at that
>> time MD_RECOVERY_RUNNING will be stable, and it will remain stable until
>> after the mempool is freed.
>> 
>> To perform a resync we need a pool of memory buffers.  We don't want to
>> have to cope with kmalloc failing, but are quite able to cope with
>> mempool_alloc() blocking.
>> We probably don't need nearly as many bufs as we allocate (4 is probably
>> plenty), but having a pool is certainly convenient.
>
> Would it be good to create/delete the pool explicitly through methods
> to start/emd the sync?  Right now the behavior looks very, very
> confusing.

Maybe.  It is created the first time ->sync_request is called,
and destroyed when it is called with a sector_nr at-or-beyond the end of
the device.  I guess some of that could be made a bit more obvious.
I'm not strongly against adding new methods for "start_sync" and "stop_sync"
but I don't see that it is really needed.

>
>> The "bigger bio" might cover a large number of sectors.  If there are
>> media errors, there might be only one sector that is bad.  So we repeat
>> the read with finer granularity (pages in the current code, though
>> device block would be ideal) and only recovery bad blocks for individual
>> pages which are bad and cannot be fixed.
>
> i have no problems with the behavior - the point is that these days
> this should be without poking into the bio internals, but by using
> a bio iterator for just the range you want to re-read.  Potentially
> using a bio clone if we can't reusing the existing bio, although I'm
> not sure we even need that from looking at the code.

Fair enough.  The code predates bio iterators and "if it ain't broke,
don't fix it".  If it is now causing problems, then maybe it is now
"broke" and should be "fixed".

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "creative" bio usage in the RAID code
  2016-11-12 17:42   ` Christoph Hellwig
  2016-11-13 22:53       ` NeilBrown
@ 2016-11-15  0:13     ` Shaohua Li
  2016-11-15  1:30         ` Ming Lei
  1 sibling, 1 reply; 15+ messages in thread
From: Shaohua Li @ 2016-11-15  0:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-raid, linux-block, neilb

On Sat, Nov 12, 2016 at 09:42:38AM -0800, Christoph Hellwig wrote:
> On Fri, Nov 11, 2016 at 11:02:23AM -0800, Shaohua Li wrote:
> > > It's mostly about the RAID1 and RAID10 code which does a lot of funny
> > > things with the bi_iov_vec and bi_vcnt fields, which we'd prefer that
> > > drivers don't touch.  One example is the r1buf_pool_alloc code,
> > > which I think should simply use bio_clone for the MD_RECOVERY_REQUESTED
> > > case, which would also take care of r1buf_pool_free.  I'm not sure
> > > about all the others cases, as some bits don't fully make sense to me,
> > 
> > The problem is we use the iov_vec to track the pages allocated. We will read
> > data to the pages and write out later for resync. If we add new fields to track
> > the pages in r1bio, we could use standard API bio_kmalloc/bio_add_page and
> > avoid the tricky parts. This should work for both the resync and writebehind
> > cases.
> 
> I don't think we need to track the pages specificly - if we clone
> a bio we share the bio_vec, e.g. for the !MD_RECOVERY_REQUESTED
> we do one bio_kmalloc, then bio_alloc_pages then clone it for the
> others bios.  for MD_RECOVERY_REQUESTED we do a bio_kmalloc +
> bio_alloc_pages for each.

Sure, for r1buf_pool_alloc, what you suggested should work well. There are a
lot of other places we are using bi_vcnt/bi_io_vec. I'm not sure if it's easy
to replace them with bio iterator. But having a separate data structue to track
the memory we read/rewite/sync and so on definitively will make things easier.
I'm not saying to add the extra data structure in bio but instead in r1bio.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "creative" bio usage in the RAID code
  2016-11-15  0:13     ` Shaohua Li
@ 2016-11-15  1:30         ` Ming Lei
  0 siblings, 0 replies; 15+ messages in thread
From: Ming Lei @ 2016-11-15  1:30 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Christoph Hellwig,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT, linux-block,
	NeilBrown

On Tue, Nov 15, 2016 at 8:13 AM, Shaohua Li <shli@kernel.org> wrote:
> On Sat, Nov 12, 2016 at 09:42:38AM -0800, Christoph Hellwig wrote:
>> On Fri, Nov 11, 2016 at 11:02:23AM -0800, Shaohua Li wrote:
>> > > It's mostly about the RAID1 and RAID10 code which does a lot of funny
>> > > things with the bi_iov_vec and bi_vcnt fields, which we'd prefer that
>> > > drivers don't touch.  One example is the r1buf_pool_alloc code,
>> > > which I think should simply use bio_clone for the MD_RECOVERY_REQUESTED
>> > > case, which would also take care of r1buf_pool_free.  I'm not sure
>> > > about all the others cases, as some bits don't fully make sense to me,
>> >
>> > The problem is we use the iov_vec to track the pages allocated. We will read
>> > data to the pages and write out later for resync. If we add new fields to track
>> > the pages in r1bio, we could use standard API bio_kmalloc/bio_add_page and
>> > avoid the tricky parts. This should work for both the resync and writebehind
>> > cases.
>>
>> I don't think we need to track the pages specificly - if we clone
>> a bio we share the bio_vec, e.g. for the !MD_RECOVERY_REQUESTED
>> we do one bio_kmalloc, then bio_alloc_pages then clone it for the
>> others bios.  for MD_RECOVERY_REQUESTED we do a bio_kmalloc +
>> bio_alloc_pages for each.
>
> Sure, for r1buf_pool_alloc, what you suggested should work well. There are a
> lot of other places we are using bi_vcnt/bi_io_vec. I'm not sure if it's easy
> to replace them with bio iterator. But having a separate data structue to track
> the memory we read/rewite/sync and so on definitively will make things easier.
> I'm not saying to add the extra data structure in bio but instead in r1bio.

From view of multipage bvec, r1buf_pool_alloc() is fine because
the direct access to bi_vcnt/bi_io_vec just happens on a new allocated
bio. For other cases, if pages aren't added to one bio via bio_add_page(),
and the bio isn't cloned from somewhere,  it should be safe to keep current
usage about accessing to bi_vcnt/bi_io_vec.

But it is cleaner to use bio iterator helpers than direct access.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "creative" bio usage in the RAID code
@ 2016-11-15  1:30         ` Ming Lei
  0 siblings, 0 replies; 15+ messages in thread
From: Ming Lei @ 2016-11-15  1:30 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Christoph Hellwig,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT, linux-block,
	NeilBrown

On Tue, Nov 15, 2016 at 8:13 AM, Shaohua Li <shli@kernel.org> wrote:
> On Sat, Nov 12, 2016 at 09:42:38AM -0800, Christoph Hellwig wrote:
>> On Fri, Nov 11, 2016 at 11:02:23AM -0800, Shaohua Li wrote:
>> > > It's mostly about the RAID1 and RAID10 code which does a lot of funny
>> > > things with the bi_iov_vec and bi_vcnt fields, which we'd prefer that
>> > > drivers don't touch.  One example is the r1buf_pool_alloc code,
>> > > which I think should simply use bio_clone for the MD_RECOVERY_REQUESTED
>> > > case, which would also take care of r1buf_pool_free.  I'm not sure
>> > > about all the others cases, as some bits don't fully make sense to me,
>> >
>> > The problem is we use the iov_vec to track the pages allocated. We will read
>> > data to the pages and write out later for resync. If we add new fields to track
>> > the pages in r1bio, we could use standard API bio_kmalloc/bio_add_page and
>> > avoid the tricky parts. This should work for both the resync and writebehind
>> > cases.
>>
>> I don't think we need to track the pages specificly - if we clone
>> a bio we share the bio_vec, e.g. for the !MD_RECOVERY_REQUESTED
>> we do one bio_kmalloc, then bio_alloc_pages then clone it for the
>> others bios.  for MD_RECOVERY_REQUESTED we do a bio_kmalloc +
>> bio_alloc_pages for each.
>
> Sure, for r1buf_pool_alloc, what you suggested should work well. There are a
> lot of other places we are using bi_vcnt/bi_io_vec. I'm not sure if it's easy
> to replace them with bio iterator. But having a separate data structue to track
> the memory we read/rewite/sync and so on definitively will make things easier.
> I'm not saying to add the extra data structure in bio but instead in r1bio.

>From view of multipage bvec, r1buf_pool_alloc() is fine because
the direct access to bi_vcnt/bi_io_vec just happens on a new allocated
bio. For other cases, if pages aren't added to one bio via bio_add_page(),
and the bio isn't cloned from somewhere,  it should be safe to keep current
usage about accessing to bi_vcnt/bi_io_vec.

But it is cleaner to use bio iterator helpers than direct access.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-11-15  1:30 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-10 19:46 "creative" bio usage in the RAID code Christoph Hellwig
2016-11-11 19:02 ` Shaohua Li
2016-11-12 17:42   ` Christoph Hellwig
2016-11-13 22:53     ` NeilBrown
2016-11-13 22:53       ` NeilBrown
2016-11-14  8:57       ` Christoph Hellwig
2016-11-14  9:51         ` NeilBrown
2016-11-14  9:51           ` NeilBrown
2016-11-15  0:13     ` Shaohua Li
2016-11-15  1:30       ` Ming Lei
2016-11-15  1:30         ` Ming Lei
2016-11-13 23:03 ` NeilBrown
2016-11-14  8:51   ` Christoph Hellwig
2016-11-14  9:43     ` NeilBrown
2016-11-14  9:43       ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.