* "creative" bio usage in the RAID code @ 2016-11-10 19:46 Christoph Hellwig 2016-11-11 19:02 ` Shaohua Li 2016-11-13 23:03 ` NeilBrown 0 siblings, 2 replies; 15+ messages in thread From: Christoph Hellwig @ 2016-11-10 19:46 UTC (permalink / raw) To: Shaohua Li; +Cc: linux-raid, linux-block Hi Shaohua, one of the major issues with Ming Lei's multipage biovec works is that we can't easily enabled the MD RAID code for it. I had a quick chat on that with Chris and Jens and they suggested talking to you about it. It's mostly about the RAID1 and RAID10 code which does a lot of funny things with the bi_iov_vec and bi_vcnt fields, which we'd prefer that drivers don't touch. One example is the r1buf_pool_alloc code, which I think should simply use bio_clone for the MD_RECOVERY_REQUESTED case, which would also take care of r1buf_pool_free. I'm not sure about all the others cases, as some bits don't fully make sense to me, e.g. why we're trying to do single page I/O out of a bigger bio. Maybe you have some better ideas what's going on there? Another not quite as urgent issue is how the RAID5 code abuses ->bi_phys_segments as and outstanding I/O counter, and I have no really good answer to that either. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "creative" bio usage in the RAID code 2016-11-10 19:46 "creative" bio usage in the RAID code Christoph Hellwig @ 2016-11-11 19:02 ` Shaohua Li 2016-11-12 17:42 ` Christoph Hellwig 2016-11-13 23:03 ` NeilBrown 1 sibling, 1 reply; 15+ messages in thread From: Shaohua Li @ 2016-11-11 19:02 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-raid, linux-block, neilb On Thu, Nov 10, 2016 at 11:46:36AM -0800, Christoph Hellwig wrote: > Hi Shaohua, > > one of the major issues with Ming Lei's multipage biovec works > is that we can't easily enabled the MD RAID code for it. I had > a quick chat on that with Chris and Jens and they suggested talking > to you about it. > > It's mostly about the RAID1 and RAID10 code which does a lot of funny > things with the bi_iov_vec and bi_vcnt fields, which we'd prefer that > drivers don't touch. One example is the r1buf_pool_alloc code, > which I think should simply use bio_clone for the MD_RECOVERY_REQUESTED > case, which would also take care of r1buf_pool_free. I'm not sure > about all the others cases, as some bits don't fully make sense to me, The problem is we use the iov_vec to track the pages allocated. We will read data to the pages and write out later for resync. If we add new fields to track the pages in r1bio, we could use standard API bio_kmalloc/bio_add_page and avoid the tricky parts. This should work for both the resync and writebehind cases. > e.g. why we're trying to do single page I/O out of a bigger bio. what's this one? > Maybe you have some better ideas what's going on there? > > Another not quite as urgent issue is how the RAID5 code abuses > ->bi_phys_segments as and outstanding I/O counter, and I have no > really good answer to that either. I don't have good idea for this one either if we don't want to allocate extra memory. The good side is we never dispatch the original bio to under layer disks. Thanks, Shaohua ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "creative" bio usage in the RAID code 2016-11-11 19:02 ` Shaohua Li @ 2016-11-12 17:42 ` Christoph Hellwig 2016-11-13 22:53 ` NeilBrown 2016-11-15 0:13 ` Shaohua Li 0 siblings, 2 replies; 15+ messages in thread From: Christoph Hellwig @ 2016-11-12 17:42 UTC (permalink / raw) To: Shaohua Li; +Cc: Christoph Hellwig, linux-raid, linux-block, neilb On Fri, Nov 11, 2016 at 11:02:23AM -0800, Shaohua Li wrote: > > It's mostly about the RAID1 and RAID10 code which does a lot of funny > > things with the bi_iov_vec and bi_vcnt fields, which we'd prefer that > > drivers don't touch. One example is the r1buf_pool_alloc code, > > which I think should simply use bio_clone for the MD_RECOVERY_REQUESTED > > case, which would also take care of r1buf_pool_free. I'm not sure > > about all the others cases, as some bits don't fully make sense to me, > > The problem is we use the iov_vec to track the pages allocated. We will read > data to the pages and write out later for resync. If we add new fields to track > the pages in r1bio, we could use standard API bio_kmalloc/bio_add_page and > avoid the tricky parts. This should work for both the resync and writebehind > cases. I don't think we need to track the pages specificly - if we clone a bio we share the bio_vec, e.g. for the !MD_RECOVERY_REQUESTED we do one bio_kmalloc, then bio_alloc_pages then clone it for the others bios. for MD_RECOVERY_REQUESTED we do a bio_kmalloc + bio_alloc_pages for each. While we're at it - I find the way MD_RECOVERY_REQUESTED is used highly confusing, and I'm not 100% sure it's correct. After all we check it in r1buf_pool_alloc, which is a mempool alloc callback, so we rely on these callbacks being done after the flag has been raise / cleared, which makes me bit suspicious, and also question why we even need the mempool. > > > e.g. why we're trying to do single page I/O out of a bigger bio. > > what's this one? fix_sync_read_error ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "creative" bio usage in the RAID code 2016-11-12 17:42 ` Christoph Hellwig @ 2016-11-13 22:53 ` NeilBrown 2016-11-15 0:13 ` Shaohua Li 1 sibling, 0 replies; 15+ messages in thread From: NeilBrown @ 2016-11-13 22:53 UTC (permalink / raw) To: Shaohua Li; +Cc: Christoph Hellwig, linux-raid, linux-block [-- Attachment #1: Type: text/plain, Size: 3090 bytes --] On Sun, Nov 13 2016, Christoph Hellwig wrote: > On Fri, Nov 11, 2016 at 11:02:23AM -0800, Shaohua Li wrote: >> > It's mostly about the RAID1 and RAID10 code which does a lot of funny >> > things with the bi_iov_vec and bi_vcnt fields, which we'd prefer that >> > drivers don't touch. One example is the r1buf_pool_alloc code, >> > which I think should simply use bio_clone for the MD_RECOVERY_REQUESTED >> > case, which would also take care of r1buf_pool_free. I'm not sure >> > about all the others cases, as some bits don't fully make sense to me, >> >> The problem is we use the iov_vec to track the pages allocated. We will read >> data to the pages and write out later for resync. If we add new fields to track >> the pages in r1bio, we could use standard API bio_kmalloc/bio_add_page and >> avoid the tricky parts. This should work for both the resync and writebehind >> cases. > > I don't think we need to track the pages specificly - if we clone > a bio we share the bio_vec, e.g. for the !MD_RECOVERY_REQUESTED > we do one bio_kmalloc, then bio_alloc_pages then clone it for the > others bios. for MD_RECOVERY_REQUESTED we do a bio_kmalloc + > bio_alloc_pages for each. Part of the reason for the oddities in this code is that I wanted a collection of bios, one per device, which were all the same size. As different devices might impose different restrictions on the size of the bios, I built them carefully, step by step. Now that those restrictions are gone, we can - as you say - just allocate a suitably sized bio and then clone it for each device. > > While we're at it - I find the way MD_RECOVERY_REQUESTED is used highly > confusing, and I'm not 100% sure it's correct. After all we check it > in r1buf_pool_alloc, which is a mempool alloc callback, so we rely > on these callbacks being done after the flag has been raise / cleared, > which makes me bit suspicious, and also question why we even need the > mempool. MD_RECOVERY_REQUEST is only set or cleared when no recovery is running. The ->reconfig_mutex and MD_RECOVERY_RUNNING flags ensure there are no races there. The r1buf_pool mempool is created are the start of resync, so at that time MD_RECOVERY_RUNNING will be stable, and it will remain stable until after the mempool is freed. To perform a resync we need a pool of memory buffers. We don't want to have to cope with kmalloc failing, but are quite able to cope with mempool_alloc() blocking. We probably don't need nearly as many bufs as we allocate (4 is probably plenty), but having a pool is certainly convenient. > >> >> > e.g. why we're trying to do single page I/O out of a bigger bio. >> >> what's this one? > > fix_sync_read_error The "bigger bio" might cover a large number of sectors. If there are media errors, there might be only one sector that is bad. So we repeat the read with finer granularity (pages in the current code, though device block would be ideal) and only recovery bad blocks for individual pages which are bad and cannot be fixed. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 800 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "creative" bio usage in the RAID code @ 2016-11-13 22:53 ` NeilBrown 0 siblings, 0 replies; 15+ messages in thread From: NeilBrown @ 2016-11-13 22:53 UTC (permalink / raw) To: Christoph Hellwig, Shaohua Li; +Cc: Christoph Hellwig, linux-raid, linux-block [-- Attachment #1: Type: text/plain, Size: 3090 bytes --] On Sun, Nov 13 2016, Christoph Hellwig wrote: > On Fri, Nov 11, 2016 at 11:02:23AM -0800, Shaohua Li wrote: >> > It's mostly about the RAID1 and RAID10 code which does a lot of funny >> > things with the bi_iov_vec and bi_vcnt fields, which we'd prefer that >> > drivers don't touch. One example is the r1buf_pool_alloc code, >> > which I think should simply use bio_clone for the MD_RECOVERY_REQUESTED >> > case, which would also take care of r1buf_pool_free. I'm not sure >> > about all the others cases, as some bits don't fully make sense to me, >> >> The problem is we use the iov_vec to track the pages allocated. We will read >> data to the pages and write out later for resync. If we add new fields to track >> the pages in r1bio, we could use standard API bio_kmalloc/bio_add_page and >> avoid the tricky parts. This should work for both the resync and writebehind >> cases. > > I don't think we need to track the pages specificly - if we clone > a bio we share the bio_vec, e.g. for the !MD_RECOVERY_REQUESTED > we do one bio_kmalloc, then bio_alloc_pages then clone it for the > others bios. for MD_RECOVERY_REQUESTED we do a bio_kmalloc + > bio_alloc_pages for each. Part of the reason for the oddities in this code is that I wanted a collection of bios, one per device, which were all the same size. As different devices might impose different restrictions on the size of the bios, I built them carefully, step by step. Now that those restrictions are gone, we can - as you say - just allocate a suitably sized bio and then clone it for each device. > > While we're at it - I find the way MD_RECOVERY_REQUESTED is used highly > confusing, and I'm not 100% sure it's correct. After all we check it > in r1buf_pool_alloc, which is a mempool alloc callback, so we rely > on these callbacks being done after the flag has been raise / cleared, > which makes me bit suspicious, and also question why we even need the > mempool. MD_RECOVERY_REQUEST is only set or cleared when no recovery is running. The ->reconfig_mutex and MD_RECOVERY_RUNNING flags ensure there are no races there. The r1buf_pool mempool is created are the start of resync, so at that time MD_RECOVERY_RUNNING will be stable, and it will remain stable until after the mempool is freed. To perform a resync we need a pool of memory buffers. We don't want to have to cope with kmalloc failing, but are quite able to cope with mempool_alloc() blocking. We probably don't need nearly as many bufs as we allocate (4 is probably plenty), but having a pool is certainly convenient. > >> >> > e.g. why we're trying to do single page I/O out of a bigger bio. >> >> what's this one? > > fix_sync_read_error The "bigger bio" might cover a large number of sectors. If there are media errors, there might be only one sector that is bad. So we repeat the read with finer granularity (pages in the current code, though device block would be ideal) and only recovery bad blocks for individual pages which are bad and cannot be fixed. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 800 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "creative" bio usage in the RAID code 2016-11-13 22:53 ` NeilBrown (?) @ 2016-11-14 8:57 ` Christoph Hellwig 2016-11-14 9:51 ` NeilBrown -1 siblings, 1 reply; 15+ messages in thread From: Christoph Hellwig @ 2016-11-14 8:57 UTC (permalink / raw) To: NeilBrown; +Cc: Christoph Hellwig, Shaohua Li, linux-raid, linux-block On Mon, Nov 14, 2016 at 09:53:46AM +1100, NeilBrown wrote: > > While we're at it - I find the way MD_RECOVERY_REQUESTED is used highly > > confusing, and I'm not 100% sure it's correct. After all we check it > > in r1buf_pool_alloc, which is a mempool alloc callback, so we rely > > on these callbacks being done after the flag has been raise / cleared, > > which makes me bit suspicious, and also question why we even need the > > mempool. > > MD_RECOVERY_REQUEST is only set or cleared when no recovery is running. > The ->reconfig_mutex and MD_RECOVERY_RUNNING flags ensure there are no > races there. > The r1buf_pool mempool is created are the start of resync, so at that > time MD_RECOVERY_RUNNING will be stable, and it will remain stable until > after the mempool is freed. > > To perform a resync we need a pool of memory buffers. We don't want to > have to cope with kmalloc failing, but are quite able to cope with > mempool_alloc() blocking. > We probably don't need nearly as many bufs as we allocate (4 is probably > plenty), but having a pool is certainly convenient. Would it be good to create/delete the pool explicitly through methods to start/emd the sync? Right now the behavior looks very, very confusing. > The "bigger bio" might cover a large number of sectors. If there are > media errors, there might be only one sector that is bad. So we repeat > the read with finer granularity (pages in the current code, though > device block would be ideal) and only recovery bad blocks for individual > pages which are bad and cannot be fixed. i have no problems with the behavior - the point is that these days this should be without poking into the bio internals, but by using a bio iterator for just the range you want to re-read. Potentially using a bio clone if we can't reusing the existing bio, although I'm not sure we even need that from looking at the code. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "creative" bio usage in the RAID code 2016-11-14 8:57 ` Christoph Hellwig @ 2016-11-14 9:51 ` NeilBrown 0 siblings, 0 replies; 15+ messages in thread From: NeilBrown @ 2016-11-14 9:51 UTC (permalink / raw) Cc: Christoph Hellwig, Shaohua Li, linux-raid, linux-block [-- Attachment #1: Type: text/plain, Size: 2551 bytes --] On Mon, Nov 14 2016, Christoph Hellwig wrote: > On Mon, Nov 14, 2016 at 09:53:46AM +1100, NeilBrown wrote: >> > While we're at it - I find the way MD_RECOVERY_REQUESTED is used highly >> > confusing, and I'm not 100% sure it's correct. After all we check it >> > in r1buf_pool_alloc, which is a mempool alloc callback, so we rely >> > on these callbacks being done after the flag has been raise / cleared, >> > which makes me bit suspicious, and also question why we even need the >> > mempool. >> >> MD_RECOVERY_REQUEST is only set or cleared when no recovery is running. >> The ->reconfig_mutex and MD_RECOVERY_RUNNING flags ensure there are no >> races there. >> The r1buf_pool mempool is created are the start of resync, so at that >> time MD_RECOVERY_RUNNING will be stable, and it will remain stable until >> after the mempool is freed. >> >> To perform a resync we need a pool of memory buffers. We don't want to >> have to cope with kmalloc failing, but are quite able to cope with >> mempool_alloc() blocking. >> We probably don't need nearly as many bufs as we allocate (4 is probably >> plenty), but having a pool is certainly convenient. > > Would it be good to create/delete the pool explicitly through methods > to start/emd the sync? Right now the behavior looks very, very > confusing. Maybe. It is created the first time ->sync_request is called, and destroyed when it is called with a sector_nr at-or-beyond the end of the device. I guess some of that could be made a bit more obvious. I'm not strongly against adding new methods for "start_sync" and "stop_sync" but I don't see that it is really needed. > >> The "bigger bio" might cover a large number of sectors. If there are >> media errors, there might be only one sector that is bad. So we repeat >> the read with finer granularity (pages in the current code, though >> device block would be ideal) and only recovery bad blocks for individual >> pages which are bad and cannot be fixed. > > i have no problems with the behavior - the point is that these days > this should be without poking into the bio internals, but by using > a bio iterator for just the range you want to re-read. Potentially > using a bio clone if we can't reusing the existing bio, although I'm > not sure we even need that from looking at the code. Fair enough. The code predates bio iterators and "if it ain't broke, don't fix it". If it is now causing problems, then maybe it is now "broke" and should be "fixed". Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 800 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "creative" bio usage in the RAID code @ 2016-11-14 9:51 ` NeilBrown 0 siblings, 0 replies; 15+ messages in thread From: NeilBrown @ 2016-11-14 9:51 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Christoph Hellwig, Shaohua Li, linux-raid, linux-block [-- Attachment #1: Type: text/plain, Size: 2551 bytes --] On Mon, Nov 14 2016, Christoph Hellwig wrote: > On Mon, Nov 14, 2016 at 09:53:46AM +1100, NeilBrown wrote: >> > While we're at it - I find the way MD_RECOVERY_REQUESTED is used highly >> > confusing, and I'm not 100% sure it's correct. After all we check it >> > in r1buf_pool_alloc, which is a mempool alloc callback, so we rely >> > on these callbacks being done after the flag has been raise / cleared, >> > which makes me bit suspicious, and also question why we even need the >> > mempool. >> >> MD_RECOVERY_REQUEST is only set or cleared when no recovery is running. >> The ->reconfig_mutex and MD_RECOVERY_RUNNING flags ensure there are no >> races there. >> The r1buf_pool mempool is created are the start of resync, so at that >> time MD_RECOVERY_RUNNING will be stable, and it will remain stable until >> after the mempool is freed. >> >> To perform a resync we need a pool of memory buffers. We don't want to >> have to cope with kmalloc failing, but are quite able to cope with >> mempool_alloc() blocking. >> We probably don't need nearly as many bufs as we allocate (4 is probably >> plenty), but having a pool is certainly convenient. > > Would it be good to create/delete the pool explicitly through methods > to start/emd the sync? Right now the behavior looks very, very > confusing. Maybe. It is created the first time ->sync_request is called, and destroyed when it is called with a sector_nr at-or-beyond the end of the device. I guess some of that could be made a bit more obvious. I'm not strongly against adding new methods for "start_sync" and "stop_sync" but I don't see that it is really needed. > >> The "bigger bio" might cover a large number of sectors. If there are >> media errors, there might be only one sector that is bad. So we repeat >> the read with finer granularity (pages in the current code, though >> device block would be ideal) and only recovery bad blocks for individual >> pages which are bad and cannot be fixed. > > i have no problems with the behavior - the point is that these days > this should be without poking into the bio internals, but by using > a bio iterator for just the range you want to re-read. Potentially > using a bio clone if we can't reusing the existing bio, although I'm > not sure we even need that from looking at the code. Fair enough. The code predates bio iterators and "if it ain't broke, don't fix it". If it is now causing problems, then maybe it is now "broke" and should be "fixed". Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 800 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "creative" bio usage in the RAID code 2016-11-12 17:42 ` Christoph Hellwig 2016-11-13 22:53 ` NeilBrown @ 2016-11-15 0:13 ` Shaohua Li 2016-11-15 1:30 ` Ming Lei 1 sibling, 1 reply; 15+ messages in thread From: Shaohua Li @ 2016-11-15 0:13 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-raid, linux-block, neilb On Sat, Nov 12, 2016 at 09:42:38AM -0800, Christoph Hellwig wrote: > On Fri, Nov 11, 2016 at 11:02:23AM -0800, Shaohua Li wrote: > > > It's mostly about the RAID1 and RAID10 code which does a lot of funny > > > things with the bi_iov_vec and bi_vcnt fields, which we'd prefer that > > > drivers don't touch. One example is the r1buf_pool_alloc code, > > > which I think should simply use bio_clone for the MD_RECOVERY_REQUESTED > > > case, which would also take care of r1buf_pool_free. I'm not sure > > > about all the others cases, as some bits don't fully make sense to me, > > > > The problem is we use the iov_vec to track the pages allocated. We will read > > data to the pages and write out later for resync. If we add new fields to track > > the pages in r1bio, we could use standard API bio_kmalloc/bio_add_page and > > avoid the tricky parts. This should work for both the resync and writebehind > > cases. > > I don't think we need to track the pages specificly - if we clone > a bio we share the bio_vec, e.g. for the !MD_RECOVERY_REQUESTED > we do one bio_kmalloc, then bio_alloc_pages then clone it for the > others bios. for MD_RECOVERY_REQUESTED we do a bio_kmalloc + > bio_alloc_pages for each. Sure, for r1buf_pool_alloc, what you suggested should work well. There are a lot of other places we are using bi_vcnt/bi_io_vec. I'm not sure if it's easy to replace them with bio iterator. But having a separate data structue to track the memory we read/rewite/sync and so on definitively will make things easier. I'm not saying to add the extra data structure in bio but instead in r1bio. Thanks, Shaohua ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "creative" bio usage in the RAID code 2016-11-15 0:13 ` Shaohua Li @ 2016-11-15 1:30 ` Ming Lei 0 siblings, 0 replies; 15+ messages in thread From: Ming Lei @ 2016-11-15 1:30 UTC (permalink / raw) To: Shaohua Li Cc: Christoph Hellwig, open list:SOFTWARE RAID (Multiple Disks) SUPPORT, linux-block, NeilBrown On Tue, Nov 15, 2016 at 8:13 AM, Shaohua Li <shli@kernel.org> wrote: > On Sat, Nov 12, 2016 at 09:42:38AM -0800, Christoph Hellwig wrote: >> On Fri, Nov 11, 2016 at 11:02:23AM -0800, Shaohua Li wrote: >> > > It's mostly about the RAID1 and RAID10 code which does a lot of funny >> > > things with the bi_iov_vec and bi_vcnt fields, which we'd prefer that >> > > drivers don't touch. One example is the r1buf_pool_alloc code, >> > > which I think should simply use bio_clone for the MD_RECOVERY_REQUESTED >> > > case, which would also take care of r1buf_pool_free. I'm not sure >> > > about all the others cases, as some bits don't fully make sense to me, >> > >> > The problem is we use the iov_vec to track the pages allocated. We will read >> > data to the pages and write out later for resync. If we add new fields to track >> > the pages in r1bio, we could use standard API bio_kmalloc/bio_add_page and >> > avoid the tricky parts. This should work for both the resync and writebehind >> > cases. >> >> I don't think we need to track the pages specificly - if we clone >> a bio we share the bio_vec, e.g. for the !MD_RECOVERY_REQUESTED >> we do one bio_kmalloc, then bio_alloc_pages then clone it for the >> others bios. for MD_RECOVERY_REQUESTED we do a bio_kmalloc + >> bio_alloc_pages for each. > > Sure, for r1buf_pool_alloc, what you suggested should work well. There are a > lot of other places we are using bi_vcnt/bi_io_vec. I'm not sure if it's easy > to replace them with bio iterator. But having a separate data structue to track > the memory we read/rewite/sync and so on definitively will make things easier. > I'm not saying to add the extra data structure in bio but instead in r1bio. From view of multipage bvec, r1buf_pool_alloc() is fine because the direct access to bi_vcnt/bi_io_vec just happens on a new allocated bio. For other cases, if pages aren't added to one bio via bio_add_page(), and the bio isn't cloned from somewhere, it should be safe to keep current usage about accessing to bi_vcnt/bi_io_vec. But it is cleaner to use bio iterator helpers than direct access. Thanks, Ming Lei ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "creative" bio usage in the RAID code @ 2016-11-15 1:30 ` Ming Lei 0 siblings, 0 replies; 15+ messages in thread From: Ming Lei @ 2016-11-15 1:30 UTC (permalink / raw) To: Shaohua Li Cc: Christoph Hellwig, open list:SOFTWARE RAID (Multiple Disks) SUPPORT, linux-block, NeilBrown On Tue, Nov 15, 2016 at 8:13 AM, Shaohua Li <shli@kernel.org> wrote: > On Sat, Nov 12, 2016 at 09:42:38AM -0800, Christoph Hellwig wrote: >> On Fri, Nov 11, 2016 at 11:02:23AM -0800, Shaohua Li wrote: >> > > It's mostly about the RAID1 and RAID10 code which does a lot of funny >> > > things with the bi_iov_vec and bi_vcnt fields, which we'd prefer that >> > > drivers don't touch. One example is the r1buf_pool_alloc code, >> > > which I think should simply use bio_clone for the MD_RECOVERY_REQUESTED >> > > case, which would also take care of r1buf_pool_free. I'm not sure >> > > about all the others cases, as some bits don't fully make sense to me, >> > >> > The problem is we use the iov_vec to track the pages allocated. We will read >> > data to the pages and write out later for resync. If we add new fields to track >> > the pages in r1bio, we could use standard API bio_kmalloc/bio_add_page and >> > avoid the tricky parts. This should work for both the resync and writebehind >> > cases. >> >> I don't think we need to track the pages specificly - if we clone >> a bio we share the bio_vec, e.g. for the !MD_RECOVERY_REQUESTED >> we do one bio_kmalloc, then bio_alloc_pages then clone it for the >> others bios. for MD_RECOVERY_REQUESTED we do a bio_kmalloc + >> bio_alloc_pages for each. > > Sure, for r1buf_pool_alloc, what you suggested should work well. There are a > lot of other places we are using bi_vcnt/bi_io_vec. I'm not sure if it's easy > to replace them with bio iterator. But having a separate data structue to track > the memory we read/rewite/sync and so on definitively will make things easier. > I'm not saying to add the extra data structure in bio but instead in r1bio. >From view of multipage bvec, r1buf_pool_alloc() is fine because the direct access to bi_vcnt/bi_io_vec just happens on a new allocated bio. For other cases, if pages aren't added to one bio via bio_add_page(), and the bio isn't cloned from somewhere, it should be safe to keep current usage about accessing to bi_vcnt/bi_io_vec. But it is cleaner to use bio iterator helpers than direct access. Thanks, Ming Lei ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "creative" bio usage in the RAID code 2016-11-10 19:46 "creative" bio usage in the RAID code Christoph Hellwig 2016-11-11 19:02 ` Shaohua Li @ 2016-11-13 23:03 ` NeilBrown 2016-11-14 8:51 ` Christoph Hellwig 1 sibling, 1 reply; 15+ messages in thread From: NeilBrown @ 2016-11-13 23:03 UTC (permalink / raw) To: Christoph Hellwig, Shaohua Li; +Cc: linux-raid, linux-block [-- Attachment #1: Type: text/plain, Size: 720 bytes --] On Fri, Nov 11 2016, Christoph Hellwig wrote: > > Another not quite as urgent issue is how the RAID5 code abuses > ->bi_phys_segments as and outstanding I/O counter, and I have no > really good answer to that either. I would suggest adding a "bi_dev_private" field to the bio which is for use by the lowest-level driver (much as bi_private is for use by the top-level initiator). That could be in a union with any or all of: unsigned int bi_phys_segments; unsigned int bi_seg_front_size; unsigned int bi_seg_back_size; (any driver that needs those, would see a 'request' rather than a 'bio' and so could use rq->special) raid5.c could then use bi_dev_private (or bi_special, or whatever it is call). NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 800 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "creative" bio usage in the RAID code 2016-11-13 23:03 ` NeilBrown @ 2016-11-14 8:51 ` Christoph Hellwig 2016-11-14 9:43 ` NeilBrown 0 siblings, 1 reply; 15+ messages in thread From: Christoph Hellwig @ 2016-11-14 8:51 UTC (permalink / raw) To: NeilBrown; +Cc: Christoph Hellwig, Shaohua Li, linux-raid, linux-block On Mon, Nov 14, 2016 at 10:03:20AM +1100, NeilBrown wrote: > I would suggest adding a "bi_dev_private" field to the bio which is for > use by the lowest-level driver (much as bi_private is for use by the > top-level initiator). > That could be in a union with any or all of: > unsigned int bi_phys_segments; > unsigned int bi_seg_front_size; > unsigned int bi_seg_back_size; > > (any driver that needs those, would see a 'request' rather than a 'bio' > and so could use rq->special) > > raid5.c could then use bi_dev_private (or bi_special, or whatever it is call). All the three above fields are those that could go away with a full implementation of the multipage bvec scheme. So any field for driver use would still be be overhead. If it's just for raid5 it could be a smaller 16 bit (or maybe even just 8 bit) one. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "creative" bio usage in the RAID code 2016-11-14 8:51 ` Christoph Hellwig @ 2016-11-14 9:43 ` NeilBrown 0 siblings, 0 replies; 15+ messages in thread From: NeilBrown @ 2016-11-14 9:43 UTC (permalink / raw) Cc: Christoph Hellwig, Shaohua Li, linux-raid, linux-block [-- Attachment #1: Type: text/plain, Size: 1415 bytes --] On Mon, Nov 14 2016, Christoph Hellwig wrote: > On Mon, Nov 14, 2016 at 10:03:20AM +1100, NeilBrown wrote: >> I would suggest adding a "bi_dev_private" field to the bio which is for >> use by the lowest-level driver (much as bi_private is for use by the >> top-level initiator). >> That could be in a union with any or all of: >> unsigned int bi_phys_segments; >> unsigned int bi_seg_front_size; >> unsigned int bi_seg_back_size; >> >> (any driver that needs those, would see a 'request' rather than a 'bio' >> and so could use rq->special) >> >> raid5.c could then use bi_dev_private (or bi_special, or whatever it is call). > > All the three above fields are those that could go away with a full > implementation of the multipage bvec scheme. So any field for driver > use would still be be overhead. If it's just for raid5 it could > be a smaller 16 bit (or maybe even just 8 bit) one. We currently store 2 counters in that field, and before commit 5b99c2ffa980528a197f26 one of the fields was only 8 bits, and that caused problems We could possibly use __bi_remaining in place of raid5_X_bi_active_stripes(). It wouldn't be a completely straightforward conversion, but I think it could be made to work. We *might* be able to use bvec_iter_advance() in place of raid5_bi_processed_stripes(). A careful audit of the code would be needed to be certain. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 800 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "creative" bio usage in the RAID code @ 2016-11-14 9:43 ` NeilBrown 0 siblings, 0 replies; 15+ messages in thread From: NeilBrown @ 2016-11-14 9:43 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Christoph Hellwig, Shaohua Li, linux-raid, linux-block [-- Attachment #1: Type: text/plain, Size: 1415 bytes --] On Mon, Nov 14 2016, Christoph Hellwig wrote: > On Mon, Nov 14, 2016 at 10:03:20AM +1100, NeilBrown wrote: >> I would suggest adding a "bi_dev_private" field to the bio which is for >> use by the lowest-level driver (much as bi_private is for use by the >> top-level initiator). >> That could be in a union with any or all of: >> unsigned int bi_phys_segments; >> unsigned int bi_seg_front_size; >> unsigned int bi_seg_back_size; >> >> (any driver that needs those, would see a 'request' rather than a 'bio' >> and so could use rq->special) >> >> raid5.c could then use bi_dev_private (or bi_special, or whatever it is call). > > All the three above fields are those that could go away with a full > implementation of the multipage bvec scheme. So any field for driver > use would still be be overhead. If it's just for raid5 it could > be a smaller 16 bit (or maybe even just 8 bit) one. We currently store 2 counters in that field, and before commit 5b99c2ffa980528a197f26 one of the fields was only 8 bits, and that caused problems We could possibly use __bi_remaining in place of raid5_X_bi_active_stripes(). It wouldn't be a completely straightforward conversion, but I think it could be made to work. We *might* be able to use bvec_iter_advance() in place of raid5_bi_processed_stripes(). A careful audit of the code would be needed to be certain. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 800 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2016-11-15 1:30 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-11-10 19:46 "creative" bio usage in the RAID code Christoph Hellwig 2016-11-11 19:02 ` Shaohua Li 2016-11-12 17:42 ` Christoph Hellwig 2016-11-13 22:53 ` NeilBrown 2016-11-13 22:53 ` NeilBrown 2016-11-14 8:57 ` Christoph Hellwig 2016-11-14 9:51 ` NeilBrown 2016-11-14 9:51 ` NeilBrown 2016-11-15 0:13 ` Shaohua Li 2016-11-15 1:30 ` Ming Lei 2016-11-15 1:30 ` Ming Lei 2016-11-13 23:03 ` NeilBrown 2016-11-14 8:51 ` Christoph Hellwig 2016-11-14 9:43 ` NeilBrown 2016-11-14 9:43 ` NeilBrown
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.