* scsi-mq - tag# and can_queue, performance. @ 2017-04-03 6:37 Arun Easi 2017-04-03 7:29 ` Hannes Reinecke 0 siblings, 1 reply; 7+ messages in thread From: Arun Easi @ 2017-04-03 6:37 UTC (permalink / raw) To: linux-scsi, Jens Axboe Hi Folks, I would like to seek your input on a few topics on SCSI / block multi-queue. 1. Tag# generation. The context is with SCSI MQ on. My question is, what should a LLD do to get request tag values in the range 0 through can_queue - 1 across *all* of the queues. In our QLogic 41XXX series of adapters, we have a per session submit queue, a shared task memory (shared across all queues) and N completion queues (separate MSI-X vectors). We report N as the nr_hw_queues. I would like to, if possible, use the block layer tags to index into the above shared task memory area. >From looking at the scsi/block source, it appears that when a LLD reports a value say #C, in can_queue (via scsi_host_template), that value is used as the max depth when corresponding block layer queues are created. So, while SCSI restricts the number of commands to LLD at #C, the request tag generated across any of the queues can range from 0..#C-1. Please correct me if I got this wrong. If the above is true, then for a LLD to get tag# within it's max-tasks range, it has to report max-tasks / number-of-hw-queues in can_queue, and in the I/O path, use the tag and hwq# to arrive at a index# to use. This, though, leads to a poor use of tag resources -- queue reaching it's capacity while LLD can still take it. blk_mq_unique_tag() would not work here, because it just puts the hwq# in the upper 16 bits, which need not fall in the max-tasks range. Perhaps the current MQ model is to cater to a queue pair (submit/completion) kind of hardware model; nevertheless I would like to know how other hardware variants can makes use of it. 2. mq vs non-mq performance gain. This is more like a poll, I guess. I was wondering what performance gains folks are observing with SCSI MQ on. I saw Christoph H.'s slide deck that has one slide that shows a 200k IOPS gain. >From my testing, though, I was not lucky to observe that big of a change. In fact, the difference was not even noticeable(*). For e.g., for 512 bytes random read test, in both cases, gave me in the vicinity of 2M IOPs. When I say both cases, I meant, one with scsi_mod's use_blk_mq set to 0 and another with 1 (LLD is reloaded when it is done). I only used one NUMA node for this run. The test was run on a x86_64 setup. * See item 3 for a special handling. 3. add_random slowness. One thing I observed with MQ on and off was this block layer tunable, add_random, which as I understand is to tune disk entropy contribution. With non-MQ, it is turned on, and with MQ, it is turned off by default. This got noticed because, when I was running multi-port testing, there was a big drop in IOPs with and without MQ (~200K IOPS to 1M+ IOPs when the test ran on same NUMA node / across NUMA nodes). Just wondering why we have it ON on one setting and OFF on another. Sorry for the rather long e-mail, but your comments/thoughts are much appreciated. Regards, -Arun ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: scsi-mq - tag# and can_queue, performance. 2017-04-03 6:37 scsi-mq - tag# and can_queue, performance Arun Easi @ 2017-04-03 7:29 ` Hannes Reinecke 2017-04-03 15:20 ` Bart Van Assche 2017-04-03 16:31 ` Arun Easi 0 siblings, 2 replies; 7+ messages in thread From: Hannes Reinecke @ 2017-04-03 7:29 UTC (permalink / raw) To: Arun Easi, linux-scsi, Jens Axboe On 04/03/2017 08:37 AM, Arun Easi wrote: > Hi Folks, > > I would like to seek your input on a few topics on SCSI / block > multi-queue. > > 1. Tag# generation. > > The context is with SCSI MQ on. My question is, what should a LLD do to > get request tag values in the range 0 through can_queue - 1 across *all* > of the queues. In our QLogic 41XXX series of adapters, we have a per > session submit queue, a shared task memory (shared across all queues) and > N completion queues (separate MSI-X vectors). We report N as the > nr_hw_queues. I would like to, if possible, use the block layer tags to > index into the above shared task memory area. > > From looking at the scsi/block source, it appears that when a LLD reports > a value say #C, in can_queue (via scsi_host_template), that value is used > as the max depth when corresponding block layer queues are created. So, > while SCSI restricts the number of commands to LLD at #C, the request tag > generated across any of the queues can range from 0..#C-1. Please correct > me if I got this wrong. > > If the above is true, then for a LLD to get tag# within it's max-tasks > range, it has to report max-tasks / number-of-hw-queues in can_queue, and > in the I/O path, use the tag and hwq# to arrive at a index# to use. This, > though, leads to a poor use of tag resources -- queue reaching it's > capacity while LLD can still take it. > Yep. > blk_mq_unique_tag() would not work here, because it just puts the hwq# in > the upper 16 bits, which need not fall in the max-tasks range. > > Perhaps the current MQ model is to cater to a queue pair > (submit/completion) kind of hardware model; nevertheless I would like to > know how other hardware variants can makes use of it. > He. Welcome to the club. Shared tag sets continue to dog the block-mq on 'legacy' (ie non-NVMe) HBAs. ATM the only 'real' solution to this problem is indeed have a static split of the entire tag space by the number of hardware queues. With the mentioned tag-starvation problem. If we were to continue with the tag to hardware ID mapping, we would need to implement a dynamic tag space mapping onto hardware queues. My idea to that would be to map the entire tag space, but rather the individual bit words onto the hardware queue. Then we could make the mapping dynamic, where there individual words are mapped onto the queues only as needed. However, the _one_ big problem we're facing here is timeouts. With the 1:1 mapping between tags and hardware IDs we can only re-use the tag once the timeout is _definitely_ resolved. But this means the command will be active, and we cannot return blk_mq_complete() until the timeout itself has been resolved. With FC this essentially means until the corresponding XIDs are safe to re-use, ie after all ABRT/RRQ etc processing has been completed. Hence we totally lose the ability to return the command itself with -ETIMEDOUT and continue with I/O processing even though the original XID is still being held by firmware. In the light of this I wonder if it wouldn't be better to completely decouple block-layer tags and hardware IDs, and have an efficient algorithm mapping the block-layer tags onto hardware IDs. That should avoid the arbitrary tag starvation problem, and would allow us to handle timeouts efficiently. Of course, we don't _have_ such an efficient algorithm; maybe it's time to have a generic one within the kernel as quite some drivers would _love_ to just use the generic implementation here. (qla2xxx, lpfc, fcoe, mpt3sas etc all suffer from the same problem) > 2. mq vs non-mq performance gain. > > This is more like a poll, I guess. I was wondering what performance gains > folks are observing with SCSI MQ on. I saw Christoph H.'s slide deck that > has one slide that shows a 200k IOPS gain. > > From my testing, though, I was not lucky to observe that big of a change. > In fact, the difference was not even noticeable(*). For e.g., for 512 > bytes random read test, in both cases, gave me in the vicinity of 2M IOPs. > When I say both cases, I meant, one with scsi_mod's use_blk_mq set to 0 > and another with 1 (LLD is reloaded when it is done). I only used one NUMA > node for this run. The test was run on a x86_64 setup. > You _really_ should have listened to my talk at VAULT. For 'legacy' HBAs there indeed is not much of a performance gain to be had; the max gain is indeed for heavy parallel I/O. And there even is a scheduler issue when running with a single submission thread; there I've measured a performance _drop_ by up to 50%. Which, as Jens claims, really looks like a block-layer issue rather than a generic problem. > * See item 3 for a special handling. > > 3. add_random slowness. > > One thing I observed with MQ on and off was this block layer tunable, > add_random, which as I understand is to tune disk entropy contribution. > With non-MQ, it is turned on, and with MQ, it is turned off by default. > > This got noticed because, when I was running multi-port testing, there was > a big drop in IOPs with and without MQ (~200K IOPS to 1M+ IOPs when the > test ran on same NUMA node / across NUMA nodes). > > Just wondering why we have it ON on one setting and OFF on another. > > Sorry for the rather long e-mail, but your comments/thoughts are much > appreciated. > You definitely want to use the automatic IRQ-affinity patches from Christoph; that proved to be a major gain in high-performance setups (eg when running off an all-flash array). Overall, I'm very much interested in these topics; let's continue with the discussion to figure out what the best approach here might be. Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: scsi-mq - tag# and can_queue, performance. 2017-04-03 7:29 ` Hannes Reinecke @ 2017-04-03 15:20 ` Bart Van Assche 2017-04-03 16:41 ` Arun Easi 2017-04-03 16:31 ` Arun Easi 1 sibling, 1 reply; 7+ messages in thread From: Bart Van Assche @ 2017-04-03 15:20 UTC (permalink / raw) To: arun.easi, linux-scsi, hare, axboe On Mon, 2017-04-03 at 09:29 +0200, Hannes Reinecke wrote: > On 04/03/2017 08:37 AM, Arun Easi wrote: > > If the above is true, then for a LLD to get tag# within it's max-tasks > > range, it has to report max-tasks / number-of-hw-queues in can_queue, and > > in the I/O path, use the tag and hwq# to arrive at a index# to use. This, > > though, leads to a poor use of tag resources -- queue reaching it's > > capacity while LLD can still take it. > > Shared tag sets continue to dog the block-mq on 'legacy' (ie non-NVMe) > HBAs. ATM the only 'real' solution to this problem is indeed have a > static split of the entire tag space by the number of hardware queues. > With the mentioned tag-starvation problem. Hello Arun and Hannes, Apparently the current blk_mq_alloc_tag_set() implementation is well suited for drivers like NVMe and ib_srp but not for traditional SCSI HBA drivers. How about adding a BLK_MQ_F_* flag that tells __blk_mq_alloc_rq_maps() to allocate a single set of tags for all hardware queues and also to add a flag to struct scsi_host_template such that SCSI LLDs can enable this behavior? Bart. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: scsi-mq - tag# and can_queue, performance. 2017-04-03 15:20 ` Bart Van Assche @ 2017-04-03 16:41 ` Arun Easi 2017-04-03 16:47 ` Jens Axboe 0 siblings, 1 reply; 7+ messages in thread From: Arun Easi @ 2017-04-03 16:41 UTC (permalink / raw) To: Bart Van Assche; +Cc: linux-scsi, hare, axboe [-- Attachment #1: Type: TEXT/PLAIN, Size: 1550 bytes --] On Mon, 3 Apr 2017, 8:20am, Bart Van Assche wrote: > On Mon, 2017-04-03 at 09:29 +0200, Hannes Reinecke wrote: > > On 04/03/2017 08:37 AM, Arun Easi wrote: > > > If the above is true, then for a LLD to get tag# within it's max-tasks > > > range, it has to report max-tasks / number-of-hw-queues in can_queue, and > > > in the I/O path, use the tag and hwq# to arrive at a index# to use. This, > > > though, leads to a poor use of tag resources -- queue reaching it's > > > capacity while LLD can still take it. > > > > Shared tag sets continue to dog the block-mq on 'legacy' (ie non-NVMe) > > HBAs. ATM the only 'real' solution to this problem is indeed have a > > static split of the entire tag space by the number of hardware queues. > > With the mentioned tag-starvation problem. > > Hello Arun and Hannes, > > Apparently the current blk_mq_alloc_tag_set() implementation is well suited > for drivers like NVMe and ib_srp but not for traditional SCSI HBA drivers. > How about adding a BLK_MQ_F_* flag that tells __blk_mq_alloc_rq_maps() to > allocate a single set of tags for all hardware queues and also to add a flag > to struct scsi_host_template such that SCSI LLDs can enable this behavior? > Hi Bart, This would certainly be beneficial in my case. Moreover, it certainly makes sense to move the logic up where multiple drivers can leverage. Perhaps, use percpu_ida* interfaces to do that, but I think I read somewhere that, it is not efficient (enough?) and is the reason to go the current way for block tags. Regards, -Arun ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: scsi-mq - tag# and can_queue, performance. 2017-04-03 16:41 ` Arun Easi @ 2017-04-03 16:47 ` Jens Axboe 2017-04-03 16:51 ` Arun Easi 0 siblings, 1 reply; 7+ messages in thread From: Jens Axboe @ 2017-04-03 16:47 UTC (permalink / raw) To: Arun Easi, Bart Van Assche; +Cc: linux-scsi, hare On 04/03/2017 10:41 AM, Arun Easi wrote: > On Mon, 3 Apr 2017, 8:20am, Bart Van Assche wrote: > >> On Mon, 2017-04-03 at 09:29 +0200, Hannes Reinecke wrote: >>> On 04/03/2017 08:37 AM, Arun Easi wrote: >>>> If the above is true, then for a LLD to get tag# within it's max-tasks >>>> range, it has to report max-tasks / number-of-hw-queues in can_queue, and >>>> in the I/O path, use the tag and hwq# to arrive at a index# to use. This, >>>> though, leads to a poor use of tag resources -- queue reaching it's >>>> capacity while LLD can still take it. >>> >>> Shared tag sets continue to dog the block-mq on 'legacy' (ie non-NVMe) >>> HBAs. ATM the only 'real' solution to this problem is indeed have a >>> static split of the entire tag space by the number of hardware queues. >>> With the mentioned tag-starvation problem. >> >> Hello Arun and Hannes, >> >> Apparently the current blk_mq_alloc_tag_set() implementation is well suited >> for drivers like NVMe and ib_srp but not for traditional SCSI HBA drivers. >> How about adding a BLK_MQ_F_* flag that tells __blk_mq_alloc_rq_maps() to >> allocate a single set of tags for all hardware queues and also to add a flag >> to struct scsi_host_template such that SCSI LLDs can enable this behavior? >> > > Hi Bart, > > This would certainly be beneficial in my case. Moreover, it certainly > makes sense to move the logic up where multiple drivers can leverage. > > Perhaps, use percpu_ida* interfaces to do that, but I think I read > somewhere that, it is not efficient (enough?) and is the reason to go the > current way for block tags. You don't have to change the underlying tag generation to solve this problem, Bart already pretty much outlined a fix that would work. percpu_ida works fine if you never use more than roughly half the available space, it's a poor fit for request tags where we want to retain good behavior and scaling at or near tag exhaustion. That's why blk-mq ended up rolling its own, which is now generically available as lib/sbitmap.c. -- Jens Axboe ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: scsi-mq - tag# and can_queue, performance. 2017-04-03 16:47 ` Jens Axboe @ 2017-04-03 16:51 ` Arun Easi 0 siblings, 0 replies; 7+ messages in thread From: Arun Easi @ 2017-04-03 16:51 UTC (permalink / raw) To: Jens Axboe; +Cc: Bart Van Assche, linux-scsi, hare On Mon, 3 Apr 2017, 9:47am, Jens Axboe wrote: > On 04/03/2017 10:41 AM, Arun Easi wrote: > > On Mon, 3 Apr 2017, 8:20am, Bart Van Assche wrote: > > > >> On Mon, 2017-04-03 at 09:29 +0200, Hannes Reinecke wrote: > >>> On 04/03/2017 08:37 AM, Arun Easi wrote: > >>>> If the above is true, then for a LLD to get tag# within it's max-tasks > >>>> range, it has to report max-tasks / number-of-hw-queues in can_queue, and > >>>> in the I/O path, use the tag and hwq# to arrive at a index# to use. This, > >>>> though, leads to a poor use of tag resources -- queue reaching it's > >>>> capacity while LLD can still take it. > >>> > >>> Shared tag sets continue to dog the block-mq on 'legacy' (ie non-NVMe) > >>> HBAs. ATM the only 'real' solution to this problem is indeed have a > >>> static split of the entire tag space by the number of hardware queues. > >>> With the mentioned tag-starvation problem. > >> > >> Hello Arun and Hannes, > >> > >> Apparently the current blk_mq_alloc_tag_set() implementation is well suited > >> for drivers like NVMe and ib_srp but not for traditional SCSI HBA drivers. > >> How about adding a BLK_MQ_F_* flag that tells __blk_mq_alloc_rq_maps() to > >> allocate a single set of tags for all hardware queues and also to add a flag > >> to struct scsi_host_template such that SCSI LLDs can enable this behavior? > >> > > > > Hi Bart, > > > > This would certainly be beneficial in my case. Moreover, it certainly > > makes sense to move the logic up where multiple drivers can leverage. > > > > Perhaps, use percpu_ida* interfaces to do that, but I think I read > > somewhere that, it is not efficient (enough?) and is the reason to go the > > current way for block tags. > > You don't have to change the underlying tag generation to solve this > problem, Bart already pretty much outlined a fix that would work. > percpu_ida works fine if you never use more than roughly half the > available space, it's a poor fit for request tags where we want to > retain good behavior and scaling at or near tag exhaustion. That's why > blk-mq ended up rolling its own, which is now generically available as > lib/sbitmap.c. > Sounds good. Thanks for the education, Jens. Regards, -Arun ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: scsi-mq - tag# and can_queue, performance. 2017-04-03 7:29 ` Hannes Reinecke 2017-04-03 15:20 ` Bart Van Assche @ 2017-04-03 16:31 ` Arun Easi 1 sibling, 0 replies; 7+ messages in thread From: Arun Easi @ 2017-04-03 16:31 UTC (permalink / raw) To: Hannes Reinecke; +Cc: linux-scsi, Jens Axboe On Mon, 3 Apr 2017, 12:29am, Hannes Reinecke wrote: > On 04/03/2017 08:37 AM, Arun Easi wrote: > > Hi Folks, > > > > I would like to seek your input on a few topics on SCSI / block > > multi-queue. > > > > 1. Tag# generation. > > > > The context is with SCSI MQ on. My question is, what should a LLD do to > > get request tag values in the range 0 through can_queue - 1 across *all* > > of the queues. In our QLogic 41XXX series of adapters, we have a per > > session submit queue, a shared task memory (shared across all queues) and > > N completion queues (separate MSI-X vectors). We report N as the > > nr_hw_queues. I would like to, if possible, use the block layer tags to > > index into the above shared task memory area. > > > > From looking at the scsi/block source, it appears that when a LLD reports > > a value say #C, in can_queue (via scsi_host_template), that value is used > > as the max depth when corresponding block layer queues are created. So, > > while SCSI restricts the number of commands to LLD at #C, the request tag > > generated across any of the queues can range from 0..#C-1. Please correct > > me if I got this wrong. > > > > If the above is true, then for a LLD to get tag# within it's max-tasks > > range, it has to report max-tasks / number-of-hw-queues in can_queue, and > > in the I/O path, use the tag and hwq# to arrive at a index# to use. This, > > though, leads to a poor use of tag resources -- queue reaching it's > > capacity while LLD can still take it. > > > Yep. > > > blk_mq_unique_tag() would not work here, because it just puts the hwq# in > > the upper 16 bits, which need not fall in the max-tasks range. > > > > Perhaps the current MQ model is to cater to a queue pair > > (submit/completion) kind of hardware model; nevertheless I would like to > > know how other hardware variants can makes use of it. > > > He. Welcome to the club. > > Shared tag sets continue to dog the block-mq on 'legacy' (ie non-NVMe) > HBAs. ATM the only 'real' solution to this problem is indeed have a > static split of the entire tag space by the number of hardware queues. > With the mentioned tag-starvation problem. > > If we were to continue with the tag to hardware ID mapping, we would > need to implement a dynamic tag space mapping onto hardware queues. > My idea to that would be to map the entire tag space, but rather the > individual bit words onto the hardware queue. Then we could make the > mapping dynamic, where there individual words are mapped onto the queues > only as needed. > However, the _one_ big problem we're facing here is timeouts. > With the 1:1 mapping between tags and hardware IDs we can only re-use > the tag once the timeout is _definitely_ resolved. But this means > the command will be active, and we cannot return blk_mq_complete() until > the timeout itself has been resolved. > With FC this essentially means until the corresponding XIDs are safe to > re-use, ie after all ABRT/RRQ etc processing has been completed. > Hence we totally lose the ability to return the command itself with > -ETIMEDOUT and continue with I/O processing even though the original XID > is still being held by firmware. > > In the light of this I wonder if it wouldn't be better to completely > decouple block-layer tags and hardware IDs, and have an efficient > algorithm mapping the block-layer tags onto hardware IDs. > That should avoid the arbitrary tag starvation problem, and would allow > us to handle timeouts efficiently. > Of course, we don't _have_ such an efficient algorithm; maybe it's time > to have a generic one within the kernel as quite some drivers would > _love_ to just use the generic implementation here. > (qla2xxx, lpfc, fcoe, mpt3sas etc all suffer from the same problem) > > > 2. mq vs non-mq performance gain. > > > > This is more like a poll, I guess. I was wondering what performance gains > > folks are observing with SCSI MQ on. I saw Christoph H.'s slide deck that > > has one slide that shows a 200k IOPS gain. > > > > From my testing, though, I was not lucky to observe that big of a change. > > In fact, the difference was not even noticeable(*). For e.g., for 512 > > bytes random read test, in both cases, gave me in the vicinity of 2M IOPs. > > When I say both cases, I meant, one with scsi_mod's use_blk_mq set to 0 > > and another with 1 (LLD is reloaded when it is done). I only used one NUMA > > node for this run. The test was run on a x86_64 setup. > > > You _really_ should have listened to my talk at VAULT. Would you have a slide deck / minutes that could be shared? > For 'legacy' HBAs there indeed is not much of a performance gain to be > had; the max gain is indeed for heavy parallel I/O. I have multiple devices (I-T nexuses) in my setup, so definitely there are parallel I/Os. > And there even is a scheduler issue when running with a single > submission thread; there I've measured a performance _drop_ by up to > 50%. Which, as Jens claims, really looks like a block-layer issue rather > than a generic problem. > > > > * See item 3 for a special handling. > > > > 3. add_random slowness. > > > > One thing I observed with MQ on and off was this block layer tunable, > > add_random, which as I understand is to tune disk entropy contribution. > > With non-MQ, it is turned on, and with MQ, it is turned off by default. > > > > This got noticed because, when I was running multi-port testing, there was > > a big drop in IOPs with and without MQ (~200K IOPS to 1M+ IOPs when the > > test ran on same NUMA node / across NUMA nodes). > > > > Just wondering why we have it ON on one setting and OFF on another. > > > > Sorry for the rather long e-mail, but your comments/thoughts are much > > appreciated. > > > You definitely want to use the automatic IRQ-affinity patches from > Christoph; that proved to be a major gain in high-performance setups (eg > when running off an all-flash array). That change is not yet present in the driver. I was using irqbalance (oneshot) / custom-script to try out various MSI-X vector to CPU mapping in the mean time. Regards, -Arun > > Overall, I'm very much interested in these topics; let's continue with > the discussion to figure out what the best approach here might be. > > Cheers, > > Hannes > ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2017-04-03 16:51 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-04-03 6:37 scsi-mq - tag# and can_queue, performance Arun Easi 2017-04-03 7:29 ` Hannes Reinecke 2017-04-03 15:20 ` Bart Van Assche 2017-04-03 16:41 ` Arun Easi 2017-04-03 16:47 ` Jens Axboe 2017-04-03 16:51 ` Arun Easi 2017-04-03 16:31 ` Arun Easi
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.