From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 19 Sep 2016 12:38:05 +0200 From: Alexander Gordeev To: Keith Busch Cc: linux-kernel@vger.kernel.org, Jens Axboe , linux-nvme@lists.infradead.org, linux-block@vger.kernel.org Subject: Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues Message-ID: <20160919103805.GA22169@agordeev.lab.eng.brq.redhat.com> References: <20160916210448.GA1178@localhost.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <20160916210448.GA1178@localhost.localdomain> List-ID: On Fri, Sep 16, 2016 at 05:04:48PM -0400, Keith Busch wrote: CC-ing linux-block@vger.kernel.org > I'm not sure I see how this helps. That probably means I'm not considering > the right scenario. Could you elaborate on when having multiple hardware > queues to choose from a given CPU will provide a benefit? No, I do not keep in mind any particular scenario besides common sense. Just an assumption deeper queues are better (in this RFC a virtual combined queue consisting of multipe h/w queues). Apparently, there could be positive effects only in systems where # of queues / # of CPUs > 1 or # of queues / # of cores > 1. But I do not happen to have ones. If I had numbers this would not be the RFC and I probably would not have posted in the first place ;) Would it be possible to give it a try on your hardware? > If we're out of avaliable h/w tags, having more queues shouldn't > improve performance. The tag depth on each nvme hw context is already > deep enough that it should mean even one full queue has saturated the > device capabilities. Am I getting you right - a single full nvme hardware queue makes other queues stalled? > Having a 1:1 already seemed like the ideal solution since you can't > simultaneously utilize more than that from the host, so there's no more > h/w parallelisms from we can exploit. On the controller side, fetching > commands is serialized memory reads, so I don't think spreading IO > among more h/w queues helps the target over posting more commands to a > single queue. I take a notion of un-ordered commands completion you described below. But I fail to realize why a CPU would not simultaneously utilize more than one queue by posting to multiple. Is it due to nvme specifics or you assume the host would not issue that many commands? Besides, blk-mq-tag re-uses the latest freed tag and IO should not actually get spred. Instead, if only currently used hardware queue is full, the next available queue is chosen. But this is a speculation without real benchmarks, of course. > If a CPU has more than one to choose from, a command sent to a less > used queue would be serviced ahead of previously issued commands on a > more heavily used one from the same CPU thread due to how NVMe command > arbitraration works, so it sounds like this would create odd latency > outliers. Yep, that sounds scary indeed. Still, any hints on benchmarking are welcomed. Many thanks! > Thanks, > Keith From mboxrd@z Thu Jan 1 00:00:00 1970 From: agordeev@redhat.com (Alexander Gordeev) Date: Mon, 19 Sep 2016 12:38:05 +0200 Subject: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues In-Reply-To: <20160916210448.GA1178@localhost.localdomain> References: <20160916210448.GA1178@localhost.localdomain> Message-ID: <20160919103805.GA22169@agordeev.lab.eng.brq.redhat.com> On Fri, Sep 16, 2016@05:04:48PM -0400, Keith Busch wrote: CC-ing linux-block at vger.kernel.org > I'm not sure I see how this helps. That probably means I'm not considering > the right scenario. Could you elaborate on when having multiple hardware > queues to choose from a given CPU will provide a benefit? No, I do not keep in mind any particular scenario besides common sense. Just an assumption deeper queues are better (in this RFC a virtual combined queue consisting of multipe h/w queues). Apparently, there could be positive effects only in systems where # of queues / # of CPUs > 1 or # of queues / # of cores > 1. But I do not happen to have ones. If I had numbers this would not be the RFC and I probably would not have posted in the first place ;) Would it be possible to give it a try on your hardware? > If we're out of avaliable h/w tags, having more queues shouldn't > improve performance. The tag depth on each nvme hw context is already > deep enough that it should mean even one full queue has saturated the > device capabilities. Am I getting you right - a single full nvme hardware queue makes other queues stalled? > Having a 1:1 already seemed like the ideal solution since you can't > simultaneously utilize more than that from the host, so there's no more > h/w parallelisms from we can exploit. On the controller side, fetching > commands is serialized memory reads, so I don't think spreading IO > among more h/w queues helps the target over posting more commands to a > single queue. I take a notion of un-ordered commands completion you described below. But I fail to realize why a CPU would not simultaneously utilize more than one queue by posting to multiple. Is it due to nvme specifics or you assume the host would not issue that many commands? Besides, blk-mq-tag re-uses the latest freed tag and IO should not actually get spred. Instead, if only currently used hardware queue is full, the next available queue is chosen. But this is a speculation without real benchmarks, of course. > If a CPU has more than one to choose from, a command sent to a less > used queue would be serviced ahead of previously issued commands on a > more heavily used one from the same CPU thread due to how NVMe command > arbitraration works, so it sounds like this would create odd latency > outliers. Yep, that sounds scary indeed. Still, any hints on benchmarking are welcomed. Many thanks! > Thanks, > Keith