All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: Jens Axboe <axboe@fb.com>
Cc: Ming Lei <tom.leiming@gmail.com>,
	Omar Sandoval <osandov@osandov.com>,
	linux-block <linux-block@vger.kernel.org>,
	Christoph Hellwig <hch@infradead.org>,
	Omar Sandoval <osandov@fb.com>
Subject: Re: [PATCH 1/4] blk-mq: introduce BLK_MQ_F_SCHED_USE_HW_TAG
Date: Thu, 4 May 2017 10:51:56 +0800	[thread overview]
Message-ID: <20170504025150.GA16218@ming.t460p> (raw)
In-Reply-To: <b7f4f8cf-30d1-8995-4256-c96184522992@fb.com>

On Wed, May 03, 2017 at 08:13:03PM -0600, Jens Axboe wrote:
> On 05/03/2017 08:01 PM, Ming Lei wrote:
> > On Thu, May 4, 2017 at 5:40 AM, Omar Sandoval <osandov@osandov.com> wrote:
> >> On Thu, May 04, 2017 at 04:13:51AM +0800, Ming Lei wrote:
> >>> On Thu, May 4, 2017 at 12:46 AM, Omar Sandoval <osandov@osandov.com> wrote:
> >>>> On Fri, Apr 28, 2017 at 11:15:36PM +0800, Ming Lei wrote:
> >>>>> When blk-mq I/O scheduler is used, we need two tags for
> >>>>> submitting one request. One is called scheduler tag for
> >>>>> allocating request and scheduling I/O, another one is called
> >>>>> driver tag, which is used for dispatching IO to hardware/driver.
> >>>>> This way introduces one extra per-queue allocation for both tags
> >>>>> and request pool, and may not be as efficient as case of none
> >>>>> scheduler.
> >>>>>
> >>>>> Also currently we put a default per-hctx limit on schedulable
> >>>>> requests, and this limit may be a bottleneck for some devices,
> >>>>> especialy when these devices have a quite big tag space.
> >>>>>
> >>>>> This patch introduces BLK_MQ_F_SCHED_USE_HW_TAG so that we can
> >>>>> allow to use hardware/driver tags directly for IO scheduling if
> >>>>> devices's hardware tag space is big enough. Then we can avoid
> >>>>> the extra resource allocation and make IO submission more
> >>>>> efficient.
> >>>>>
> >>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>>>> ---
> >>>>>  block/blk-mq-sched.c   | 10 +++++++++-
> >>>>>  block/blk-mq.c         | 35 +++++++++++++++++++++++++++++------
> >>>>>  include/linux/blk-mq.h |  1 +
> >>>>>  3 files changed, 39 insertions(+), 7 deletions(-)
> >>>>
> >>>> One more note on this: if we're using the hardware tags directly, then
> >>>> we are no longer limited to q->nr_requests requests in-flight. Instead,
> >>>> we're limited to the hw queue depth. We probably want to maintain the
> >>>> original behavior,
> >>>
> >>> That need further investigation, and generally scheduler should be happy with
> >>> more requests which can be scheduled.
> >>>
> >>> We can make it as one follow-up.
> >>
> >> If we say nr_requests is 256, then we should honor that. So either
> >> update nr_requests to reflect the actual depth we're using or resize the
> >> hardware tags.
> > 
> > Firstly nr_requests is set as 256 from blk-mq inside instead of user
> > space, it won't be a big deal to violate that.
> 
> The legacy scheduling layer used 2*128 by default, that's why I used the
> "magic" 256 internally. FWIW, I agree with Omar here. If it's set to
> 256, we must honor that. Users will tweak this value down to trade peak
> performance for latency, it's important that it does what it advertises.

In case of scheduling with hw tags, we share tags between scheduler and
dispatching, if we resize(only decrease actually) the tags, dispatching
space(hw tags) is decreased too. That means the actual usable device tag
space need to be decreased much.

> 
> > Secondly, when there is enough tags available, it might hurt
> > performance if we don't use them all.
> 
> That's mostly bogus. Crazy large tag depths have only one use case -
> synthetic peak performance benchmarks from manufacturers. We don't want
> to allow really deep queues. Nothing good comes from that, just a lot of
> pain and latency issues.

Given device provides so high queue depth, it might be reasonable to just
allow to use them up. For example of NVMe, once mq scheduler is enabled,
the actual size of device tag space is just 256 at default, even though
the hardware provides very big tag space(>= 10K).

The problem is that lifetime of sched tag is same with request's
lifetime(from submission to completion), and it covers lifetime of
device tag.  In theory sched tag should have been freed just after
the rq is dispatched to driver. Unfortunately we can't do that because
request is allocated from sched tag set.

> 
> The most important part is actually that the scheduler has a higher
> depth than the device, as mentioned in an email from a few days ago. We

I agree this point, but:

Unfortunately in case of NVMe or other high depth devices, the default
scheduler queue depth(256) is much less than device depth, do we need to
adjust the default value for this devices? In theory, the default 256
scheduler depth may hurt performance on this devices since the device
tag space is much under-utilized.


Thanks,
Ming

  reply	other threads:[~2017-05-04  2:52 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-28 15:15 [PATCH 0/4] blk-mq: support to use hw tag for scheduling Ming Lei
2017-04-28 15:15 ` [PATCH 1/4] blk-mq: introduce BLK_MQ_F_SCHED_USE_HW_TAG Ming Lei
2017-05-03 16:21   ` Omar Sandoval
2017-05-03 16:46   ` Omar Sandoval
2017-05-03 20:13     ` Ming Lei
2017-05-03 21:40       ` Omar Sandoval
2017-05-04  2:01         ` Ming Lei
2017-05-04  2:13           ` Jens Axboe
2017-05-04  2:51             ` Ming Lei [this message]
2017-05-04 14:06               ` Jens Axboe
2017-05-05 22:54                 ` Ming Lei
2017-05-05 22:54                   ` Ming Lei
2017-05-05 23:33                   ` Ming Lei
2017-05-05 23:33                     ` Ming Lei
2017-05-10  7:25                 ` Ming Lei
2017-04-28 15:15 ` [PATCH 2/4] blk-mq: introduce blk_mq_get_queue_depth() Ming Lei
2017-04-28 18:23   ` Jens Axboe
2017-04-29  9:55     ` Ming Lei
2017-05-03 16:55   ` Omar Sandoval
2017-05-04  2:10     ` Ming Lei
2017-04-28 15:15 ` [PATCH 3/4] blk-mq: use hw tag for scheduling if hw tag space is big enough Ming Lei
2017-04-28 18:09   ` Bart Van Assche
2017-04-29 10:35     ` Ming Lei
2017-05-01 15:06       ` Bart Van Assche
2017-05-02  3:49         ` Omar Sandoval
2017-05-02  8:46         ` Ming Lei
2017-04-28 18:22   ` Jens Axboe
2017-04-28 20:11     ` Bart Van Assche
2017-04-29 10:59     ` Ming Lei
2017-05-03 16:29   ` Omar Sandoval
2017-05-03 16:55     ` Ming Lei
2017-05-03 17:00       ` Omar Sandoval
2017-05-03 17:33         ` Ming Lei
2017-04-28 15:15 ` [PATCH 4/4] blk-mq: dump new introduced flag of BLK_MQ_F_SCHED_USE_HW_TAG Ming Lei
2017-04-28 18:10   ` Bart Van Assche
2017-04-29 11:00     ` Ming Lei
2017-04-28 20:29 ` [PATCH 0/4] blk-mq: support to use hw tag for scheduling Jens Axboe
2017-05-03  4:03   ` Ming Lei
2017-05-03 14:08     ` Jens Axboe
2017-05-03 14:10       ` Jens Axboe
2017-05-03 15:03         ` Ming Lei
2017-05-03 15:08           ` Jens Axboe
2017-05-03 15:38             ` Ming Lei
2017-05-03 16:06               ` Omar Sandoval
2017-05-03 16:21                 ` Ming Lei
2017-05-03 16:52               ` Ming Lei
2017-05-03 17:03                 ` Ming Lei
2017-05-03 17:07                   ` Jens Axboe
2017-05-03 17:15                     ` Bart Van Assche
2017-05-03 17:24                       ` Jens Axboe
2017-05-03 17:35                         ` Bart Van Assche
2017-05-03 17:40                           ` Jens Axboe
2017-05-03 17:43                             ` Bart Van Assche
2017-05-03 17:08                 ` Bart Van Assche
2017-05-03 17:11                   ` Jens Axboe
2017-05-03 17:19                   ` Ming Lei
2017-05-03 17:41                     ` Bart Van Assche

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170504025150.GA16218@ming.t460p \
    --to=ming.lei@redhat.com \
    --cc=axboe@fb.com \
    --cc=hch@infradead.org \
    --cc=linux-block@vger.kernel.org \
    --cc=osandov@fb.com \
    --cc=osandov@osandov.com \
    --cc=tom.leiming@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.