linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
From: Damien Le Moal <Damien.LeMoal@wdc.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: "linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	Tim Walker <tim.t.walker@seagate.com>,
	linux-scsi <linux-scsi@vger.kernel.org>
Subject: Re: [LSF/MM/BPF TOPIC] NVMe HDD
Date: Thu, 13 Feb 2020 08:24:36 +0000	[thread overview]
Message-ID: <BYAPR04MB58160C04182D5FE3A15842BBE71A0@BYAPR04MB5816.namprd04.prod.outlook.com> (raw)
In-Reply-To: 20200213075348.GA9144@ming.t460p

On 2020/02/13 16:54, Ming Lei wrote:
> On Thu, Feb 13, 2020 at 02:40:41AM +0000, Damien Le Moal wrote:
>> Ming,
>>
>> On 2020/02/13 7:03, Ming Lei wrote:
>>> On Wed, Feb 12, 2020 at 01:47:53AM +0000, Damien Le Moal wrote:
>>>> On 2020/02/12 4:01, Tim Walker wrote:
>>>>> On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@redhat.com> wrote:
>>>>>>
>>>>>> On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote:
>>>>>>> Background:
>>>>>>>
>>>>>>> NVMe specification has hardened over the decade and now NVMe devices
>>>>>>> are well integrated into our customers’ systems. As we look forward,
>>>>>>> moving HDDs to the NVMe command set eliminates the SAS IOC and driver
>>>>>>> stack, consolidating on a single access method for rotational and
>>>>>>> static storage technologies. PCIe-NVMe offers near-SATA interface
>>>>>>> costs, features and performance suitable for high-cap HDDs, and
>>>>>>> optimal interoperability for storage automation, tiering, and
>>>>>>> management. We will share some early conceptual results and proposed
>>>>>>> salient design goals and challenges surrounding an NVMe HDD.
>>>>>>
>>>>>> HDD. performance is very sensitive to IO order. Could you provide some
>>>>>> background info about NVMe HDD? Such as:
>>>>>>
>>>>>> - number of hw queues
>>>>>> - hw queue depth
>>>>>> - will NVMe sort/merge IO among all SQs or not?
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Discussion Proposal:
>>>>>>>
>>>>>>> We’d like to share our views and solicit input on:
>>>>>>>
>>>>>>> -What Linux storage stack assumptions do we need to be aware of as we
>>>>>>> develop these devices with drastically different performance
>>>>>>> characteristics than traditional NAND? For example, what schedular or
>>>>>>> device driver level changes will be needed to integrate NVMe HDDs?
>>>>>>
>>>>>> IO merge is often important for HDD. IO merge is usually triggered when
>>>>>> .queue_rq() returns STS_RESOURCE, so far this condition won't be
>>>>>> triggered for NVMe SSD.
>>>>>>
>>>>>> Also blk-mq kills BDI queue congestion and ioc batching, and causes
>>>>>> writeback performance regression[1][2].
>>>>>>
>>>>>> What I am thinking is that if we need to switch to use independent IO
>>>>>> path for handling SSD and HDD. IO, given the two mediums are so
>>>>>> different from performance viewpoint.
>>>>>>
>>>>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e=
>>>>>> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e=
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Ming
>>>>>>
>>>>>
>>>>> I would expect the drive would support a reasonable number of queues
>>>>> and a relatively deep queue depth, more in line with NVMe practices
>>>>> than SAS HDD's typical 128. But it probably doesn't make sense to
>>>>> queue up thousands of commands on something as slow as an HDD, and
>>>>> many customers keep queues < 32 for latency management.
>>>>
>>>> Exposing an HDD through multiple-queues each with a high queue depth is
>>>> simply asking for troubles. Commands will end up spending so much time
>>>> sitting in the queues that they will timeout. This can already be observed
>>>> with the smartpqi SAS HBA which exposes single drives as multiqueue block
>>>> devices with high queue depth. Exercising these drives heavily leads to
>>>> thousands of commands being queued and to timeouts. It is fairly easy to
>>>> trigger this without a manual change to the QD. This is on my to-do list of
>>>> fixes for some time now (lacking time to do it).
>>>
>>> Just wondering why smartpqi SAS won't set one proper .cmd_per_lun for
>>> avoiding the issue, looks the driver simply assigns .can_queue to it,
>>> then it isn't strange to see the timeout issue. If .can_queue is a bit
>>> big, HDD. is easily saturated too long.
>>>
>>>>
>>>> NVMe HDDs need to have an interface setup that match their speed, that is,
>>>> something like a SAS interface: *single* queue pair with a max QD of 256 or
>>>> less depending on what the drive can take. Their is no TASK_SET_FULL
>>>> notification on NVMe, so throttling has to come from the max QD of the SQ,
>>>> which the drive will advertise to the host.
>>>>
>>>>> Merge and elevator are important to HDD performance. I don't believe
>>>>> NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort
>>>>> within a SQ without driving large differences between SSD & HDD data
>>>>> paths?
>>>>
>>>> As far as I know, there is no merging going on once requests are passed to
>>>> the driver and added to an SQ. So this is beside the point.
>>>> The current default block scheduler for NVMe SSDs is "none". This is
>>>> decided based on the number of queues of the device. For NVMe drives that
>>>> have only a single queue *AND* the QUEUE_FLAG_NONROT flag cleared in their
>>>> request queue will can fallback to the default spinning rust mq-deadline
>>>> elevator. That will achieve command merging and LBA ordering needed for
>>>> good performance on HDDs.
>>>
>>> mq-deadline basically won't merge IO if STS_RESOURCE isn't returned from
>>> .queue_rq(), or blk_mq_get_dispatch_budget always return true. NVMe's
>>> .queue_rq() basically always returns STS_OK.
>>
>> I am confused: when an elevator is set, ->queue_rq() is called for requests
>> obtained from the elevator (with e->type->ops.dispatch_request()), after
>> the requests went through it. And merging will happen at that stage when
>> new requests are inserted in the elevator.
> 
> When request is queued to lld via .queue_rq(), the request has been
> removed from scheduler queue. And IO merge is just done inside or
> against scheduler queue.

Yes, for incoming new BIOs, not for requests passed to the LLD.

>> If the ->queue_rq() returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, the
>> request is indeed requeued which offer more chances of further merging, but
>> that is not the same as no merging happening.
>> Am I missing your point here ?
> 
> BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE or getting no budget can be
> thought as device saturation feedback, then more requests can be
> gathered in scheduler queue since we don't dequeue request from
> scheduler queue when that happens, then IO merge is possible.
> 
> Without any device saturation feedback from driver, block layer just
> dequeues request from scheduler queue with same speed of submission to
> hardware, then no IO can be merged.

Got it. And since queue full will mean no more tags, submission will block
on get_request() and there will be no chance in the elevator to merge
anything (aside from opportunistic merging in plugs), isn't it ?
So I guess NVMe HDDs will need some tuning in this area.

> 
> If you observe sequential IO on NVMe PCI, you will see no IO merge
> basically.
> 
>  
> Thanks,
> Ming
> 
> 


-- 
Damien Le Moal
Western Digital Research

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

  reply	other threads:[~2020-02-13  8:24 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-10 19:20 [LSF/MM/BPF TOPIC] NVMe HDD Tim Walker
2020-02-10 20:43 ` Keith Busch
2020-02-10 22:25   ` Finn Thain
2020-02-11 12:28 ` Ming Lei
2020-02-11 19:01   ` Tim Walker
2020-02-12  1:47     ` Damien Le Moal
2020-02-12 22:03       ` Ming Lei
2020-02-13  2:40         ` Damien Le Moal
2020-02-13  7:53           ` Ming Lei
2020-02-13  8:24             ` Damien Le Moal [this message]
2020-02-13  8:34               ` Ming Lei
2020-02-13 16:30                 ` Keith Busch
2020-02-14  0:40                   ` Ming Lei
2020-02-13  3:02       ` Martin K. Petersen
2020-02-13  3:12         ` Tim Walker
2020-02-13  4:17           ` Martin K. Petersen
2020-02-14  7:32             ` Hannes Reinecke
2020-02-14 14:40               ` Keith Busch
2020-02-14 16:04                 ` Hannes Reinecke
2020-02-14 17:05                   ` Keith Busch
2020-02-18 15:54                     ` Tim Walker
2020-02-18 17:41                       ` Keith Busch
2020-02-18 17:52                         ` James Smart
2020-02-19  1:31                         ` Ming Lei
2020-02-19  1:53                           ` Damien Le Moal
2020-02-19  2:15                             ` Ming Lei
2020-02-19  2:32                               ` Damien Le Moal
2020-02-19  2:56                                 ` Tim Walker
2020-02-19 16:28                                   ` Tim Walker
2020-02-19 20:50                                     ` Keith Busch
2020-02-14  0:35         ` Ming Lei
2020-02-12 21:52     ` Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BYAPR04MB58160C04182D5FE3A15842BBE71A0@BYAPR04MB5816.namprd04.prod.outlook.com \
    --to=damien.lemoal@wdc.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=tim.t.walker@seagate.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).