All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-10 19:20 ` Tim Walker
  0 siblings, 0 replies; 64+ messages in thread
From: Tim Walker @ 2020-02-10 19:20 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-nvme

Background:

NVMe specification has hardened over the decade and now NVMe devices
are well integrated into our customers’ systems. As we look forward,
moving HDDs to the NVMe command set eliminates the SAS IOC and driver
stack, consolidating on a single access method for rotational and
static storage technologies. PCIe-NVMe offers near-SATA interface
costs, features and performance suitable for high-cap HDDs, and
optimal interoperability for storage automation, tiering, and
management. We will share some early conceptual results and proposed
salient design goals and challenges surrounding an NVMe HDD.


Discussion Proposal:

We’d like to share our views and solicit input on:

-What Linux storage stack assumptions do we need to be aware of as we
develop these devices with drastically different performance
characteristics than traditional NAND? For example, what schedular or
device driver level changes will be needed to integrate NVMe HDDs?

-Are there NVMe feature trade-offs that make sense for HDDs that won’t
break the HDD-SSD interoperability goals?

-How would upcoming multi-actuator HDDs impact NVMe?


Regards,
Tim Walker

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-10 19:20 ` Tim Walker
  0 siblings, 0 replies; 64+ messages in thread
From: Tim Walker @ 2020-02-10 19:20 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-nvme

Background:

NVMe specification has hardened over the decade and now NVMe devices
are well integrated into our customers’ systems. As we look forward,
moving HDDs to the NVMe command set eliminates the SAS IOC and driver
stack, consolidating on a single access method for rotational and
static storage technologies. PCIe-NVMe offers near-SATA interface
costs, features and performance suitable for high-cap HDDs, and
optimal interoperability for storage automation, tiering, and
management. We will share some early conceptual results and proposed
salient design goals and challenges surrounding an NVMe HDD.


Discussion Proposal:

We’d like to share our views and solicit input on:

-What Linux storage stack assumptions do we need to be aware of as we
develop these devices with drastically different performance
characteristics than traditional NAND? For example, what schedular or
device driver level changes will be needed to integrate NVMe HDDs?

-Are there NVMe feature trade-offs that make sense for HDDs that won’t
break the HDD-SSD interoperability goals?

-How would upcoming multi-actuator HDDs impact NVMe?


Regards,
Tim Walker

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-10 19:20 ` Tim Walker
@ 2020-02-10 20:43   ` Keith Busch
  -1 siblings, 0 replies; 64+ messages in thread
From: Keith Busch @ 2020-02-10 20:43 UTC (permalink / raw)
  To: Tim Walker; +Cc: linux-block, linux-scsi, linux-nvme

On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote:
> -What Linux storage stack assumptions do we need to be aware of as we
> develop these devices with drastically different performance
> characteristics than traditional NAND? For example, what schedular or
> device driver level changes will be needed to integrate NVMe HDDs?

Right now the nvme driver unconditionally sets QUEUE_FLAG_NONROT
(non-rational, i.e. ssd), on all nvme namespace's request_queue flags. We
need the specification to define a capability bit or field associated
with the namespace to tell the driver otherwise, then we can propogate
that information up to the block layer.

Even without that, an otherwise spec compliant HDD should function as an
nvme device with existing software, but I would be interested to hear
additional ideas or feature gaps with other protocols that should be
considered in order to make an nvme hdd work well.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-10 20:43   ` Keith Busch
  0 siblings, 0 replies; 64+ messages in thread
From: Keith Busch @ 2020-02-10 20:43 UTC (permalink / raw)
  To: Tim Walker; +Cc: linux-block, linux-nvme, linux-scsi

On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote:
> -What Linux storage stack assumptions do we need to be aware of as we
> develop these devices with drastically different performance
> characteristics than traditional NAND? For example, what schedular or
> device driver level changes will be needed to integrate NVMe HDDs?

Right now the nvme driver unconditionally sets QUEUE_FLAG_NONROT
(non-rational, i.e. ssd), on all nvme namespace's request_queue flags. We
need the specification to define a capability bit or field associated
with the namespace to tell the driver otherwise, then we can propogate
that information up to the block layer.

Even without that, an otherwise spec compliant HDD should function as an
nvme device with existing software, but I would be interested to hear
additional ideas or feature gaps with other protocols that should be
considered in order to make an nvme hdd work well.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-10 20:43   ` Keith Busch
@ 2020-02-10 22:25     ` Finn Thain
  -1 siblings, 0 replies; 64+ messages in thread
From: Finn Thain @ 2020-02-10 22:25 UTC (permalink / raw)
  To: Keith Busch; +Cc: Tim Walker, linux-block, linux-scsi, linux-nvme

On Mon, 10 Feb 2020, Keith Busch wrote:

> Right now the nvme driver unconditionally sets QUEUE_FLAG_NONROT 
> (non-rational, i.e. ssd), on all nvme namespace's request_queue flags. 

I agree -- the standard nomenclature is not rational ;-) Air-cooled is not 
"solid state". Any round-robin algorithm is "rotational". No expensive 
array is a "R.A.I.D.". There's no "S.C.S.I." on a large system...

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-10 22:25     ` Finn Thain
  0 siblings, 0 replies; 64+ messages in thread
From: Finn Thain @ 2020-02-10 22:25 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, linux-nvme, Tim Walker, linux-scsi

On Mon, 10 Feb 2020, Keith Busch wrote:

> Right now the nvme driver unconditionally sets QUEUE_FLAG_NONROT 
> (non-rational, i.e. ssd), on all nvme namespace's request_queue flags. 

I agree -- the standard nomenclature is not rational ;-) Air-cooled is not 
"solid state". Any round-robin algorithm is "rotational". No expensive 
array is a "R.A.I.D.". There's no "S.C.S.I." on a large system...

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-10 19:20 ` Tim Walker
@ 2020-02-11 12:28   ` Ming Lei
  -1 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2020-02-11 12:28 UTC (permalink / raw)
  To: Tim Walker; +Cc: linux-block, linux-scsi, linux-nvme

On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote:
> Background:
> 
> NVMe specification has hardened over the decade and now NVMe devices
> are well integrated into our customers’ systems. As we look forward,
> moving HDDs to the NVMe command set eliminates the SAS IOC and driver
> stack, consolidating on a single access method for rotational and
> static storage technologies. PCIe-NVMe offers near-SATA interface
> costs, features and performance suitable for high-cap HDDs, and
> optimal interoperability for storage automation, tiering, and
> management. We will share some early conceptual results and proposed
> salient design goals and challenges surrounding an NVMe HDD.

HDD. performance is very sensitive to IO order. Could you provide some
background info about NVMe HDD? Such as:

- number of hw queues
- hw queue depth
- will NVMe sort/merge IO among all SQs or not?

> 
> 
> Discussion Proposal:
> 
> We’d like to share our views and solicit input on:
> 
> -What Linux storage stack assumptions do we need to be aware of as we
> develop these devices with drastically different performance
> characteristics than traditional NAND? For example, what schedular or
> device driver level changes will be needed to integrate NVMe HDDs?

IO merge is often important for HDD. IO merge is usually triggered when
.queue_rq() returns STS_RESOURCE, so far this condition won't be
triggered for NVMe SSD.

Also blk-mq kills BDI queue congestion and ioc batching, and causes
writeback performance regression[1][2].

What I am thinking is that if we need to switch to use independent IO
path for handling SSD and HDD. IO, given the two mediums are so
different from performance viewpoint.

[1] https://lore.kernel.org/linux-scsi/Pine.LNX.4.44L0.1909181213141.1507-100000@iolanthe.rowland.org/
[2] https://lore.kernel.org/linux-scsi/20191226083706.GA17974@ming.t460p/


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-11 12:28   ` Ming Lei
  0 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2020-02-11 12:28 UTC (permalink / raw)
  To: Tim Walker; +Cc: linux-block, linux-nvme, linux-scsi

On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote:
> Background:
> 
> NVMe specification has hardened over the decade and now NVMe devices
> are well integrated into our customers’ systems. As we look forward,
> moving HDDs to the NVMe command set eliminates the SAS IOC and driver
> stack, consolidating on a single access method for rotational and
> static storage technologies. PCIe-NVMe offers near-SATA interface
> costs, features and performance suitable for high-cap HDDs, and
> optimal interoperability for storage automation, tiering, and
> management. We will share some early conceptual results and proposed
> salient design goals and challenges surrounding an NVMe HDD.

HDD. performance is very sensitive to IO order. Could you provide some
background info about NVMe HDD? Such as:

- number of hw queues
- hw queue depth
- will NVMe sort/merge IO among all SQs or not?

> 
> 
> Discussion Proposal:
> 
> We’d like to share our views and solicit input on:
> 
> -What Linux storage stack assumptions do we need to be aware of as we
> develop these devices with drastically different performance
> characteristics than traditional NAND? For example, what schedular or
> device driver level changes will be needed to integrate NVMe HDDs?

IO merge is often important for HDD. IO merge is usually triggered when
.queue_rq() returns STS_RESOURCE, so far this condition won't be
triggered for NVMe SSD.

Also blk-mq kills BDI queue congestion and ioc batching, and causes
writeback performance regression[1][2].

What I am thinking is that if we need to switch to use independent IO
path for handling SSD and HDD. IO, given the two mediums are so
different from performance viewpoint.

[1] https://lore.kernel.org/linux-scsi/Pine.LNX.4.44L0.1909181213141.1507-100000@iolanthe.rowland.org/
[2] https://lore.kernel.org/linux-scsi/20191226083706.GA17974@ming.t460p/


Thanks, 
Ming


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-11 12:28   ` Ming Lei
@ 2020-02-11 19:01     ` Tim Walker
  -1 siblings, 0 replies; 64+ messages in thread
From: Tim Walker @ 2020-02-11 19:01 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, linux-scsi, linux-nvme

On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote:
> > Background:
> >
> > NVMe specification has hardened over the decade and now NVMe devices
> > are well integrated into our customers’ systems. As we look forward,
> > moving HDDs to the NVMe command set eliminates the SAS IOC and driver
> > stack, consolidating on a single access method for rotational and
> > static storage technologies. PCIe-NVMe offers near-SATA interface
> > costs, features and performance suitable for high-cap HDDs, and
> > optimal interoperability for storage automation, tiering, and
> > management. We will share some early conceptual results and proposed
> > salient design goals and challenges surrounding an NVMe HDD.
>
> HDD. performance is very sensitive to IO order. Could you provide some
> background info about NVMe HDD? Such as:
>
> - number of hw queues
> - hw queue depth
> - will NVMe sort/merge IO among all SQs or not?
>
> >
> >
> > Discussion Proposal:
> >
> > We’d like to share our views and solicit input on:
> >
> > -What Linux storage stack assumptions do we need to be aware of as we
> > develop these devices with drastically different performance
> > characteristics than traditional NAND? For example, what schedular or
> > device driver level changes will be needed to integrate NVMe HDDs?
>
> IO merge is often important for HDD. IO merge is usually triggered when
> .queue_rq() returns STS_RESOURCE, so far this condition won't be
> triggered for NVMe SSD.
>
> Also blk-mq kills BDI queue congestion and ioc batching, and causes
> writeback performance regression[1][2].
>
> What I am thinking is that if we need to switch to use independent IO
> path for handling SSD and HDD. IO, given the two mediums are so
> different from performance viewpoint.
>
> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e=
> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e=
>
>
> Thanks,
> Ming
>

I would expect the drive would support a reasonable number of queues
and a relatively deep queue depth, more in line with NVMe practices
than SAS HDD's typical 128. But it probably doesn't make sense to
queue up thousands of commands on something as slow as an HDD, and
many customers keep queues < 32 for latency management.

Merge and elevator are important to HDD performance. I don't believe
NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort
within a SQ without driving large differences between SSD & HDD data
paths?

Thanks,
-Tim

-- 
Tim Walker
Product Design Systems Engineering, Seagate Technology
(303) 775-3770

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-11 19:01     ` Tim Walker
  0 siblings, 0 replies; 64+ messages in thread
From: Tim Walker @ 2020-02-11 19:01 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, linux-nvme, linux-scsi

On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote:
> > Background:
> >
> > NVMe specification has hardened over the decade and now NVMe devices
> > are well integrated into our customers’ systems. As we look forward,
> > moving HDDs to the NVMe command set eliminates the SAS IOC and driver
> > stack, consolidating on a single access method for rotational and
> > static storage technologies. PCIe-NVMe offers near-SATA interface
> > costs, features and performance suitable for high-cap HDDs, and
> > optimal interoperability for storage automation, tiering, and
> > management. We will share some early conceptual results and proposed
> > salient design goals and challenges surrounding an NVMe HDD.
>
> HDD. performance is very sensitive to IO order. Could you provide some
> background info about NVMe HDD? Such as:
>
> - number of hw queues
> - hw queue depth
> - will NVMe sort/merge IO among all SQs or not?
>
> >
> >
> > Discussion Proposal:
> >
> > We’d like to share our views and solicit input on:
> >
> > -What Linux storage stack assumptions do we need to be aware of as we
> > develop these devices with drastically different performance
> > characteristics than traditional NAND? For example, what schedular or
> > device driver level changes will be needed to integrate NVMe HDDs?
>
> IO merge is often important for HDD. IO merge is usually triggered when
> .queue_rq() returns STS_RESOURCE, so far this condition won't be
> triggered for NVMe SSD.
>
> Also blk-mq kills BDI queue congestion and ioc batching, and causes
> writeback performance regression[1][2].
>
> What I am thinking is that if we need to switch to use independent IO
> path for handling SSD and HDD. IO, given the two mediums are so
> different from performance viewpoint.
>
> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e=
> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e=
>
>
> Thanks,
> Ming
>

I would expect the drive would support a reasonable number of queues
and a relatively deep queue depth, more in line with NVMe practices
than SAS HDD's typical 128. But it probably doesn't make sense to
queue up thousands of commands on something as slow as an HDD, and
many customers keep queues < 32 for latency management.

Merge and elevator are important to HDD performance. I don't believe
NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort
within a SQ without driving large differences between SSD & HDD data
paths?

Thanks,
-Tim

-- 
Tim Walker
Product Design Systems Engineering, Seagate Technology
(303) 775-3770

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-11 19:01     ` Tim Walker
@ 2020-02-12  1:47       ` Damien Le Moal
  -1 siblings, 0 replies; 64+ messages in thread
From: Damien Le Moal @ 2020-02-12  1:47 UTC (permalink / raw)
  To: Tim Walker, Ming Lei; +Cc: linux-block, linux-scsi, linux-nvme

On 2020/02/12 4:01, Tim Walker wrote:
> On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@redhat.com> wrote:
>>
>> On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote:
>>> Background:
>>>
>>> NVMe specification has hardened over the decade and now NVMe devices
>>> are well integrated into our customers’ systems. As we look forward,
>>> moving HDDs to the NVMe command set eliminates the SAS IOC and driver
>>> stack, consolidating on a single access method for rotational and
>>> static storage technologies. PCIe-NVMe offers near-SATA interface
>>> costs, features and performance suitable for high-cap HDDs, and
>>> optimal interoperability for storage automation, tiering, and
>>> management. We will share some early conceptual results and proposed
>>> salient design goals and challenges surrounding an NVMe HDD.
>>
>> HDD. performance is very sensitive to IO order. Could you provide some
>> background info about NVMe HDD? Such as:
>>
>> - number of hw queues
>> - hw queue depth
>> - will NVMe sort/merge IO among all SQs or not?
>>
>>>
>>>
>>> Discussion Proposal:
>>>
>>> We’d like to share our views and solicit input on:
>>>
>>> -What Linux storage stack assumptions do we need to be aware of as we
>>> develop these devices with drastically different performance
>>> characteristics than traditional NAND? For example, what schedular or
>>> device driver level changes will be needed to integrate NVMe HDDs?
>>
>> IO merge is often important for HDD. IO merge is usually triggered when
>> .queue_rq() returns STS_RESOURCE, so far this condition won't be
>> triggered for NVMe SSD.
>>
>> Also blk-mq kills BDI queue congestion and ioc batching, and causes
>> writeback performance regression[1][2].
>>
>> What I am thinking is that if we need to switch to use independent IO
>> path for handling SSD and HDD. IO, given the two mediums are so
>> different from performance viewpoint.
>>
>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e=
>> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e=
>>
>>
>> Thanks,
>> Ming
>>
> 
> I would expect the drive would support a reasonable number of queues
> and a relatively deep queue depth, more in line with NVMe practices
> than SAS HDD's typical 128. But it probably doesn't make sense to
> queue up thousands of commands on something as slow as an HDD, and
> many customers keep queues < 32 for latency management.

Exposing an HDD through multiple-queues each with a high queue depth is
simply asking for troubles. Commands will end up spending so much time
sitting in the queues that they will timeout. This can already be observed
with the smartpqi SAS HBA which exposes single drives as multiqueue block
devices with high queue depth. Exercising these drives heavily leads to
thousands of commands being queued and to timeouts. It is fairly easy to
trigger this without a manual change to the QD. This is on my to-do list of
fixes for some time now (lacking time to do it).

NVMe HDDs need to have an interface setup that match their speed, that is,
something like a SAS interface: *single* queue pair with a max QD of 256 or
less depending on what the drive can take. Their is no TASK_SET_FULL
notification on NVMe, so throttling has to come from the max QD of the SQ,
which the drive will advertise to the host.

> Merge and elevator are important to HDD performance. I don't believe
> NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort
> within a SQ without driving large differences between SSD & HDD data
> paths?

As far as I know, there is no merging going on once requests are passed to
the driver and added to an SQ. So this is beside the point.
The current default block scheduler for NVMe SSDs is "none". This is
decided based on the number of queues of the device. For NVMe drives that
have only a single queue *AND* the QUEUE_FLAG_NONROT flag cleared in their
request queue will can fallback to the default spinning rust mq-deadline
elevator. That will achieve command merging and LBA ordering needed for
good performance on HDDs.

NVMe specs will need an update to have a "NONROT" (non-rotational) bit in
the identify data for all this to fit well in the current stack.

> 
> Thanks,
> -Tim
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-12  1:47       ` Damien Le Moal
  0 siblings, 0 replies; 64+ messages in thread
From: Damien Le Moal @ 2020-02-12  1:47 UTC (permalink / raw)
  To: Tim Walker, Ming Lei; +Cc: linux-block, linux-nvme, linux-scsi

On 2020/02/12 4:01, Tim Walker wrote:
> On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@redhat.com> wrote:
>>
>> On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote:
>>> Background:
>>>
>>> NVMe specification has hardened over the decade and now NVMe devices
>>> are well integrated into our customers’ systems. As we look forward,
>>> moving HDDs to the NVMe command set eliminates the SAS IOC and driver
>>> stack, consolidating on a single access method for rotational and
>>> static storage technologies. PCIe-NVMe offers near-SATA interface
>>> costs, features and performance suitable for high-cap HDDs, and
>>> optimal interoperability for storage automation, tiering, and
>>> management. We will share some early conceptual results and proposed
>>> salient design goals and challenges surrounding an NVMe HDD.
>>
>> HDD. performance is very sensitive to IO order. Could you provide some
>> background info about NVMe HDD? Such as:
>>
>> - number of hw queues
>> - hw queue depth
>> - will NVMe sort/merge IO among all SQs or not?
>>
>>>
>>>
>>> Discussion Proposal:
>>>
>>> We’d like to share our views and solicit input on:
>>>
>>> -What Linux storage stack assumptions do we need to be aware of as we
>>> develop these devices with drastically different performance
>>> characteristics than traditional NAND? For example, what schedular or
>>> device driver level changes will be needed to integrate NVMe HDDs?
>>
>> IO merge is often important for HDD. IO merge is usually triggered when
>> .queue_rq() returns STS_RESOURCE, so far this condition won't be
>> triggered for NVMe SSD.
>>
>> Also blk-mq kills BDI queue congestion and ioc batching, and causes
>> writeback performance regression[1][2].
>>
>> What I am thinking is that if we need to switch to use independent IO
>> path for handling SSD and HDD. IO, given the two mediums are so
>> different from performance viewpoint.
>>
>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e=
>> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e=
>>
>>
>> Thanks,
>> Ming
>>
> 
> I would expect the drive would support a reasonable number of queues
> and a relatively deep queue depth, more in line with NVMe practices
> than SAS HDD's typical 128. But it probably doesn't make sense to
> queue up thousands of commands on something as slow as an HDD, and
> many customers keep queues < 32 for latency management.

Exposing an HDD through multiple-queues each with a high queue depth is
simply asking for troubles. Commands will end up spending so much time
sitting in the queues that they will timeout. This can already be observed
with the smartpqi SAS HBA which exposes single drives as multiqueue block
devices with high queue depth. Exercising these drives heavily leads to
thousands of commands being queued and to timeouts. It is fairly easy to
trigger this without a manual change to the QD. This is on my to-do list of
fixes for some time now (lacking time to do it).

NVMe HDDs need to have an interface setup that match their speed, that is,
something like a SAS interface: *single* queue pair with a max QD of 256 or
less depending on what the drive can take. Their is no TASK_SET_FULL
notification on NVMe, so throttling has to come from the max QD of the SQ,
which the drive will advertise to the host.

> Merge and elevator are important to HDD performance. I don't believe
> NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort
> within a SQ without driving large differences between SSD & HDD data
> paths?

As far as I know, there is no merging going on once requests are passed to
the driver and added to an SQ. So this is beside the point.
The current default block scheduler for NVMe SSDs is "none". This is
decided based on the number of queues of the device. For NVMe drives that
have only a single queue *AND* the QUEUE_FLAG_NONROT flag cleared in their
request queue will can fallback to the default spinning rust mq-deadline
elevator. That will achieve command merging and LBA ordering needed for
good performance on HDDs.

NVMe specs will need an update to have a "NONROT" (non-rotational) bit in
the identify data for all this to fit well in the current stack.

> 
> Thanks,
> -Tim
> 


-- 
Damien Le Moal
Western Digital Research

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-11 19:01     ` Tim Walker
@ 2020-02-12 21:52       ` Ming Lei
  -1 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2020-02-12 21:52 UTC (permalink / raw)
  To: Tim Walker; +Cc: linux-block, linux-scsi, linux-nvme

On Tue, Feb 11, 2020 at 02:01:18PM -0500, Tim Walker wrote:
> On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote:
> > > Background:
> > >
> > > NVMe specification has hardened over the decade and now NVMe devices
> > > are well integrated into our customers’ systems. As we look forward,
> > > moving HDDs to the NVMe command set eliminates the SAS IOC and driver
> > > stack, consolidating on a single access method for rotational and
> > > static storage technologies. PCIe-NVMe offers near-SATA interface
> > > costs, features and performance suitable for high-cap HDDs, and
> > > optimal interoperability for storage automation, tiering, and
> > > management. We will share some early conceptual results and proposed
> > > salient design goals and challenges surrounding an NVMe HDD.
> >
> > HDD. performance is very sensitive to IO order. Could you provide some
> > background info about NVMe HDD? Such as:
> >
> > - number of hw queues
> > - hw queue depth
> > - will NVMe sort/merge IO among all SQs or not?
> >
> > >
> > >
> > > Discussion Proposal:
> > >
> > > We’d like to share our views and solicit input on:
> > >
> > > -What Linux storage stack assumptions do we need to be aware of as we
> > > develop these devices with drastically different performance
> > > characteristics than traditional NAND? For example, what schedular or
> > > device driver level changes will be needed to integrate NVMe HDDs?
> >
> > IO merge is often important for HDD. IO merge is usually triggered when
> > .queue_rq() returns STS_RESOURCE, so far this condition won't be
> > triggered for NVMe SSD.
> >
> > Also blk-mq kills BDI queue congestion and ioc batching, and causes
> > writeback performance regression[1][2].
> >
> > What I am thinking is that if we need to switch to use independent IO
> > path for handling SSD and HDD. IO, given the two mediums are so
> > different from performance viewpoint.
> >
> > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e=
> > [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e=
> >
> >
> > Thanks,
> > Ming
> >
> 
> I would expect the drive would support a reasonable number of queues
> and a relatively deep queue depth, more in line with NVMe practices
> than SAS HDD's typical 128. But it probably doesn't make sense to
> queue up thousands of commands on something as slow as an HDD, and
> many customers keep queues < 32 for latency management.

MQ & deep queue depth will cause trouble for HDD., as Damien mentioned, 
IO timeout may be caused. Then looks you need to add per-ns queue depth,
just like what sdev->device_busy does for avoiding IO timeout. On the
other hand, with per-ns queue depth, you may prevent IO submitted to NVMe
when this ns is saturated, then block layer's IO merge can be triggered.

> 
> Merge and elevator are important to HDD performance. I don't believe
> NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort
> within a SQ without driving large differences between SSD & HDD data
> paths?

If NVMe doesn't sort/merge across SQs, it should be better to just use
single queue for HDD. Otherwise, it is easy to break IO order & merge.

Even someone complains that sequential IO becomes dis-continuous on
NVMe(SSD) when arbitration burst is less than IO queue depth. It is said
fio performance is hurt, but I don't understand how that can happen on
SSD.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-12 21:52       ` Ming Lei
  0 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2020-02-12 21:52 UTC (permalink / raw)
  To: Tim Walker; +Cc: linux-block, linux-nvme, linux-scsi

On Tue, Feb 11, 2020 at 02:01:18PM -0500, Tim Walker wrote:
> On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote:
> > > Background:
> > >
> > > NVMe specification has hardened over the decade and now NVMe devices
> > > are well integrated into our customers’ systems. As we look forward,
> > > moving HDDs to the NVMe command set eliminates the SAS IOC and driver
> > > stack, consolidating on a single access method for rotational and
> > > static storage technologies. PCIe-NVMe offers near-SATA interface
> > > costs, features and performance suitable for high-cap HDDs, and
> > > optimal interoperability for storage automation, tiering, and
> > > management. We will share some early conceptual results and proposed
> > > salient design goals and challenges surrounding an NVMe HDD.
> >
> > HDD. performance is very sensitive to IO order. Could you provide some
> > background info about NVMe HDD? Such as:
> >
> > - number of hw queues
> > - hw queue depth
> > - will NVMe sort/merge IO among all SQs or not?
> >
> > >
> > >
> > > Discussion Proposal:
> > >
> > > We’d like to share our views and solicit input on:
> > >
> > > -What Linux storage stack assumptions do we need to be aware of as we
> > > develop these devices with drastically different performance
> > > characteristics than traditional NAND? For example, what schedular or
> > > device driver level changes will be needed to integrate NVMe HDDs?
> >
> > IO merge is often important for HDD. IO merge is usually triggered when
> > .queue_rq() returns STS_RESOURCE, so far this condition won't be
> > triggered for NVMe SSD.
> >
> > Also blk-mq kills BDI queue congestion and ioc batching, and causes
> > writeback performance regression[1][2].
> >
> > What I am thinking is that if we need to switch to use independent IO
> > path for handling SSD and HDD. IO, given the two mediums are so
> > different from performance viewpoint.
> >
> > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e=
> > [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e=
> >
> >
> > Thanks,
> > Ming
> >
> 
> I would expect the drive would support a reasonable number of queues
> and a relatively deep queue depth, more in line with NVMe practices
> than SAS HDD's typical 128. But it probably doesn't make sense to
> queue up thousands of commands on something as slow as an HDD, and
> many customers keep queues < 32 for latency management.

MQ & deep queue depth will cause trouble for HDD., as Damien mentioned, 
IO timeout may be caused. Then looks you need to add per-ns queue depth,
just like what sdev->device_busy does for avoiding IO timeout. On the
other hand, with per-ns queue depth, you may prevent IO submitted to NVMe
when this ns is saturated, then block layer's IO merge can be triggered.

> 
> Merge and elevator are important to HDD performance. I don't believe
> NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort
> within a SQ without driving large differences between SSD & HDD data
> paths?

If NVMe doesn't sort/merge across SQs, it should be better to just use
single queue for HDD. Otherwise, it is easy to break IO order & merge.

Even someone complains that sequential IO becomes dis-continuous on
NVMe(SSD) when arbitration burst is less than IO queue depth. It is said
fio performance is hurt, but I don't understand how that can happen on
SSD.


Thanks,
Ming


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-12  1:47       ` Damien Le Moal
@ 2020-02-12 22:03         ` Ming Lei
  -1 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2020-02-12 22:03 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Tim Walker, linux-block, linux-scsi, linux-nvme

On Wed, Feb 12, 2020 at 01:47:53AM +0000, Damien Le Moal wrote:
> On 2020/02/12 4:01, Tim Walker wrote:
> > On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@redhat.com> wrote:
> >>
> >> On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote:
> >>> Background:
> >>>
> >>> NVMe specification has hardened over the decade and now NVMe devices
> >>> are well integrated into our customers’ systems. As we look forward,
> >>> moving HDDs to the NVMe command set eliminates the SAS IOC and driver
> >>> stack, consolidating on a single access method for rotational and
> >>> static storage technologies. PCIe-NVMe offers near-SATA interface
> >>> costs, features and performance suitable for high-cap HDDs, and
> >>> optimal interoperability for storage automation, tiering, and
> >>> management. We will share some early conceptual results and proposed
> >>> salient design goals and challenges surrounding an NVMe HDD.
> >>
> >> HDD. performance is very sensitive to IO order. Could you provide some
> >> background info about NVMe HDD? Such as:
> >>
> >> - number of hw queues
> >> - hw queue depth
> >> - will NVMe sort/merge IO among all SQs or not?
> >>
> >>>
> >>>
> >>> Discussion Proposal:
> >>>
> >>> We’d like to share our views and solicit input on:
> >>>
> >>> -What Linux storage stack assumptions do we need to be aware of as we
> >>> develop these devices with drastically different performance
> >>> characteristics than traditional NAND? For example, what schedular or
> >>> device driver level changes will be needed to integrate NVMe HDDs?
> >>
> >> IO merge is often important for HDD. IO merge is usually triggered when
> >> .queue_rq() returns STS_RESOURCE, so far this condition won't be
> >> triggered for NVMe SSD.
> >>
> >> Also blk-mq kills BDI queue congestion and ioc batching, and causes
> >> writeback performance regression[1][2].
> >>
> >> What I am thinking is that if we need to switch to use independent IO
> >> path for handling SSD and HDD. IO, given the two mediums are so
> >> different from performance viewpoint.
> >>
> >> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e=
> >> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e=
> >>
> >>
> >> Thanks,
> >> Ming
> >>
> > 
> > I would expect the drive would support a reasonable number of queues
> > and a relatively deep queue depth, more in line with NVMe practices
> > than SAS HDD's typical 128. But it probably doesn't make sense to
> > queue up thousands of commands on something as slow as an HDD, and
> > many customers keep queues < 32 for latency management.
> 
> Exposing an HDD through multiple-queues each with a high queue depth is
> simply asking for troubles. Commands will end up spending so much time
> sitting in the queues that they will timeout. This can already be observed
> with the smartpqi SAS HBA which exposes single drives as multiqueue block
> devices with high queue depth. Exercising these drives heavily leads to
> thousands of commands being queued and to timeouts. It is fairly easy to
> trigger this without a manual change to the QD. This is on my to-do list of
> fixes for some time now (lacking time to do it).

Just wondering why smartpqi SAS won't set one proper .cmd_per_lun for
avoiding the issue, looks the driver simply assigns .can_queue to it,
then it isn't strange to see the timeout issue. If .can_queue is a bit
big, HDD. is easily saturated too long.

> 
> NVMe HDDs need to have an interface setup that match their speed, that is,
> something like a SAS interface: *single* queue pair with a max QD of 256 or
> less depending on what the drive can take. Their is no TASK_SET_FULL
> notification on NVMe, so throttling has to come from the max QD of the SQ,
> which the drive will advertise to the host.
> 
> > Merge and elevator are important to HDD performance. I don't believe
> > NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort
> > within a SQ without driving large differences between SSD & HDD data
> > paths?
> 
> As far as I know, there is no merging going on once requests are passed to
> the driver and added to an SQ. So this is beside the point.
> The current default block scheduler for NVMe SSDs is "none". This is
> decided based on the number of queues of the device. For NVMe drives that
> have only a single queue *AND* the QUEUE_FLAG_NONROT flag cleared in their
> request queue will can fallback to the default spinning rust mq-deadline
> elevator. That will achieve command merging and LBA ordering needed for
> good performance on HDDs.

mq-deadline basically won't merge IO if STS_RESOURCE isn't returned from
.queue_rq(), or blk_mq_get_dispatch_budget always return true. NVMe's
.queue_rq() basically always returns STS_OK.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-12 22:03         ` Ming Lei
  0 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2020-02-12 22:03 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: linux-block, linux-nvme, Tim Walker, linux-scsi

On Wed, Feb 12, 2020 at 01:47:53AM +0000, Damien Le Moal wrote:
> On 2020/02/12 4:01, Tim Walker wrote:
> > On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@redhat.com> wrote:
> >>
> >> On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote:
> >>> Background:
> >>>
> >>> NVMe specification has hardened over the decade and now NVMe devices
> >>> are well integrated into our customers’ systems. As we look forward,
> >>> moving HDDs to the NVMe command set eliminates the SAS IOC and driver
> >>> stack, consolidating on a single access method for rotational and
> >>> static storage technologies. PCIe-NVMe offers near-SATA interface
> >>> costs, features and performance suitable for high-cap HDDs, and
> >>> optimal interoperability for storage automation, tiering, and
> >>> management. We will share some early conceptual results and proposed
> >>> salient design goals and challenges surrounding an NVMe HDD.
> >>
> >> HDD. performance is very sensitive to IO order. Could you provide some
> >> background info about NVMe HDD? Such as:
> >>
> >> - number of hw queues
> >> - hw queue depth
> >> - will NVMe sort/merge IO among all SQs or not?
> >>
> >>>
> >>>
> >>> Discussion Proposal:
> >>>
> >>> We’d like to share our views and solicit input on:
> >>>
> >>> -What Linux storage stack assumptions do we need to be aware of as we
> >>> develop these devices with drastically different performance
> >>> characteristics than traditional NAND? For example, what schedular or
> >>> device driver level changes will be needed to integrate NVMe HDDs?
> >>
> >> IO merge is often important for HDD. IO merge is usually triggered when
> >> .queue_rq() returns STS_RESOURCE, so far this condition won't be
> >> triggered for NVMe SSD.
> >>
> >> Also blk-mq kills BDI queue congestion and ioc batching, and causes
> >> writeback performance regression[1][2].
> >>
> >> What I am thinking is that if we need to switch to use independent IO
> >> path for handling SSD and HDD. IO, given the two mediums are so
> >> different from performance viewpoint.
> >>
> >> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e=
> >> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e=
> >>
> >>
> >> Thanks,
> >> Ming
> >>
> > 
> > I would expect the drive would support a reasonable number of queues
> > and a relatively deep queue depth, more in line with NVMe practices
> > than SAS HDD's typical 128. But it probably doesn't make sense to
> > queue up thousands of commands on something as slow as an HDD, and
> > many customers keep queues < 32 for latency management.
> 
> Exposing an HDD through multiple-queues each with a high queue depth is
> simply asking for troubles. Commands will end up spending so much time
> sitting in the queues that they will timeout. This can already be observed
> with the smartpqi SAS HBA which exposes single drives as multiqueue block
> devices with high queue depth. Exercising these drives heavily leads to
> thousands of commands being queued and to timeouts. It is fairly easy to
> trigger this without a manual change to the QD. This is on my to-do list of
> fixes for some time now (lacking time to do it).

Just wondering why smartpqi SAS won't set one proper .cmd_per_lun for
avoiding the issue, looks the driver simply assigns .can_queue to it,
then it isn't strange to see the timeout issue. If .can_queue is a bit
big, HDD. is easily saturated too long.

> 
> NVMe HDDs need to have an interface setup that match their speed, that is,
> something like a SAS interface: *single* queue pair with a max QD of 256 or
> less depending on what the drive can take. Their is no TASK_SET_FULL
> notification on NVMe, so throttling has to come from the max QD of the SQ,
> which the drive will advertise to the host.
> 
> > Merge and elevator are important to HDD performance. I don't believe
> > NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort
> > within a SQ without driving large differences between SSD & HDD data
> > paths?
> 
> As far as I know, there is no merging going on once requests are passed to
> the driver and added to an SQ. So this is beside the point.
> The current default block scheduler for NVMe SSDs is "none". This is
> decided based on the number of queues of the device. For NVMe drives that
> have only a single queue *AND* the QUEUE_FLAG_NONROT flag cleared in their
> request queue will can fallback to the default spinning rust mq-deadline
> elevator. That will achieve command merging and LBA ordering needed for
> good performance on HDDs.

mq-deadline basically won't merge IO if STS_RESOURCE isn't returned from
.queue_rq(), or blk_mq_get_dispatch_budget always return true. NVMe's
.queue_rq() basically always returns STS_OK.


Thanks, 
Ming


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-12 22:03         ` Ming Lei
@ 2020-02-13  2:40           ` Damien Le Moal
  -1 siblings, 0 replies; 64+ messages in thread
From: Damien Le Moal @ 2020-02-13  2:40 UTC (permalink / raw)
  To: Ming Lei; +Cc: Tim Walker, linux-block, linux-scsi, linux-nvme

Ming,

On 2020/02/13 7:03, Ming Lei wrote:
> On Wed, Feb 12, 2020 at 01:47:53AM +0000, Damien Le Moal wrote:
>> On 2020/02/12 4:01, Tim Walker wrote:
>>> On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@redhat.com> wrote:
>>>>
>>>> On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote:
>>>>> Background:
>>>>>
>>>>> NVMe specification has hardened over the decade and now NVMe devices
>>>>> are well integrated into our customers’ systems. As we look forward,
>>>>> moving HDDs to the NVMe command set eliminates the SAS IOC and driver
>>>>> stack, consolidating on a single access method for rotational and
>>>>> static storage technologies. PCIe-NVMe offers near-SATA interface
>>>>> costs, features and performance suitable for high-cap HDDs, and
>>>>> optimal interoperability for storage automation, tiering, and
>>>>> management. We will share some early conceptual results and proposed
>>>>> salient design goals and challenges surrounding an NVMe HDD.
>>>>
>>>> HDD. performance is very sensitive to IO order. Could you provide some
>>>> background info about NVMe HDD? Such as:
>>>>
>>>> - number of hw queues
>>>> - hw queue depth
>>>> - will NVMe sort/merge IO among all SQs or not?
>>>>
>>>>>
>>>>>
>>>>> Discussion Proposal:
>>>>>
>>>>> We’d like to share our views and solicit input on:
>>>>>
>>>>> -What Linux storage stack assumptions do we need to be aware of as we
>>>>> develop these devices with drastically different performance
>>>>> characteristics than traditional NAND? For example, what schedular or
>>>>> device driver level changes will be needed to integrate NVMe HDDs?
>>>>
>>>> IO merge is often important for HDD. IO merge is usually triggered when
>>>> .queue_rq() returns STS_RESOURCE, so far this condition won't be
>>>> triggered for NVMe SSD.
>>>>
>>>> Also blk-mq kills BDI queue congestion and ioc batching, and causes
>>>> writeback performance regression[1][2].
>>>>
>>>> What I am thinking is that if we need to switch to use independent IO
>>>> path for handling SSD and HDD. IO, given the two mediums are so
>>>> different from performance viewpoint.
>>>>
>>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e=
>>>> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e=
>>>>
>>>>
>>>> Thanks,
>>>> Ming
>>>>
>>>
>>> I would expect the drive would support a reasonable number of queues
>>> and a relatively deep queue depth, more in line with NVMe practices
>>> than SAS HDD's typical 128. But it probably doesn't make sense to
>>> queue up thousands of commands on something as slow as an HDD, and
>>> many customers keep queues < 32 for latency management.
>>
>> Exposing an HDD through multiple-queues each with a high queue depth is
>> simply asking for troubles. Commands will end up spending so much time
>> sitting in the queues that they will timeout. This can already be observed
>> with the smartpqi SAS HBA which exposes single drives as multiqueue block
>> devices with high queue depth. Exercising these drives heavily leads to
>> thousands of commands being queued and to timeouts. It is fairly easy to
>> trigger this without a manual change to the QD. This is on my to-do list of
>> fixes for some time now (lacking time to do it).
> 
> Just wondering why smartpqi SAS won't set one proper .cmd_per_lun for
> avoiding the issue, looks the driver simply assigns .can_queue to it,
> then it isn't strange to see the timeout issue. If .can_queue is a bit
> big, HDD. is easily saturated too long.
> 
>>
>> NVMe HDDs need to have an interface setup that match their speed, that is,
>> something like a SAS interface: *single* queue pair with a max QD of 256 or
>> less depending on what the drive can take. Their is no TASK_SET_FULL
>> notification on NVMe, so throttling has to come from the max QD of the SQ,
>> which the drive will advertise to the host.
>>
>>> Merge and elevator are important to HDD performance. I don't believe
>>> NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort
>>> within a SQ without driving large differences between SSD & HDD data
>>> paths?
>>
>> As far as I know, there is no merging going on once requests are passed to
>> the driver and added to an SQ. So this is beside the point.
>> The current default block scheduler for NVMe SSDs is "none". This is
>> decided based on the number of queues of the device. For NVMe drives that
>> have only a single queue *AND* the QUEUE_FLAG_NONROT flag cleared in their
>> request queue will can fallback to the default spinning rust mq-deadline
>> elevator. That will achieve command merging and LBA ordering needed for
>> good performance on HDDs.
> 
> mq-deadline basically won't merge IO if STS_RESOURCE isn't returned from
> .queue_rq(), or blk_mq_get_dispatch_budget always return true. NVMe's
> .queue_rq() basically always returns STS_OK.

I am confused: when an elevator is set, ->queue_rq() is called for requests
obtained from the elevator (with e->type->ops.dispatch_request()), after
the requests went through it. And merging will happen at that stage when
new requests are inserted in the elevator.

If the ->queue_rq() returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, the
request is indeed requeued which offer more chances of further merging, but
that is not the same as no merging happening.
Am I missing your point here ?

> 
> 
> Thanks, 
> Ming
> 
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-13  2:40           ` Damien Le Moal
  0 siblings, 0 replies; 64+ messages in thread
From: Damien Le Moal @ 2020-02-13  2:40 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, linux-nvme, Tim Walker, linux-scsi

Ming,

On 2020/02/13 7:03, Ming Lei wrote:
> On Wed, Feb 12, 2020 at 01:47:53AM +0000, Damien Le Moal wrote:
>> On 2020/02/12 4:01, Tim Walker wrote:
>>> On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@redhat.com> wrote:
>>>>
>>>> On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote:
>>>>> Background:
>>>>>
>>>>> NVMe specification has hardened over the decade and now NVMe devices
>>>>> are well integrated into our customers’ systems. As we look forward,
>>>>> moving HDDs to the NVMe command set eliminates the SAS IOC and driver
>>>>> stack, consolidating on a single access method for rotational and
>>>>> static storage technologies. PCIe-NVMe offers near-SATA interface
>>>>> costs, features and performance suitable for high-cap HDDs, and
>>>>> optimal interoperability for storage automation, tiering, and
>>>>> management. We will share some early conceptual results and proposed
>>>>> salient design goals and challenges surrounding an NVMe HDD.
>>>>
>>>> HDD. performance is very sensitive to IO order. Could you provide some
>>>> background info about NVMe HDD? Such as:
>>>>
>>>> - number of hw queues
>>>> - hw queue depth
>>>> - will NVMe sort/merge IO among all SQs or not?
>>>>
>>>>>
>>>>>
>>>>> Discussion Proposal:
>>>>>
>>>>> We’d like to share our views and solicit input on:
>>>>>
>>>>> -What Linux storage stack assumptions do we need to be aware of as we
>>>>> develop these devices with drastically different performance
>>>>> characteristics than traditional NAND? For example, what schedular or
>>>>> device driver level changes will be needed to integrate NVMe HDDs?
>>>>
>>>> IO merge is often important for HDD. IO merge is usually triggered when
>>>> .queue_rq() returns STS_RESOURCE, so far this condition won't be
>>>> triggered for NVMe SSD.
>>>>
>>>> Also blk-mq kills BDI queue congestion and ioc batching, and causes
>>>> writeback performance regression[1][2].
>>>>
>>>> What I am thinking is that if we need to switch to use independent IO
>>>> path for handling SSD and HDD. IO, given the two mediums are so
>>>> different from performance viewpoint.
>>>>
>>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e=
>>>> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e=
>>>>
>>>>
>>>> Thanks,
>>>> Ming
>>>>
>>>
>>> I would expect the drive would support a reasonable number of queues
>>> and a relatively deep queue depth, more in line with NVMe practices
>>> than SAS HDD's typical 128. But it probably doesn't make sense to
>>> queue up thousands of commands on something as slow as an HDD, and
>>> many customers keep queues < 32 for latency management.
>>
>> Exposing an HDD through multiple-queues each with a high queue depth is
>> simply asking for troubles. Commands will end up spending so much time
>> sitting in the queues that they will timeout. This can already be observed
>> with the smartpqi SAS HBA which exposes single drives as multiqueue block
>> devices with high queue depth. Exercising these drives heavily leads to
>> thousands of commands being queued and to timeouts. It is fairly easy to
>> trigger this without a manual change to the QD. This is on my to-do list of
>> fixes for some time now (lacking time to do it).
> 
> Just wondering why smartpqi SAS won't set one proper .cmd_per_lun for
> avoiding the issue, looks the driver simply assigns .can_queue to it,
> then it isn't strange to see the timeout issue. If .can_queue is a bit
> big, HDD. is easily saturated too long.
> 
>>
>> NVMe HDDs need to have an interface setup that match their speed, that is,
>> something like a SAS interface: *single* queue pair with a max QD of 256 or
>> less depending on what the drive can take. Their is no TASK_SET_FULL
>> notification on NVMe, so throttling has to come from the max QD of the SQ,
>> which the drive will advertise to the host.
>>
>>> Merge and elevator are important to HDD performance. I don't believe
>>> NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort
>>> within a SQ without driving large differences between SSD & HDD data
>>> paths?
>>
>> As far as I know, there is no merging going on once requests are passed to
>> the driver and added to an SQ. So this is beside the point.
>> The current default block scheduler for NVMe SSDs is "none". This is
>> decided based on the number of queues of the device. For NVMe drives that
>> have only a single queue *AND* the QUEUE_FLAG_NONROT flag cleared in their
>> request queue will can fallback to the default spinning rust mq-deadline
>> elevator. That will achieve command merging and LBA ordering needed for
>> good performance on HDDs.
> 
> mq-deadline basically won't merge IO if STS_RESOURCE isn't returned from
> .queue_rq(), or blk_mq_get_dispatch_budget always return true. NVMe's
> .queue_rq() basically always returns STS_OK.

I am confused: when an elevator is set, ->queue_rq() is called for requests
obtained from the elevator (with e->type->ops.dispatch_request()), after
the requests went through it. And merging will happen at that stage when
new requests are inserted in the elevator.

If the ->queue_rq() returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, the
request is indeed requeued which offer more chances of further merging, but
that is not the same as no merging happening.
Am I missing your point here ?

> 
> 
> Thanks, 
> Ming
> 
> 


-- 
Damien Le Moal
Western Digital Research

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-12  1:47       ` Damien Le Moal
@ 2020-02-13  3:02         ` Martin K. Petersen
  -1 siblings, 0 replies; 64+ messages in thread
From: Martin K. Petersen @ 2020-02-13  3:02 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Tim Walker, Ming Lei, linux-block, linux-scsi, linux-nvme


Damien,

> Exposing an HDD through multiple-queues each with a high queue depth
> is simply asking for troubles. Commands will end up spending so much
> time sitting in the queues that they will timeout.

Yep!

> This can already be observed with the smartpqi SAS HBA which exposes
> single drives as multiqueue block devices with high queue depth.
> Exercising these drives heavily leads to thousands of commands being
> queued and to timeouts. It is fairly easy to trigger this without a
> manual change to the QD. This is on my to-do list of fixes for some
> time now (lacking time to do it).

Controllers that queue internally are very susceptible to application or
filesystem timeouts when drives are struggling to keep up.

> NVMe HDDs need to have an interface setup that match their speed, that
> is, something like a SAS interface: *single* queue pair with a max QD
> of 256 or less depending on what the drive can take. Their is no
> TASK_SET_FULL notification on NVMe, so throttling has to come from the
> max QD of the SQ, which the drive will advertise to the host.

At the very minimum we'll need low queue depths. But I have my doubts
whether we can make this work well enough without some kind of TASK SET
FULL style AER to throttle the I/O.

> NVMe specs will need an update to have a "NONROT" (non-rotational) bit in
> the identify data for all this to fit well in the current stack.

Absolutely.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-13  3:02         ` Martin K. Petersen
  0 siblings, 0 replies; 64+ messages in thread
From: Martin K. Petersen @ 2020-02-13  3:02 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: linux-block, linux-nvme, Tim Walker, linux-scsi, Ming Lei


Damien,

> Exposing an HDD through multiple-queues each with a high queue depth
> is simply asking for troubles. Commands will end up spending so much
> time sitting in the queues that they will timeout.

Yep!

> This can already be observed with the smartpqi SAS HBA which exposes
> single drives as multiqueue block devices with high queue depth.
> Exercising these drives heavily leads to thousands of commands being
> queued and to timeouts. It is fairly easy to trigger this without a
> manual change to the QD. This is on my to-do list of fixes for some
> time now (lacking time to do it).

Controllers that queue internally are very susceptible to application or
filesystem timeouts when drives are struggling to keep up.

> NVMe HDDs need to have an interface setup that match their speed, that
> is, something like a SAS interface: *single* queue pair with a max QD
> of 256 or less depending on what the drive can take. Their is no
> TASK_SET_FULL notification on NVMe, so throttling has to come from the
> max QD of the SQ, which the drive will advertise to the host.

At the very minimum we'll need low queue depths. But I have my doubts
whether we can make this work well enough without some kind of TASK SET
FULL style AER to throttle the I/O.

> NVMe specs will need an update to have a "NONROT" (non-rotational) bit in
> the identify data for all this to fit well in the current stack.

Absolutely.

-- 
Martin K. Petersen	Oracle Linux Engineering

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-13  3:02         ` Martin K. Petersen
@ 2020-02-13  3:12           ` Tim Walker
  -1 siblings, 0 replies; 64+ messages in thread
From: Tim Walker @ 2020-02-13  3:12 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Damien Le Moal, Ming Lei, linux-block, linux-scsi, linux-nvme

On Wed, Feb 12, 2020 at 10:02 PM Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>
>
> Damien,
>
> > Exposing an HDD through multiple-queues each with a high queue depth
> > is simply asking for troubles. Commands will end up spending so much
> > time sitting in the queues that they will timeout.
>
> Yep!
>
> > This can already be observed with the smartpqi SAS HBA which exposes
> > single drives as multiqueue block devices with high queue depth.
> > Exercising these drives heavily leads to thousands of commands being
> > queued and to timeouts. It is fairly easy to trigger this without a
> > manual change to the QD. This is on my to-do list of fixes for some
> > time now (lacking time to do it).
>
> Controllers that queue internally are very susceptible to application or
> filesystem timeouts when drives are struggling to keep up.
>
> > NVMe HDDs need to have an interface setup that match their speed, that
> > is, something like a SAS interface: *single* queue pair with a max QD
> > of 256 or less depending on what the drive can take. Their is no
> > TASK_SET_FULL notification on NVMe, so throttling has to come from the
> > max QD of the SQ, which the drive will advertise to the host.
>
> At the very minimum we'll need low queue depths. But I have my doubts
> whether we can make this work well enough without some kind of TASK SET
> FULL style AER to throttle the I/O.
>
> > NVMe specs will need an update to have a "NONROT" (non-rotational) bit in
> > the identify data for all this to fit well in the current stack.
>
> Absolutely.
>
> --
> Martin K. Petersen      Oracle Linux Engineering
Hi all-

We already anticipated the need for the "spinning rust" bit, so it is
already in place (on paper, at least).

SAS currently supports QD256, but the general consensus is that most
customers don't run anywhere near that deep. Does it help the system
for the HD to report a limited (256) max queue depth, or is it really
up to the system to decide many commands to queue?

Regarding number of SQ pairs, I think HDD would function well with
only one. Some thoughts on why we would want >1:
-A priority-based SQ servicing algorithm that would permit
low-priority commands to be queued in a dedicated SQ.
-The host may want an SQ per actuator for multi-actuator devices.
There may be others that I haven't thought of, but you get the idea.
At any rate, the drive can support as many queue-pairs as it wants to
- we can use as few as makes sense.

Since NVMe doesn't guarantee command execution order, it seems the
zoned block version of an NVME HDD would need to support zone append.
Do you agree?

-- 
Tim Walker
Product Design Systems Engineering, Seagate Technology
(303) 775-3770

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-13  3:12           ` Tim Walker
  0 siblings, 0 replies; 64+ messages in thread
From: Tim Walker @ 2020-02-13  3:12 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: linux-block, Damien Le Moal, linux-nvme, linux-scsi, Ming Lei

On Wed, Feb 12, 2020 at 10:02 PM Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>
>
> Damien,
>
> > Exposing an HDD through multiple-queues each with a high queue depth
> > is simply asking for troubles. Commands will end up spending so much
> > time sitting in the queues that they will timeout.
>
> Yep!
>
> > This can already be observed with the smartpqi SAS HBA which exposes
> > single drives as multiqueue block devices with high queue depth.
> > Exercising these drives heavily leads to thousands of commands being
> > queued and to timeouts. It is fairly easy to trigger this without a
> > manual change to the QD. This is on my to-do list of fixes for some
> > time now (lacking time to do it).
>
> Controllers that queue internally are very susceptible to application or
> filesystem timeouts when drives are struggling to keep up.
>
> > NVMe HDDs need to have an interface setup that match their speed, that
> > is, something like a SAS interface: *single* queue pair with a max QD
> > of 256 or less depending on what the drive can take. Their is no
> > TASK_SET_FULL notification on NVMe, so throttling has to come from the
> > max QD of the SQ, which the drive will advertise to the host.
>
> At the very minimum we'll need low queue depths. But I have my doubts
> whether we can make this work well enough without some kind of TASK SET
> FULL style AER to throttle the I/O.
>
> > NVMe specs will need an update to have a "NONROT" (non-rotational) bit in
> > the identify data for all this to fit well in the current stack.
>
> Absolutely.
>
> --
> Martin K. Petersen      Oracle Linux Engineering
Hi all-

We already anticipated the need for the "spinning rust" bit, so it is
already in place (on paper, at least).

SAS currently supports QD256, but the general consensus is that most
customers don't run anywhere near that deep. Does it help the system
for the HD to report a limited (256) max queue depth, or is it really
up to the system to decide many commands to queue?

Regarding number of SQ pairs, I think HDD would function well with
only one. Some thoughts on why we would want >1:
-A priority-based SQ servicing algorithm that would permit
low-priority commands to be queued in a dedicated SQ.
-The host may want an SQ per actuator for multi-actuator devices.
There may be others that I haven't thought of, but you get the idea.
At any rate, the drive can support as many queue-pairs as it wants to
- we can use as few as makes sense.

Since NVMe doesn't guarantee command execution order, it seems the
zoned block version of an NVME HDD would need to support zone append.
Do you agree?

-- 
Tim Walker
Product Design Systems Engineering, Seagate Technology
(303) 775-3770

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-13  3:12           ` Tim Walker
@ 2020-02-13  4:17             ` Martin K. Petersen
  -1 siblings, 0 replies; 64+ messages in thread
From: Martin K. Petersen @ 2020-02-13  4:17 UTC (permalink / raw)
  To: Tim Walker
  Cc: Martin K. Petersen, Damien Le Moal, Ming Lei, linux-block,
	linux-scsi, linux-nvme


Tim,

> SAS currently supports QD256, but the general consensus is that most
> customers don't run anywhere near that deep. Does it help the system
> for the HD to report a limited (256) max queue depth, or is it really
> up to the system to decide many commands to queue?

People often artificially lower the queue depth to avoid timeouts. The
default timeout is 30 seconds from an I/O is queued. However, many
enterprise applications set the timeout to 3-5 seconds. Which means that
with deep queues you'll quickly start seeing timeouts if a drive
temporarily is having issues keeping up (media errors, excessive spare
track seeks, etc.).

Well-behaved devices will return QF/TSF if they have transient resource
starvation or exceed internal QoS limits. QF will cause the SCSI stack
to reduce the number of I/Os in flight. This allows the drive to recover
from its congested state and reduces the potential of application and
filesystem timeouts.

> Regarding number of SQ pairs, I think HDD would function well with
> only one. Some thoughts on why we would want >1:

> -A priority-based SQ servicing algorithm that would permit
> low-priority commands to be queued in a dedicated SQ.
> -The host may want an SQ per actuator for multi-actuator devices.

That's fine. I think we're just saying that the common practice of
allocating very deep queues for each CPU core in the system will lead to
problems since the host will inevitably be able to queue much more I/O
than the drive can realistically complete.

> Since NVMe doesn't guarantee command execution order, it seems the
> zoned block version of an NVME HDD would need to support zone append.
> Do you agree?

Absolutely!

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-13  4:17             ` Martin K. Petersen
  0 siblings, 0 replies; 64+ messages in thread
From: Martin K. Petersen @ 2020-02-13  4:17 UTC (permalink / raw)
  To: Tim Walker
  Cc: Damien Le Moal, Martin K. Petersen, linux-scsi, linux-nvme,
	Ming Lei, linux-block


Tim,

> SAS currently supports QD256, but the general consensus is that most
> customers don't run anywhere near that deep. Does it help the system
> for the HD to report a limited (256) max queue depth, or is it really
> up to the system to decide many commands to queue?

People often artificially lower the queue depth to avoid timeouts. The
default timeout is 30 seconds from an I/O is queued. However, many
enterprise applications set the timeout to 3-5 seconds. Which means that
with deep queues you'll quickly start seeing timeouts if a drive
temporarily is having issues keeping up (media errors, excessive spare
track seeks, etc.).

Well-behaved devices will return QF/TSF if they have transient resource
starvation or exceed internal QoS limits. QF will cause the SCSI stack
to reduce the number of I/Os in flight. This allows the drive to recover
from its congested state and reduces the potential of application and
filesystem timeouts.

> Regarding number of SQ pairs, I think HDD would function well with
> only one. Some thoughts on why we would want >1:

> -A priority-based SQ servicing algorithm that would permit
> low-priority commands to be queued in a dedicated SQ.
> -The host may want an SQ per actuator for multi-actuator devices.

That's fine. I think we're just saying that the common practice of
allocating very deep queues for each CPU core in the system will lead to
problems since the host will inevitably be able to queue much more I/O
than the drive can realistically complete.

> Since NVMe doesn't guarantee command execution order, it seems the
> zoned block version of an NVME HDD would need to support zone append.
> Do you agree?

Absolutely!

-- 
Martin K. Petersen	Oracle Linux Engineering

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-13  2:40           ` Damien Le Moal
@ 2020-02-13  7:53             ` Ming Lei
  -1 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2020-02-13  7:53 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Tim Walker, linux-block, linux-scsi, linux-nvme

On Thu, Feb 13, 2020 at 02:40:41AM +0000, Damien Le Moal wrote:
> Ming,
> 
> On 2020/02/13 7:03, Ming Lei wrote:
> > On Wed, Feb 12, 2020 at 01:47:53AM +0000, Damien Le Moal wrote:
> >> On 2020/02/12 4:01, Tim Walker wrote:
> >>> On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@redhat.com> wrote:
> >>>>
> >>>> On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote:
> >>>>> Background:
> >>>>>
> >>>>> NVMe specification has hardened over the decade and now NVMe devices
> >>>>> are well integrated into our customers’ systems. As we look forward,
> >>>>> moving HDDs to the NVMe command set eliminates the SAS IOC and driver
> >>>>> stack, consolidating on a single access method for rotational and
> >>>>> static storage technologies. PCIe-NVMe offers near-SATA interface
> >>>>> costs, features and performance suitable for high-cap HDDs, and
> >>>>> optimal interoperability for storage automation, tiering, and
> >>>>> management. We will share some early conceptual results and proposed
> >>>>> salient design goals and challenges surrounding an NVMe HDD.
> >>>>
> >>>> HDD. performance is very sensitive to IO order. Could you provide some
> >>>> background info about NVMe HDD? Such as:
> >>>>
> >>>> - number of hw queues
> >>>> - hw queue depth
> >>>> - will NVMe sort/merge IO among all SQs or not?
> >>>>
> >>>>>
> >>>>>
> >>>>> Discussion Proposal:
> >>>>>
> >>>>> We’d like to share our views and solicit input on:
> >>>>>
> >>>>> -What Linux storage stack assumptions do we need to be aware of as we
> >>>>> develop these devices with drastically different performance
> >>>>> characteristics than traditional NAND? For example, what schedular or
> >>>>> device driver level changes will be needed to integrate NVMe HDDs?
> >>>>
> >>>> IO merge is often important for HDD. IO merge is usually triggered when
> >>>> .queue_rq() returns STS_RESOURCE, so far this condition won't be
> >>>> triggered for NVMe SSD.
> >>>>
> >>>> Also blk-mq kills BDI queue congestion and ioc batching, and causes
> >>>> writeback performance regression[1][2].
> >>>>
> >>>> What I am thinking is that if we need to switch to use independent IO
> >>>> path for handling SSD and HDD. IO, given the two mediums are so
> >>>> different from performance viewpoint.
> >>>>
> >>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e=
> >>>> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e=
> >>>>
> >>>>
> >>>> Thanks,
> >>>> Ming
> >>>>
> >>>
> >>> I would expect the drive would support a reasonable number of queues
> >>> and a relatively deep queue depth, more in line with NVMe practices
> >>> than SAS HDD's typical 128. But it probably doesn't make sense to
> >>> queue up thousands of commands on something as slow as an HDD, and
> >>> many customers keep queues < 32 for latency management.
> >>
> >> Exposing an HDD through multiple-queues each with a high queue depth is
> >> simply asking for troubles. Commands will end up spending so much time
> >> sitting in the queues that they will timeout. This can already be observed
> >> with the smartpqi SAS HBA which exposes single drives as multiqueue block
> >> devices with high queue depth. Exercising these drives heavily leads to
> >> thousands of commands being queued and to timeouts. It is fairly easy to
> >> trigger this without a manual change to the QD. This is on my to-do list of
> >> fixes for some time now (lacking time to do it).
> > 
> > Just wondering why smartpqi SAS won't set one proper .cmd_per_lun for
> > avoiding the issue, looks the driver simply assigns .can_queue to it,
> > then it isn't strange to see the timeout issue. If .can_queue is a bit
> > big, HDD. is easily saturated too long.
> > 
> >>
> >> NVMe HDDs need to have an interface setup that match their speed, that is,
> >> something like a SAS interface: *single* queue pair with a max QD of 256 or
> >> less depending on what the drive can take. Their is no TASK_SET_FULL
> >> notification on NVMe, so throttling has to come from the max QD of the SQ,
> >> which the drive will advertise to the host.
> >>
> >>> Merge and elevator are important to HDD performance. I don't believe
> >>> NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort
> >>> within a SQ without driving large differences between SSD & HDD data
> >>> paths?
> >>
> >> As far as I know, there is no merging going on once requests are passed to
> >> the driver and added to an SQ. So this is beside the point.
> >> The current default block scheduler for NVMe SSDs is "none". This is
> >> decided based on the number of queues of the device. For NVMe drives that
> >> have only a single queue *AND* the QUEUE_FLAG_NONROT flag cleared in their
> >> request queue will can fallback to the default spinning rust mq-deadline
> >> elevator. That will achieve command merging and LBA ordering needed for
> >> good performance on HDDs.
> > 
> > mq-deadline basically won't merge IO if STS_RESOURCE isn't returned from
> > .queue_rq(), or blk_mq_get_dispatch_budget always return true. NVMe's
> > .queue_rq() basically always returns STS_OK.
> 
> I am confused: when an elevator is set, ->queue_rq() is called for requests
> obtained from the elevator (with e->type->ops.dispatch_request()), after
> the requests went through it. And merging will happen at that stage when
> new requests are inserted in the elevator.

When request is queued to lld via .queue_rq(), the request has been
removed from scheduler queue. And IO merge is just done inside or
against scheduler queue.

> 
> If the ->queue_rq() returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, the
> request is indeed requeued which offer more chances of further merging, but
> that is not the same as no merging happening.
> Am I missing your point here ?

BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE or getting no budget can be
thought as device saturation feedback, then more requests can be
gathered in scheduler queue since we don't dequeue request from
scheduler queue when that happens, then IO merge is possible.

Without any device saturation feedback from driver, block layer just
dequeues request from scheduler queue with same speed of submission to
hardware, then no IO can be merged.

If you observe sequential IO on NVMe PCI, you will see no IO merge
basically.

 
Thanks,
Ming


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-13  7:53             ` Ming Lei
  0 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2020-02-13  7:53 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: linux-block, linux-nvme, Tim Walker, linux-scsi

On Thu, Feb 13, 2020 at 02:40:41AM +0000, Damien Le Moal wrote:
> Ming,
> 
> On 2020/02/13 7:03, Ming Lei wrote:
> > On Wed, Feb 12, 2020 at 01:47:53AM +0000, Damien Le Moal wrote:
> >> On 2020/02/12 4:01, Tim Walker wrote:
> >>> On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@redhat.com> wrote:
> >>>>
> >>>> On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote:
> >>>>> Background:
> >>>>>
> >>>>> NVMe specification has hardened over the decade and now NVMe devices
> >>>>> are well integrated into our customers’ systems. As we look forward,
> >>>>> moving HDDs to the NVMe command set eliminates the SAS IOC and driver
> >>>>> stack, consolidating on a single access method for rotational and
> >>>>> static storage technologies. PCIe-NVMe offers near-SATA interface
> >>>>> costs, features and performance suitable for high-cap HDDs, and
> >>>>> optimal interoperability for storage automation, tiering, and
> >>>>> management. We will share some early conceptual results and proposed
> >>>>> salient design goals and challenges surrounding an NVMe HDD.
> >>>>
> >>>> HDD. performance is very sensitive to IO order. Could you provide some
> >>>> background info about NVMe HDD? Such as:
> >>>>
> >>>> - number of hw queues
> >>>> - hw queue depth
> >>>> - will NVMe sort/merge IO among all SQs or not?
> >>>>
> >>>>>
> >>>>>
> >>>>> Discussion Proposal:
> >>>>>
> >>>>> We’d like to share our views and solicit input on:
> >>>>>
> >>>>> -What Linux storage stack assumptions do we need to be aware of as we
> >>>>> develop these devices with drastically different performance
> >>>>> characteristics than traditional NAND? For example, what schedular or
> >>>>> device driver level changes will be needed to integrate NVMe HDDs?
> >>>>
> >>>> IO merge is often important for HDD. IO merge is usually triggered when
> >>>> .queue_rq() returns STS_RESOURCE, so far this condition won't be
> >>>> triggered for NVMe SSD.
> >>>>
> >>>> Also blk-mq kills BDI queue congestion and ioc batching, and causes
> >>>> writeback performance regression[1][2].
> >>>>
> >>>> What I am thinking is that if we need to switch to use independent IO
> >>>> path for handling SSD and HDD. IO, given the two mediums are so
> >>>> different from performance viewpoint.
> >>>>
> >>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e=
> >>>> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e=
> >>>>
> >>>>
> >>>> Thanks,
> >>>> Ming
> >>>>
> >>>
> >>> I would expect the drive would support a reasonable number of queues
> >>> and a relatively deep queue depth, more in line with NVMe practices
> >>> than SAS HDD's typical 128. But it probably doesn't make sense to
> >>> queue up thousands of commands on something as slow as an HDD, and
> >>> many customers keep queues < 32 for latency management.
> >>
> >> Exposing an HDD through multiple-queues each with a high queue depth is
> >> simply asking for troubles. Commands will end up spending so much time
> >> sitting in the queues that they will timeout. This can already be observed
> >> with the smartpqi SAS HBA which exposes single drives as multiqueue block
> >> devices with high queue depth. Exercising these drives heavily leads to
> >> thousands of commands being queued and to timeouts. It is fairly easy to
> >> trigger this without a manual change to the QD. This is on my to-do list of
> >> fixes for some time now (lacking time to do it).
> > 
> > Just wondering why smartpqi SAS won't set one proper .cmd_per_lun for
> > avoiding the issue, looks the driver simply assigns .can_queue to it,
> > then it isn't strange to see the timeout issue. If .can_queue is a bit
> > big, HDD. is easily saturated too long.
> > 
> >>
> >> NVMe HDDs need to have an interface setup that match their speed, that is,
> >> something like a SAS interface: *single* queue pair with a max QD of 256 or
> >> less depending on what the drive can take. Their is no TASK_SET_FULL
> >> notification on NVMe, so throttling has to come from the max QD of the SQ,
> >> which the drive will advertise to the host.
> >>
> >>> Merge and elevator are important to HDD performance. I don't believe
> >>> NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort
> >>> within a SQ without driving large differences between SSD & HDD data
> >>> paths?
> >>
> >> As far as I know, there is no merging going on once requests are passed to
> >> the driver and added to an SQ. So this is beside the point.
> >> The current default block scheduler for NVMe SSDs is "none". This is
> >> decided based on the number of queues of the device. For NVMe drives that
> >> have only a single queue *AND* the QUEUE_FLAG_NONROT flag cleared in their
> >> request queue will can fallback to the default spinning rust mq-deadline
> >> elevator. That will achieve command merging and LBA ordering needed for
> >> good performance on HDDs.
> > 
> > mq-deadline basically won't merge IO if STS_RESOURCE isn't returned from
> > .queue_rq(), or blk_mq_get_dispatch_budget always return true. NVMe's
> > .queue_rq() basically always returns STS_OK.
> 
> I am confused: when an elevator is set, ->queue_rq() is called for requests
> obtained from the elevator (with e->type->ops.dispatch_request()), after
> the requests went through it. And merging will happen at that stage when
> new requests are inserted in the elevator.

When request is queued to lld via .queue_rq(), the request has been
removed from scheduler queue. And IO merge is just done inside or
against scheduler queue.

> 
> If the ->queue_rq() returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, the
> request is indeed requeued which offer more chances of further merging, but
> that is not the same as no merging happening.
> Am I missing your point here ?

BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE or getting no budget can be
thought as device saturation feedback, then more requests can be
gathered in scheduler queue since we don't dequeue request from
scheduler queue when that happens, then IO merge is possible.

Without any device saturation feedback from driver, block layer just
dequeues request from scheduler queue with same speed of submission to
hardware, then no IO can be merged.

If you observe sequential IO on NVMe PCI, you will see no IO merge
basically.

 
Thanks,
Ming


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-13  7:53             ` Ming Lei
@ 2020-02-13  8:24               ` Damien Le Moal
  -1 siblings, 0 replies; 64+ messages in thread
From: Damien Le Moal @ 2020-02-13  8:24 UTC (permalink / raw)
  To: Ming Lei; +Cc: Tim Walker, linux-block, linux-scsi, linux-nvme

On 2020/02/13 16:54, Ming Lei wrote:
> On Thu, Feb 13, 2020 at 02:40:41AM +0000, Damien Le Moal wrote:
>> Ming,
>>
>> On 2020/02/13 7:03, Ming Lei wrote:
>>> On Wed, Feb 12, 2020 at 01:47:53AM +0000, Damien Le Moal wrote:
>>>> On 2020/02/12 4:01, Tim Walker wrote:
>>>>> On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@redhat.com> wrote:
>>>>>>
>>>>>> On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote:
>>>>>>> Background:
>>>>>>>
>>>>>>> NVMe specification has hardened over the decade and now NVMe devices
>>>>>>> are well integrated into our customers’ systems. As we look forward,
>>>>>>> moving HDDs to the NVMe command set eliminates the SAS IOC and driver
>>>>>>> stack, consolidating on a single access method for rotational and
>>>>>>> static storage technologies. PCIe-NVMe offers near-SATA interface
>>>>>>> costs, features and performance suitable for high-cap HDDs, and
>>>>>>> optimal interoperability for storage automation, tiering, and
>>>>>>> management. We will share some early conceptual results and proposed
>>>>>>> salient design goals and challenges surrounding an NVMe HDD.
>>>>>>
>>>>>> HDD. performance is very sensitive to IO order. Could you provide some
>>>>>> background info about NVMe HDD? Such as:
>>>>>>
>>>>>> - number of hw queues
>>>>>> - hw queue depth
>>>>>> - will NVMe sort/merge IO among all SQs or not?
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Discussion Proposal:
>>>>>>>
>>>>>>> We’d like to share our views and solicit input on:
>>>>>>>
>>>>>>> -What Linux storage stack assumptions do we need to be aware of as we
>>>>>>> develop these devices with drastically different performance
>>>>>>> characteristics than traditional NAND? For example, what schedular or
>>>>>>> device driver level changes will be needed to integrate NVMe HDDs?
>>>>>>
>>>>>> IO merge is often important for HDD. IO merge is usually triggered when
>>>>>> .queue_rq() returns STS_RESOURCE, so far this condition won't be
>>>>>> triggered for NVMe SSD.
>>>>>>
>>>>>> Also blk-mq kills BDI queue congestion and ioc batching, and causes
>>>>>> writeback performance regression[1][2].
>>>>>>
>>>>>> What I am thinking is that if we need to switch to use independent IO
>>>>>> path for handling SSD and HDD. IO, given the two mediums are so
>>>>>> different from performance viewpoint.
>>>>>>
>>>>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e=
>>>>>> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e=
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Ming
>>>>>>
>>>>>
>>>>> I would expect the drive would support a reasonable number of queues
>>>>> and a relatively deep queue depth, more in line with NVMe practices
>>>>> than SAS HDD's typical 128. But it probably doesn't make sense to
>>>>> queue up thousands of commands on something as slow as an HDD, and
>>>>> many customers keep queues < 32 for latency management.
>>>>
>>>> Exposing an HDD through multiple-queues each with a high queue depth is
>>>> simply asking for troubles. Commands will end up spending so much time
>>>> sitting in the queues that they will timeout. This can already be observed
>>>> with the smartpqi SAS HBA which exposes single drives as multiqueue block
>>>> devices with high queue depth. Exercising these drives heavily leads to
>>>> thousands of commands being queued and to timeouts. It is fairly easy to
>>>> trigger this without a manual change to the QD. This is on my to-do list of
>>>> fixes for some time now (lacking time to do it).
>>>
>>> Just wondering why smartpqi SAS won't set one proper .cmd_per_lun for
>>> avoiding the issue, looks the driver simply assigns .can_queue to it,
>>> then it isn't strange to see the timeout issue. If .can_queue is a bit
>>> big, HDD. is easily saturated too long.
>>>
>>>>
>>>> NVMe HDDs need to have an interface setup that match their speed, that is,
>>>> something like a SAS interface: *single* queue pair with a max QD of 256 or
>>>> less depending on what the drive can take. Their is no TASK_SET_FULL
>>>> notification on NVMe, so throttling has to come from the max QD of the SQ,
>>>> which the drive will advertise to the host.
>>>>
>>>>> Merge and elevator are important to HDD performance. I don't believe
>>>>> NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort
>>>>> within a SQ without driving large differences between SSD & HDD data
>>>>> paths?
>>>>
>>>> As far as I know, there is no merging going on once requests are passed to
>>>> the driver and added to an SQ. So this is beside the point.
>>>> The current default block scheduler for NVMe SSDs is "none". This is
>>>> decided based on the number of queues of the device. For NVMe drives that
>>>> have only a single queue *AND* the QUEUE_FLAG_NONROT flag cleared in their
>>>> request queue will can fallback to the default spinning rust mq-deadline
>>>> elevator. That will achieve command merging and LBA ordering needed for
>>>> good performance on HDDs.
>>>
>>> mq-deadline basically won't merge IO if STS_RESOURCE isn't returned from
>>> .queue_rq(), or blk_mq_get_dispatch_budget always return true. NVMe's
>>> .queue_rq() basically always returns STS_OK.
>>
>> I am confused: when an elevator is set, ->queue_rq() is called for requests
>> obtained from the elevator (with e->type->ops.dispatch_request()), after
>> the requests went through it. And merging will happen at that stage when
>> new requests are inserted in the elevator.
> 
> When request is queued to lld via .queue_rq(), the request has been
> removed from scheduler queue. And IO merge is just done inside or
> against scheduler queue.

Yes, for incoming new BIOs, not for requests passed to the LLD.

>> If the ->queue_rq() returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, the
>> request is indeed requeued which offer more chances of further merging, but
>> that is not the same as no merging happening.
>> Am I missing your point here ?
> 
> BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE or getting no budget can be
> thought as device saturation feedback, then more requests can be
> gathered in scheduler queue since we don't dequeue request from
> scheduler queue when that happens, then IO merge is possible.
> 
> Without any device saturation feedback from driver, block layer just
> dequeues request from scheduler queue with same speed of submission to
> hardware, then no IO can be merged.

Got it. And since queue full will mean no more tags, submission will block
on get_request() and there will be no chance in the elevator to merge
anything (aside from opportunistic merging in plugs), isn't it ?
So I guess NVMe HDDs will need some tuning in this area.

> 
> If you observe sequential IO on NVMe PCI, you will see no IO merge
> basically.
> 
>  
> Thanks,
> Ming
> 
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-13  8:24               ` Damien Le Moal
  0 siblings, 0 replies; 64+ messages in thread
From: Damien Le Moal @ 2020-02-13  8:24 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, linux-nvme, Tim Walker, linux-scsi

On 2020/02/13 16:54, Ming Lei wrote:
> On Thu, Feb 13, 2020 at 02:40:41AM +0000, Damien Le Moal wrote:
>> Ming,
>>
>> On 2020/02/13 7:03, Ming Lei wrote:
>>> On Wed, Feb 12, 2020 at 01:47:53AM +0000, Damien Le Moal wrote:
>>>> On 2020/02/12 4:01, Tim Walker wrote:
>>>>> On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@redhat.com> wrote:
>>>>>>
>>>>>> On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote:
>>>>>>> Background:
>>>>>>>
>>>>>>> NVMe specification has hardened over the decade and now NVMe devices
>>>>>>> are well integrated into our customers’ systems. As we look forward,
>>>>>>> moving HDDs to the NVMe command set eliminates the SAS IOC and driver
>>>>>>> stack, consolidating on a single access method for rotational and
>>>>>>> static storage technologies. PCIe-NVMe offers near-SATA interface
>>>>>>> costs, features and performance suitable for high-cap HDDs, and
>>>>>>> optimal interoperability for storage automation, tiering, and
>>>>>>> management. We will share some early conceptual results and proposed
>>>>>>> salient design goals and challenges surrounding an NVMe HDD.
>>>>>>
>>>>>> HDD. performance is very sensitive to IO order. Could you provide some
>>>>>> background info about NVMe HDD? Such as:
>>>>>>
>>>>>> - number of hw queues
>>>>>> - hw queue depth
>>>>>> - will NVMe sort/merge IO among all SQs or not?
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Discussion Proposal:
>>>>>>>
>>>>>>> We’d like to share our views and solicit input on:
>>>>>>>
>>>>>>> -What Linux storage stack assumptions do we need to be aware of as we
>>>>>>> develop these devices with drastically different performance
>>>>>>> characteristics than traditional NAND? For example, what schedular or
>>>>>>> device driver level changes will be needed to integrate NVMe HDDs?
>>>>>>
>>>>>> IO merge is often important for HDD. IO merge is usually triggered when
>>>>>> .queue_rq() returns STS_RESOURCE, so far this condition won't be
>>>>>> triggered for NVMe SSD.
>>>>>>
>>>>>> Also blk-mq kills BDI queue congestion and ioc batching, and causes
>>>>>> writeback performance regression[1][2].
>>>>>>
>>>>>> What I am thinking is that if we need to switch to use independent IO
>>>>>> path for handling SSD and HDD. IO, given the two mediums are so
>>>>>> different from performance viewpoint.
>>>>>>
>>>>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e=
>>>>>> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e=
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Ming
>>>>>>
>>>>>
>>>>> I would expect the drive would support a reasonable number of queues
>>>>> and a relatively deep queue depth, more in line with NVMe practices
>>>>> than SAS HDD's typical 128. But it probably doesn't make sense to
>>>>> queue up thousands of commands on something as slow as an HDD, and
>>>>> many customers keep queues < 32 for latency management.
>>>>
>>>> Exposing an HDD through multiple-queues each with a high queue depth is
>>>> simply asking for troubles. Commands will end up spending so much time
>>>> sitting in the queues that they will timeout. This can already be observed
>>>> with the smartpqi SAS HBA which exposes single drives as multiqueue block
>>>> devices with high queue depth. Exercising these drives heavily leads to
>>>> thousands of commands being queued and to timeouts. It is fairly easy to
>>>> trigger this without a manual change to the QD. This is on my to-do list of
>>>> fixes for some time now (lacking time to do it).
>>>
>>> Just wondering why smartpqi SAS won't set one proper .cmd_per_lun for
>>> avoiding the issue, looks the driver simply assigns .can_queue to it,
>>> then it isn't strange to see the timeout issue. If .can_queue is a bit
>>> big, HDD. is easily saturated too long.
>>>
>>>>
>>>> NVMe HDDs need to have an interface setup that match their speed, that is,
>>>> something like a SAS interface: *single* queue pair with a max QD of 256 or
>>>> less depending on what the drive can take. Their is no TASK_SET_FULL
>>>> notification on NVMe, so throttling has to come from the max QD of the SQ,
>>>> which the drive will advertise to the host.
>>>>
>>>>> Merge and elevator are important to HDD performance. I don't believe
>>>>> NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort
>>>>> within a SQ without driving large differences between SSD & HDD data
>>>>> paths?
>>>>
>>>> As far as I know, there is no merging going on once requests are passed to
>>>> the driver and added to an SQ. So this is beside the point.
>>>> The current default block scheduler for NVMe SSDs is "none". This is
>>>> decided based on the number of queues of the device. For NVMe drives that
>>>> have only a single queue *AND* the QUEUE_FLAG_NONROT flag cleared in their
>>>> request queue will can fallback to the default spinning rust mq-deadline
>>>> elevator. That will achieve command merging and LBA ordering needed for
>>>> good performance on HDDs.
>>>
>>> mq-deadline basically won't merge IO if STS_RESOURCE isn't returned from
>>> .queue_rq(), or blk_mq_get_dispatch_budget always return true. NVMe's
>>> .queue_rq() basically always returns STS_OK.
>>
>> I am confused: when an elevator is set, ->queue_rq() is called for requests
>> obtained from the elevator (with e->type->ops.dispatch_request()), after
>> the requests went through it. And merging will happen at that stage when
>> new requests are inserted in the elevator.
> 
> When request is queued to lld via .queue_rq(), the request has been
> removed from scheduler queue. And IO merge is just done inside or
> against scheduler queue.

Yes, for incoming new BIOs, not for requests passed to the LLD.

>> If the ->queue_rq() returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, the
>> request is indeed requeued which offer more chances of further merging, but
>> that is not the same as no merging happening.
>> Am I missing your point here ?
> 
> BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE or getting no budget can be
> thought as device saturation feedback, then more requests can be
> gathered in scheduler queue since we don't dequeue request from
> scheduler queue when that happens, then IO merge is possible.
> 
> Without any device saturation feedback from driver, block layer just
> dequeues request from scheduler queue with same speed of submission to
> hardware, then no IO can be merged.

Got it. And since queue full will mean no more tags, submission will block
on get_request() and there will be no chance in the elevator to merge
anything (aside from opportunistic merging in plugs), isn't it ?
So I guess NVMe HDDs will need some tuning in this area.

> 
> If you observe sequential IO on NVMe PCI, you will see no IO merge
> basically.
> 
>  
> Thanks,
> Ming
> 
> 


-- 
Damien Le Moal
Western Digital Research

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-13  8:24               ` Damien Le Moal
@ 2020-02-13  8:34                 ` Ming Lei
  -1 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2020-02-13  8:34 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Tim Walker, linux-block, linux-scsi, linux-nvme

On Thu, Feb 13, 2020 at 08:24:36AM +0000, Damien Le Moal wrote:
> On 2020/02/13 16:54, Ming Lei wrote:
> > On Thu, Feb 13, 2020 at 02:40:41AM +0000, Damien Le Moal wrote:
> >> Ming,
> >>
> >> On 2020/02/13 7:03, Ming Lei wrote:
> >>> On Wed, Feb 12, 2020 at 01:47:53AM +0000, Damien Le Moal wrote:
> >>>> On 2020/02/12 4:01, Tim Walker wrote:
> >>>>> On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@redhat.com> wrote:
> >>>>>>
> >>>>>> On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote:
> >>>>>>> Background:
> >>>>>>>
> >>>>>>> NVMe specification has hardened over the decade and now NVMe devices
> >>>>>>> are well integrated into our customers’ systems. As we look forward,
> >>>>>>> moving HDDs to the NVMe command set eliminates the SAS IOC and driver
> >>>>>>> stack, consolidating on a single access method for rotational and
> >>>>>>> static storage technologies. PCIe-NVMe offers near-SATA interface
> >>>>>>> costs, features and performance suitable for high-cap HDDs, and
> >>>>>>> optimal interoperability for storage automation, tiering, and
> >>>>>>> management. We will share some early conceptual results and proposed
> >>>>>>> salient design goals and challenges surrounding an NVMe HDD.
> >>>>>>
> >>>>>> HDD. performance is very sensitive to IO order. Could you provide some
> >>>>>> background info about NVMe HDD? Such as:
> >>>>>>
> >>>>>> - number of hw queues
> >>>>>> - hw queue depth
> >>>>>> - will NVMe sort/merge IO among all SQs or not?
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Discussion Proposal:
> >>>>>>>
> >>>>>>> We’d like to share our views and solicit input on:
> >>>>>>>
> >>>>>>> -What Linux storage stack assumptions do we need to be aware of as we
> >>>>>>> develop these devices with drastically different performance
> >>>>>>> characteristics than traditional NAND? For example, what schedular or
> >>>>>>> device driver level changes will be needed to integrate NVMe HDDs?
> >>>>>>
> >>>>>> IO merge is often important for HDD. IO merge is usually triggered when
> >>>>>> .queue_rq() returns STS_RESOURCE, so far this condition won't be
> >>>>>> triggered for NVMe SSD.
> >>>>>>
> >>>>>> Also blk-mq kills BDI queue congestion and ioc batching, and causes
> >>>>>> writeback performance regression[1][2].
> >>>>>>
> >>>>>> What I am thinking is that if we need to switch to use independent IO
> >>>>>> path for handling SSD and HDD. IO, given the two mediums are so
> >>>>>> different from performance viewpoint.
> >>>>>>
> >>>>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e=
> >>>>>> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e=
> >>>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Ming
> >>>>>>
> >>>>>
> >>>>> I would expect the drive would support a reasonable number of queues
> >>>>> and a relatively deep queue depth, more in line with NVMe practices
> >>>>> than SAS HDD's typical 128. But it probably doesn't make sense to
> >>>>> queue up thousands of commands on something as slow as an HDD, and
> >>>>> many customers keep queues < 32 for latency management.
> >>>>
> >>>> Exposing an HDD through multiple-queues each with a high queue depth is
> >>>> simply asking for troubles. Commands will end up spending so much time
> >>>> sitting in the queues that they will timeout. This can already be observed
> >>>> with the smartpqi SAS HBA which exposes single drives as multiqueue block
> >>>> devices with high queue depth. Exercising these drives heavily leads to
> >>>> thousands of commands being queued and to timeouts. It is fairly easy to
> >>>> trigger this without a manual change to the QD. This is on my to-do list of
> >>>> fixes for some time now (lacking time to do it).
> >>>
> >>> Just wondering why smartpqi SAS won't set one proper .cmd_per_lun for
> >>> avoiding the issue, looks the driver simply assigns .can_queue to it,
> >>> then it isn't strange to see the timeout issue. If .can_queue is a bit
> >>> big, HDD. is easily saturated too long.
> >>>
> >>>>
> >>>> NVMe HDDs need to have an interface setup that match their speed, that is,
> >>>> something like a SAS interface: *single* queue pair with a max QD of 256 or
> >>>> less depending on what the drive can take. Their is no TASK_SET_FULL
> >>>> notification on NVMe, so throttling has to come from the max QD of the SQ,
> >>>> which the drive will advertise to the host.
> >>>>
> >>>>> Merge and elevator are important to HDD performance. I don't believe
> >>>>> NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort
> >>>>> within a SQ without driving large differences between SSD & HDD data
> >>>>> paths?
> >>>>
> >>>> As far as I know, there is no merging going on once requests are passed to
> >>>> the driver and added to an SQ. So this is beside the point.
> >>>> The current default block scheduler for NVMe SSDs is "none". This is
> >>>> decided based on the number of queues of the device. For NVMe drives that
> >>>> have only a single queue *AND* the QUEUE_FLAG_NONROT flag cleared in their
> >>>> request queue will can fallback to the default spinning rust mq-deadline
> >>>> elevator. That will achieve command merging and LBA ordering needed for
> >>>> good performance on HDDs.
> >>>
> >>> mq-deadline basically won't merge IO if STS_RESOURCE isn't returned from
> >>> .queue_rq(), or blk_mq_get_dispatch_budget always return true. NVMe's
> >>> .queue_rq() basically always returns STS_OK.
> >>
> >> I am confused: when an elevator is set, ->queue_rq() is called for requests
> >> obtained from the elevator (with e->type->ops.dispatch_request()), after
> >> the requests went through it. And merging will happen at that stage when
> >> new requests are inserted in the elevator.
> > 
> > When request is queued to lld via .queue_rq(), the request has been
> > removed from scheduler queue. And IO merge is just done inside or
> > against scheduler queue.
> 
> Yes, for incoming new BIOs, not for requests passed to the LLD.
> 
> >> If the ->queue_rq() returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, the
> >> request is indeed requeued which offer more chances of further merging, but
> >> that is not the same as no merging happening.
> >> Am I missing your point here ?
> > 
> > BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE or getting no budget can be
> > thought as device saturation feedback, then more requests can be
> > gathered in scheduler queue since we don't dequeue request from
> > scheduler queue when that happens, then IO merge is possible.
> > 
> > Without any device saturation feedback from driver, block layer just
> > dequeues request from scheduler queue with same speed of submission to
> > hardware, then no IO can be merged.
> 
> Got it. And since queue full will mean no more tags, submission will block
> on get_request() and there will be no chance in the elevator to merge
> anything (aside from opportunistic merging in plugs), isn't it ?
> So I guess NVMe HDDs will need some tuning in this area.

scheduler queue depth is usually 2 times of hw queue depth, so requests
ar usually enough for merging.

For NVMe, there isn't ns queue depth, such as scsi's device queue depth,
meantime the hw queue depth is big enough, so no chance to trigger merge.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-13  8:34                 ` Ming Lei
  0 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2020-02-13  8:34 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: linux-block, linux-nvme, Tim Walker, linux-scsi

On Thu, Feb 13, 2020 at 08:24:36AM +0000, Damien Le Moal wrote:
> On 2020/02/13 16:54, Ming Lei wrote:
> > On Thu, Feb 13, 2020 at 02:40:41AM +0000, Damien Le Moal wrote:
> >> Ming,
> >>
> >> On 2020/02/13 7:03, Ming Lei wrote:
> >>> On Wed, Feb 12, 2020 at 01:47:53AM +0000, Damien Le Moal wrote:
> >>>> On 2020/02/12 4:01, Tim Walker wrote:
> >>>>> On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@redhat.com> wrote:
> >>>>>>
> >>>>>> On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote:
> >>>>>>> Background:
> >>>>>>>
> >>>>>>> NVMe specification has hardened over the decade and now NVMe devices
> >>>>>>> are well integrated into our customers’ systems. As we look forward,
> >>>>>>> moving HDDs to the NVMe command set eliminates the SAS IOC and driver
> >>>>>>> stack, consolidating on a single access method for rotational and
> >>>>>>> static storage technologies. PCIe-NVMe offers near-SATA interface
> >>>>>>> costs, features and performance suitable for high-cap HDDs, and
> >>>>>>> optimal interoperability for storage automation, tiering, and
> >>>>>>> management. We will share some early conceptual results and proposed
> >>>>>>> salient design goals and challenges surrounding an NVMe HDD.
> >>>>>>
> >>>>>> HDD. performance is very sensitive to IO order. Could you provide some
> >>>>>> background info about NVMe HDD? Such as:
> >>>>>>
> >>>>>> - number of hw queues
> >>>>>> - hw queue depth
> >>>>>> - will NVMe sort/merge IO among all SQs or not?
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Discussion Proposal:
> >>>>>>>
> >>>>>>> We’d like to share our views and solicit input on:
> >>>>>>>
> >>>>>>> -What Linux storage stack assumptions do we need to be aware of as we
> >>>>>>> develop these devices with drastically different performance
> >>>>>>> characteristics than traditional NAND? For example, what schedular or
> >>>>>>> device driver level changes will be needed to integrate NVMe HDDs?
> >>>>>>
> >>>>>> IO merge is often important for HDD. IO merge is usually triggered when
> >>>>>> .queue_rq() returns STS_RESOURCE, so far this condition won't be
> >>>>>> triggered for NVMe SSD.
> >>>>>>
> >>>>>> Also blk-mq kills BDI queue congestion and ioc batching, and causes
> >>>>>> writeback performance regression[1][2].
> >>>>>>
> >>>>>> What I am thinking is that if we need to switch to use independent IO
> >>>>>> path for handling SSD and HDD. IO, given the two mediums are so
> >>>>>> different from performance viewpoint.
> >>>>>>
> >>>>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e=
> >>>>>> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e=
> >>>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Ming
> >>>>>>
> >>>>>
> >>>>> I would expect the drive would support a reasonable number of queues
> >>>>> and a relatively deep queue depth, more in line with NVMe practices
> >>>>> than SAS HDD's typical 128. But it probably doesn't make sense to
> >>>>> queue up thousands of commands on something as slow as an HDD, and
> >>>>> many customers keep queues < 32 for latency management.
> >>>>
> >>>> Exposing an HDD through multiple-queues each with a high queue depth is
> >>>> simply asking for troubles. Commands will end up spending so much time
> >>>> sitting in the queues that they will timeout. This can already be observed
> >>>> with the smartpqi SAS HBA which exposes single drives as multiqueue block
> >>>> devices with high queue depth. Exercising these drives heavily leads to
> >>>> thousands of commands being queued and to timeouts. It is fairly easy to
> >>>> trigger this without a manual change to the QD. This is on my to-do list of
> >>>> fixes for some time now (lacking time to do it).
> >>>
> >>> Just wondering why smartpqi SAS won't set one proper .cmd_per_lun for
> >>> avoiding the issue, looks the driver simply assigns .can_queue to it,
> >>> then it isn't strange to see the timeout issue. If .can_queue is a bit
> >>> big, HDD. is easily saturated too long.
> >>>
> >>>>
> >>>> NVMe HDDs need to have an interface setup that match their speed, that is,
> >>>> something like a SAS interface: *single* queue pair with a max QD of 256 or
> >>>> less depending on what the drive can take. Their is no TASK_SET_FULL
> >>>> notification on NVMe, so throttling has to come from the max QD of the SQ,
> >>>> which the drive will advertise to the host.
> >>>>
> >>>>> Merge and elevator are important to HDD performance. I don't believe
> >>>>> NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort
> >>>>> within a SQ without driving large differences between SSD & HDD data
> >>>>> paths?
> >>>>
> >>>> As far as I know, there is no merging going on once requests are passed to
> >>>> the driver and added to an SQ. So this is beside the point.
> >>>> The current default block scheduler for NVMe SSDs is "none". This is
> >>>> decided based on the number of queues of the device. For NVMe drives that
> >>>> have only a single queue *AND* the QUEUE_FLAG_NONROT flag cleared in their
> >>>> request queue will can fallback to the default spinning rust mq-deadline
> >>>> elevator. That will achieve command merging and LBA ordering needed for
> >>>> good performance on HDDs.
> >>>
> >>> mq-deadline basically won't merge IO if STS_RESOURCE isn't returned from
> >>> .queue_rq(), or blk_mq_get_dispatch_budget always return true. NVMe's
> >>> .queue_rq() basically always returns STS_OK.
> >>
> >> I am confused: when an elevator is set, ->queue_rq() is called for requests
> >> obtained from the elevator (with e->type->ops.dispatch_request()), after
> >> the requests went through it. And merging will happen at that stage when
> >> new requests are inserted in the elevator.
> > 
> > When request is queued to lld via .queue_rq(), the request has been
> > removed from scheduler queue. And IO merge is just done inside or
> > against scheduler queue.
> 
> Yes, for incoming new BIOs, not for requests passed to the LLD.
> 
> >> If the ->queue_rq() returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, the
> >> request is indeed requeued which offer more chances of further merging, but
> >> that is not the same as no merging happening.
> >> Am I missing your point here ?
> > 
> > BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE or getting no budget can be
> > thought as device saturation feedback, then more requests can be
> > gathered in scheduler queue since we don't dequeue request from
> > scheduler queue when that happens, then IO merge is possible.
> > 
> > Without any device saturation feedback from driver, block layer just
> > dequeues request from scheduler queue with same speed of submission to
> > hardware, then no IO can be merged.
> 
> Got it. And since queue full will mean no more tags, submission will block
> on get_request() and there will be no chance in the elevator to merge
> anything (aside from opportunistic merging in plugs), isn't it ?
> So I guess NVMe HDDs will need some tuning in this area.

scheduler queue depth is usually 2 times of hw queue depth, so requests
ar usually enough for merging.

For NVMe, there isn't ns queue depth, such as scsi's device queue depth,
meantime the hw queue depth is big enough, so no chance to trigger merge.

Thanks,
Ming


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-13  8:34                 ` Ming Lei
@ 2020-02-13 16:30                   ` Keith Busch
  -1 siblings, 0 replies; 64+ messages in thread
From: Keith Busch @ 2020-02-13 16:30 UTC (permalink / raw)
  To: Ming Lei; +Cc: Damien Le Moal, Tim Walker, linux-block, linux-scsi, linux-nvme

On Thu, Feb 13, 2020 at 04:34:13PM +0800, Ming Lei wrote:
> On Thu, Feb 13, 2020 at 08:24:36AM +0000, Damien Le Moal wrote:
> > Got it. And since queue full will mean no more tags, submission will block
> > on get_request() and there will be no chance in the elevator to merge
> > anything (aside from opportunistic merging in plugs), isn't it ?
> > So I guess NVMe HDDs will need some tuning in this area.
> 
> scheduler queue depth is usually 2 times of hw queue depth, so requests
> ar usually enough for merging.
> 
> For NVMe, there isn't ns queue depth, such as scsi's device queue depth,
> meantime the hw queue depth is big enough, so no chance to trigger merge.

Most NVMe devices contain a single namespace anyway, so the shared tag
queue depth is effectively the ns queue depth, and an NVMe HDD should
advertise queue count and depth capabilities orders of magnitude lower
than what we're used to with nvme SSDs. That should get merging and
BLK_STS_DEV_RESOURCE handling to occur as desired, right?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-13 16:30                   ` Keith Busch
  0 siblings, 0 replies; 64+ messages in thread
From: Keith Busch @ 2020-02-13 16:30 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, Damien Le Moal, linux-nvme, Tim Walker, linux-scsi

On Thu, Feb 13, 2020 at 04:34:13PM +0800, Ming Lei wrote:
> On Thu, Feb 13, 2020 at 08:24:36AM +0000, Damien Le Moal wrote:
> > Got it. And since queue full will mean no more tags, submission will block
> > on get_request() and there will be no chance in the elevator to merge
> > anything (aside from opportunistic merging in plugs), isn't it ?
> > So I guess NVMe HDDs will need some tuning in this area.
> 
> scheduler queue depth is usually 2 times of hw queue depth, so requests
> ar usually enough for merging.
> 
> For NVMe, there isn't ns queue depth, such as scsi's device queue depth,
> meantime the hw queue depth is big enough, so no chance to trigger merge.

Most NVMe devices contain a single namespace anyway, so the shared tag
queue depth is effectively the ns queue depth, and an NVMe HDD should
advertise queue count and depth capabilities orders of magnitude lower
than what we're used to with nvme SSDs. That should get merging and
BLK_STS_DEV_RESOURCE handling to occur as desired, right?

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-13  3:02         ` Martin K. Petersen
@ 2020-02-14  0:35           ` Ming Lei
  -1 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2020-02-14  0:35 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Damien Le Moal, linux-block, linux-nvme, Tim Walker, linux-scsi

On Wed, Feb 12, 2020 at 10:02:08PM -0500, Martin K. Petersen wrote:
> 
> Damien,
> 
> > Exposing an HDD through multiple-queues each with a high queue depth
> > is simply asking for troubles. Commands will end up spending so much
> > time sitting in the queues that they will timeout.
> 
> Yep!
> 
> > This can already be observed with the smartpqi SAS HBA which exposes
> > single drives as multiqueue block devices with high queue depth.
> > Exercising these drives heavily leads to thousands of commands being
> > queued and to timeouts. It is fairly easy to trigger this without a
> > manual change to the QD. This is on my to-do list of fixes for some
> > time now (lacking time to do it).
> 
> Controllers that queue internally are very susceptible to application or
> filesystem timeouts when drives are struggling to keep up.
> 
> > NVMe HDDs need to have an interface setup that match their speed, that
> > is, something like a SAS interface: *single* queue pair with a max QD
> > of 256 or less depending on what the drive can take. Their is no
> > TASK_SET_FULL notification on NVMe, so throttling has to come from the
> > max QD of the SQ, which the drive will advertise to the host.
> 
> At the very minimum we'll need low queue depths. But I have my doubts
> whether we can make this work well enough without some kind of TASK SET
> FULL style AER to throttle the I/O.

Looks 32 or sort of works fine for HDD, and 128 is good enough for
SSD.

And this number should drive enough parallelism, meantime timeout can be
avoided most of times if not too small timeout value is set. But SCSI
still allows to adjust the queue depth via sysfs.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-14  0:35           ` Ming Lei
  0 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2020-02-14  0:35 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: linux-block, Damien Le Moal, Tim Walker, linux-nvme, linux-scsi

On Wed, Feb 12, 2020 at 10:02:08PM -0500, Martin K. Petersen wrote:
> 
> Damien,
> 
> > Exposing an HDD through multiple-queues each with a high queue depth
> > is simply asking for troubles. Commands will end up spending so much
> > time sitting in the queues that they will timeout.
> 
> Yep!
> 
> > This can already be observed with the smartpqi SAS HBA which exposes
> > single drives as multiqueue block devices with high queue depth.
> > Exercising these drives heavily leads to thousands of commands being
> > queued and to timeouts. It is fairly easy to trigger this without a
> > manual change to the QD. This is on my to-do list of fixes for some
> > time now (lacking time to do it).
> 
> Controllers that queue internally are very susceptible to application or
> filesystem timeouts when drives are struggling to keep up.
> 
> > NVMe HDDs need to have an interface setup that match their speed, that
> > is, something like a SAS interface: *single* queue pair with a max QD
> > of 256 or less depending on what the drive can take. Their is no
> > TASK_SET_FULL notification on NVMe, so throttling has to come from the
> > max QD of the SQ, which the drive will advertise to the host.
> 
> At the very minimum we'll need low queue depths. But I have my doubts
> whether we can make this work well enough without some kind of TASK SET
> FULL style AER to throttle the I/O.

Looks 32 or sort of works fine for HDD, and 128 is good enough for
SSD.

And this number should drive enough parallelism, meantime timeout can be
avoided most of times if not too small timeout value is set. But SCSI
still allows to adjust the queue depth via sysfs.

Thanks,
Ming


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-13 16:30                   ` Keith Busch
@ 2020-02-14  0:40                     ` Ming Lei
  -1 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2020-02-14  0:40 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, Damien Le Moal, linux-nvme, Tim Walker, linux-scsi

On Fri, Feb 14, 2020 at 01:30:38AM +0900, Keith Busch wrote:
> On Thu, Feb 13, 2020 at 04:34:13PM +0800, Ming Lei wrote:
> > On Thu, Feb 13, 2020 at 08:24:36AM +0000, Damien Le Moal wrote:
> > > Got it. And since queue full will mean no more tags, submission will block
> > > on get_request() and there will be no chance in the elevator to merge
> > > anything (aside from opportunistic merging in plugs), isn't it ?
> > > So I guess NVMe HDDs will need some tuning in this area.
> > 
> > scheduler queue depth is usually 2 times of hw queue depth, so requests
> > ar usually enough for merging.
> > 
> > For NVMe, there isn't ns queue depth, such as scsi's device queue depth,
> > meantime the hw queue depth is big enough, so no chance to trigger merge.
> 
> Most NVMe devices contain a single namespace anyway, so the shared tag
> queue depth is effectively the ns queue depth, and an NVMe HDD should
> advertise queue count and depth capabilities orders of magnitude lower
> than what we're used to with nvme SSDs. That should get merging and
> BLK_STS_DEV_RESOURCE handling to occur as desired, right?

Right.

The advertised queue depth might serve two purposes:

1) reflect the namespace's actual queueing capability, so block layer's merging
is possible

2) avoid timeout caused by too many in-flight IO


Thanks,
Ming


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-14  0:40                     ` Ming Lei
  0 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2020-02-14  0:40 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, Damien Le Moal, Tim Walker, linux-nvme, linux-scsi

On Fri, Feb 14, 2020 at 01:30:38AM +0900, Keith Busch wrote:
> On Thu, Feb 13, 2020 at 04:34:13PM +0800, Ming Lei wrote:
> > On Thu, Feb 13, 2020 at 08:24:36AM +0000, Damien Le Moal wrote:
> > > Got it. And since queue full will mean no more tags, submission will block
> > > on get_request() and there will be no chance in the elevator to merge
> > > anything (aside from opportunistic merging in plugs), isn't it ?
> > > So I guess NVMe HDDs will need some tuning in this area.
> > 
> > scheduler queue depth is usually 2 times of hw queue depth, so requests
> > ar usually enough for merging.
> > 
> > For NVMe, there isn't ns queue depth, such as scsi's device queue depth,
> > meantime the hw queue depth is big enough, so no chance to trigger merge.
> 
> Most NVMe devices contain a single namespace anyway, so the shared tag
> queue depth is effectively the ns queue depth, and an NVMe HDD should
> advertise queue count and depth capabilities orders of magnitude lower
> than what we're used to with nvme SSDs. That should get merging and
> BLK_STS_DEV_RESOURCE handling to occur as desired, right?

Right.

The advertised queue depth might serve two purposes:

1) reflect the namespace's actual queueing capability, so block layer's merging
is possible

2) avoid timeout caused by too many in-flight IO


Thanks,
Ming


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-13  4:17             ` Martin K. Petersen
@ 2020-02-14  7:32               ` Hannes Reinecke
  -1 siblings, 0 replies; 64+ messages in thread
From: Hannes Reinecke @ 2020-02-14  7:32 UTC (permalink / raw)
  To: Martin K. Petersen, Tim Walker
  Cc: Damien Le Moal, Ming Lei, linux-block, linux-scsi, linux-nvme

On 2/13/20 5:17 AM, Martin K. Petersen wrote:
> 
> Tim,
> 
>> SAS currently supports QD256, but the general consensus is that most
>> customers don't run anywhere near that deep. Does it help the system
>> for the HD to report a limited (256) max queue depth, or is it really
>> up to the system to decide many commands to queue?
> 
> People often artificially lower the queue depth to avoid timeouts. The
> default timeout is 30 seconds from an I/O is queued. However, many
> enterprise applications set the timeout to 3-5 seconds. Which means that
> with deep queues you'll quickly start seeing timeouts if a drive
> temporarily is having issues keeping up (media errors, excessive spare
> track seeks, etc.).
> 
> Well-behaved devices will return QF/TSF if they have transient resource
> starvation or exceed internal QoS limits. QF will cause the SCSI stack
> to reduce the number of I/Os in flight. This allows the drive to recover
> from its congested state and reduces the potential of application and
> filesystem timeouts.
> 
This may even be a chance to revisit QoS / queue busy handling.
NVMe has this SQ head pointer mechanism which was supposed to handle
this kind of situations, but to my knowledge no-one has been
implementing it.
Might be worthwhile revisiting it; guess NVMe HDDs would profit from that.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-14  7:32               ` Hannes Reinecke
  0 siblings, 0 replies; 64+ messages in thread
From: Hannes Reinecke @ 2020-02-14  7:32 UTC (permalink / raw)
  To: Martin K. Petersen, Tim Walker
  Cc: linux-block, Damien Le Moal, linux-nvme, linux-scsi, Ming Lei

On 2/13/20 5:17 AM, Martin K. Petersen wrote:
> 
> Tim,
> 
>> SAS currently supports QD256, but the general consensus is that most
>> customers don't run anywhere near that deep. Does it help the system
>> for the HD to report a limited (256) max queue depth, or is it really
>> up to the system to decide many commands to queue?
> 
> People often artificially lower the queue depth to avoid timeouts. The
> default timeout is 30 seconds from an I/O is queued. However, many
> enterprise applications set the timeout to 3-5 seconds. Which means that
> with deep queues you'll quickly start seeing timeouts if a drive
> temporarily is having issues keeping up (media errors, excessive spare
> track seeks, etc.).
> 
> Well-behaved devices will return QF/TSF if they have transient resource
> starvation or exceed internal QoS limits. QF will cause the SCSI stack
> to reduce the number of I/Os in flight. This allows the drive to recover
> from its congested state and reduces the potential of application and
> filesystem timeouts.
> 
This may even be a chance to revisit QoS / queue busy handling.
NVMe has this SQ head pointer mechanism which was supposed to handle
this kind of situations, but to my knowledge no-one has been
implementing it.
Might be worthwhile revisiting it; guess NVMe HDDs would profit from that.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-14  7:32               ` Hannes Reinecke
@ 2020-02-14 14:40                 ` Keith Busch
  -1 siblings, 0 replies; 64+ messages in thread
From: Keith Busch @ 2020-02-14 14:40 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Martin K. Petersen, Tim Walker, Damien Le Moal, Ming Lei,
	linux-block, linux-scsi, linux-nvme

On Fri, Feb 14, 2020 at 08:32:57AM +0100, Hannes Reinecke wrote:
> On 2/13/20 5:17 AM, Martin K. Petersen wrote:
> > People often artificially lower the queue depth to avoid timeouts. The
> > default timeout is 30 seconds from an I/O is queued. However, many
> > enterprise applications set the timeout to 3-5 seconds. Which means that
> > with deep queues you'll quickly start seeing timeouts if a drive
> > temporarily is having issues keeping up (media errors, excessive spare
> > track seeks, etc.).
> > 
> > Well-behaved devices will return QF/TSF if they have transient resource
> > starvation or exceed internal QoS limits. QF will cause the SCSI stack
> > to reduce the number of I/Os in flight. This allows the drive to recover
> > from its congested state and reduces the potential of application and
> > filesystem timeouts.
> > 
> This may even be a chance to revisit QoS / queue busy handling.
> NVMe has this SQ head pointer mechanism which was supposed to handle
> this kind of situations, but to my knowledge no-one has been
> implementing it.
> Might be worthwhile revisiting it; guess NVMe HDDs would profit from that.

We don't need that because we don't allocate enough tags to potentially
wrap the tail past the head. If you can allocate a tag, the queue is not
full. And convesely, no tag == queue full.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-14 14:40                 ` Keith Busch
  0 siblings, 0 replies; 64+ messages in thread
From: Keith Busch @ 2020-02-14 14:40 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Damien Le Moal, Martin K. Petersen, linux-scsi, Tim Walker,
	linux-nvme, Ming Lei, linux-block

On Fri, Feb 14, 2020 at 08:32:57AM +0100, Hannes Reinecke wrote:
> On 2/13/20 5:17 AM, Martin K. Petersen wrote:
> > People often artificially lower the queue depth to avoid timeouts. The
> > default timeout is 30 seconds from an I/O is queued. However, many
> > enterprise applications set the timeout to 3-5 seconds. Which means that
> > with deep queues you'll quickly start seeing timeouts if a drive
> > temporarily is having issues keeping up (media errors, excessive spare
> > track seeks, etc.).
> > 
> > Well-behaved devices will return QF/TSF if they have transient resource
> > starvation or exceed internal QoS limits. QF will cause the SCSI stack
> > to reduce the number of I/Os in flight. This allows the drive to recover
> > from its congested state and reduces the potential of application and
> > filesystem timeouts.
> > 
> This may even be a chance to revisit QoS / queue busy handling.
> NVMe has this SQ head pointer mechanism which was supposed to handle
> this kind of situations, but to my knowledge no-one has been
> implementing it.
> Might be worthwhile revisiting it; guess NVMe HDDs would profit from that.

We don't need that because we don't allocate enough tags to potentially
wrap the tail past the head. If you can allocate a tag, the queue is not
full. And convesely, no tag == queue full.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-14 14:40                 ` Keith Busch
@ 2020-02-14 16:04                   ` Hannes Reinecke
  -1 siblings, 0 replies; 64+ messages in thread
From: Hannes Reinecke @ 2020-02-14 16:04 UTC (permalink / raw)
  To: Keith Busch
  Cc: Martin K. Petersen, Tim Walker, Damien Le Moal, Ming Lei,
	linux-block, linux-scsi, linux-nvme

On 2/14/20 3:40 PM, Keith Busch wrote:
> On Fri, Feb 14, 2020 at 08:32:57AM +0100, Hannes Reinecke wrote:
>> On 2/13/20 5:17 AM, Martin K. Petersen wrote:
>>> People often artificially lower the queue depth to avoid timeouts. The
>>> default timeout is 30 seconds from an I/O is queued. However, many
>>> enterprise applications set the timeout to 3-5 seconds. Which means that
>>> with deep queues you'll quickly start seeing timeouts if a drive
>>> temporarily is having issues keeping up (media errors, excessive spare
>>> track seeks, etc.).
>>>
>>> Well-behaved devices will return QF/TSF if they have transient resource
>>> starvation or exceed internal QoS limits. QF will cause the SCSI stack
>>> to reduce the number of I/Os in flight. This allows the drive to recover
>>> from its congested state and reduces the potential of application and
>>> filesystem timeouts.
>>>
>> This may even be a chance to revisit QoS / queue busy handling.
>> NVMe has this SQ head pointer mechanism which was supposed to handle
>> this kind of situations, but to my knowledge no-one has been
>> implementing it.
>> Might be worthwhile revisiting it; guess NVMe HDDs would profit from that.
> 
> We don't need that because we don't allocate enough tags to potentially
> wrap the tail past the head. If you can allocate a tag, the queue is not
> full. And convesely, no tag == queue full.
> 
It's not a problem on our side.
It's a problem on the target/controller side.
The target/controller might have a need to throttle I/O (due to QoS 
settings or competing resources from other hosts), but currently no 
means of signalling that to the host.
Which, incidentally, is the underlying reason for the DNR handling 
discussion we had; NetApp tried to model QoS by sending "Namespace not 
ready" without the DNR bit set, which of course is a totally different 
use-case as the typical 'Namespace not ready' response we get (with the 
DNR bit set) when a namespace was unmapped.

And that is where SQ head pointer updates comes in; it would allow the 
controller to signal back to the host that it should hold off sending 
I/O for a bit.
So this could / might be used for NVMe HDDs, too, which also might have 
a need to signal back to the host that I/Os should be throttled...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-14 16:04                   ` Hannes Reinecke
  0 siblings, 0 replies; 64+ messages in thread
From: Hannes Reinecke @ 2020-02-14 16:04 UTC (permalink / raw)
  To: Keith Busch
  Cc: Damien Le Moal, Martin K. Petersen, linux-scsi, Tim Walker,
	linux-nvme, Ming Lei, linux-block

On 2/14/20 3:40 PM, Keith Busch wrote:
> On Fri, Feb 14, 2020 at 08:32:57AM +0100, Hannes Reinecke wrote:
>> On 2/13/20 5:17 AM, Martin K. Petersen wrote:
>>> People often artificially lower the queue depth to avoid timeouts. The
>>> default timeout is 30 seconds from an I/O is queued. However, many
>>> enterprise applications set the timeout to 3-5 seconds. Which means that
>>> with deep queues you'll quickly start seeing timeouts if a drive
>>> temporarily is having issues keeping up (media errors, excessive spare
>>> track seeks, etc.).
>>>
>>> Well-behaved devices will return QF/TSF if they have transient resource
>>> starvation or exceed internal QoS limits. QF will cause the SCSI stack
>>> to reduce the number of I/Os in flight. This allows the drive to recover
>>> from its congested state and reduces the potential of application and
>>> filesystem timeouts.
>>>
>> This may even be a chance to revisit QoS / queue busy handling.
>> NVMe has this SQ head pointer mechanism which was supposed to handle
>> this kind of situations, but to my knowledge no-one has been
>> implementing it.
>> Might be worthwhile revisiting it; guess NVMe HDDs would profit from that.
> 
> We don't need that because we don't allocate enough tags to potentially
> wrap the tail past the head. If you can allocate a tag, the queue is not
> full. And convesely, no tag == queue full.
> 
It's not a problem on our side.
It's a problem on the target/controller side.
The target/controller might have a need to throttle I/O (due to QoS 
settings or competing resources from other hosts), but currently no 
means of signalling that to the host.
Which, incidentally, is the underlying reason for the DNR handling 
discussion we had; NetApp tried to model QoS by sending "Namespace not 
ready" without the DNR bit set, which of course is a totally different 
use-case as the typical 'Namespace not ready' response we get (with the 
DNR bit set) when a namespace was unmapped.

And that is where SQ head pointer updates comes in; it would allow the 
controller to signal back to the host that it should hold off sending 
I/O for a bit.
So this could / might be used for NVMe HDDs, too, which also might have 
a need to signal back to the host that I/Os should be throttled...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-14 16:04                   ` Hannes Reinecke
@ 2020-02-14 17:05                     ` Keith Busch
  -1 siblings, 0 replies; 64+ messages in thread
From: Keith Busch @ 2020-02-14 17:05 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Martin K. Petersen, Tim Walker, Damien Le Moal, Ming Lei,
	linux-block, linux-scsi, linux-nvme

On Fri, Feb 14, 2020 at 05:04:25PM +0100, Hannes Reinecke wrote:
> On 2/14/20 3:40 PM, Keith Busch wrote:
> > On Fri, Feb 14, 2020 at 08:32:57AM +0100, Hannes Reinecke wrote:
> > > On 2/13/20 5:17 AM, Martin K. Petersen wrote:
> > > > People often artificially lower the queue depth to avoid timeouts. The
> > > > default timeout is 30 seconds from an I/O is queued. However, many
> > > > enterprise applications set the timeout to 3-5 seconds. Which means that
> > > > with deep queues you'll quickly start seeing timeouts if a drive
> > > > temporarily is having issues keeping up (media errors, excessive spare
> > > > track seeks, etc.).
> > > > 
> > > > Well-behaved devices will return QF/TSF if they have transient resource
> > > > starvation or exceed internal QoS limits. QF will cause the SCSI stack
> > > > to reduce the number of I/Os in flight. This allows the drive to recover
> > > > from its congested state and reduces the potential of application and
> > > > filesystem timeouts.
> > > > 
> > > This may even be a chance to revisit QoS / queue busy handling.
> > > NVMe has this SQ head pointer mechanism which was supposed to handle
> > > this kind of situations, but to my knowledge no-one has been
> > > implementing it.
> > > Might be worthwhile revisiting it; guess NVMe HDDs would profit from that.
> > 
> > We don't need that because we don't allocate enough tags to potentially
> > wrap the tail past the head. If you can allocate a tag, the queue is not
> > full. And convesely, no tag == queue full.
> > 
> It's not a problem on our side.
> It's a problem on the target/controller side.
> The target/controller might have a need to throttle I/O (due to QoS settings
> or competing resources from other hosts), but currently no means of
> signalling that to the host.
> Which, incidentally, is the underlying reason for the DNR handling
> discussion we had; NetApp tried to model QoS by sending "Namespace not
> ready" without the DNR bit set, which of course is a totally different
> use-case as the typical 'Namespace not ready' response we get (with the DNR
> bit set) when a namespace was unmapped.
> 
> And that is where SQ head pointer updates comes in; it would allow the
> controller to signal back to the host that it should hold off sending I/O
> for a bit.
> So this could / might be used for NVMe HDDs, too, which also might have a
> need to signal back to the host that I/Os should be throttled...

Okay, I see. I think this needs a new nvme AER notice as Martin
suggested. The desired host behavior is simiilar to what we do with a
"firmware activation notice" where we temporarily quiesce new requests
and reset IO timeouts for previously dispatched requests. Perhaps tie
this to the CSTS.PP register as well.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-14 17:05                     ` Keith Busch
  0 siblings, 0 replies; 64+ messages in thread
From: Keith Busch @ 2020-02-14 17:05 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Damien Le Moal, Martin K. Petersen, linux-scsi, Tim Walker,
	linux-nvme, Ming Lei, linux-block

On Fri, Feb 14, 2020 at 05:04:25PM +0100, Hannes Reinecke wrote:
> On 2/14/20 3:40 PM, Keith Busch wrote:
> > On Fri, Feb 14, 2020 at 08:32:57AM +0100, Hannes Reinecke wrote:
> > > On 2/13/20 5:17 AM, Martin K. Petersen wrote:
> > > > People often artificially lower the queue depth to avoid timeouts. The
> > > > default timeout is 30 seconds from an I/O is queued. However, many
> > > > enterprise applications set the timeout to 3-5 seconds. Which means that
> > > > with deep queues you'll quickly start seeing timeouts if a drive
> > > > temporarily is having issues keeping up (media errors, excessive spare
> > > > track seeks, etc.).
> > > > 
> > > > Well-behaved devices will return QF/TSF if they have transient resource
> > > > starvation or exceed internal QoS limits. QF will cause the SCSI stack
> > > > to reduce the number of I/Os in flight. This allows the drive to recover
> > > > from its congested state and reduces the potential of application and
> > > > filesystem timeouts.
> > > > 
> > > This may even be a chance to revisit QoS / queue busy handling.
> > > NVMe has this SQ head pointer mechanism which was supposed to handle
> > > this kind of situations, but to my knowledge no-one has been
> > > implementing it.
> > > Might be worthwhile revisiting it; guess NVMe HDDs would profit from that.
> > 
> > We don't need that because we don't allocate enough tags to potentially
> > wrap the tail past the head. If you can allocate a tag, the queue is not
> > full. And convesely, no tag == queue full.
> > 
> It's not a problem on our side.
> It's a problem on the target/controller side.
> The target/controller might have a need to throttle I/O (due to QoS settings
> or competing resources from other hosts), but currently no means of
> signalling that to the host.
> Which, incidentally, is the underlying reason for the DNR handling
> discussion we had; NetApp tried to model QoS by sending "Namespace not
> ready" without the DNR bit set, which of course is a totally different
> use-case as the typical 'Namespace not ready' response we get (with the DNR
> bit set) when a namespace was unmapped.
> 
> And that is where SQ head pointer updates comes in; it would allow the
> controller to signal back to the host that it should hold off sending I/O
> for a bit.
> So this could / might be used for NVMe HDDs, too, which also might have a
> need to signal back to the host that I/Os should be throttled...

Okay, I see. I think this needs a new nvme AER notice as Martin
suggested. The desired host behavior is simiilar to what we do with a
"firmware activation notice" where we temporarily quiesce new requests
and reset IO timeouts for previously dispatched requests. Perhaps tie
this to the CSTS.PP register as well.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-14 17:05                     ` Keith Busch
@ 2020-02-18 15:54                       ` Tim Walker
  -1 siblings, 0 replies; 64+ messages in thread
From: Tim Walker @ 2020-02-18 15:54 UTC (permalink / raw)
  To: Keith Busch
  Cc: Hannes Reinecke, Martin K. Petersen, Damien Le Moal, Ming Lei,
	linux-block, linux-scsi, linux-nvme

On Fri, Feb 14, 2020 at 12:05 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Fri, Feb 14, 2020 at 05:04:25PM +0100, Hannes Reinecke wrote:
> > On 2/14/20 3:40 PM, Keith Busch wrote:
> > > On Fri, Feb 14, 2020 at 08:32:57AM +0100, Hannes Reinecke wrote:
> > > > On 2/13/20 5:17 AM, Martin K. Petersen wrote:
> > > > > People often artificially lower the queue depth to avoid timeouts. The
> > > > > default timeout is 30 seconds from an I/O is queued. However, many
> > > > > enterprise applications set the timeout to 3-5 seconds. Which means that
> > > > > with deep queues you'll quickly start seeing timeouts if a drive
> > > > > temporarily is having issues keeping up (media errors, excessive spare
> > > > > track seeks, etc.).
> > > > >
> > > > > Well-behaved devices will return QF/TSF if they have transient resource
> > > > > starvation or exceed internal QoS limits. QF will cause the SCSI stack
> > > > > to reduce the number of I/Os in flight. This allows the drive to recover
> > > > > from its congested state and reduces the potential of application and
> > > > > filesystem timeouts.
> > > > >
> > > > This may even be a chance to revisit QoS / queue busy handling.
> > > > NVMe has this SQ head pointer mechanism which was supposed to handle
> > > > this kind of situations, but to my knowledge no-one has been
> > > > implementing it.
> > > > Might be worthwhile revisiting it; guess NVMe HDDs would profit from that.
> > >
> > > We don't need that because we don't allocate enough tags to potentially
> > > wrap the tail past the head. If you can allocate a tag, the queue is not
> > > full. And convesely, no tag == queue full.
> > >
> > It's not a problem on our side.
> > It's a problem on the target/controller side.
> > The target/controller might have a need to throttle I/O (due to QoS settings
> > or competing resources from other hosts), but currently no means of
> > signalling that to the host.
> > Which, incidentally, is the underlying reason for the DNR handling
> > discussion we had; NetApp tried to model QoS by sending "Namespace not
> > ready" without the DNR bit set, which of course is a totally different
> > use-case as the typical 'Namespace not ready' response we get (with the DNR
> > bit set) when a namespace was unmapped.
> >
> > And that is where SQ head pointer updates comes in; it would allow the
> > controller to signal back to the host that it should hold off sending I/O
> > for a bit.
> > So this could / might be used for NVMe HDDs, too, which also might have a
> > need to signal back to the host that I/Os should be throttled...
>
> Okay, I see. I think this needs a new nvme AER notice as Martin
> suggested. The desired host behavior is simiilar to what we do with a
> "firmware activation notice" where we temporarily quiesce new requests
> and reset IO timeouts for previously dispatched requests. Perhaps tie
> this to the CSTS.PP register as well.
Hi all-

With regards to our discussion on queue depths, it's common knowledge
that an HDD choses commands from its internal command queue to
optimize performance. The HDD looks at things like the current
actuator position, current media rotational position, power
constraints, command age, etc to choose the best next command to
service. A large number of commands in the queue gives the HDD a
better selection of commands from which to choose to maximize
throughput/IOPS/etc but at the expense of the added latency due to
commands sitting in the queue.

NVMe doesn't allow us to pull commands randomly from the SQ, so the
HDD should attempt to fill its internal queue from the various SQs,
according to the SQ servicing policy, so it can have a large number of
commands to choose from for its internal command processing
optimization.

It seems to me that the host would want to limit the total number of
outstanding commands to an NVMe HDD for the same latency reasons they
are frequently limited today. If we assume the HDD would have a
relatively deep (perhaps 256) internal queue (which is deeper than
most latency-sensitive customers would want to run) then the SQ would
be empty most of the time. To me it seems that only when the host's
number of outstanding commands fell below the threshold should the
host add commands to the SQ. Since the drive internal command queue
would not be full, the HDD would immediately pull the commands from
the SQ and put them into its internal command queue.

I can't think of any advantage to running a deep SQ in this scenario.

When the host requests to delete a SQ the HDD should abort the
commands it is holding in its internal queue that came from the SQ to
be deleted, then delete the SQ.

Best regards,
-Tim

-- 
Tim Walker
Product Design Systems Engineering, Seagate Technology
(303) 775-3770

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-18 15:54                       ` Tim Walker
  0 siblings, 0 replies; 64+ messages in thread
From: Tim Walker @ 2020-02-18 15:54 UTC (permalink / raw)
  To: Keith Busch
  Cc: Damien Le Moal, Martin K. Petersen, linux-scsi, linux-nvme,
	Ming Lei, linux-block, Hannes Reinecke

On Fri, Feb 14, 2020 at 12:05 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Fri, Feb 14, 2020 at 05:04:25PM +0100, Hannes Reinecke wrote:
> > On 2/14/20 3:40 PM, Keith Busch wrote:
> > > On Fri, Feb 14, 2020 at 08:32:57AM +0100, Hannes Reinecke wrote:
> > > > On 2/13/20 5:17 AM, Martin K. Petersen wrote:
> > > > > People often artificially lower the queue depth to avoid timeouts. The
> > > > > default timeout is 30 seconds from an I/O is queued. However, many
> > > > > enterprise applications set the timeout to 3-5 seconds. Which means that
> > > > > with deep queues you'll quickly start seeing timeouts if a drive
> > > > > temporarily is having issues keeping up (media errors, excessive spare
> > > > > track seeks, etc.).
> > > > >
> > > > > Well-behaved devices will return QF/TSF if they have transient resource
> > > > > starvation or exceed internal QoS limits. QF will cause the SCSI stack
> > > > > to reduce the number of I/Os in flight. This allows the drive to recover
> > > > > from its congested state and reduces the potential of application and
> > > > > filesystem timeouts.
> > > > >
> > > > This may even be a chance to revisit QoS / queue busy handling.
> > > > NVMe has this SQ head pointer mechanism which was supposed to handle
> > > > this kind of situations, but to my knowledge no-one has been
> > > > implementing it.
> > > > Might be worthwhile revisiting it; guess NVMe HDDs would profit from that.
> > >
> > > We don't need that because we don't allocate enough tags to potentially
> > > wrap the tail past the head. If you can allocate a tag, the queue is not
> > > full. And convesely, no tag == queue full.
> > >
> > It's not a problem on our side.
> > It's a problem on the target/controller side.
> > The target/controller might have a need to throttle I/O (due to QoS settings
> > or competing resources from other hosts), but currently no means of
> > signalling that to the host.
> > Which, incidentally, is the underlying reason for the DNR handling
> > discussion we had; NetApp tried to model QoS by sending "Namespace not
> > ready" without the DNR bit set, which of course is a totally different
> > use-case as the typical 'Namespace not ready' response we get (with the DNR
> > bit set) when a namespace was unmapped.
> >
> > And that is where SQ head pointer updates comes in; it would allow the
> > controller to signal back to the host that it should hold off sending I/O
> > for a bit.
> > So this could / might be used for NVMe HDDs, too, which also might have a
> > need to signal back to the host that I/Os should be throttled...
>
> Okay, I see. I think this needs a new nvme AER notice as Martin
> suggested. The desired host behavior is simiilar to what we do with a
> "firmware activation notice" where we temporarily quiesce new requests
> and reset IO timeouts for previously dispatched requests. Perhaps tie
> this to the CSTS.PP register as well.
Hi all-

With regards to our discussion on queue depths, it's common knowledge
that an HDD choses commands from its internal command queue to
optimize performance. The HDD looks at things like the current
actuator position, current media rotational position, power
constraints, command age, etc to choose the best next command to
service. A large number of commands in the queue gives the HDD a
better selection of commands from which to choose to maximize
throughput/IOPS/etc but at the expense of the added latency due to
commands sitting in the queue.

NVMe doesn't allow us to pull commands randomly from the SQ, so the
HDD should attempt to fill its internal queue from the various SQs,
according to the SQ servicing policy, so it can have a large number of
commands to choose from for its internal command processing
optimization.

It seems to me that the host would want to limit the total number of
outstanding commands to an NVMe HDD for the same latency reasons they
are frequently limited today. If we assume the HDD would have a
relatively deep (perhaps 256) internal queue (which is deeper than
most latency-sensitive customers would want to run) then the SQ would
be empty most of the time. To me it seems that only when the host's
number of outstanding commands fell below the threshold should the
host add commands to the SQ. Since the drive internal command queue
would not be full, the HDD would immediately pull the commands from
the SQ and put them into its internal command queue.

I can't think of any advantage to running a deep SQ in this scenario.

When the host requests to delete a SQ the HDD should abort the
commands it is holding in its internal queue that came from the SQ to
be deleted, then delete the SQ.

Best regards,
-Tim

-- 
Tim Walker
Product Design Systems Engineering, Seagate Technology
(303) 775-3770

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-18 15:54                       ` Tim Walker
@ 2020-02-18 17:41                         ` Keith Busch
  -1 siblings, 0 replies; 64+ messages in thread
From: Keith Busch @ 2020-02-18 17:41 UTC (permalink / raw)
  To: Tim Walker
  Cc: Hannes Reinecke, Martin K. Petersen, Damien Le Moal, Ming Lei,
	linux-block, linux-scsi, linux-nvme

On Tue, Feb 18, 2020 at 10:54:54AM -0500, Tim Walker wrote:
> With regards to our discussion on queue depths, it's common knowledge
> that an HDD choses commands from its internal command queue to
> optimize performance. The HDD looks at things like the current
> actuator position, current media rotational position, power
> constraints, command age, etc to choose the best next command to
> service. A large number of commands in the queue gives the HDD a
> better selection of commands from which to choose to maximize
> throughput/IOPS/etc but at the expense of the added latency due to
> commands sitting in the queue.
> 
> NVMe doesn't allow us to pull commands randomly from the SQ, so the
> HDD should attempt to fill its internal queue from the various SQs,
> according to the SQ servicing policy, so it can have a large number of
> commands to choose from for its internal command processing
> optimization.

You don't need multiple queues for that. While the device has to fifo
fetch commands from a host's submission queue, it may reorder their
executuion and completion however it wants, which you can do with a
single queue.
 
> It seems to me that the host would want to limit the total number of
> outstanding commands to an NVMe HDD

The host shouldn't have to decide on limits. NVMe lets the device report
it's queue count and depth. It should the device's responsibility to
report appropriate values that maximize iops within your latency limits,
and the host will react accordingly.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-18 17:41                         ` Keith Busch
  0 siblings, 0 replies; 64+ messages in thread
From: Keith Busch @ 2020-02-18 17:41 UTC (permalink / raw)
  To: Tim Walker
  Cc: Damien Le Moal, Martin K. Petersen, linux-scsi, linux-nvme,
	Ming Lei, linux-block, Hannes Reinecke

On Tue, Feb 18, 2020 at 10:54:54AM -0500, Tim Walker wrote:
> With regards to our discussion on queue depths, it's common knowledge
> that an HDD choses commands from its internal command queue to
> optimize performance. The HDD looks at things like the current
> actuator position, current media rotational position, power
> constraints, command age, etc to choose the best next command to
> service. A large number of commands in the queue gives the HDD a
> better selection of commands from which to choose to maximize
> throughput/IOPS/etc but at the expense of the added latency due to
> commands sitting in the queue.
> 
> NVMe doesn't allow us to pull commands randomly from the SQ, so the
> HDD should attempt to fill its internal queue from the various SQs,
> according to the SQ servicing policy, so it can have a large number of
> commands to choose from for its internal command processing
> optimization.

You don't need multiple queues for that. While the device has to fifo
fetch commands from a host's submission queue, it may reorder their
executuion and completion however it wants, which you can do with a
single queue.
 
> It seems to me that the host would want to limit the total number of
> outstanding commands to an NVMe HDD

The host shouldn't have to decide on limits. NVMe lets the device report
it's queue count and depth. It should the device's responsibility to
report appropriate values that maximize iops within your latency limits,
and the host will react accordingly.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-18 17:41                         ` Keith Busch
@ 2020-02-18 17:52                           ` James Smart
  -1 siblings, 0 replies; 64+ messages in thread
From: James Smart @ 2020-02-18 17:52 UTC (permalink / raw)
  To: Keith Busch, Tim Walker
  Cc: Hannes Reinecke, Martin K. Petersen, Damien Le Moal, Ming Lei,
	linux-block, linux-scsi, linux-nvme



On 2/18/2020 9:41 AM, Keith Busch wrote:
> On Tue, Feb 18, 2020 at 10:54:54AM -0500, Tim Walker wrote:
>> With regards to our discussion on queue depths, it's common knowledge
>> that an HDD choses commands from its internal command queue to
>> optimize performance. The HDD looks at things like the current
>> actuator position, current media rotational position, power
>> constraints, command age, etc to choose the best next command to
>> service. A large number of commands in the queue gives the HDD a
>> better selection of commands from which to choose to maximize
>> throughput/IOPS/etc but at the expense of the added latency due to
>> commands sitting in the queue.
>>
>> NVMe doesn't allow us to pull commands randomly from the SQ, so the
>> HDD should attempt to fill its internal queue from the various SQs,
>> according to the SQ servicing policy, so it can have a large number of
>> commands to choose from for its internal command processing
>> optimization.
> You don't need multiple queues for that. While the device has to fifo
> fetch commands from a host's submission queue, it may reorder their
> executuion and completion however it wants, which you can do with a
> single queue.
>   
>> It seems to me that the host would want to limit the total number of
>> outstanding commands to an NVMe HDD
> The host shouldn't have to decide on limits. NVMe lets the device report
> it's queue count and depth. It should the device's responsibility to
> report appropriate values that maximize iops within your latency limits,
> and the host will react accordingly.

+1 on Keith's comments. Also, if a ns depth limit needs to be 
introduced, it should be via the nvme committee and then reported back 
as device attributes. Many of SCSI's problems where the protocol didn't 
solve it, especially in multi-initiator environments, which made all 
kinds of requirements/mish-mashes on host stacks and target behaviors. 
none of that should be repeated.

-- james


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-18 17:52                           ` James Smart
  0 siblings, 0 replies; 64+ messages in thread
From: James Smart @ 2020-02-18 17:52 UTC (permalink / raw)
  To: Keith Busch, Tim Walker
  Cc: Damien Le Moal, Martin K. Petersen, linux-scsi, linux-nvme,
	Ming Lei, linux-block, Hannes Reinecke



On 2/18/2020 9:41 AM, Keith Busch wrote:
> On Tue, Feb 18, 2020 at 10:54:54AM -0500, Tim Walker wrote:
>> With regards to our discussion on queue depths, it's common knowledge
>> that an HDD choses commands from its internal command queue to
>> optimize performance. The HDD looks at things like the current
>> actuator position, current media rotational position, power
>> constraints, command age, etc to choose the best next command to
>> service. A large number of commands in the queue gives the HDD a
>> better selection of commands from which to choose to maximize
>> throughput/IOPS/etc but at the expense of the added latency due to
>> commands sitting in the queue.
>>
>> NVMe doesn't allow us to pull commands randomly from the SQ, so the
>> HDD should attempt to fill its internal queue from the various SQs,
>> according to the SQ servicing policy, so it can have a large number of
>> commands to choose from for its internal command processing
>> optimization.
> You don't need multiple queues for that. While the device has to fifo
> fetch commands from a host's submission queue, it may reorder their
> executuion and completion however it wants, which you can do with a
> single queue.
>   
>> It seems to me that the host would want to limit the total number of
>> outstanding commands to an NVMe HDD
> The host shouldn't have to decide on limits. NVMe lets the device report
> it's queue count and depth. It should the device's responsibility to
> report appropriate values that maximize iops within your latency limits,
> and the host will react accordingly.

+1 on Keith's comments. Also, if a ns depth limit needs to be 
introduced, it should be via the nvme committee and then reported back 
as device attributes. Many of SCSI's problems where the protocol didn't 
solve it, especially in multi-initiator environments, which made all 
kinds of requirements/mish-mashes on host stacks and target behaviors. 
none of that should be repeated.

-- james


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-18 17:41                         ` Keith Busch
@ 2020-02-19  1:31                           ` Ming Lei
  -1 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2020-02-19  1:31 UTC (permalink / raw)
  To: Keith Busch
  Cc: Tim Walker, Hannes Reinecke, Martin K. Petersen, Damien Le Moal,
	linux-block, linux-scsi, linux-nvme

On Wed, Feb 19, 2020 at 02:41:14AM +0900, Keith Busch wrote:
> On Tue, Feb 18, 2020 at 10:54:54AM -0500, Tim Walker wrote:
> > With regards to our discussion on queue depths, it's common knowledge
> > that an HDD choses commands from its internal command queue to
> > optimize performance. The HDD looks at things like the current
> > actuator position, current media rotational position, power
> > constraints, command age, etc to choose the best next command to
> > service. A large number of commands in the queue gives the HDD a
> > better selection of commands from which to choose to maximize
> > throughput/IOPS/etc but at the expense of the added latency due to
> > commands sitting in the queue.
> > 
> > NVMe doesn't allow us to pull commands randomly from the SQ, so the
> > HDD should attempt to fill its internal queue from the various SQs,
> > according to the SQ servicing policy, so it can have a large number of
> > commands to choose from for its internal command processing
> > optimization.
> 
> You don't need multiple queues for that. While the device has to fifo
> fetch commands from a host's submission queue, it may reorder their
> executuion and completion however it wants, which you can do with a
> single queue.
>  
> > It seems to me that the host would want to limit the total number of
> > outstanding commands to an NVMe HDD
> 
> The host shouldn't have to decide on limits. NVMe lets the device report
> it's queue count and depth. It should the device's responsibility to

Will NVMe HDD support multiple NS? If yes, this queue depth isn't
enough, given all NSs share this single host queue depth.

> report appropriate values that maximize iops within your latency limits,
> and the host will react accordingly.

Suppose NVMe HDD just wants to support single NS and there is single queue,
if the device just reports one host queue depth, block layer IO sort/merge
can only be done when there is device saturation feedback provided.

So, looks either NS queue depth or per-NS device saturation feedback
mechanism is needed, otherwise NVMe HDD may have to do internal IO
sort/merge.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-19  1:31                           ` Ming Lei
  0 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2020-02-19  1:31 UTC (permalink / raw)
  To: Keith Busch
  Cc: Damien Le Moal, Martin K. Petersen, linux-scsi, Tim Walker,
	linux-nvme, linux-block, Hannes Reinecke

On Wed, Feb 19, 2020 at 02:41:14AM +0900, Keith Busch wrote:
> On Tue, Feb 18, 2020 at 10:54:54AM -0500, Tim Walker wrote:
> > With regards to our discussion on queue depths, it's common knowledge
> > that an HDD choses commands from its internal command queue to
> > optimize performance. The HDD looks at things like the current
> > actuator position, current media rotational position, power
> > constraints, command age, etc to choose the best next command to
> > service. A large number of commands in the queue gives the HDD a
> > better selection of commands from which to choose to maximize
> > throughput/IOPS/etc but at the expense of the added latency due to
> > commands sitting in the queue.
> > 
> > NVMe doesn't allow us to pull commands randomly from the SQ, so the
> > HDD should attempt to fill its internal queue from the various SQs,
> > according to the SQ servicing policy, so it can have a large number of
> > commands to choose from for its internal command processing
> > optimization.
> 
> You don't need multiple queues for that. While the device has to fifo
> fetch commands from a host's submission queue, it may reorder their
> executuion and completion however it wants, which you can do with a
> single queue.
>  
> > It seems to me that the host would want to limit the total number of
> > outstanding commands to an NVMe HDD
> 
> The host shouldn't have to decide on limits. NVMe lets the device report
> it's queue count and depth. It should the device's responsibility to

Will NVMe HDD support multiple NS? If yes, this queue depth isn't
enough, given all NSs share this single host queue depth.

> report appropriate values that maximize iops within your latency limits,
> and the host will react accordingly.

Suppose NVMe HDD just wants to support single NS and there is single queue,
if the device just reports one host queue depth, block layer IO sort/merge
can only be done when there is device saturation feedback provided.

So, looks either NS queue depth or per-NS device saturation feedback
mechanism is needed, otherwise NVMe HDD may have to do internal IO
sort/merge.


Thanks,
Ming


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-19  1:31                           ` Ming Lei
@ 2020-02-19  1:53                             ` Damien Le Moal
  -1 siblings, 0 replies; 64+ messages in thread
From: Damien Le Moal @ 2020-02-19  1:53 UTC (permalink / raw)
  To: Ming Lei, Keith Busch
  Cc: Tim Walker, Hannes Reinecke, Martin K. Petersen, linux-block,
	linux-scsi, linux-nvme

On 2020/02/19 10:32, Ming Lei wrote:
> On Wed, Feb 19, 2020 at 02:41:14AM +0900, Keith Busch wrote:
>> On Tue, Feb 18, 2020 at 10:54:54AM -0500, Tim Walker wrote:
>>> With regards to our discussion on queue depths, it's common knowledge
>>> that an HDD choses commands from its internal command queue to
>>> optimize performance. The HDD looks at things like the current
>>> actuator position, current media rotational position, power
>>> constraints, command age, etc to choose the best next command to
>>> service. A large number of commands in the queue gives the HDD a
>>> better selection of commands from which to choose to maximize
>>> throughput/IOPS/etc but at the expense of the added latency due to
>>> commands sitting in the queue.
>>>
>>> NVMe doesn't allow us to pull commands randomly from the SQ, so the
>>> HDD should attempt to fill its internal queue from the various SQs,
>>> according to the SQ servicing policy, so it can have a large number of
>>> commands to choose from for its internal command processing
>>> optimization.
>>
>> You don't need multiple queues for that. While the device has to fifo
>> fetch commands from a host's submission queue, it may reorder their
>> executuion and completion however it wants, which you can do with a
>> single queue.
>>  
>>> It seems to me that the host would want to limit the total number of
>>> outstanding commands to an NVMe HDD
>>
>> The host shouldn't have to decide on limits. NVMe lets the device report
>> it's queue count and depth. It should the device's responsibility to
> 
> Will NVMe HDD support multiple NS? If yes, this queue depth isn't
> enough, given all NSs share this single host queue depth.
> 
>> report appropriate values that maximize iops within your latency limits,
>> and the host will react accordingly.
> 
> Suppose NVMe HDD just wants to support single NS and there is single queue,
> if the device just reports one host queue depth, block layer IO sort/merge
> can only be done when there is device saturation feedback provided.
> 
> So, looks either NS queue depth or per-NS device saturation feedback
> mechanism is needed, otherwise NVMe HDD may have to do internal IO
> sort/merge.

SAS and SATA HDDs today already do internal IO reordering and merging, a
lot. That is partly why even with "none" set as the scheduler, you can see
iops increasing with QD used.

But yes, I think you do have a point with the saturation feedback. This may
be necessary for better scheduling host-side.

> 
> 
> Thanks,
> Ming
> 
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-19  1:53                             ` Damien Le Moal
  0 siblings, 0 replies; 64+ messages in thread
From: Damien Le Moal @ 2020-02-19  1:53 UTC (permalink / raw)
  To: Ming Lei, Keith Busch
  Cc: Martin K. Petersen, linux-scsi, Tim Walker, linux-nvme,
	linux-block, Hannes Reinecke

On 2020/02/19 10:32, Ming Lei wrote:
> On Wed, Feb 19, 2020 at 02:41:14AM +0900, Keith Busch wrote:
>> On Tue, Feb 18, 2020 at 10:54:54AM -0500, Tim Walker wrote:
>>> With regards to our discussion on queue depths, it's common knowledge
>>> that an HDD choses commands from its internal command queue to
>>> optimize performance. The HDD looks at things like the current
>>> actuator position, current media rotational position, power
>>> constraints, command age, etc to choose the best next command to
>>> service. A large number of commands in the queue gives the HDD a
>>> better selection of commands from which to choose to maximize
>>> throughput/IOPS/etc but at the expense of the added latency due to
>>> commands sitting in the queue.
>>>
>>> NVMe doesn't allow us to pull commands randomly from the SQ, so the
>>> HDD should attempt to fill its internal queue from the various SQs,
>>> according to the SQ servicing policy, so it can have a large number of
>>> commands to choose from for its internal command processing
>>> optimization.
>>
>> You don't need multiple queues for that. While the device has to fifo
>> fetch commands from a host's submission queue, it may reorder their
>> executuion and completion however it wants, which you can do with a
>> single queue.
>>  
>>> It seems to me that the host would want to limit the total number of
>>> outstanding commands to an NVMe HDD
>>
>> The host shouldn't have to decide on limits. NVMe lets the device report
>> it's queue count and depth. It should the device's responsibility to
> 
> Will NVMe HDD support multiple NS? If yes, this queue depth isn't
> enough, given all NSs share this single host queue depth.
> 
>> report appropriate values that maximize iops within your latency limits,
>> and the host will react accordingly.
> 
> Suppose NVMe HDD just wants to support single NS and there is single queue,
> if the device just reports one host queue depth, block layer IO sort/merge
> can only be done when there is device saturation feedback provided.
> 
> So, looks either NS queue depth or per-NS device saturation feedback
> mechanism is needed, otherwise NVMe HDD may have to do internal IO
> sort/merge.

SAS and SATA HDDs today already do internal IO reordering and merging, a
lot. That is partly why even with "none" set as the scheduler, you can see
iops increasing with QD used.

But yes, I think you do have a point with the saturation feedback. This may
be necessary for better scheduling host-side.

> 
> 
> Thanks,
> Ming
> 
> 


-- 
Damien Le Moal
Western Digital Research

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-19  1:53                             ` Damien Le Moal
@ 2020-02-19  2:15                               ` Ming Lei
  -1 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2020-02-19  2:15 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Keith Busch, Tim Walker, Hannes Reinecke, Martin K. Petersen,
	linux-block, linux-scsi, linux-nvme

On Wed, Feb 19, 2020 at 01:53:53AM +0000, Damien Le Moal wrote:
> On 2020/02/19 10:32, Ming Lei wrote:
> > On Wed, Feb 19, 2020 at 02:41:14AM +0900, Keith Busch wrote:
> >> On Tue, Feb 18, 2020 at 10:54:54AM -0500, Tim Walker wrote:
> >>> With regards to our discussion on queue depths, it's common knowledge
> >>> that an HDD choses commands from its internal command queue to
> >>> optimize performance. The HDD looks at things like the current
> >>> actuator position, current media rotational position, power
> >>> constraints, command age, etc to choose the best next command to
> >>> service. A large number of commands in the queue gives the HDD a
> >>> better selection of commands from which to choose to maximize
> >>> throughput/IOPS/etc but at the expense of the added latency due to
> >>> commands sitting in the queue.
> >>>
> >>> NVMe doesn't allow us to pull commands randomly from the SQ, so the
> >>> HDD should attempt to fill its internal queue from the various SQs,
> >>> according to the SQ servicing policy, so it can have a large number of
> >>> commands to choose from for its internal command processing
> >>> optimization.
> >>
> >> You don't need multiple queues for that. While the device has to fifo
> >> fetch commands from a host's submission queue, it may reorder their
> >> executuion and completion however it wants, which you can do with a
> >> single queue.
> >>  
> >>> It seems to me that the host would want to limit the total number of
> >>> outstanding commands to an NVMe HDD
> >>
> >> The host shouldn't have to decide on limits. NVMe lets the device report
> >> it's queue count and depth. It should the device's responsibility to
> > 
> > Will NVMe HDD support multiple NS? If yes, this queue depth isn't
> > enough, given all NSs share this single host queue depth.
> > 
> >> report appropriate values that maximize iops within your latency limits,
> >> and the host will react accordingly.
> > 
> > Suppose NVMe HDD just wants to support single NS and there is single queue,
> > if the device just reports one host queue depth, block layer IO sort/merge
> > can only be done when there is device saturation feedback provided.
> > 
> > So, looks either NS queue depth or per-NS device saturation feedback
> > mechanism is needed, otherwise NVMe HDD may have to do internal IO
> > sort/merge.
> 
> SAS and SATA HDDs today already do internal IO reordering and merging, a
> lot. That is partly why even with "none" set as the scheduler, you can see
> iops increasing with QD used.

That is why I asked if NVMe HDD will attempt to sort/merge IO among SQs
from the beginning, but Tim said no, see:

https://lore.kernel.org/linux-block/20200212215251.GA25314@ming.t460p/T/#m2d0eff5ef8fcaced0f304180e571bb8fefc72e84

It could be cheap for NVMe HDD to do that, given all queues/requests
just stay in system's RAM.

Also I guess internal IO sort/merge may not be good enough compared with
SW's implementation:

1) device internal queue depth is often low, and the participated requests won't
be enough many, but SW's scheduler queue depth is often 2 times of
device queue depth.

2) HDD drive doesn't have context info, so when concurrent IOs are run from
multiple contexts, HDD internal reorder/merge can't work well enough. blk-mq
doesn't address this case too, however the legacy IO path does consider that
via IOC batch.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-19  2:15                               ` Ming Lei
  0 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2020-02-19  2:15 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Martin K. Petersen, linux-scsi, Tim Walker, linux-nvme,
	linux-block, Hannes Reinecke, Keith Busch

On Wed, Feb 19, 2020 at 01:53:53AM +0000, Damien Le Moal wrote:
> On 2020/02/19 10:32, Ming Lei wrote:
> > On Wed, Feb 19, 2020 at 02:41:14AM +0900, Keith Busch wrote:
> >> On Tue, Feb 18, 2020 at 10:54:54AM -0500, Tim Walker wrote:
> >>> With regards to our discussion on queue depths, it's common knowledge
> >>> that an HDD choses commands from its internal command queue to
> >>> optimize performance. The HDD looks at things like the current
> >>> actuator position, current media rotational position, power
> >>> constraints, command age, etc to choose the best next command to
> >>> service. A large number of commands in the queue gives the HDD a
> >>> better selection of commands from which to choose to maximize
> >>> throughput/IOPS/etc but at the expense of the added latency due to
> >>> commands sitting in the queue.
> >>>
> >>> NVMe doesn't allow us to pull commands randomly from the SQ, so the
> >>> HDD should attempt to fill its internal queue from the various SQs,
> >>> according to the SQ servicing policy, so it can have a large number of
> >>> commands to choose from for its internal command processing
> >>> optimization.
> >>
> >> You don't need multiple queues for that. While the device has to fifo
> >> fetch commands from a host's submission queue, it may reorder their
> >> executuion and completion however it wants, which you can do with a
> >> single queue.
> >>  
> >>> It seems to me that the host would want to limit the total number of
> >>> outstanding commands to an NVMe HDD
> >>
> >> The host shouldn't have to decide on limits. NVMe lets the device report
> >> it's queue count and depth. It should the device's responsibility to
> > 
> > Will NVMe HDD support multiple NS? If yes, this queue depth isn't
> > enough, given all NSs share this single host queue depth.
> > 
> >> report appropriate values that maximize iops within your latency limits,
> >> and the host will react accordingly.
> > 
> > Suppose NVMe HDD just wants to support single NS and there is single queue,
> > if the device just reports one host queue depth, block layer IO sort/merge
> > can only be done when there is device saturation feedback provided.
> > 
> > So, looks either NS queue depth or per-NS device saturation feedback
> > mechanism is needed, otherwise NVMe HDD may have to do internal IO
> > sort/merge.
> 
> SAS and SATA HDDs today already do internal IO reordering and merging, a
> lot. That is partly why even with "none" set as the scheduler, you can see
> iops increasing with QD used.

That is why I asked if NVMe HDD will attempt to sort/merge IO among SQs
from the beginning, but Tim said no, see:

https://lore.kernel.org/linux-block/20200212215251.GA25314@ming.t460p/T/#m2d0eff5ef8fcaced0f304180e571bb8fefc72e84

It could be cheap for NVMe HDD to do that, given all queues/requests
just stay in system's RAM.

Also I guess internal IO sort/merge may not be good enough compared with
SW's implementation:

1) device internal queue depth is often low, and the participated requests won't
be enough many, but SW's scheduler queue depth is often 2 times of
device queue depth.

2) HDD drive doesn't have context info, so when concurrent IOs are run from
multiple contexts, HDD internal reorder/merge can't work well enough. blk-mq
doesn't address this case too, however the legacy IO path does consider that
via IOC batch.


Thanks, 
Ming


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-19  2:15                               ` Ming Lei
@ 2020-02-19  2:32                                 ` Damien Le Moal
  -1 siblings, 0 replies; 64+ messages in thread
From: Damien Le Moal @ 2020-02-19  2:32 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, Tim Walker, Hannes Reinecke, Martin K. Petersen,
	linux-block, linux-scsi, linux-nvme

On 2020/02/19 11:16, Ming Lei wrote:
> On Wed, Feb 19, 2020 at 01:53:53AM +0000, Damien Le Moal wrote:
>> On 2020/02/19 10:32, Ming Lei wrote:
>>> On Wed, Feb 19, 2020 at 02:41:14AM +0900, Keith Busch wrote:
>>>> On Tue, Feb 18, 2020 at 10:54:54AM -0500, Tim Walker wrote:
>>>>> With regards to our discussion on queue depths, it's common knowledge
>>>>> that an HDD choses commands from its internal command queue to
>>>>> optimize performance. The HDD looks at things like the current
>>>>> actuator position, current media rotational position, power
>>>>> constraints, command age, etc to choose the best next command to
>>>>> service. A large number of commands in the queue gives the HDD a
>>>>> better selection of commands from which to choose to maximize
>>>>> throughput/IOPS/etc but at the expense of the added latency due to
>>>>> commands sitting in the queue.
>>>>>
>>>>> NVMe doesn't allow us to pull commands randomly from the SQ, so the
>>>>> HDD should attempt to fill its internal queue from the various SQs,
>>>>> according to the SQ servicing policy, so it can have a large number of
>>>>> commands to choose from for its internal command processing
>>>>> optimization.
>>>>
>>>> You don't need multiple queues for that. While the device has to fifo
>>>> fetch commands from a host's submission queue, it may reorder their
>>>> executuion and completion however it wants, which you can do with a
>>>> single queue.
>>>>  
>>>>> It seems to me that the host would want to limit the total number of
>>>>> outstanding commands to an NVMe HDD
>>>>
>>>> The host shouldn't have to decide on limits. NVMe lets the device report
>>>> it's queue count and depth. It should the device's responsibility to
>>>
>>> Will NVMe HDD support multiple NS? If yes, this queue depth isn't
>>> enough, given all NSs share this single host queue depth.
>>>
>>>> report appropriate values that maximize iops within your latency limits,
>>>> and the host will react accordingly.
>>>
>>> Suppose NVMe HDD just wants to support single NS and there is single queue,
>>> if the device just reports one host queue depth, block layer IO sort/merge
>>> can only be done when there is device saturation feedback provided.
>>>
>>> So, looks either NS queue depth or per-NS device saturation feedback
>>> mechanism is needed, otherwise NVMe HDD may have to do internal IO
>>> sort/merge.
>>
>> SAS and SATA HDDs today already do internal IO reordering and merging, a
>> lot. That is partly why even with "none" set as the scheduler, you can see
>> iops increasing with QD used.
> 
> That is why I asked if NVMe HDD will attempt to sort/merge IO among SQs
> from the beginning, but Tim said no, see:
> 
> https://lore.kernel.org/linux-block/20200212215251.GA25314@ming.t460p/T/#m2d0eff5ef8fcaced0f304180e571bb8fefc72e84
> 
> It could be cheap for NVMe HDD to do that, given all queues/requests
> just stay in system's RAM.

Yes. Keith also commented on that. SQEs have to be removed in order from
the SQ, but that does not mean that the disk has to execute them in order.
So I do not think this is an issue.

> Also I guess internal IO sort/merge may not be good enough compared with
> SW's implementation:
> 
> 1) device internal queue depth is often low, and the participated requests won't
> be enough many, but SW's scheduler queue depth is often 2 times of
> device queue depth.

Drive internal QD can actually be quite large to accommodate for internal
house-keeping commands (e.g. ATI/FTI refreshes, media cache flushes, etc)
while simultaneously executing incoming user commands. These internal task
are often one of the reason for SAS drives to return QF at different
host-seen QD, and why in the end NVMe may need a mechanism similar to task
set full notifications in SAS.

> 2) HDD drive doesn't have context info, so when concurrent IOs are run from
> multiple contexts, HDD internal reorder/merge can't work well enough. blk-mq
> doesn't address this case too, however the legacy IO path does consider that
> via IOC batch.>
> 
> Thanks, 
> Ming
> 
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-19  2:32                                 ` Damien Le Moal
  0 siblings, 0 replies; 64+ messages in thread
From: Damien Le Moal @ 2020-02-19  2:32 UTC (permalink / raw)
  To: Ming Lei
  Cc: Martin K. Petersen, linux-scsi, Tim Walker, linux-nvme,
	linux-block, Hannes Reinecke, Keith Busch

On 2020/02/19 11:16, Ming Lei wrote:
> On Wed, Feb 19, 2020 at 01:53:53AM +0000, Damien Le Moal wrote:
>> On 2020/02/19 10:32, Ming Lei wrote:
>>> On Wed, Feb 19, 2020 at 02:41:14AM +0900, Keith Busch wrote:
>>>> On Tue, Feb 18, 2020 at 10:54:54AM -0500, Tim Walker wrote:
>>>>> With regards to our discussion on queue depths, it's common knowledge
>>>>> that an HDD choses commands from its internal command queue to
>>>>> optimize performance. The HDD looks at things like the current
>>>>> actuator position, current media rotational position, power
>>>>> constraints, command age, etc to choose the best next command to
>>>>> service. A large number of commands in the queue gives the HDD a
>>>>> better selection of commands from which to choose to maximize
>>>>> throughput/IOPS/etc but at the expense of the added latency due to
>>>>> commands sitting in the queue.
>>>>>
>>>>> NVMe doesn't allow us to pull commands randomly from the SQ, so the
>>>>> HDD should attempt to fill its internal queue from the various SQs,
>>>>> according to the SQ servicing policy, so it can have a large number of
>>>>> commands to choose from for its internal command processing
>>>>> optimization.
>>>>
>>>> You don't need multiple queues for that. While the device has to fifo
>>>> fetch commands from a host's submission queue, it may reorder their
>>>> executuion and completion however it wants, which you can do with a
>>>> single queue.
>>>>  
>>>>> It seems to me that the host would want to limit the total number of
>>>>> outstanding commands to an NVMe HDD
>>>>
>>>> The host shouldn't have to decide on limits. NVMe lets the device report
>>>> it's queue count and depth. It should the device's responsibility to
>>>
>>> Will NVMe HDD support multiple NS? If yes, this queue depth isn't
>>> enough, given all NSs share this single host queue depth.
>>>
>>>> report appropriate values that maximize iops within your latency limits,
>>>> and the host will react accordingly.
>>>
>>> Suppose NVMe HDD just wants to support single NS and there is single queue,
>>> if the device just reports one host queue depth, block layer IO sort/merge
>>> can only be done when there is device saturation feedback provided.
>>>
>>> So, looks either NS queue depth or per-NS device saturation feedback
>>> mechanism is needed, otherwise NVMe HDD may have to do internal IO
>>> sort/merge.
>>
>> SAS and SATA HDDs today already do internal IO reordering and merging, a
>> lot. That is partly why even with "none" set as the scheduler, you can see
>> iops increasing with QD used.
> 
> That is why I asked if NVMe HDD will attempt to sort/merge IO among SQs
> from the beginning, but Tim said no, see:
> 
> https://lore.kernel.org/linux-block/20200212215251.GA25314@ming.t460p/T/#m2d0eff5ef8fcaced0f304180e571bb8fefc72e84
> 
> It could be cheap for NVMe HDD to do that, given all queues/requests
> just stay in system's RAM.

Yes. Keith also commented on that. SQEs have to be removed in order from
the SQ, but that does not mean that the disk has to execute them in order.
So I do not think this is an issue.

> Also I guess internal IO sort/merge may not be good enough compared with
> SW's implementation:
> 
> 1) device internal queue depth is often low, and the participated requests won't
> be enough many, but SW's scheduler queue depth is often 2 times of
> device queue depth.

Drive internal QD can actually be quite large to accommodate for internal
house-keeping commands (e.g. ATI/FTI refreshes, media cache flushes, etc)
while simultaneously executing incoming user commands. These internal task
are often one of the reason for SAS drives to return QF at different
host-seen QD, and why in the end NVMe may need a mechanism similar to task
set full notifications in SAS.

> 2) HDD drive doesn't have context info, so when concurrent IOs are run from
> multiple contexts, HDD internal reorder/merge can't work well enough. blk-mq
> doesn't address this case too, however the legacy IO path does consider that
> via IOC batch.>
> 
> Thanks, 
> Ming
> 
> 


-- 
Damien Le Moal
Western Digital Research

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-19  2:32                                 ` Damien Le Moal
@ 2020-02-19  2:56                                   ` Tim Walker
  -1 siblings, 0 replies; 64+ messages in thread
From: Tim Walker @ 2020-02-19  2:56 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Ming Lei, Keith Busch, Hannes Reinecke, Martin K. Petersen,
	linux-block, linux-scsi, linux-nvme

On Tue, Feb 18, 2020 at 9:32 PM Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
>
> On 2020/02/19 11:16, Ming Lei wrote:
> > On Wed, Feb 19, 2020 at 01:53:53AM +0000, Damien Le Moal wrote:
> >> On 2020/02/19 10:32, Ming Lei wrote:
> >>> On Wed, Feb 19, 2020 at 02:41:14AM +0900, Keith Busch wrote:
> >>>> On Tue, Feb 18, 2020 at 10:54:54AM -0500, Tim Walker wrote:
> >>>>> With regards to our discussion on queue depths, it's common knowledge
> >>>>> that an HDD choses commands from its internal command queue to
> >>>>> optimize performance. The HDD looks at things like the current
> >>>>> actuator position, current media rotational position, power
> >>>>> constraints, command age, etc to choose the best next command to
> >>>>> service. A large number of commands in the queue gives the HDD a
> >>>>> better selection of commands from which to choose to maximize
> >>>>> throughput/IOPS/etc but at the expense of the added latency due to
> >>>>> commands sitting in the queue.
> >>>>>
> >>>>> NVMe doesn't allow us to pull commands randomly from the SQ, so the
> >>>>> HDD should attempt to fill its internal queue from the various SQs,
> >>>>> according to the SQ servicing policy, so it can have a large number of
> >>>>> commands to choose from for its internal command processing
> >>>>> optimization.
> >>>>
> >>>> You don't need multiple queues for that. While the device has to fifo
> >>>> fetch commands from a host's submission queue, it may reorder their
> >>>> executuion and completion however it wants, which you can do with a
> >>>> single queue.
> >>>>
> >>>>> It seems to me that the host would want to limit the total number of
> >>>>> outstanding commands to an NVMe HDD
> >>>>
> >>>> The host shouldn't have to decide on limits. NVMe lets the device report
> >>>> it's queue count and depth. It should the device's responsibility to
> >>>
> >>> Will NVMe HDD support multiple NS? If yes, this queue depth isn't
> >>> enough, given all NSs share this single host queue depth.
> >>>
> >>>> report appropriate values that maximize iops within your latency limits,
> >>>> and the host will react accordingly.
> >>>
> >>> Suppose NVMe HDD just wants to support single NS and there is single queue,
> >>> if the device just reports one host queue depth, block layer IO sort/merge
> >>> can only be done when there is device saturation feedback provided.
> >>>
> >>> So, looks either NS queue depth or per-NS device saturation feedback
> >>> mechanism is needed, otherwise NVMe HDD may have to do internal IO
> >>> sort/merge.
> >>
> >> SAS and SATA HDDs today already do internal IO reordering and merging, a
> >> lot. That is partly why even with "none" set as the scheduler, you can see
> >> iops increasing with QD used.
> >
> > That is why I asked if NVMe HDD will attempt to sort/merge IO among SQs
> > from the beginning, but Tim said no, see:
> >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dblock_20200212215251.GA25314-40ming.t460p_T_-23m2d0eff5ef8fcaced0f304180e571bb8fefc72e84&d=DwIFAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=nUNT2_IvSlbeY_25S516HctZv4od6WK6h2q2_C4Q8SY&s=TTxCbaBVGOCBZROb7fqSBDCe9wIZrYdBDSCW2TqrLzM&e=
> >
> > It could be cheap for NVMe HDD to do that, given all queues/requests
> > just stay in system's RAM.
>
> Yes. Keith also commented on that. SQEs have to be removed in order from
> the SQ, but that does not mean that the disk has to execute them in order.
> So I do not think this is an issue.
>
> > Also I guess internal IO sort/merge may not be good enough compared with
> > SW's implementation:
> >
> > 1) device internal queue depth is often low, and the participated requests won't
> > be enough many, but SW's scheduler queue depth is often 2 times of
> > device queue depth.
>
> Drive internal QD can actually be quite large to accommodate for internal
> house-keeping commands (e.g. ATI/FTI refreshes, media cache flushes, etc)
> while simultaneously executing incoming user commands. These internal task
> are often one of the reason for SAS drives to return QF at different
> host-seen QD, and why in the end NVMe may need a mechanism similar to task
> set full notifications in SAS.
>
> > 2) HDD drive doesn't have context info, so when concurrent IOs are run from
> > multiple contexts, HDD internal reorder/merge can't work well enough. blk-mq
> > doesn't address this case too, however the legacy IO path does consider that
> > via IOC batch.>
> >
> > Thanks,
> > Ming
> >
> >
>
>
> --
> Damien Le Moal
> Western Digital Research
[sorry for the duplicate mailing - forgot about plain text!]

Hi Damien-

You're right. The HDD needs those commands in its internal queue to
sort and merge them, because commands are pulled from the SQ strictly
FIFO which precludes any sorting or merging within the SQ. That being
said, HDDs still work better with a good kernel scheduler to group
commands into HDD-friendly sequences. So it would be helpful if we
could devise a method to help the kernel sort/merge before loading the
commands into the SQ, just as we do with SCSI today.

Ming:
Regarding sorting across SQs, I mean to say these two things:
1. The HDD would not try and reach up into the SQs and choose the next
best command. I understand the SQs are FIFO, so that is why NVMe HDD
has to pull them into our internal queue for sorting and merging. Our
internal queue has historically been more than adequate (SAS-256,
SATA-32) to provide pretty good optimization without excessive command
latencies.

2. Also, I know NVMe specifically does not imply any completion order
within the SQ, but an NVMe HDD will likely honor the submission order
within any single SQ, but not try and correlate across multiple SQs
(if the host sets up multiple SQs). I believe this is different from
SSD. I think of this as being left over from SAS/SATA where we manage
overlapped commands by command order-of-arrival.

Many HDD customers spend a lot of time balancing workload and queue
depth to reach the IOPS/throughput targets they desire. It's not
straightforward since HDD command completion time is extremely
workload-sensitive. Some more sophisticated customers dynamically
control queue depth to keep all the command latencies within QOS. But
that requires extensive workload characterization, plus knowledge of
the upcoming workload, both of which makes it difficult for the HDD to
auto-tune its own queue depth. I'm really interested to have this
queue approach discussion at the conference - there seems to be areas
where we can improve on legacy behavior.

In all these scenarios, a single SQ/CQ pair is certainly more than
adequate to keep an HDD busy. Multiple SQ/CQ pairs probably only
assist driver or system architects to separate traffic classes into
separate SQs. At any rate, the HDD won't mandate >1 SQ, but it will
support it if desired.

-Tim
-- 
Tim Walker
Product Design Systems Engineering, Seagate Technology
(303) 775-3770

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-19  2:56                                   ` Tim Walker
  0 siblings, 0 replies; 64+ messages in thread
From: Tim Walker @ 2020-02-19  2:56 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Martin K. Petersen, linux-scsi, linux-nvme, Ming Lei,
	linux-block, Hannes Reinecke, Keith Busch

On Tue, Feb 18, 2020 at 9:32 PM Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
>
> On 2020/02/19 11:16, Ming Lei wrote:
> > On Wed, Feb 19, 2020 at 01:53:53AM +0000, Damien Le Moal wrote:
> >> On 2020/02/19 10:32, Ming Lei wrote:
> >>> On Wed, Feb 19, 2020 at 02:41:14AM +0900, Keith Busch wrote:
> >>>> On Tue, Feb 18, 2020 at 10:54:54AM -0500, Tim Walker wrote:
> >>>>> With regards to our discussion on queue depths, it's common knowledge
> >>>>> that an HDD choses commands from its internal command queue to
> >>>>> optimize performance. The HDD looks at things like the current
> >>>>> actuator position, current media rotational position, power
> >>>>> constraints, command age, etc to choose the best next command to
> >>>>> service. A large number of commands in the queue gives the HDD a
> >>>>> better selection of commands from which to choose to maximize
> >>>>> throughput/IOPS/etc but at the expense of the added latency due to
> >>>>> commands sitting in the queue.
> >>>>>
> >>>>> NVMe doesn't allow us to pull commands randomly from the SQ, so the
> >>>>> HDD should attempt to fill its internal queue from the various SQs,
> >>>>> according to the SQ servicing policy, so it can have a large number of
> >>>>> commands to choose from for its internal command processing
> >>>>> optimization.
> >>>>
> >>>> You don't need multiple queues for that. While the device has to fifo
> >>>> fetch commands from a host's submission queue, it may reorder their
> >>>> executuion and completion however it wants, which you can do with a
> >>>> single queue.
> >>>>
> >>>>> It seems to me that the host would want to limit the total number of
> >>>>> outstanding commands to an NVMe HDD
> >>>>
> >>>> The host shouldn't have to decide on limits. NVMe lets the device report
> >>>> it's queue count and depth. It should the device's responsibility to
> >>>
> >>> Will NVMe HDD support multiple NS? If yes, this queue depth isn't
> >>> enough, given all NSs share this single host queue depth.
> >>>
> >>>> report appropriate values that maximize iops within your latency limits,
> >>>> and the host will react accordingly.
> >>>
> >>> Suppose NVMe HDD just wants to support single NS and there is single queue,
> >>> if the device just reports one host queue depth, block layer IO sort/merge
> >>> can only be done when there is device saturation feedback provided.
> >>>
> >>> So, looks either NS queue depth or per-NS device saturation feedback
> >>> mechanism is needed, otherwise NVMe HDD may have to do internal IO
> >>> sort/merge.
> >>
> >> SAS and SATA HDDs today already do internal IO reordering and merging, a
> >> lot. That is partly why even with "none" set as the scheduler, you can see
> >> iops increasing with QD used.
> >
> > That is why I asked if NVMe HDD will attempt to sort/merge IO among SQs
> > from the beginning, but Tim said no, see:
> >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dblock_20200212215251.GA25314-40ming.t460p_T_-23m2d0eff5ef8fcaced0f304180e571bb8fefc72e84&d=DwIFAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=nUNT2_IvSlbeY_25S516HctZv4od6WK6h2q2_C4Q8SY&s=TTxCbaBVGOCBZROb7fqSBDCe9wIZrYdBDSCW2TqrLzM&e=
> >
> > It could be cheap for NVMe HDD to do that, given all queues/requests
> > just stay in system's RAM.
>
> Yes. Keith also commented on that. SQEs have to be removed in order from
> the SQ, but that does not mean that the disk has to execute them in order.
> So I do not think this is an issue.
>
> > Also I guess internal IO sort/merge may not be good enough compared with
> > SW's implementation:
> >
> > 1) device internal queue depth is often low, and the participated requests won't
> > be enough many, but SW's scheduler queue depth is often 2 times of
> > device queue depth.
>
> Drive internal QD can actually be quite large to accommodate for internal
> house-keeping commands (e.g. ATI/FTI refreshes, media cache flushes, etc)
> while simultaneously executing incoming user commands. These internal task
> are often one of the reason for SAS drives to return QF at different
> host-seen QD, and why in the end NVMe may need a mechanism similar to task
> set full notifications in SAS.
>
> > 2) HDD drive doesn't have context info, so when concurrent IOs are run from
> > multiple contexts, HDD internal reorder/merge can't work well enough. blk-mq
> > doesn't address this case too, however the legacy IO path does consider that
> > via IOC batch.>
> >
> > Thanks,
> > Ming
> >
> >
>
>
> --
> Damien Le Moal
> Western Digital Research
[sorry for the duplicate mailing - forgot about plain text!]

Hi Damien-

You're right. The HDD needs those commands in its internal queue to
sort and merge them, because commands are pulled from the SQ strictly
FIFO which precludes any sorting or merging within the SQ. That being
said, HDDs still work better with a good kernel scheduler to group
commands into HDD-friendly sequences. So it would be helpful if we
could devise a method to help the kernel sort/merge before loading the
commands into the SQ, just as we do with SCSI today.

Ming:
Regarding sorting across SQs, I mean to say these two things:
1. The HDD would not try and reach up into the SQs and choose the next
best command. I understand the SQs are FIFO, so that is why NVMe HDD
has to pull them into our internal queue for sorting and merging. Our
internal queue has historically been more than adequate (SAS-256,
SATA-32) to provide pretty good optimization without excessive command
latencies.

2. Also, I know NVMe specifically does not imply any completion order
within the SQ, but an NVMe HDD will likely honor the submission order
within any single SQ, but not try and correlate across multiple SQs
(if the host sets up multiple SQs). I believe this is different from
SSD. I think of this as being left over from SAS/SATA where we manage
overlapped commands by command order-of-arrival.

Many HDD customers spend a lot of time balancing workload and queue
depth to reach the IOPS/throughput targets they desire. It's not
straightforward since HDD command completion time is extremely
workload-sensitive. Some more sophisticated customers dynamically
control queue depth to keep all the command latencies within QOS. But
that requires extensive workload characterization, plus knowledge of
the upcoming workload, both of which makes it difficult for the HDD to
auto-tune its own queue depth. I'm really interested to have this
queue approach discussion at the conference - there seems to be areas
where we can improve on legacy behavior.

In all these scenarios, a single SQ/CQ pair is certainly more than
adequate to keep an HDD busy. Multiple SQ/CQ pairs probably only
assist driver or system architects to separate traffic classes into
separate SQs. At any rate, the HDD won't mandate >1 SQ, but it will
support it if desired.

-Tim
-- 
Tim Walker
Product Design Systems Engineering, Seagate Technology
(303) 775-3770

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-19  2:56                                   ` Tim Walker
@ 2020-02-19 16:28                                     ` Tim Walker
  -1 siblings, 0 replies; 64+ messages in thread
From: Tim Walker @ 2020-02-19 16:28 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Ming Lei, Keith Busch, Hannes Reinecke, Martin K. Petersen,
	linux-block, linux-scsi, linux-nvme

On Tue, Feb 18, 2020 at 9:56 PM Tim Walker <tim.t.walker@seagate.com> wrote:
>
> On Tue, Feb 18, 2020 at 9:32 PM Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
> >
> > On 2020/02/19 11:16, Ming Lei wrote:
> > > On Wed, Feb 19, 2020 at 01:53:53AM +0000, Damien Le Moal wrote:
> > >> On 2020/02/19 10:32, Ming Lei wrote:
> > >>> On Wed, Feb 19, 2020 at 02:41:14AM +0900, Keith Busch wrote:
> > >>>> On Tue, Feb 18, 2020 at 10:54:54AM -0500, Tim Walker wrote:
> > >>>>> With regards to our discussion on queue depths, it's common knowledge
> > >>>>> that an HDD choses commands from its internal command queue to
> > >>>>> optimize performance. The HDD looks at things like the current
> > >>>>> actuator position, current media rotational position, power
> > >>>>> constraints, command age, etc to choose the best next command to
> > >>>>> service. A large number of commands in the queue gives the HDD a
> > >>>>> better selection of commands from which to choose to maximize
> > >>>>> throughput/IOPS/etc but at the expense of the added latency due to
> > >>>>> commands sitting in the queue.
> > >>>>>
> > >>>>> NVMe doesn't allow us to pull commands randomly from the SQ, so the
> > >>>>> HDD should attempt to fill its internal queue from the various SQs,
> > >>>>> according to the SQ servicing policy, so it can have a large number of
> > >>>>> commands to choose from for its internal command processing
> > >>>>> optimization.
> > >>>>
> > >>>> You don't need multiple queues for that. While the device has to fifo
> > >>>> fetch commands from a host's submission queue, it may reorder their
> > >>>> executuion and completion however it wants, which you can do with a
> > >>>> single queue.
> > >>>>
> > >>>>> It seems to me that the host would want to limit the total number of
> > >>>>> outstanding commands to an NVMe HDD
> > >>>>
> > >>>> The host shouldn't have to decide on limits. NVMe lets the device report
> > >>>> it's queue count and depth. It should the device's responsibility to
> > >>>
> > >>> Will NVMe HDD support multiple NS? If yes, this queue depth isn't
> > >>> enough, given all NSs share this single host queue depth.
> > >>>
> > >>>> report appropriate values that maximize iops within your latency limits,
> > >>>> and the host will react accordingly.
> > >>>
> > >>> Suppose NVMe HDD just wants to support single NS and there is single queue,
> > >>> if the device just reports one host queue depth, block layer IO sort/merge
> > >>> can only be done when there is device saturation feedback provided.
> > >>>
> > >>> So, looks either NS queue depth or per-NS device saturation feedback
> > >>> mechanism is needed, otherwise NVMe HDD may have to do internal IO
> > >>> sort/merge.
> > >>
> > >> SAS and SATA HDDs today already do internal IO reordering and merging, a
> > >> lot. That is partly why even with "none" set as the scheduler, you can see
> > >> iops increasing with QD used.
> > >
> > > That is why I asked if NVMe HDD will attempt to sort/merge IO among SQs
> > > from the beginning, but Tim said no, see:
> > >
> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dblock_20200212215251.GA25314-40ming.t460p_T_-23m2d0eff5ef8fcaced0f304180e571bb8fefc72e84&d=DwIFAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=nUNT2_IvSlbeY_25S516HctZv4od6WK6h2q2_C4Q8SY&s=TTxCbaBVGOCBZROb7fqSBDCe9wIZrYdBDSCW2TqrLzM&e=
> > >
> > > It could be cheap for NVMe HDD to do that, given all queues/requests
> > > just stay in system's RAM.
> >
> > Yes. Keith also commented on that. SQEs have to be removed in order from
> > the SQ, but that does not mean that the disk has to execute them in order.
> > So I do not think this is an issue.
> >
> > > Also I guess internal IO sort/merge may not be good enough compared with
> > > SW's implementation:
> > >
> > > 1) device internal queue depth is often low, and the participated requests won't
> > > be enough many, but SW's scheduler queue depth is often 2 times of
> > > device queue depth.
> >
> > Drive internal QD can actually be quite large to accommodate for internal
> > house-keeping commands (e.g. ATI/FTI refreshes, media cache flushes, etc)
> > while simultaneously executing incoming user commands. These internal task
> > are often one of the reason for SAS drives to return QF at different
> > host-seen QD, and why in the end NVMe may need a mechanism similar to task
> > set full notifications in SAS.
> >
> > > 2) HDD drive doesn't have context info, so when concurrent IOs are run from
> > > multiple contexts, HDD internal reorder/merge can't work well enough. blk-mq
> > > doesn't address this case too, however the legacy IO path does consider that
> > > via IOC batch.>
> > >
> > > Thanks,
> > > Ming
> > >
> > >
> >
> >
> > --
> > Damien Le Moal
> > Western Digital Research
> [sorry for the duplicate mailing - forgot about plain text!]
>
> Hi Damien-
>
> You're right. The HDD needs those commands in its internal queue to
> sort and merge them, because commands are pulled from the SQ strictly
> FIFO which precludes any sorting or merging within the SQ. That being
> said, HDDs still work better with a good kernel scheduler to group
> commands into HDD-friendly sequences. So it would be helpful if we
> could devise a method to help the kernel sort/merge before loading the
> commands into the SQ, just as we do with SCSI today.
>
> Ming:
> Regarding sorting across SQs, I mean to say these two things:
> 1. The HDD would not try and reach up into the SQs and choose the next
> best command. I understand the SQs are FIFO, so that is why NVMe HDD
> has to pull them into our internal queue for sorting and merging. Our
> internal queue has historically been more than adequate (SAS-256,
> SATA-32) to provide pretty good optimization without excessive command
> latencies.
>
> 2. Also, I know NVMe specifically does not imply any completion order
> within the SQ, but an NVMe HDD will likely honor the submission order
> within any single SQ, but not try and correlate across multiple SQs
> (if the host sets up multiple SQs). I believe this is different from
> SSD. I think of this as being left over from SAS/SATA where we manage
> overlapped commands by command order-of-arrival.
>
> Many HDD customers spend a lot of time balancing workload and queue
> depth to reach the IOPS/throughput targets they desire. It's not
> straightforward since HDD command completion time is extremely
> workload-sensitive. Some more sophisticated customers dynamically
> control queue depth to keep all the command latencies within QOS. But
> that requires extensive workload characterization, plus knowledge of
> the upcoming workload, both of which makes it difficult for the HDD to
> auto-tune its own queue depth. I'm really interested to have this
> queue approach discussion at the conference - there seems to be areas
> where we can improve on legacy behavior.
>
> In all these scenarios, a single SQ/CQ pair is certainly more than
> adequate to keep an HDD busy. Multiple SQ/CQ pairs probably only
> assist driver or system architects to separate traffic classes into
> separate SQs. At any rate, the HDD won't mandate >1 SQ, but it will
> support it if desired.
>
> -Tim
> --
> Tim Walker
> Product Design Systems Engineering, Seagate Technology
> (303) 775-3770

Hi Ming-

>Will NVMe HDD support multiple NS?

At this point it doesn't seem like an NVMe HDD could benefit from
multiple namespaces. However, a multiple actuator HDD can present the
actuators as independent channels that are capable of independent
media access. It seems that we would want them on separate namespaces,
or sets. I'd like to discuss the pros and cons of each, and which
would be better for system integration.

Best regards,
-Tim

-- 
Tim Walker
Product Design Systems Engineering, Seagate Technology
(303) 775-3770

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-19 16:28                                     ` Tim Walker
  0 siblings, 0 replies; 64+ messages in thread
From: Tim Walker @ 2020-02-19 16:28 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Martin K. Petersen, linux-scsi, linux-nvme, Ming Lei,
	linux-block, Hannes Reinecke, Keith Busch

On Tue, Feb 18, 2020 at 9:56 PM Tim Walker <tim.t.walker@seagate.com> wrote:
>
> On Tue, Feb 18, 2020 at 9:32 PM Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
> >
> > On 2020/02/19 11:16, Ming Lei wrote:
> > > On Wed, Feb 19, 2020 at 01:53:53AM +0000, Damien Le Moal wrote:
> > >> On 2020/02/19 10:32, Ming Lei wrote:
> > >>> On Wed, Feb 19, 2020 at 02:41:14AM +0900, Keith Busch wrote:
> > >>>> On Tue, Feb 18, 2020 at 10:54:54AM -0500, Tim Walker wrote:
> > >>>>> With regards to our discussion on queue depths, it's common knowledge
> > >>>>> that an HDD choses commands from its internal command queue to
> > >>>>> optimize performance. The HDD looks at things like the current
> > >>>>> actuator position, current media rotational position, power
> > >>>>> constraints, command age, etc to choose the best next command to
> > >>>>> service. A large number of commands in the queue gives the HDD a
> > >>>>> better selection of commands from which to choose to maximize
> > >>>>> throughput/IOPS/etc but at the expense of the added latency due to
> > >>>>> commands sitting in the queue.
> > >>>>>
> > >>>>> NVMe doesn't allow us to pull commands randomly from the SQ, so the
> > >>>>> HDD should attempt to fill its internal queue from the various SQs,
> > >>>>> according to the SQ servicing policy, so it can have a large number of
> > >>>>> commands to choose from for its internal command processing
> > >>>>> optimization.
> > >>>>
> > >>>> You don't need multiple queues for that. While the device has to fifo
> > >>>> fetch commands from a host's submission queue, it may reorder their
> > >>>> executuion and completion however it wants, which you can do with a
> > >>>> single queue.
> > >>>>
> > >>>>> It seems to me that the host would want to limit the total number of
> > >>>>> outstanding commands to an NVMe HDD
> > >>>>
> > >>>> The host shouldn't have to decide on limits. NVMe lets the device report
> > >>>> it's queue count and depth. It should the device's responsibility to
> > >>>
> > >>> Will NVMe HDD support multiple NS? If yes, this queue depth isn't
> > >>> enough, given all NSs share this single host queue depth.
> > >>>
> > >>>> report appropriate values that maximize iops within your latency limits,
> > >>>> and the host will react accordingly.
> > >>>
> > >>> Suppose NVMe HDD just wants to support single NS and there is single queue,
> > >>> if the device just reports one host queue depth, block layer IO sort/merge
> > >>> can only be done when there is device saturation feedback provided.
> > >>>
> > >>> So, looks either NS queue depth or per-NS device saturation feedback
> > >>> mechanism is needed, otherwise NVMe HDD may have to do internal IO
> > >>> sort/merge.
> > >>
> > >> SAS and SATA HDDs today already do internal IO reordering and merging, a
> > >> lot. That is partly why even with "none" set as the scheduler, you can see
> > >> iops increasing with QD used.
> > >
> > > That is why I asked if NVMe HDD will attempt to sort/merge IO among SQs
> > > from the beginning, but Tim said no, see:
> > >
> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dblock_20200212215251.GA25314-40ming.t460p_T_-23m2d0eff5ef8fcaced0f304180e571bb8fefc72e84&d=DwIFAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=nUNT2_IvSlbeY_25S516HctZv4od6WK6h2q2_C4Q8SY&s=TTxCbaBVGOCBZROb7fqSBDCe9wIZrYdBDSCW2TqrLzM&e=
> > >
> > > It could be cheap for NVMe HDD to do that, given all queues/requests
> > > just stay in system's RAM.
> >
> > Yes. Keith also commented on that. SQEs have to be removed in order from
> > the SQ, but that does not mean that the disk has to execute them in order.
> > So I do not think this is an issue.
> >
> > > Also I guess internal IO sort/merge may not be good enough compared with
> > > SW's implementation:
> > >
> > > 1) device internal queue depth is often low, and the participated requests won't
> > > be enough many, but SW's scheduler queue depth is often 2 times of
> > > device queue depth.
> >
> > Drive internal QD can actually be quite large to accommodate for internal
> > house-keeping commands (e.g. ATI/FTI refreshes, media cache flushes, etc)
> > while simultaneously executing incoming user commands. These internal task
> > are often one of the reason for SAS drives to return QF at different
> > host-seen QD, and why in the end NVMe may need a mechanism similar to task
> > set full notifications in SAS.
> >
> > > 2) HDD drive doesn't have context info, so when concurrent IOs are run from
> > > multiple contexts, HDD internal reorder/merge can't work well enough. blk-mq
> > > doesn't address this case too, however the legacy IO path does consider that
> > > via IOC batch.>
> > >
> > > Thanks,
> > > Ming
> > >
> > >
> >
> >
> > --
> > Damien Le Moal
> > Western Digital Research
> [sorry for the duplicate mailing - forgot about plain text!]
>
> Hi Damien-
>
> You're right. The HDD needs those commands in its internal queue to
> sort and merge them, because commands are pulled from the SQ strictly
> FIFO which precludes any sorting or merging within the SQ. That being
> said, HDDs still work better with a good kernel scheduler to group
> commands into HDD-friendly sequences. So it would be helpful if we
> could devise a method to help the kernel sort/merge before loading the
> commands into the SQ, just as we do with SCSI today.
>
> Ming:
> Regarding sorting across SQs, I mean to say these two things:
> 1. The HDD would not try and reach up into the SQs and choose the next
> best command. I understand the SQs are FIFO, so that is why NVMe HDD
> has to pull them into our internal queue for sorting and merging. Our
> internal queue has historically been more than adequate (SAS-256,
> SATA-32) to provide pretty good optimization without excessive command
> latencies.
>
> 2. Also, I know NVMe specifically does not imply any completion order
> within the SQ, but an NVMe HDD will likely honor the submission order
> within any single SQ, but not try and correlate across multiple SQs
> (if the host sets up multiple SQs). I believe this is different from
> SSD. I think of this as being left over from SAS/SATA where we manage
> overlapped commands by command order-of-arrival.
>
> Many HDD customers spend a lot of time balancing workload and queue
> depth to reach the IOPS/throughput targets they desire. It's not
> straightforward since HDD command completion time is extremely
> workload-sensitive. Some more sophisticated customers dynamically
> control queue depth to keep all the command latencies within QOS. But
> that requires extensive workload characterization, plus knowledge of
> the upcoming workload, both of which makes it difficult for the HDD to
> auto-tune its own queue depth. I'm really interested to have this
> queue approach discussion at the conference - there seems to be areas
> where we can improve on legacy behavior.
>
> In all these scenarios, a single SQ/CQ pair is certainly more than
> adequate to keep an HDD busy. Multiple SQ/CQ pairs probably only
> assist driver or system architects to separate traffic classes into
> separate SQs. At any rate, the HDD won't mandate >1 SQ, but it will
> support it if desired.
>
> -Tim
> --
> Tim Walker
> Product Design Systems Engineering, Seagate Technology
> (303) 775-3770

Hi Ming-

>Will NVMe HDD support multiple NS?

At this point it doesn't seem like an NVMe HDD could benefit from
multiple namespaces. However, a multiple actuator HDD can present the
actuators as independent channels that are capable of independent
media access. It seems that we would want them on separate namespaces,
or sets. I'd like to discuss the pros and cons of each, and which
would be better for system integration.

Best regards,
-Tim

-- 
Tim Walker
Product Design Systems Engineering, Seagate Technology
(303) 775-3770

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
  2020-02-19 16:28                                     ` Tim Walker
@ 2020-02-19 20:50                                       ` Keith Busch
  -1 siblings, 0 replies; 64+ messages in thread
From: Keith Busch @ 2020-02-19 20:50 UTC (permalink / raw)
  To: Tim Walker
  Cc: Damien Le Moal, Ming Lei, Hannes Reinecke, Martin K. Petersen,
	linux-block, linux-scsi, linux-nvme

On Wed, Feb 19, 2020 at 11:28:46AM -0500, Tim Walker wrote:
> Hi Ming-
> 
> >Will NVMe HDD support multiple NS?
> 
> At this point it doesn't seem like an NVMe HDD could benefit from
> multiple namespaces. However, a multiple actuator HDD can present the
> actuators as independent channels that are capable of independent
> media access. It seems that we would want them on separate namespaces,
> or sets. I'd like to discuss the pros and cons of each, and which
> would be better for system integration.

If NVM Sets are not implemented, the host is not aware of resource
separatation for each namespace.

If you implement NVM Sets, two namespaces in different sets tells the host
that the device has a backend resource partition (logical or physical)
such that processing commands for one namespace will not affect the
processing capabilities of the other. Sets define "noisy neighbor"
domains.

Dual actuators sound like you have independent resources appropriate to
report as NVM Sets, but that may depend on other implementation details.

The NVMe specification does not go far enough, though, since IO queues
are always a shared resource. The host may implement a different IO
queue policy such that they're not shared (you'd need at least one IO
queue per set), but we don't currently do that.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe HDD
@ 2020-02-19 20:50                                       ` Keith Busch
  0 siblings, 0 replies; 64+ messages in thread
From: Keith Busch @ 2020-02-19 20:50 UTC (permalink / raw)
  To: Tim Walker
  Cc: Damien Le Moal, Martin K. Petersen, linux-scsi, linux-nvme,
	Ming Lei, linux-block, Hannes Reinecke

On Wed, Feb 19, 2020 at 11:28:46AM -0500, Tim Walker wrote:
> Hi Ming-
> 
> >Will NVMe HDD support multiple NS?
> 
> At this point it doesn't seem like an NVMe HDD could benefit from
> multiple namespaces. However, a multiple actuator HDD can present the
> actuators as independent channels that are capable of independent
> media access. It seems that we would want them on separate namespaces,
> or sets. I'd like to discuss the pros and cons of each, and which
> would be better for system integration.

If NVM Sets are not implemented, the host is not aware of resource
separatation for each namespace.

If you implement NVM Sets, two namespaces in different sets tells the host
that the device has a backend resource partition (logical or physical)
such that processing commands for one namespace will not affect the
processing capabilities of the other. Sets define "noisy neighbor"
domains.

Dual actuators sound like you have independent resources appropriate to
report as NVM Sets, but that may depend on other implementation details.

The NVMe specification does not go far enough, though, since IO queues
are always a shared resource. The host may implement a different IO
queue policy such that they're not shared (you'd need at least one IO
queue per set), but we don't currently do that.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2020-02-19 20:51 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-10 19:20 [LSF/MM/BPF TOPIC] NVMe HDD Tim Walker
2020-02-10 19:20 ` Tim Walker
2020-02-10 20:43 ` Keith Busch
2020-02-10 20:43   ` Keith Busch
2020-02-10 22:25   ` Finn Thain
2020-02-10 22:25     ` Finn Thain
2020-02-11 12:28 ` Ming Lei
2020-02-11 12:28   ` Ming Lei
2020-02-11 19:01   ` Tim Walker
2020-02-11 19:01     ` Tim Walker
2020-02-12  1:47     ` Damien Le Moal
2020-02-12  1:47       ` Damien Le Moal
2020-02-12 22:03       ` Ming Lei
2020-02-12 22:03         ` Ming Lei
2020-02-13  2:40         ` Damien Le Moal
2020-02-13  2:40           ` Damien Le Moal
2020-02-13  7:53           ` Ming Lei
2020-02-13  7:53             ` Ming Lei
2020-02-13  8:24             ` Damien Le Moal
2020-02-13  8:24               ` Damien Le Moal
2020-02-13  8:34               ` Ming Lei
2020-02-13  8:34                 ` Ming Lei
2020-02-13 16:30                 ` Keith Busch
2020-02-13 16:30                   ` Keith Busch
2020-02-14  0:40                   ` Ming Lei
2020-02-14  0:40                     ` Ming Lei
2020-02-13  3:02       ` Martin K. Petersen
2020-02-13  3:02         ` Martin K. Petersen
2020-02-13  3:12         ` Tim Walker
2020-02-13  3:12           ` Tim Walker
2020-02-13  4:17           ` Martin K. Petersen
2020-02-13  4:17             ` Martin K. Petersen
2020-02-14  7:32             ` Hannes Reinecke
2020-02-14  7:32               ` Hannes Reinecke
2020-02-14 14:40               ` Keith Busch
2020-02-14 14:40                 ` Keith Busch
2020-02-14 16:04                 ` Hannes Reinecke
2020-02-14 16:04                   ` Hannes Reinecke
2020-02-14 17:05                   ` Keith Busch
2020-02-14 17:05                     ` Keith Busch
2020-02-18 15:54                     ` Tim Walker
2020-02-18 15:54                       ` Tim Walker
2020-02-18 17:41                       ` Keith Busch
2020-02-18 17:41                         ` Keith Busch
2020-02-18 17:52                         ` James Smart
2020-02-18 17:52                           ` James Smart
2020-02-19  1:31                         ` Ming Lei
2020-02-19  1:31                           ` Ming Lei
2020-02-19  1:53                           ` Damien Le Moal
2020-02-19  1:53                             ` Damien Le Moal
2020-02-19  2:15                             ` Ming Lei
2020-02-19  2:15                               ` Ming Lei
2020-02-19  2:32                               ` Damien Le Moal
2020-02-19  2:32                                 ` Damien Le Moal
2020-02-19  2:56                                 ` Tim Walker
2020-02-19  2:56                                   ` Tim Walker
2020-02-19 16:28                                   ` Tim Walker
2020-02-19 16:28                                     ` Tim Walker
2020-02-19 20:50                                     ` Keith Busch
2020-02-19 20:50                                       ` Keith Busch
2020-02-14  0:35         ` Ming Lei
2020-02-14  0:35           ` Ming Lei
2020-02-12 21:52     ` Ming Lei
2020-02-12 21:52       ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.