* [LSF/MM/BPF TOPIC] NVMe HDD @ 2020-02-10 19:20 Tim Walker 2020-02-10 20:43 ` Keith Busch 2020-02-11 12:28 ` Ming Lei 0 siblings, 2 replies; 32+ messages in thread From: Tim Walker @ 2020-02-10 19:20 UTC (permalink / raw) To: linux-block, linux-scsi, linux-nvme Background: NVMe specification has hardened over the decade and now NVMe devices are well integrated into our customers’ systems. As we look forward, moving HDDs to the NVMe command set eliminates the SAS IOC and driver stack, consolidating on a single access method for rotational and static storage technologies. PCIe-NVMe offers near-SATA interface costs, features and performance suitable for high-cap HDDs, and optimal interoperability for storage automation, tiering, and management. We will share some early conceptual results and proposed salient design goals and challenges surrounding an NVMe HDD. Discussion Proposal: We’d like to share our views and solicit input on: -What Linux storage stack assumptions do we need to be aware of as we develop these devices with drastically different performance characteristics than traditional NAND? For example, what schedular or device driver level changes will be needed to integrate NVMe HDDs? -Are there NVMe feature trade-offs that make sense for HDDs that won’t break the HDD-SSD interoperability goals? -How would upcoming multi-actuator HDDs impact NVMe? Regards, Tim Walker ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-10 19:20 [LSF/MM/BPF TOPIC] NVMe HDD Tim Walker @ 2020-02-10 20:43 ` Keith Busch 2020-02-10 22:25 ` Finn Thain 2020-02-11 12:28 ` Ming Lei 1 sibling, 1 reply; 32+ messages in thread From: Keith Busch @ 2020-02-10 20:43 UTC (permalink / raw) To: Tim Walker; +Cc: linux-block, linux-scsi, linux-nvme On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote: > -What Linux storage stack assumptions do we need to be aware of as we > develop these devices with drastically different performance > characteristics than traditional NAND? For example, what schedular or > device driver level changes will be needed to integrate NVMe HDDs? Right now the nvme driver unconditionally sets QUEUE_FLAG_NONROT (non-rational, i.e. ssd), on all nvme namespace's request_queue flags. We need the specification to define a capability bit or field associated with the namespace to tell the driver otherwise, then we can propogate that information up to the block layer. Even without that, an otherwise spec compliant HDD should function as an nvme device with existing software, but I would be interested to hear additional ideas or feature gaps with other protocols that should be considered in order to make an nvme hdd work well. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-10 20:43 ` Keith Busch @ 2020-02-10 22:25 ` Finn Thain 0 siblings, 0 replies; 32+ messages in thread From: Finn Thain @ 2020-02-10 22:25 UTC (permalink / raw) To: Keith Busch; +Cc: Tim Walker, linux-block, linux-scsi, linux-nvme On Mon, 10 Feb 2020, Keith Busch wrote: > Right now the nvme driver unconditionally sets QUEUE_FLAG_NONROT > (non-rational, i.e. ssd), on all nvme namespace's request_queue flags. I agree -- the standard nomenclature is not rational ;-) Air-cooled is not "solid state". Any round-robin algorithm is "rotational". No expensive array is a "R.A.I.D.". There's no "S.C.S.I." on a large system... ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-10 19:20 [LSF/MM/BPF TOPIC] NVMe HDD Tim Walker 2020-02-10 20:43 ` Keith Busch @ 2020-02-11 12:28 ` Ming Lei 2020-02-11 19:01 ` Tim Walker 1 sibling, 1 reply; 32+ messages in thread From: Ming Lei @ 2020-02-11 12:28 UTC (permalink / raw) To: Tim Walker; +Cc: linux-block, linux-scsi, linux-nvme On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote: > Background: > > NVMe specification has hardened over the decade and now NVMe devices > are well integrated into our customers’ systems. As we look forward, > moving HDDs to the NVMe command set eliminates the SAS IOC and driver > stack, consolidating on a single access method for rotational and > static storage technologies. PCIe-NVMe offers near-SATA interface > costs, features and performance suitable for high-cap HDDs, and > optimal interoperability for storage automation, tiering, and > management. We will share some early conceptual results and proposed > salient design goals and challenges surrounding an NVMe HDD. HDD. performance is very sensitive to IO order. Could you provide some background info about NVMe HDD? Such as: - number of hw queues - hw queue depth - will NVMe sort/merge IO among all SQs or not? > > > Discussion Proposal: > > We’d like to share our views and solicit input on: > > -What Linux storage stack assumptions do we need to be aware of as we > develop these devices with drastically different performance > characteristics than traditional NAND? For example, what schedular or > device driver level changes will be needed to integrate NVMe HDDs? IO merge is often important for HDD. IO merge is usually triggered when .queue_rq() returns STS_RESOURCE, so far this condition won't be triggered for NVMe SSD. Also blk-mq kills BDI queue congestion and ioc batching, and causes writeback performance regression[1][2]. What I am thinking is that if we need to switch to use independent IO path for handling SSD and HDD. IO, given the two mediums are so different from performance viewpoint. [1] https://lore.kernel.org/linux-scsi/Pine.LNX.4.44L0.1909181213141.1507-100000@iolanthe.rowland.org/ [2] https://lore.kernel.org/linux-scsi/20191226083706.GA17974@ming.t460p/ Thanks, Ming ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-11 12:28 ` Ming Lei @ 2020-02-11 19:01 ` Tim Walker 2020-02-12 1:47 ` Damien Le Moal 2020-02-12 21:52 ` Ming Lei 0 siblings, 2 replies; 32+ messages in thread From: Tim Walker @ 2020-02-11 19:01 UTC (permalink / raw) To: Ming Lei; +Cc: linux-block, linux-scsi, linux-nvme On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@redhat.com> wrote: > > On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote: > > Background: > > > > NVMe specification has hardened over the decade and now NVMe devices > > are well integrated into our customers’ systems. As we look forward, > > moving HDDs to the NVMe command set eliminates the SAS IOC and driver > > stack, consolidating on a single access method for rotational and > > static storage technologies. PCIe-NVMe offers near-SATA interface > > costs, features and performance suitable for high-cap HDDs, and > > optimal interoperability for storage automation, tiering, and > > management. We will share some early conceptual results and proposed > > salient design goals and challenges surrounding an NVMe HDD. > > HDD. performance is very sensitive to IO order. Could you provide some > background info about NVMe HDD? Such as: > > - number of hw queues > - hw queue depth > - will NVMe sort/merge IO among all SQs or not? > > > > > > > Discussion Proposal: > > > > We’d like to share our views and solicit input on: > > > > -What Linux storage stack assumptions do we need to be aware of as we > > develop these devices with drastically different performance > > characteristics than traditional NAND? For example, what schedular or > > device driver level changes will be needed to integrate NVMe HDDs? > > IO merge is often important for HDD. IO merge is usually triggered when > .queue_rq() returns STS_RESOURCE, so far this condition won't be > triggered for NVMe SSD. > > Also blk-mq kills BDI queue congestion and ioc batching, and causes > writeback performance regression[1][2]. > > What I am thinking is that if we need to switch to use independent IO > path for handling SSD and HDD. IO, given the two mediums are so > different from performance viewpoint. > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e= > [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e= > > > Thanks, > Ming > I would expect the drive would support a reasonable number of queues and a relatively deep queue depth, more in line with NVMe practices than SAS HDD's typical 128. But it probably doesn't make sense to queue up thousands of commands on something as slow as an HDD, and many customers keep queues < 32 for latency management. Merge and elevator are important to HDD performance. I don't believe NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort within a SQ without driving large differences between SSD & HDD data paths? Thanks, -Tim -- Tim Walker Product Design Systems Engineering, Seagate Technology (303) 775-3770 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-11 19:01 ` Tim Walker @ 2020-02-12 1:47 ` Damien Le Moal 2020-02-12 22:03 ` Ming Lei 2020-02-13 3:02 ` Martin K. Petersen 2020-02-12 21:52 ` Ming Lei 1 sibling, 2 replies; 32+ messages in thread From: Damien Le Moal @ 2020-02-12 1:47 UTC (permalink / raw) To: Tim Walker, Ming Lei; +Cc: linux-block, linux-scsi, linux-nvme On 2020/02/12 4:01, Tim Walker wrote: > On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@redhat.com> wrote: >> >> On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote: >>> Background: >>> >>> NVMe specification has hardened over the decade and now NVMe devices >>> are well integrated into our customers’ systems. As we look forward, >>> moving HDDs to the NVMe command set eliminates the SAS IOC and driver >>> stack, consolidating on a single access method for rotational and >>> static storage technologies. PCIe-NVMe offers near-SATA interface >>> costs, features and performance suitable for high-cap HDDs, and >>> optimal interoperability for storage automation, tiering, and >>> management. We will share some early conceptual results and proposed >>> salient design goals and challenges surrounding an NVMe HDD. >> >> HDD. performance is very sensitive to IO order. Could you provide some >> background info about NVMe HDD? Such as: >> >> - number of hw queues >> - hw queue depth >> - will NVMe sort/merge IO among all SQs or not? >> >>> >>> >>> Discussion Proposal: >>> >>> We’d like to share our views and solicit input on: >>> >>> -What Linux storage stack assumptions do we need to be aware of as we >>> develop these devices with drastically different performance >>> characteristics than traditional NAND? For example, what schedular or >>> device driver level changes will be needed to integrate NVMe HDDs? >> >> IO merge is often important for HDD. IO merge is usually triggered when >> .queue_rq() returns STS_RESOURCE, so far this condition won't be >> triggered for NVMe SSD. >> >> Also blk-mq kills BDI queue congestion and ioc batching, and causes >> writeback performance regression[1][2]. >> >> What I am thinking is that if we need to switch to use independent IO >> path for handling SSD and HDD. IO, given the two mediums are so >> different from performance viewpoint. >> >> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e= >> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e= >> >> >> Thanks, >> Ming >> > > I would expect the drive would support a reasonable number of queues > and a relatively deep queue depth, more in line with NVMe practices > than SAS HDD's typical 128. But it probably doesn't make sense to > queue up thousands of commands on something as slow as an HDD, and > many customers keep queues < 32 for latency management. Exposing an HDD through multiple-queues each with a high queue depth is simply asking for troubles. Commands will end up spending so much time sitting in the queues that they will timeout. This can already be observed with the smartpqi SAS HBA which exposes single drives as multiqueue block devices with high queue depth. Exercising these drives heavily leads to thousands of commands being queued and to timeouts. It is fairly easy to trigger this without a manual change to the QD. This is on my to-do list of fixes for some time now (lacking time to do it). NVMe HDDs need to have an interface setup that match their speed, that is, something like a SAS interface: *single* queue pair with a max QD of 256 or less depending on what the drive can take. Their is no TASK_SET_FULL notification on NVMe, so throttling has to come from the max QD of the SQ, which the drive will advertise to the host. > Merge and elevator are important to HDD performance. I don't believe > NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort > within a SQ without driving large differences between SSD & HDD data > paths? As far as I know, there is no merging going on once requests are passed to the driver and added to an SQ. So this is beside the point. The current default block scheduler for NVMe SSDs is "none". This is decided based on the number of queues of the device. For NVMe drives that have only a single queue *AND* the QUEUE_FLAG_NONROT flag cleared in their request queue will can fallback to the default spinning rust mq-deadline elevator. That will achieve command merging and LBA ordering needed for good performance on HDDs. NVMe specs will need an update to have a "NONROT" (non-rotational) bit in the identify data for all this to fit well in the current stack. > > Thanks, > -Tim > -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-12 1:47 ` Damien Le Moal @ 2020-02-12 22:03 ` Ming Lei 2020-02-13 2:40 ` Damien Le Moal 2020-02-13 3:02 ` Martin K. Petersen 1 sibling, 1 reply; 32+ messages in thread From: Ming Lei @ 2020-02-12 22:03 UTC (permalink / raw) To: Damien Le Moal; +Cc: Tim Walker, linux-block, linux-scsi, linux-nvme On Wed, Feb 12, 2020 at 01:47:53AM +0000, Damien Le Moal wrote: > On 2020/02/12 4:01, Tim Walker wrote: > > On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@redhat.com> wrote: > >> > >> On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote: > >>> Background: > >>> > >>> NVMe specification has hardened over the decade and now NVMe devices > >>> are well integrated into our customers’ systems. As we look forward, > >>> moving HDDs to the NVMe command set eliminates the SAS IOC and driver > >>> stack, consolidating on a single access method for rotational and > >>> static storage technologies. PCIe-NVMe offers near-SATA interface > >>> costs, features and performance suitable for high-cap HDDs, and > >>> optimal interoperability for storage automation, tiering, and > >>> management. We will share some early conceptual results and proposed > >>> salient design goals and challenges surrounding an NVMe HDD. > >> > >> HDD. performance is very sensitive to IO order. Could you provide some > >> background info about NVMe HDD? Such as: > >> > >> - number of hw queues > >> - hw queue depth > >> - will NVMe sort/merge IO among all SQs or not? > >> > >>> > >>> > >>> Discussion Proposal: > >>> > >>> We’d like to share our views and solicit input on: > >>> > >>> -What Linux storage stack assumptions do we need to be aware of as we > >>> develop these devices with drastically different performance > >>> characteristics than traditional NAND? For example, what schedular or > >>> device driver level changes will be needed to integrate NVMe HDDs? > >> > >> IO merge is often important for HDD. IO merge is usually triggered when > >> .queue_rq() returns STS_RESOURCE, so far this condition won't be > >> triggered for NVMe SSD. > >> > >> Also blk-mq kills BDI queue congestion and ioc batching, and causes > >> writeback performance regression[1][2]. > >> > >> What I am thinking is that if we need to switch to use independent IO > >> path for handling SSD and HDD. IO, given the two mediums are so > >> different from performance viewpoint. > >> > >> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e= > >> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e= > >> > >> > >> Thanks, > >> Ming > >> > > > > I would expect the drive would support a reasonable number of queues > > and a relatively deep queue depth, more in line with NVMe practices > > than SAS HDD's typical 128. But it probably doesn't make sense to > > queue up thousands of commands on something as slow as an HDD, and > > many customers keep queues < 32 for latency management. > > Exposing an HDD through multiple-queues each with a high queue depth is > simply asking for troubles. Commands will end up spending so much time > sitting in the queues that they will timeout. This can already be observed > with the smartpqi SAS HBA which exposes single drives as multiqueue block > devices with high queue depth. Exercising these drives heavily leads to > thousands of commands being queued and to timeouts. It is fairly easy to > trigger this without a manual change to the QD. This is on my to-do list of > fixes for some time now (lacking time to do it). Just wondering why smartpqi SAS won't set one proper .cmd_per_lun for avoiding the issue, looks the driver simply assigns .can_queue to it, then it isn't strange to see the timeout issue. If .can_queue is a bit big, HDD. is easily saturated too long. > > NVMe HDDs need to have an interface setup that match their speed, that is, > something like a SAS interface: *single* queue pair with a max QD of 256 or > less depending on what the drive can take. Their is no TASK_SET_FULL > notification on NVMe, so throttling has to come from the max QD of the SQ, > which the drive will advertise to the host. > > > Merge and elevator are important to HDD performance. I don't believe > > NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort > > within a SQ without driving large differences between SSD & HDD data > > paths? > > As far as I know, there is no merging going on once requests are passed to > the driver and added to an SQ. So this is beside the point. > The current default block scheduler for NVMe SSDs is "none". This is > decided based on the number of queues of the device. For NVMe drives that > have only a single queue *AND* the QUEUE_FLAG_NONROT flag cleared in their > request queue will can fallback to the default spinning rust mq-deadline > elevator. That will achieve command merging and LBA ordering needed for > good performance on HDDs. mq-deadline basically won't merge IO if STS_RESOURCE isn't returned from .queue_rq(), or blk_mq_get_dispatch_budget always return true. NVMe's .queue_rq() basically always returns STS_OK. Thanks, Ming ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-12 22:03 ` Ming Lei @ 2020-02-13 2:40 ` Damien Le Moal 2020-02-13 7:53 ` Ming Lei 0 siblings, 1 reply; 32+ messages in thread From: Damien Le Moal @ 2020-02-13 2:40 UTC (permalink / raw) To: Ming Lei; +Cc: Tim Walker, linux-block, linux-scsi, linux-nvme Ming, On 2020/02/13 7:03, Ming Lei wrote: > On Wed, Feb 12, 2020 at 01:47:53AM +0000, Damien Le Moal wrote: >> On 2020/02/12 4:01, Tim Walker wrote: >>> On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@redhat.com> wrote: >>>> >>>> On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote: >>>>> Background: >>>>> >>>>> NVMe specification has hardened over the decade and now NVMe devices >>>>> are well integrated into our customers’ systems. As we look forward, >>>>> moving HDDs to the NVMe command set eliminates the SAS IOC and driver >>>>> stack, consolidating on a single access method for rotational and >>>>> static storage technologies. PCIe-NVMe offers near-SATA interface >>>>> costs, features and performance suitable for high-cap HDDs, and >>>>> optimal interoperability for storage automation, tiering, and >>>>> management. We will share some early conceptual results and proposed >>>>> salient design goals and challenges surrounding an NVMe HDD. >>>> >>>> HDD. performance is very sensitive to IO order. Could you provide some >>>> background info about NVMe HDD? Such as: >>>> >>>> - number of hw queues >>>> - hw queue depth >>>> - will NVMe sort/merge IO among all SQs or not? >>>> >>>>> >>>>> >>>>> Discussion Proposal: >>>>> >>>>> We’d like to share our views and solicit input on: >>>>> >>>>> -What Linux storage stack assumptions do we need to be aware of as we >>>>> develop these devices with drastically different performance >>>>> characteristics than traditional NAND? For example, what schedular or >>>>> device driver level changes will be needed to integrate NVMe HDDs? >>>> >>>> IO merge is often important for HDD. IO merge is usually triggered when >>>> .queue_rq() returns STS_RESOURCE, so far this condition won't be >>>> triggered for NVMe SSD. >>>> >>>> Also blk-mq kills BDI queue congestion and ioc batching, and causes >>>> writeback performance regression[1][2]. >>>> >>>> What I am thinking is that if we need to switch to use independent IO >>>> path for handling SSD and HDD. IO, given the two mediums are so >>>> different from performance viewpoint. >>>> >>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e= >>>> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e= >>>> >>>> >>>> Thanks, >>>> Ming >>>> >>> >>> I would expect the drive would support a reasonable number of queues >>> and a relatively deep queue depth, more in line with NVMe practices >>> than SAS HDD's typical 128. But it probably doesn't make sense to >>> queue up thousands of commands on something as slow as an HDD, and >>> many customers keep queues < 32 for latency management. >> >> Exposing an HDD through multiple-queues each with a high queue depth is >> simply asking for troubles. Commands will end up spending so much time >> sitting in the queues that they will timeout. This can already be observed >> with the smartpqi SAS HBA which exposes single drives as multiqueue block >> devices with high queue depth. Exercising these drives heavily leads to >> thousands of commands being queued and to timeouts. It is fairly easy to >> trigger this without a manual change to the QD. This is on my to-do list of >> fixes for some time now (lacking time to do it). > > Just wondering why smartpqi SAS won't set one proper .cmd_per_lun for > avoiding the issue, looks the driver simply assigns .can_queue to it, > then it isn't strange to see the timeout issue. If .can_queue is a bit > big, HDD. is easily saturated too long. > >> >> NVMe HDDs need to have an interface setup that match their speed, that is, >> something like a SAS interface: *single* queue pair with a max QD of 256 or >> less depending on what the drive can take. Their is no TASK_SET_FULL >> notification on NVMe, so throttling has to come from the max QD of the SQ, >> which the drive will advertise to the host. >> >>> Merge and elevator are important to HDD performance. I don't believe >>> NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort >>> within a SQ without driving large differences between SSD & HDD data >>> paths? >> >> As far as I know, there is no merging going on once requests are passed to >> the driver and added to an SQ. So this is beside the point. >> The current default block scheduler for NVMe SSDs is "none". This is >> decided based on the number of queues of the device. For NVMe drives that >> have only a single queue *AND* the QUEUE_FLAG_NONROT flag cleared in their >> request queue will can fallback to the default spinning rust mq-deadline >> elevator. That will achieve command merging and LBA ordering needed for >> good performance on HDDs. > > mq-deadline basically won't merge IO if STS_RESOURCE isn't returned from > .queue_rq(), or blk_mq_get_dispatch_budget always return true. NVMe's > .queue_rq() basically always returns STS_OK. I am confused: when an elevator is set, ->queue_rq() is called for requests obtained from the elevator (with e->type->ops.dispatch_request()), after the requests went through it. And merging will happen at that stage when new requests are inserted in the elevator. If the ->queue_rq() returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, the request is indeed requeued which offer more chances of further merging, but that is not the same as no merging happening. Am I missing your point here ? > > > Thanks, > Ming > > -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-13 2:40 ` Damien Le Moal @ 2020-02-13 7:53 ` Ming Lei 2020-02-13 8:24 ` Damien Le Moal 0 siblings, 1 reply; 32+ messages in thread From: Ming Lei @ 2020-02-13 7:53 UTC (permalink / raw) To: Damien Le Moal; +Cc: Tim Walker, linux-block, linux-scsi, linux-nvme On Thu, Feb 13, 2020 at 02:40:41AM +0000, Damien Le Moal wrote: > Ming, > > On 2020/02/13 7:03, Ming Lei wrote: > > On Wed, Feb 12, 2020 at 01:47:53AM +0000, Damien Le Moal wrote: > >> On 2020/02/12 4:01, Tim Walker wrote: > >>> On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@redhat.com> wrote: > >>>> > >>>> On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote: > >>>>> Background: > >>>>> > >>>>> NVMe specification has hardened over the decade and now NVMe devices > >>>>> are well integrated into our customers’ systems. As we look forward, > >>>>> moving HDDs to the NVMe command set eliminates the SAS IOC and driver > >>>>> stack, consolidating on a single access method for rotational and > >>>>> static storage technologies. PCIe-NVMe offers near-SATA interface > >>>>> costs, features and performance suitable for high-cap HDDs, and > >>>>> optimal interoperability for storage automation, tiering, and > >>>>> management. We will share some early conceptual results and proposed > >>>>> salient design goals and challenges surrounding an NVMe HDD. > >>>> > >>>> HDD. performance is very sensitive to IO order. Could you provide some > >>>> background info about NVMe HDD? Such as: > >>>> > >>>> - number of hw queues > >>>> - hw queue depth > >>>> - will NVMe sort/merge IO among all SQs or not? > >>>> > >>>>> > >>>>> > >>>>> Discussion Proposal: > >>>>> > >>>>> We’d like to share our views and solicit input on: > >>>>> > >>>>> -What Linux storage stack assumptions do we need to be aware of as we > >>>>> develop these devices with drastically different performance > >>>>> characteristics than traditional NAND? For example, what schedular or > >>>>> device driver level changes will be needed to integrate NVMe HDDs? > >>>> > >>>> IO merge is often important for HDD. IO merge is usually triggered when > >>>> .queue_rq() returns STS_RESOURCE, so far this condition won't be > >>>> triggered for NVMe SSD. > >>>> > >>>> Also blk-mq kills BDI queue congestion and ioc batching, and causes > >>>> writeback performance regression[1][2]. > >>>> > >>>> What I am thinking is that if we need to switch to use independent IO > >>>> path for handling SSD and HDD. IO, given the two mediums are so > >>>> different from performance viewpoint. > >>>> > >>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e= > >>>> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e= > >>>> > >>>> > >>>> Thanks, > >>>> Ming > >>>> > >>> > >>> I would expect the drive would support a reasonable number of queues > >>> and a relatively deep queue depth, more in line with NVMe practices > >>> than SAS HDD's typical 128. But it probably doesn't make sense to > >>> queue up thousands of commands on something as slow as an HDD, and > >>> many customers keep queues < 32 for latency management. > >> > >> Exposing an HDD through multiple-queues each with a high queue depth is > >> simply asking for troubles. Commands will end up spending so much time > >> sitting in the queues that they will timeout. This can already be observed > >> with the smartpqi SAS HBA which exposes single drives as multiqueue block > >> devices with high queue depth. Exercising these drives heavily leads to > >> thousands of commands being queued and to timeouts. It is fairly easy to > >> trigger this without a manual change to the QD. This is on my to-do list of > >> fixes for some time now (lacking time to do it). > > > > Just wondering why smartpqi SAS won't set one proper .cmd_per_lun for > > avoiding the issue, looks the driver simply assigns .can_queue to it, > > then it isn't strange to see the timeout issue. If .can_queue is a bit > > big, HDD. is easily saturated too long. > > > >> > >> NVMe HDDs need to have an interface setup that match their speed, that is, > >> something like a SAS interface: *single* queue pair with a max QD of 256 or > >> less depending on what the drive can take. Their is no TASK_SET_FULL > >> notification on NVMe, so throttling has to come from the max QD of the SQ, > >> which the drive will advertise to the host. > >> > >>> Merge and elevator are important to HDD performance. I don't believe > >>> NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort > >>> within a SQ without driving large differences between SSD & HDD data > >>> paths? > >> > >> As far as I know, there is no merging going on once requests are passed to > >> the driver and added to an SQ. So this is beside the point. > >> The current default block scheduler for NVMe SSDs is "none". This is > >> decided based on the number of queues of the device. For NVMe drives that > >> have only a single queue *AND* the QUEUE_FLAG_NONROT flag cleared in their > >> request queue will can fallback to the default spinning rust mq-deadline > >> elevator. That will achieve command merging and LBA ordering needed for > >> good performance on HDDs. > > > > mq-deadline basically won't merge IO if STS_RESOURCE isn't returned from > > .queue_rq(), or blk_mq_get_dispatch_budget always return true. NVMe's > > .queue_rq() basically always returns STS_OK. > > I am confused: when an elevator is set, ->queue_rq() is called for requests > obtained from the elevator (with e->type->ops.dispatch_request()), after > the requests went through it. And merging will happen at that stage when > new requests are inserted in the elevator. When request is queued to lld via .queue_rq(), the request has been removed from scheduler queue. And IO merge is just done inside or against scheduler queue. > > If the ->queue_rq() returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, the > request is indeed requeued which offer more chances of further merging, but > that is not the same as no merging happening. > Am I missing your point here ? BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE or getting no budget can be thought as device saturation feedback, then more requests can be gathered in scheduler queue since we don't dequeue request from scheduler queue when that happens, then IO merge is possible. Without any device saturation feedback from driver, block layer just dequeues request from scheduler queue with same speed of submission to hardware, then no IO can be merged. If you observe sequential IO on NVMe PCI, you will see no IO merge basically. Thanks, Ming ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-13 7:53 ` Ming Lei @ 2020-02-13 8:24 ` Damien Le Moal 2020-02-13 8:34 ` Ming Lei 0 siblings, 1 reply; 32+ messages in thread From: Damien Le Moal @ 2020-02-13 8:24 UTC (permalink / raw) To: Ming Lei; +Cc: Tim Walker, linux-block, linux-scsi, linux-nvme On 2020/02/13 16:54, Ming Lei wrote: > On Thu, Feb 13, 2020 at 02:40:41AM +0000, Damien Le Moal wrote: >> Ming, >> >> On 2020/02/13 7:03, Ming Lei wrote: >>> On Wed, Feb 12, 2020 at 01:47:53AM +0000, Damien Le Moal wrote: >>>> On 2020/02/12 4:01, Tim Walker wrote: >>>>> On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@redhat.com> wrote: >>>>>> >>>>>> On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote: >>>>>>> Background: >>>>>>> >>>>>>> NVMe specification has hardened over the decade and now NVMe devices >>>>>>> are well integrated into our customers’ systems. As we look forward, >>>>>>> moving HDDs to the NVMe command set eliminates the SAS IOC and driver >>>>>>> stack, consolidating on a single access method for rotational and >>>>>>> static storage technologies. PCIe-NVMe offers near-SATA interface >>>>>>> costs, features and performance suitable for high-cap HDDs, and >>>>>>> optimal interoperability for storage automation, tiering, and >>>>>>> management. We will share some early conceptual results and proposed >>>>>>> salient design goals and challenges surrounding an NVMe HDD. >>>>>> >>>>>> HDD. performance is very sensitive to IO order. Could you provide some >>>>>> background info about NVMe HDD? Such as: >>>>>> >>>>>> - number of hw queues >>>>>> - hw queue depth >>>>>> - will NVMe sort/merge IO among all SQs or not? >>>>>> >>>>>>> >>>>>>> >>>>>>> Discussion Proposal: >>>>>>> >>>>>>> We’d like to share our views and solicit input on: >>>>>>> >>>>>>> -What Linux storage stack assumptions do we need to be aware of as we >>>>>>> develop these devices with drastically different performance >>>>>>> characteristics than traditional NAND? For example, what schedular or >>>>>>> device driver level changes will be needed to integrate NVMe HDDs? >>>>>> >>>>>> IO merge is often important for HDD. IO merge is usually triggered when >>>>>> .queue_rq() returns STS_RESOURCE, so far this condition won't be >>>>>> triggered for NVMe SSD. >>>>>> >>>>>> Also blk-mq kills BDI queue congestion and ioc batching, and causes >>>>>> writeback performance regression[1][2]. >>>>>> >>>>>> What I am thinking is that if we need to switch to use independent IO >>>>>> path for handling SSD and HDD. IO, given the two mediums are so >>>>>> different from performance viewpoint. >>>>>> >>>>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e= >>>>>> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e= >>>>>> >>>>>> >>>>>> Thanks, >>>>>> Ming >>>>>> >>>>> >>>>> I would expect the drive would support a reasonable number of queues >>>>> and a relatively deep queue depth, more in line with NVMe practices >>>>> than SAS HDD's typical 128. But it probably doesn't make sense to >>>>> queue up thousands of commands on something as slow as an HDD, and >>>>> many customers keep queues < 32 for latency management. >>>> >>>> Exposing an HDD through multiple-queues each with a high queue depth is >>>> simply asking for troubles. Commands will end up spending so much time >>>> sitting in the queues that they will timeout. This can already be observed >>>> with the smartpqi SAS HBA which exposes single drives as multiqueue block >>>> devices with high queue depth. Exercising these drives heavily leads to >>>> thousands of commands being queued and to timeouts. It is fairly easy to >>>> trigger this without a manual change to the QD. This is on my to-do list of >>>> fixes for some time now (lacking time to do it). >>> >>> Just wondering why smartpqi SAS won't set one proper .cmd_per_lun for >>> avoiding the issue, looks the driver simply assigns .can_queue to it, >>> then it isn't strange to see the timeout issue. If .can_queue is a bit >>> big, HDD. is easily saturated too long. >>> >>>> >>>> NVMe HDDs need to have an interface setup that match their speed, that is, >>>> something like a SAS interface: *single* queue pair with a max QD of 256 or >>>> less depending on what the drive can take. Their is no TASK_SET_FULL >>>> notification on NVMe, so throttling has to come from the max QD of the SQ, >>>> which the drive will advertise to the host. >>>> >>>>> Merge and elevator are important to HDD performance. I don't believe >>>>> NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort >>>>> within a SQ without driving large differences between SSD & HDD data >>>>> paths? >>>> >>>> As far as I know, there is no merging going on once requests are passed to >>>> the driver and added to an SQ. So this is beside the point. >>>> The current default block scheduler for NVMe SSDs is "none". This is >>>> decided based on the number of queues of the device. For NVMe drives that >>>> have only a single queue *AND* the QUEUE_FLAG_NONROT flag cleared in their >>>> request queue will can fallback to the default spinning rust mq-deadline >>>> elevator. That will achieve command merging and LBA ordering needed for >>>> good performance on HDDs. >>> >>> mq-deadline basically won't merge IO if STS_RESOURCE isn't returned from >>> .queue_rq(), or blk_mq_get_dispatch_budget always return true. NVMe's >>> .queue_rq() basically always returns STS_OK. >> >> I am confused: when an elevator is set, ->queue_rq() is called for requests >> obtained from the elevator (with e->type->ops.dispatch_request()), after >> the requests went through it. And merging will happen at that stage when >> new requests are inserted in the elevator. > > When request is queued to lld via .queue_rq(), the request has been > removed from scheduler queue. And IO merge is just done inside or > against scheduler queue. Yes, for incoming new BIOs, not for requests passed to the LLD. >> If the ->queue_rq() returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, the >> request is indeed requeued which offer more chances of further merging, but >> that is not the same as no merging happening. >> Am I missing your point here ? > > BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE or getting no budget can be > thought as device saturation feedback, then more requests can be > gathered in scheduler queue since we don't dequeue request from > scheduler queue when that happens, then IO merge is possible. > > Without any device saturation feedback from driver, block layer just > dequeues request from scheduler queue with same speed of submission to > hardware, then no IO can be merged. Got it. And since queue full will mean no more tags, submission will block on get_request() and there will be no chance in the elevator to merge anything (aside from opportunistic merging in plugs), isn't it ? So I guess NVMe HDDs will need some tuning in this area. > > If you observe sequential IO on NVMe PCI, you will see no IO merge > basically. > > > Thanks, > Ming > > -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-13 8:24 ` Damien Le Moal @ 2020-02-13 8:34 ` Ming Lei 2020-02-13 16:30 ` Keith Busch 0 siblings, 1 reply; 32+ messages in thread From: Ming Lei @ 2020-02-13 8:34 UTC (permalink / raw) To: Damien Le Moal; +Cc: Tim Walker, linux-block, linux-scsi, linux-nvme On Thu, Feb 13, 2020 at 08:24:36AM +0000, Damien Le Moal wrote: > On 2020/02/13 16:54, Ming Lei wrote: > > On Thu, Feb 13, 2020 at 02:40:41AM +0000, Damien Le Moal wrote: > >> Ming, > >> > >> On 2020/02/13 7:03, Ming Lei wrote: > >>> On Wed, Feb 12, 2020 at 01:47:53AM +0000, Damien Le Moal wrote: > >>>> On 2020/02/12 4:01, Tim Walker wrote: > >>>>> On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@redhat.com> wrote: > >>>>>> > >>>>>> On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote: > >>>>>>> Background: > >>>>>>> > >>>>>>> NVMe specification has hardened over the decade and now NVMe devices > >>>>>>> are well integrated into our customers’ systems. As we look forward, > >>>>>>> moving HDDs to the NVMe command set eliminates the SAS IOC and driver > >>>>>>> stack, consolidating on a single access method for rotational and > >>>>>>> static storage technologies. PCIe-NVMe offers near-SATA interface > >>>>>>> costs, features and performance suitable for high-cap HDDs, and > >>>>>>> optimal interoperability for storage automation, tiering, and > >>>>>>> management. We will share some early conceptual results and proposed > >>>>>>> salient design goals and challenges surrounding an NVMe HDD. > >>>>>> > >>>>>> HDD. performance is very sensitive to IO order. Could you provide some > >>>>>> background info about NVMe HDD? Such as: > >>>>>> > >>>>>> - number of hw queues > >>>>>> - hw queue depth > >>>>>> - will NVMe sort/merge IO among all SQs or not? > >>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Discussion Proposal: > >>>>>>> > >>>>>>> We’d like to share our views and solicit input on: > >>>>>>> > >>>>>>> -What Linux storage stack assumptions do we need to be aware of as we > >>>>>>> develop these devices with drastically different performance > >>>>>>> characteristics than traditional NAND? For example, what schedular or > >>>>>>> device driver level changes will be needed to integrate NVMe HDDs? > >>>>>> > >>>>>> IO merge is often important for HDD. IO merge is usually triggered when > >>>>>> .queue_rq() returns STS_RESOURCE, so far this condition won't be > >>>>>> triggered for NVMe SSD. > >>>>>> > >>>>>> Also blk-mq kills BDI queue congestion and ioc batching, and causes > >>>>>> writeback performance regression[1][2]. > >>>>>> > >>>>>> What I am thinking is that if we need to switch to use independent IO > >>>>>> path for handling SSD and HDD. IO, given the two mediums are so > >>>>>> different from performance viewpoint. > >>>>>> > >>>>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e= > >>>>>> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e= > >>>>>> > >>>>>> > >>>>>> Thanks, > >>>>>> Ming > >>>>>> > >>>>> > >>>>> I would expect the drive would support a reasonable number of queues > >>>>> and a relatively deep queue depth, more in line with NVMe practices > >>>>> than SAS HDD's typical 128. But it probably doesn't make sense to > >>>>> queue up thousands of commands on something as slow as an HDD, and > >>>>> many customers keep queues < 32 for latency management. > >>>> > >>>> Exposing an HDD through multiple-queues each with a high queue depth is > >>>> simply asking for troubles. Commands will end up spending so much time > >>>> sitting in the queues that they will timeout. This can already be observed > >>>> with the smartpqi SAS HBA which exposes single drives as multiqueue block > >>>> devices with high queue depth. Exercising these drives heavily leads to > >>>> thousands of commands being queued and to timeouts. It is fairly easy to > >>>> trigger this without a manual change to the QD. This is on my to-do list of > >>>> fixes for some time now (lacking time to do it). > >>> > >>> Just wondering why smartpqi SAS won't set one proper .cmd_per_lun for > >>> avoiding the issue, looks the driver simply assigns .can_queue to it, > >>> then it isn't strange to see the timeout issue. If .can_queue is a bit > >>> big, HDD. is easily saturated too long. > >>> > >>>> > >>>> NVMe HDDs need to have an interface setup that match their speed, that is, > >>>> something like a SAS interface: *single* queue pair with a max QD of 256 or > >>>> less depending on what the drive can take. Their is no TASK_SET_FULL > >>>> notification on NVMe, so throttling has to come from the max QD of the SQ, > >>>> which the drive will advertise to the host. > >>>> > >>>>> Merge and elevator are important to HDD performance. I don't believe > >>>>> NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort > >>>>> within a SQ without driving large differences between SSD & HDD data > >>>>> paths? > >>>> > >>>> As far as I know, there is no merging going on once requests are passed to > >>>> the driver and added to an SQ. So this is beside the point. > >>>> The current default block scheduler for NVMe SSDs is "none". This is > >>>> decided based on the number of queues of the device. For NVMe drives that > >>>> have only a single queue *AND* the QUEUE_FLAG_NONROT flag cleared in their > >>>> request queue will can fallback to the default spinning rust mq-deadline > >>>> elevator. That will achieve command merging and LBA ordering needed for > >>>> good performance on HDDs. > >>> > >>> mq-deadline basically won't merge IO if STS_RESOURCE isn't returned from > >>> .queue_rq(), or blk_mq_get_dispatch_budget always return true. NVMe's > >>> .queue_rq() basically always returns STS_OK. > >> > >> I am confused: when an elevator is set, ->queue_rq() is called for requests > >> obtained from the elevator (with e->type->ops.dispatch_request()), after > >> the requests went through it. And merging will happen at that stage when > >> new requests are inserted in the elevator. > > > > When request is queued to lld via .queue_rq(), the request has been > > removed from scheduler queue. And IO merge is just done inside or > > against scheduler queue. > > Yes, for incoming new BIOs, not for requests passed to the LLD. > > >> If the ->queue_rq() returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, the > >> request is indeed requeued which offer more chances of further merging, but > >> that is not the same as no merging happening. > >> Am I missing your point here ? > > > > BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE or getting no budget can be > > thought as device saturation feedback, then more requests can be > > gathered in scheduler queue since we don't dequeue request from > > scheduler queue when that happens, then IO merge is possible. > > > > Without any device saturation feedback from driver, block layer just > > dequeues request from scheduler queue with same speed of submission to > > hardware, then no IO can be merged. > > Got it. And since queue full will mean no more tags, submission will block > on get_request() and there will be no chance in the elevator to merge > anything (aside from opportunistic merging in plugs), isn't it ? > So I guess NVMe HDDs will need some tuning in this area. scheduler queue depth is usually 2 times of hw queue depth, so requests ar usually enough for merging. For NVMe, there isn't ns queue depth, such as scsi's device queue depth, meantime the hw queue depth is big enough, so no chance to trigger merge. Thanks, Ming ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-13 8:34 ` Ming Lei @ 2020-02-13 16:30 ` Keith Busch 2020-02-14 0:40 ` Ming Lei 0 siblings, 1 reply; 32+ messages in thread From: Keith Busch @ 2020-02-13 16:30 UTC (permalink / raw) To: Ming Lei; +Cc: Damien Le Moal, Tim Walker, linux-block, linux-scsi, linux-nvme On Thu, Feb 13, 2020 at 04:34:13PM +0800, Ming Lei wrote: > On Thu, Feb 13, 2020 at 08:24:36AM +0000, Damien Le Moal wrote: > > Got it. And since queue full will mean no more tags, submission will block > > on get_request() and there will be no chance in the elevator to merge > > anything (aside from opportunistic merging in plugs), isn't it ? > > So I guess NVMe HDDs will need some tuning in this area. > > scheduler queue depth is usually 2 times of hw queue depth, so requests > ar usually enough for merging. > > For NVMe, there isn't ns queue depth, such as scsi's device queue depth, > meantime the hw queue depth is big enough, so no chance to trigger merge. Most NVMe devices contain a single namespace anyway, so the shared tag queue depth is effectively the ns queue depth, and an NVMe HDD should advertise queue count and depth capabilities orders of magnitude lower than what we're used to with nvme SSDs. That should get merging and BLK_STS_DEV_RESOURCE handling to occur as desired, right? ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-13 16:30 ` Keith Busch @ 2020-02-14 0:40 ` Ming Lei 0 siblings, 0 replies; 32+ messages in thread From: Ming Lei @ 2020-02-14 0:40 UTC (permalink / raw) To: Keith Busch Cc: linux-block, Damien Le Moal, linux-nvme, Tim Walker, linux-scsi On Fri, Feb 14, 2020 at 01:30:38AM +0900, Keith Busch wrote: > On Thu, Feb 13, 2020 at 04:34:13PM +0800, Ming Lei wrote: > > On Thu, Feb 13, 2020 at 08:24:36AM +0000, Damien Le Moal wrote: > > > Got it. And since queue full will mean no more tags, submission will block > > > on get_request() and there will be no chance in the elevator to merge > > > anything (aside from opportunistic merging in plugs), isn't it ? > > > So I guess NVMe HDDs will need some tuning in this area. > > > > scheduler queue depth is usually 2 times of hw queue depth, so requests > > ar usually enough for merging. > > > > For NVMe, there isn't ns queue depth, such as scsi's device queue depth, > > meantime the hw queue depth is big enough, so no chance to trigger merge. > > Most NVMe devices contain a single namespace anyway, so the shared tag > queue depth is effectively the ns queue depth, and an NVMe HDD should > advertise queue count and depth capabilities orders of magnitude lower > than what we're used to with nvme SSDs. That should get merging and > BLK_STS_DEV_RESOURCE handling to occur as desired, right? Right. The advertised queue depth might serve two purposes: 1) reflect the namespace's actual queueing capability, so block layer's merging is possible 2) avoid timeout caused by too many in-flight IO Thanks, Ming ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-12 1:47 ` Damien Le Moal 2020-02-12 22:03 ` Ming Lei @ 2020-02-13 3:02 ` Martin K. Petersen 2020-02-13 3:12 ` Tim Walker 2020-02-14 0:35 ` Ming Lei 1 sibling, 2 replies; 32+ messages in thread From: Martin K. Petersen @ 2020-02-13 3:02 UTC (permalink / raw) To: Damien Le Moal; +Cc: Tim Walker, Ming Lei, linux-block, linux-scsi, linux-nvme Damien, > Exposing an HDD through multiple-queues each with a high queue depth > is simply asking for troubles. Commands will end up spending so much > time sitting in the queues that they will timeout. Yep! > This can already be observed with the smartpqi SAS HBA which exposes > single drives as multiqueue block devices with high queue depth. > Exercising these drives heavily leads to thousands of commands being > queued and to timeouts. It is fairly easy to trigger this without a > manual change to the QD. This is on my to-do list of fixes for some > time now (lacking time to do it). Controllers that queue internally are very susceptible to application or filesystem timeouts when drives are struggling to keep up. > NVMe HDDs need to have an interface setup that match their speed, that > is, something like a SAS interface: *single* queue pair with a max QD > of 256 or less depending on what the drive can take. Their is no > TASK_SET_FULL notification on NVMe, so throttling has to come from the > max QD of the SQ, which the drive will advertise to the host. At the very minimum we'll need low queue depths. But I have my doubts whether we can make this work well enough without some kind of TASK SET FULL style AER to throttle the I/O. > NVMe specs will need an update to have a "NONROT" (non-rotational) bit in > the identify data for all this to fit well in the current stack. Absolutely. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-13 3:02 ` Martin K. Petersen @ 2020-02-13 3:12 ` Tim Walker 2020-02-13 4:17 ` Martin K. Petersen 2020-02-14 0:35 ` Ming Lei 1 sibling, 1 reply; 32+ messages in thread From: Tim Walker @ 2020-02-13 3:12 UTC (permalink / raw) To: Martin K. Petersen Cc: Damien Le Moal, Ming Lei, linux-block, linux-scsi, linux-nvme On Wed, Feb 12, 2020 at 10:02 PM Martin K. Petersen <martin.petersen@oracle.com> wrote: > > > Damien, > > > Exposing an HDD through multiple-queues each with a high queue depth > > is simply asking for troubles. Commands will end up spending so much > > time sitting in the queues that they will timeout. > > Yep! > > > This can already be observed with the smartpqi SAS HBA which exposes > > single drives as multiqueue block devices with high queue depth. > > Exercising these drives heavily leads to thousands of commands being > > queued and to timeouts. It is fairly easy to trigger this without a > > manual change to the QD. This is on my to-do list of fixes for some > > time now (lacking time to do it). > > Controllers that queue internally are very susceptible to application or > filesystem timeouts when drives are struggling to keep up. > > > NVMe HDDs need to have an interface setup that match their speed, that > > is, something like a SAS interface: *single* queue pair with a max QD > > of 256 or less depending on what the drive can take. Their is no > > TASK_SET_FULL notification on NVMe, so throttling has to come from the > > max QD of the SQ, which the drive will advertise to the host. > > At the very minimum we'll need low queue depths. But I have my doubts > whether we can make this work well enough without some kind of TASK SET > FULL style AER to throttle the I/O. > > > NVMe specs will need an update to have a "NONROT" (non-rotational) bit in > > the identify data for all this to fit well in the current stack. > > Absolutely. > > -- > Martin K. Petersen Oracle Linux Engineering Hi all- We already anticipated the need for the "spinning rust" bit, so it is already in place (on paper, at least). SAS currently supports QD256, but the general consensus is that most customers don't run anywhere near that deep. Does it help the system for the HD to report a limited (256) max queue depth, or is it really up to the system to decide many commands to queue? Regarding number of SQ pairs, I think HDD would function well with only one. Some thoughts on why we would want >1: -A priority-based SQ servicing algorithm that would permit low-priority commands to be queued in a dedicated SQ. -The host may want an SQ per actuator for multi-actuator devices. There may be others that I haven't thought of, but you get the idea. At any rate, the drive can support as many queue-pairs as it wants to - we can use as few as makes sense. Since NVMe doesn't guarantee command execution order, it seems the zoned block version of an NVME HDD would need to support zone append. Do you agree? -- Tim Walker Product Design Systems Engineering, Seagate Technology (303) 775-3770 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-13 3:12 ` Tim Walker @ 2020-02-13 4:17 ` Martin K. Petersen 2020-02-14 7:32 ` Hannes Reinecke 0 siblings, 1 reply; 32+ messages in thread From: Martin K. Petersen @ 2020-02-13 4:17 UTC (permalink / raw) To: Tim Walker Cc: Martin K. Petersen, Damien Le Moal, Ming Lei, linux-block, linux-scsi, linux-nvme Tim, > SAS currently supports QD256, but the general consensus is that most > customers don't run anywhere near that deep. Does it help the system > for the HD to report a limited (256) max queue depth, or is it really > up to the system to decide many commands to queue? People often artificially lower the queue depth to avoid timeouts. The default timeout is 30 seconds from an I/O is queued. However, many enterprise applications set the timeout to 3-5 seconds. Which means that with deep queues you'll quickly start seeing timeouts if a drive temporarily is having issues keeping up (media errors, excessive spare track seeks, etc.). Well-behaved devices will return QF/TSF if they have transient resource starvation or exceed internal QoS limits. QF will cause the SCSI stack to reduce the number of I/Os in flight. This allows the drive to recover from its congested state and reduces the potential of application and filesystem timeouts. > Regarding number of SQ pairs, I think HDD would function well with > only one. Some thoughts on why we would want >1: > -A priority-based SQ servicing algorithm that would permit > low-priority commands to be queued in a dedicated SQ. > -The host may want an SQ per actuator for multi-actuator devices. That's fine. I think we're just saying that the common practice of allocating very deep queues for each CPU core in the system will lead to problems since the host will inevitably be able to queue much more I/O than the drive can realistically complete. > Since NVMe doesn't guarantee command execution order, it seems the > zoned block version of an NVME HDD would need to support zone append. > Do you agree? Absolutely! -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-13 4:17 ` Martin K. Petersen @ 2020-02-14 7:32 ` Hannes Reinecke 2020-02-14 14:40 ` Keith Busch 0 siblings, 1 reply; 32+ messages in thread From: Hannes Reinecke @ 2020-02-14 7:32 UTC (permalink / raw) To: Martin K. Petersen, Tim Walker Cc: Damien Le Moal, Ming Lei, linux-block, linux-scsi, linux-nvme On 2/13/20 5:17 AM, Martin K. Petersen wrote: > > Tim, > >> SAS currently supports QD256, but the general consensus is that most >> customers don't run anywhere near that deep. Does it help the system >> for the HD to report a limited (256) max queue depth, or is it really >> up to the system to decide many commands to queue? > > People often artificially lower the queue depth to avoid timeouts. The > default timeout is 30 seconds from an I/O is queued. However, many > enterprise applications set the timeout to 3-5 seconds. Which means that > with deep queues you'll quickly start seeing timeouts if a drive > temporarily is having issues keeping up (media errors, excessive spare > track seeks, etc.). > > Well-behaved devices will return QF/TSF if they have transient resource > starvation or exceed internal QoS limits. QF will cause the SCSI stack > to reduce the number of I/Os in flight. This allows the drive to recover > from its congested state and reduces the potential of application and > filesystem timeouts. > This may even be a chance to revisit QoS / queue busy handling. NVMe has this SQ head pointer mechanism which was supposed to handle this kind of situations, but to my knowledge no-one has been implementing it. Might be worthwhile revisiting it; guess NVMe HDDs would profit from that. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), GF: Felix Imendörffer ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-14 7:32 ` Hannes Reinecke @ 2020-02-14 14:40 ` Keith Busch 2020-02-14 16:04 ` Hannes Reinecke 0 siblings, 1 reply; 32+ messages in thread From: Keith Busch @ 2020-02-14 14:40 UTC (permalink / raw) To: Hannes Reinecke Cc: Martin K. Petersen, Tim Walker, Damien Le Moal, Ming Lei, linux-block, linux-scsi, linux-nvme On Fri, Feb 14, 2020 at 08:32:57AM +0100, Hannes Reinecke wrote: > On 2/13/20 5:17 AM, Martin K. Petersen wrote: > > People often artificially lower the queue depth to avoid timeouts. The > > default timeout is 30 seconds from an I/O is queued. However, many > > enterprise applications set the timeout to 3-5 seconds. Which means that > > with deep queues you'll quickly start seeing timeouts if a drive > > temporarily is having issues keeping up (media errors, excessive spare > > track seeks, etc.). > > > > Well-behaved devices will return QF/TSF if they have transient resource > > starvation or exceed internal QoS limits. QF will cause the SCSI stack > > to reduce the number of I/Os in flight. This allows the drive to recover > > from its congested state and reduces the potential of application and > > filesystem timeouts. > > > This may even be a chance to revisit QoS / queue busy handling. > NVMe has this SQ head pointer mechanism which was supposed to handle > this kind of situations, but to my knowledge no-one has been > implementing it. > Might be worthwhile revisiting it; guess NVMe HDDs would profit from that. We don't need that because we don't allocate enough tags to potentially wrap the tail past the head. If you can allocate a tag, the queue is not full. And convesely, no tag == queue full. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-14 14:40 ` Keith Busch @ 2020-02-14 16:04 ` Hannes Reinecke 2020-02-14 17:05 ` Keith Busch 0 siblings, 1 reply; 32+ messages in thread From: Hannes Reinecke @ 2020-02-14 16:04 UTC (permalink / raw) To: Keith Busch Cc: Martin K. Petersen, Tim Walker, Damien Le Moal, Ming Lei, linux-block, linux-scsi, linux-nvme On 2/14/20 3:40 PM, Keith Busch wrote: > On Fri, Feb 14, 2020 at 08:32:57AM +0100, Hannes Reinecke wrote: >> On 2/13/20 5:17 AM, Martin K. Petersen wrote: >>> People often artificially lower the queue depth to avoid timeouts. The >>> default timeout is 30 seconds from an I/O is queued. However, many >>> enterprise applications set the timeout to 3-5 seconds. Which means that >>> with deep queues you'll quickly start seeing timeouts if a drive >>> temporarily is having issues keeping up (media errors, excessive spare >>> track seeks, etc.). >>> >>> Well-behaved devices will return QF/TSF if they have transient resource >>> starvation or exceed internal QoS limits. QF will cause the SCSI stack >>> to reduce the number of I/Os in flight. This allows the drive to recover >>> from its congested state and reduces the potential of application and >>> filesystem timeouts. >>> >> This may even be a chance to revisit QoS / queue busy handling. >> NVMe has this SQ head pointer mechanism which was supposed to handle >> this kind of situations, but to my knowledge no-one has been >> implementing it. >> Might be worthwhile revisiting it; guess NVMe HDDs would profit from that. > > We don't need that because we don't allocate enough tags to potentially > wrap the tail past the head. If you can allocate a tag, the queue is not > full. And convesely, no tag == queue full. > It's not a problem on our side. It's a problem on the target/controller side. The target/controller might have a need to throttle I/O (due to QoS settings or competing resources from other hosts), but currently no means of signalling that to the host. Which, incidentally, is the underlying reason for the DNR handling discussion we had; NetApp tried to model QoS by sending "Namespace not ready" without the DNR bit set, which of course is a totally different use-case as the typical 'Namespace not ready' response we get (with the DNR bit set) when a namespace was unmapped. And that is where SQ head pointer updates comes in; it would allow the controller to signal back to the host that it should hold off sending I/O for a bit. So this could / might be used for NVMe HDDs, too, which also might have a need to signal back to the host that I/Os should be throttled... Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-14 16:04 ` Hannes Reinecke @ 2020-02-14 17:05 ` Keith Busch 2020-02-18 15:54 ` Tim Walker 0 siblings, 1 reply; 32+ messages in thread From: Keith Busch @ 2020-02-14 17:05 UTC (permalink / raw) To: Hannes Reinecke Cc: Martin K. Petersen, Tim Walker, Damien Le Moal, Ming Lei, linux-block, linux-scsi, linux-nvme On Fri, Feb 14, 2020 at 05:04:25PM +0100, Hannes Reinecke wrote: > On 2/14/20 3:40 PM, Keith Busch wrote: > > On Fri, Feb 14, 2020 at 08:32:57AM +0100, Hannes Reinecke wrote: > > > On 2/13/20 5:17 AM, Martin K. Petersen wrote: > > > > People often artificially lower the queue depth to avoid timeouts. The > > > > default timeout is 30 seconds from an I/O is queued. However, many > > > > enterprise applications set the timeout to 3-5 seconds. Which means that > > > > with deep queues you'll quickly start seeing timeouts if a drive > > > > temporarily is having issues keeping up (media errors, excessive spare > > > > track seeks, etc.). > > > > > > > > Well-behaved devices will return QF/TSF if they have transient resource > > > > starvation or exceed internal QoS limits. QF will cause the SCSI stack > > > > to reduce the number of I/Os in flight. This allows the drive to recover > > > > from its congested state and reduces the potential of application and > > > > filesystem timeouts. > > > > > > > This may even be a chance to revisit QoS / queue busy handling. > > > NVMe has this SQ head pointer mechanism which was supposed to handle > > > this kind of situations, but to my knowledge no-one has been > > > implementing it. > > > Might be worthwhile revisiting it; guess NVMe HDDs would profit from that. > > > > We don't need that because we don't allocate enough tags to potentially > > wrap the tail past the head. If you can allocate a tag, the queue is not > > full. And convesely, no tag == queue full. > > > It's not a problem on our side. > It's a problem on the target/controller side. > The target/controller might have a need to throttle I/O (due to QoS settings > or competing resources from other hosts), but currently no means of > signalling that to the host. > Which, incidentally, is the underlying reason for the DNR handling > discussion we had; NetApp tried to model QoS by sending "Namespace not > ready" without the DNR bit set, which of course is a totally different > use-case as the typical 'Namespace not ready' response we get (with the DNR > bit set) when a namespace was unmapped. > > And that is where SQ head pointer updates comes in; it would allow the > controller to signal back to the host that it should hold off sending I/O > for a bit. > So this could / might be used for NVMe HDDs, too, which also might have a > need to signal back to the host that I/Os should be throttled... Okay, I see. I think this needs a new nvme AER notice as Martin suggested. The desired host behavior is simiilar to what we do with a "firmware activation notice" where we temporarily quiesce new requests and reset IO timeouts for previously dispatched requests. Perhaps tie this to the CSTS.PP register as well. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-14 17:05 ` Keith Busch @ 2020-02-18 15:54 ` Tim Walker 2020-02-18 17:41 ` Keith Busch 0 siblings, 1 reply; 32+ messages in thread From: Tim Walker @ 2020-02-18 15:54 UTC (permalink / raw) To: Keith Busch Cc: Hannes Reinecke, Martin K. Petersen, Damien Le Moal, Ming Lei, linux-block, linux-scsi, linux-nvme On Fri, Feb 14, 2020 at 12:05 PM Keith Busch <kbusch@kernel.org> wrote: > > On Fri, Feb 14, 2020 at 05:04:25PM +0100, Hannes Reinecke wrote: > > On 2/14/20 3:40 PM, Keith Busch wrote: > > > On Fri, Feb 14, 2020 at 08:32:57AM +0100, Hannes Reinecke wrote: > > > > On 2/13/20 5:17 AM, Martin K. Petersen wrote: > > > > > People often artificially lower the queue depth to avoid timeouts. The > > > > > default timeout is 30 seconds from an I/O is queued. However, many > > > > > enterprise applications set the timeout to 3-5 seconds. Which means that > > > > > with deep queues you'll quickly start seeing timeouts if a drive > > > > > temporarily is having issues keeping up (media errors, excessive spare > > > > > track seeks, etc.). > > > > > > > > > > Well-behaved devices will return QF/TSF if they have transient resource > > > > > starvation or exceed internal QoS limits. QF will cause the SCSI stack > > > > > to reduce the number of I/Os in flight. This allows the drive to recover > > > > > from its congested state and reduces the potential of application and > > > > > filesystem timeouts. > > > > > > > > > This may even be a chance to revisit QoS / queue busy handling. > > > > NVMe has this SQ head pointer mechanism which was supposed to handle > > > > this kind of situations, but to my knowledge no-one has been > > > > implementing it. > > > > Might be worthwhile revisiting it; guess NVMe HDDs would profit from that. > > > > > > We don't need that because we don't allocate enough tags to potentially > > > wrap the tail past the head. If you can allocate a tag, the queue is not > > > full. And convesely, no tag == queue full. > > > > > It's not a problem on our side. > > It's a problem on the target/controller side. > > The target/controller might have a need to throttle I/O (due to QoS settings > > or competing resources from other hosts), but currently no means of > > signalling that to the host. > > Which, incidentally, is the underlying reason for the DNR handling > > discussion we had; NetApp tried to model QoS by sending "Namespace not > > ready" without the DNR bit set, which of course is a totally different > > use-case as the typical 'Namespace not ready' response we get (with the DNR > > bit set) when a namespace was unmapped. > > > > And that is where SQ head pointer updates comes in; it would allow the > > controller to signal back to the host that it should hold off sending I/O > > for a bit. > > So this could / might be used for NVMe HDDs, too, which also might have a > > need to signal back to the host that I/Os should be throttled... > > Okay, I see. I think this needs a new nvme AER notice as Martin > suggested. The desired host behavior is simiilar to what we do with a > "firmware activation notice" where we temporarily quiesce new requests > and reset IO timeouts for previously dispatched requests. Perhaps tie > this to the CSTS.PP register as well. Hi all- With regards to our discussion on queue depths, it's common knowledge that an HDD choses commands from its internal command queue to optimize performance. The HDD looks at things like the current actuator position, current media rotational position, power constraints, command age, etc to choose the best next command to service. A large number of commands in the queue gives the HDD a better selection of commands from which to choose to maximize throughput/IOPS/etc but at the expense of the added latency due to commands sitting in the queue. NVMe doesn't allow us to pull commands randomly from the SQ, so the HDD should attempt to fill its internal queue from the various SQs, according to the SQ servicing policy, so it can have a large number of commands to choose from for its internal command processing optimization. It seems to me that the host would want to limit the total number of outstanding commands to an NVMe HDD for the same latency reasons they are frequently limited today. If we assume the HDD would have a relatively deep (perhaps 256) internal queue (which is deeper than most latency-sensitive customers would want to run) then the SQ would be empty most of the time. To me it seems that only when the host's number of outstanding commands fell below the threshold should the host add commands to the SQ. Since the drive internal command queue would not be full, the HDD would immediately pull the commands from the SQ and put them into its internal command queue. I can't think of any advantage to running a deep SQ in this scenario. When the host requests to delete a SQ the HDD should abort the commands it is holding in its internal queue that came from the SQ to be deleted, then delete the SQ. Best regards, -Tim -- Tim Walker Product Design Systems Engineering, Seagate Technology (303) 775-3770 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-18 15:54 ` Tim Walker @ 2020-02-18 17:41 ` Keith Busch 2020-02-18 17:52 ` James Smart 2020-02-19 1:31 ` Ming Lei 0 siblings, 2 replies; 32+ messages in thread From: Keith Busch @ 2020-02-18 17:41 UTC (permalink / raw) To: Tim Walker Cc: Hannes Reinecke, Martin K. Petersen, Damien Le Moal, Ming Lei, linux-block, linux-scsi, linux-nvme On Tue, Feb 18, 2020 at 10:54:54AM -0500, Tim Walker wrote: > With regards to our discussion on queue depths, it's common knowledge > that an HDD choses commands from its internal command queue to > optimize performance. The HDD looks at things like the current > actuator position, current media rotational position, power > constraints, command age, etc to choose the best next command to > service. A large number of commands in the queue gives the HDD a > better selection of commands from which to choose to maximize > throughput/IOPS/etc but at the expense of the added latency due to > commands sitting in the queue. > > NVMe doesn't allow us to pull commands randomly from the SQ, so the > HDD should attempt to fill its internal queue from the various SQs, > according to the SQ servicing policy, so it can have a large number of > commands to choose from for its internal command processing > optimization. You don't need multiple queues for that. While the device has to fifo fetch commands from a host's submission queue, it may reorder their executuion and completion however it wants, which you can do with a single queue. > It seems to me that the host would want to limit the total number of > outstanding commands to an NVMe HDD The host shouldn't have to decide on limits. NVMe lets the device report it's queue count and depth. It should the device's responsibility to report appropriate values that maximize iops within your latency limits, and the host will react accordingly. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-18 17:41 ` Keith Busch @ 2020-02-18 17:52 ` James Smart 2020-02-19 1:31 ` Ming Lei 1 sibling, 0 replies; 32+ messages in thread From: James Smart @ 2020-02-18 17:52 UTC (permalink / raw) To: Keith Busch, Tim Walker Cc: Hannes Reinecke, Martin K. Petersen, Damien Le Moal, Ming Lei, linux-block, linux-scsi, linux-nvme On 2/18/2020 9:41 AM, Keith Busch wrote: > On Tue, Feb 18, 2020 at 10:54:54AM -0500, Tim Walker wrote: >> With regards to our discussion on queue depths, it's common knowledge >> that an HDD choses commands from its internal command queue to >> optimize performance. The HDD looks at things like the current >> actuator position, current media rotational position, power >> constraints, command age, etc to choose the best next command to >> service. A large number of commands in the queue gives the HDD a >> better selection of commands from which to choose to maximize >> throughput/IOPS/etc but at the expense of the added latency due to >> commands sitting in the queue. >> >> NVMe doesn't allow us to pull commands randomly from the SQ, so the >> HDD should attempt to fill its internal queue from the various SQs, >> according to the SQ servicing policy, so it can have a large number of >> commands to choose from for its internal command processing >> optimization. > You don't need multiple queues for that. While the device has to fifo > fetch commands from a host's submission queue, it may reorder their > executuion and completion however it wants, which you can do with a > single queue. > >> It seems to me that the host would want to limit the total number of >> outstanding commands to an NVMe HDD > The host shouldn't have to decide on limits. NVMe lets the device report > it's queue count and depth. It should the device's responsibility to > report appropriate values that maximize iops within your latency limits, > and the host will react accordingly. +1 on Keith's comments. Also, if a ns depth limit needs to be introduced, it should be via the nvme committee and then reported back as device attributes. Many of SCSI's problems where the protocol didn't solve it, especially in multi-initiator environments, which made all kinds of requirements/mish-mashes on host stacks and target behaviors. none of that should be repeated. -- james ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-18 17:41 ` Keith Busch 2020-02-18 17:52 ` James Smart @ 2020-02-19 1:31 ` Ming Lei 2020-02-19 1:53 ` Damien Le Moal 1 sibling, 1 reply; 32+ messages in thread From: Ming Lei @ 2020-02-19 1:31 UTC (permalink / raw) To: Keith Busch Cc: Tim Walker, Hannes Reinecke, Martin K. Petersen, Damien Le Moal, linux-block, linux-scsi, linux-nvme On Wed, Feb 19, 2020 at 02:41:14AM +0900, Keith Busch wrote: > On Tue, Feb 18, 2020 at 10:54:54AM -0500, Tim Walker wrote: > > With regards to our discussion on queue depths, it's common knowledge > > that an HDD choses commands from its internal command queue to > > optimize performance. The HDD looks at things like the current > > actuator position, current media rotational position, power > > constraints, command age, etc to choose the best next command to > > service. A large number of commands in the queue gives the HDD a > > better selection of commands from which to choose to maximize > > throughput/IOPS/etc but at the expense of the added latency due to > > commands sitting in the queue. > > > > NVMe doesn't allow us to pull commands randomly from the SQ, so the > > HDD should attempt to fill its internal queue from the various SQs, > > according to the SQ servicing policy, so it can have a large number of > > commands to choose from for its internal command processing > > optimization. > > You don't need multiple queues for that. While the device has to fifo > fetch commands from a host's submission queue, it may reorder their > executuion and completion however it wants, which you can do with a > single queue. > > > It seems to me that the host would want to limit the total number of > > outstanding commands to an NVMe HDD > > The host shouldn't have to decide on limits. NVMe lets the device report > it's queue count and depth. It should the device's responsibility to Will NVMe HDD support multiple NS? If yes, this queue depth isn't enough, given all NSs share this single host queue depth. > report appropriate values that maximize iops within your latency limits, > and the host will react accordingly. Suppose NVMe HDD just wants to support single NS and there is single queue, if the device just reports one host queue depth, block layer IO sort/merge can only be done when there is device saturation feedback provided. So, looks either NS queue depth or per-NS device saturation feedback mechanism is needed, otherwise NVMe HDD may have to do internal IO sort/merge. Thanks, Ming ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-19 1:31 ` Ming Lei @ 2020-02-19 1:53 ` Damien Le Moal 2020-02-19 2:15 ` Ming Lei 0 siblings, 1 reply; 32+ messages in thread From: Damien Le Moal @ 2020-02-19 1:53 UTC (permalink / raw) To: Ming Lei, Keith Busch Cc: Tim Walker, Hannes Reinecke, Martin K. Petersen, linux-block, linux-scsi, linux-nvme On 2020/02/19 10:32, Ming Lei wrote: > On Wed, Feb 19, 2020 at 02:41:14AM +0900, Keith Busch wrote: >> On Tue, Feb 18, 2020 at 10:54:54AM -0500, Tim Walker wrote: >>> With regards to our discussion on queue depths, it's common knowledge >>> that an HDD choses commands from its internal command queue to >>> optimize performance. The HDD looks at things like the current >>> actuator position, current media rotational position, power >>> constraints, command age, etc to choose the best next command to >>> service. A large number of commands in the queue gives the HDD a >>> better selection of commands from which to choose to maximize >>> throughput/IOPS/etc but at the expense of the added latency due to >>> commands sitting in the queue. >>> >>> NVMe doesn't allow us to pull commands randomly from the SQ, so the >>> HDD should attempt to fill its internal queue from the various SQs, >>> according to the SQ servicing policy, so it can have a large number of >>> commands to choose from for its internal command processing >>> optimization. >> >> You don't need multiple queues for that. While the device has to fifo >> fetch commands from a host's submission queue, it may reorder their >> executuion and completion however it wants, which you can do with a >> single queue. >> >>> It seems to me that the host would want to limit the total number of >>> outstanding commands to an NVMe HDD >> >> The host shouldn't have to decide on limits. NVMe lets the device report >> it's queue count and depth. It should the device's responsibility to > > Will NVMe HDD support multiple NS? If yes, this queue depth isn't > enough, given all NSs share this single host queue depth. > >> report appropriate values that maximize iops within your latency limits, >> and the host will react accordingly. > > Suppose NVMe HDD just wants to support single NS and there is single queue, > if the device just reports one host queue depth, block layer IO sort/merge > can only be done when there is device saturation feedback provided. > > So, looks either NS queue depth or per-NS device saturation feedback > mechanism is needed, otherwise NVMe HDD may have to do internal IO > sort/merge. SAS and SATA HDDs today already do internal IO reordering and merging, a lot. That is partly why even with "none" set as the scheduler, you can see iops increasing with QD used. But yes, I think you do have a point with the saturation feedback. This may be necessary for better scheduling host-side. > > > Thanks, > Ming > > -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-19 1:53 ` Damien Le Moal @ 2020-02-19 2:15 ` Ming Lei 2020-02-19 2:32 ` Damien Le Moal 0 siblings, 1 reply; 32+ messages in thread From: Ming Lei @ 2020-02-19 2:15 UTC (permalink / raw) To: Damien Le Moal Cc: Keith Busch, Tim Walker, Hannes Reinecke, Martin K. Petersen, linux-block, linux-scsi, linux-nvme On Wed, Feb 19, 2020 at 01:53:53AM +0000, Damien Le Moal wrote: > On 2020/02/19 10:32, Ming Lei wrote: > > On Wed, Feb 19, 2020 at 02:41:14AM +0900, Keith Busch wrote: > >> On Tue, Feb 18, 2020 at 10:54:54AM -0500, Tim Walker wrote: > >>> With regards to our discussion on queue depths, it's common knowledge > >>> that an HDD choses commands from its internal command queue to > >>> optimize performance. The HDD looks at things like the current > >>> actuator position, current media rotational position, power > >>> constraints, command age, etc to choose the best next command to > >>> service. A large number of commands in the queue gives the HDD a > >>> better selection of commands from which to choose to maximize > >>> throughput/IOPS/etc but at the expense of the added latency due to > >>> commands sitting in the queue. > >>> > >>> NVMe doesn't allow us to pull commands randomly from the SQ, so the > >>> HDD should attempt to fill its internal queue from the various SQs, > >>> according to the SQ servicing policy, so it can have a large number of > >>> commands to choose from for its internal command processing > >>> optimization. > >> > >> You don't need multiple queues for that. While the device has to fifo > >> fetch commands from a host's submission queue, it may reorder their > >> executuion and completion however it wants, which you can do with a > >> single queue. > >> > >>> It seems to me that the host would want to limit the total number of > >>> outstanding commands to an NVMe HDD > >> > >> The host shouldn't have to decide on limits. NVMe lets the device report > >> it's queue count and depth. It should the device's responsibility to > > > > Will NVMe HDD support multiple NS? If yes, this queue depth isn't > > enough, given all NSs share this single host queue depth. > > > >> report appropriate values that maximize iops within your latency limits, > >> and the host will react accordingly. > > > > Suppose NVMe HDD just wants to support single NS and there is single queue, > > if the device just reports one host queue depth, block layer IO sort/merge > > can only be done when there is device saturation feedback provided. > > > > So, looks either NS queue depth or per-NS device saturation feedback > > mechanism is needed, otherwise NVMe HDD may have to do internal IO > > sort/merge. > > SAS and SATA HDDs today already do internal IO reordering and merging, a > lot. That is partly why even with "none" set as the scheduler, you can see > iops increasing with QD used. That is why I asked if NVMe HDD will attempt to sort/merge IO among SQs from the beginning, but Tim said no, see: https://lore.kernel.org/linux-block/20200212215251.GA25314@ming.t460p/T/#m2d0eff5ef8fcaced0f304180e571bb8fefc72e84 It could be cheap for NVMe HDD to do that, given all queues/requests just stay in system's RAM. Also I guess internal IO sort/merge may not be good enough compared with SW's implementation: 1) device internal queue depth is often low, and the participated requests won't be enough many, but SW's scheduler queue depth is often 2 times of device queue depth. 2) HDD drive doesn't have context info, so when concurrent IOs are run from multiple contexts, HDD internal reorder/merge can't work well enough. blk-mq doesn't address this case too, however the legacy IO path does consider that via IOC batch. Thanks, Ming ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-19 2:15 ` Ming Lei @ 2020-02-19 2:32 ` Damien Le Moal 2020-02-19 2:56 ` Tim Walker 0 siblings, 1 reply; 32+ messages in thread From: Damien Le Moal @ 2020-02-19 2:32 UTC (permalink / raw) To: Ming Lei Cc: Keith Busch, Tim Walker, Hannes Reinecke, Martin K. Petersen, linux-block, linux-scsi, linux-nvme On 2020/02/19 11:16, Ming Lei wrote: > On Wed, Feb 19, 2020 at 01:53:53AM +0000, Damien Le Moal wrote: >> On 2020/02/19 10:32, Ming Lei wrote: >>> On Wed, Feb 19, 2020 at 02:41:14AM +0900, Keith Busch wrote: >>>> On Tue, Feb 18, 2020 at 10:54:54AM -0500, Tim Walker wrote: >>>>> With regards to our discussion on queue depths, it's common knowledge >>>>> that an HDD choses commands from its internal command queue to >>>>> optimize performance. The HDD looks at things like the current >>>>> actuator position, current media rotational position, power >>>>> constraints, command age, etc to choose the best next command to >>>>> service. A large number of commands in the queue gives the HDD a >>>>> better selection of commands from which to choose to maximize >>>>> throughput/IOPS/etc but at the expense of the added latency due to >>>>> commands sitting in the queue. >>>>> >>>>> NVMe doesn't allow us to pull commands randomly from the SQ, so the >>>>> HDD should attempt to fill its internal queue from the various SQs, >>>>> according to the SQ servicing policy, so it can have a large number of >>>>> commands to choose from for its internal command processing >>>>> optimization. >>>> >>>> You don't need multiple queues for that. While the device has to fifo >>>> fetch commands from a host's submission queue, it may reorder their >>>> executuion and completion however it wants, which you can do with a >>>> single queue. >>>> >>>>> It seems to me that the host would want to limit the total number of >>>>> outstanding commands to an NVMe HDD >>>> >>>> The host shouldn't have to decide on limits. NVMe lets the device report >>>> it's queue count and depth. It should the device's responsibility to >>> >>> Will NVMe HDD support multiple NS? If yes, this queue depth isn't >>> enough, given all NSs share this single host queue depth. >>> >>>> report appropriate values that maximize iops within your latency limits, >>>> and the host will react accordingly. >>> >>> Suppose NVMe HDD just wants to support single NS and there is single queue, >>> if the device just reports one host queue depth, block layer IO sort/merge >>> can only be done when there is device saturation feedback provided. >>> >>> So, looks either NS queue depth or per-NS device saturation feedback >>> mechanism is needed, otherwise NVMe HDD may have to do internal IO >>> sort/merge. >> >> SAS and SATA HDDs today already do internal IO reordering and merging, a >> lot. That is partly why even with "none" set as the scheduler, you can see >> iops increasing with QD used. > > That is why I asked if NVMe HDD will attempt to sort/merge IO among SQs > from the beginning, but Tim said no, see: > > https://lore.kernel.org/linux-block/20200212215251.GA25314@ming.t460p/T/#m2d0eff5ef8fcaced0f304180e571bb8fefc72e84 > > It could be cheap for NVMe HDD to do that, given all queues/requests > just stay in system's RAM. Yes. Keith also commented on that. SQEs have to be removed in order from the SQ, but that does not mean that the disk has to execute them in order. So I do not think this is an issue. > Also I guess internal IO sort/merge may not be good enough compared with > SW's implementation: > > 1) device internal queue depth is often low, and the participated requests won't > be enough many, but SW's scheduler queue depth is often 2 times of > device queue depth. Drive internal QD can actually be quite large to accommodate for internal house-keeping commands (e.g. ATI/FTI refreshes, media cache flushes, etc) while simultaneously executing incoming user commands. These internal task are often one of the reason for SAS drives to return QF at different host-seen QD, and why in the end NVMe may need a mechanism similar to task set full notifications in SAS. > 2) HDD drive doesn't have context info, so when concurrent IOs are run from > multiple contexts, HDD internal reorder/merge can't work well enough. blk-mq > doesn't address this case too, however the legacy IO path does consider that > via IOC batch.> > > Thanks, > Ming > > -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-19 2:32 ` Damien Le Moal @ 2020-02-19 2:56 ` Tim Walker 2020-02-19 16:28 ` Tim Walker 0 siblings, 1 reply; 32+ messages in thread From: Tim Walker @ 2020-02-19 2:56 UTC (permalink / raw) To: Damien Le Moal Cc: Ming Lei, Keith Busch, Hannes Reinecke, Martin K. Petersen, linux-block, linux-scsi, linux-nvme On Tue, Feb 18, 2020 at 9:32 PM Damien Le Moal <Damien.LeMoal@wdc.com> wrote: > > On 2020/02/19 11:16, Ming Lei wrote: > > On Wed, Feb 19, 2020 at 01:53:53AM +0000, Damien Le Moal wrote: > >> On 2020/02/19 10:32, Ming Lei wrote: > >>> On Wed, Feb 19, 2020 at 02:41:14AM +0900, Keith Busch wrote: > >>>> On Tue, Feb 18, 2020 at 10:54:54AM -0500, Tim Walker wrote: > >>>>> With regards to our discussion on queue depths, it's common knowledge > >>>>> that an HDD choses commands from its internal command queue to > >>>>> optimize performance. The HDD looks at things like the current > >>>>> actuator position, current media rotational position, power > >>>>> constraints, command age, etc to choose the best next command to > >>>>> service. A large number of commands in the queue gives the HDD a > >>>>> better selection of commands from which to choose to maximize > >>>>> throughput/IOPS/etc but at the expense of the added latency due to > >>>>> commands sitting in the queue. > >>>>> > >>>>> NVMe doesn't allow us to pull commands randomly from the SQ, so the > >>>>> HDD should attempt to fill its internal queue from the various SQs, > >>>>> according to the SQ servicing policy, so it can have a large number of > >>>>> commands to choose from for its internal command processing > >>>>> optimization. > >>>> > >>>> You don't need multiple queues for that. While the device has to fifo > >>>> fetch commands from a host's submission queue, it may reorder their > >>>> executuion and completion however it wants, which you can do with a > >>>> single queue. > >>>> > >>>>> It seems to me that the host would want to limit the total number of > >>>>> outstanding commands to an NVMe HDD > >>>> > >>>> The host shouldn't have to decide on limits. NVMe lets the device report > >>>> it's queue count and depth. It should the device's responsibility to > >>> > >>> Will NVMe HDD support multiple NS? If yes, this queue depth isn't > >>> enough, given all NSs share this single host queue depth. > >>> > >>>> report appropriate values that maximize iops within your latency limits, > >>>> and the host will react accordingly. > >>> > >>> Suppose NVMe HDD just wants to support single NS and there is single queue, > >>> if the device just reports one host queue depth, block layer IO sort/merge > >>> can only be done when there is device saturation feedback provided. > >>> > >>> So, looks either NS queue depth or per-NS device saturation feedback > >>> mechanism is needed, otherwise NVMe HDD may have to do internal IO > >>> sort/merge. > >> > >> SAS and SATA HDDs today already do internal IO reordering and merging, a > >> lot. That is partly why even with "none" set as the scheduler, you can see > >> iops increasing with QD used. > > > > That is why I asked if NVMe HDD will attempt to sort/merge IO among SQs > > from the beginning, but Tim said no, see: > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dblock_20200212215251.GA25314-40ming.t460p_T_-23m2d0eff5ef8fcaced0f304180e571bb8fefc72e84&d=DwIFAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=nUNT2_IvSlbeY_25S516HctZv4od6WK6h2q2_C4Q8SY&s=TTxCbaBVGOCBZROb7fqSBDCe9wIZrYdBDSCW2TqrLzM&e= > > > > It could be cheap for NVMe HDD to do that, given all queues/requests > > just stay in system's RAM. > > Yes. Keith also commented on that. SQEs have to be removed in order from > the SQ, but that does not mean that the disk has to execute them in order. > So I do not think this is an issue. > > > Also I guess internal IO sort/merge may not be good enough compared with > > SW's implementation: > > > > 1) device internal queue depth is often low, and the participated requests won't > > be enough many, but SW's scheduler queue depth is often 2 times of > > device queue depth. > > Drive internal QD can actually be quite large to accommodate for internal > house-keeping commands (e.g. ATI/FTI refreshes, media cache flushes, etc) > while simultaneously executing incoming user commands. These internal task > are often one of the reason for SAS drives to return QF at different > host-seen QD, and why in the end NVMe may need a mechanism similar to task > set full notifications in SAS. > > > 2) HDD drive doesn't have context info, so when concurrent IOs are run from > > multiple contexts, HDD internal reorder/merge can't work well enough. blk-mq > > doesn't address this case too, however the legacy IO path does consider that > > via IOC batch.> > > > > Thanks, > > Ming > > > > > > > -- > Damien Le Moal > Western Digital Research [sorry for the duplicate mailing - forgot about plain text!] Hi Damien- You're right. The HDD needs those commands in its internal queue to sort and merge them, because commands are pulled from the SQ strictly FIFO which precludes any sorting or merging within the SQ. That being said, HDDs still work better with a good kernel scheduler to group commands into HDD-friendly sequences. So it would be helpful if we could devise a method to help the kernel sort/merge before loading the commands into the SQ, just as we do with SCSI today. Ming: Regarding sorting across SQs, I mean to say these two things: 1. The HDD would not try and reach up into the SQs and choose the next best command. I understand the SQs are FIFO, so that is why NVMe HDD has to pull them into our internal queue for sorting and merging. Our internal queue has historically been more than adequate (SAS-256, SATA-32) to provide pretty good optimization without excessive command latencies. 2. Also, I know NVMe specifically does not imply any completion order within the SQ, but an NVMe HDD will likely honor the submission order within any single SQ, but not try and correlate across multiple SQs (if the host sets up multiple SQs). I believe this is different from SSD. I think of this as being left over from SAS/SATA where we manage overlapped commands by command order-of-arrival. Many HDD customers spend a lot of time balancing workload and queue depth to reach the IOPS/throughput targets they desire. It's not straightforward since HDD command completion time is extremely workload-sensitive. Some more sophisticated customers dynamically control queue depth to keep all the command latencies within QOS. But that requires extensive workload characterization, plus knowledge of the upcoming workload, both of which makes it difficult for the HDD to auto-tune its own queue depth. I'm really interested to have this queue approach discussion at the conference - there seems to be areas where we can improve on legacy behavior. In all these scenarios, a single SQ/CQ pair is certainly more than adequate to keep an HDD busy. Multiple SQ/CQ pairs probably only assist driver or system architects to separate traffic classes into separate SQs. At any rate, the HDD won't mandate >1 SQ, but it will support it if desired. -Tim -- Tim Walker Product Design Systems Engineering, Seagate Technology (303) 775-3770 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-19 2:56 ` Tim Walker @ 2020-02-19 16:28 ` Tim Walker 2020-02-19 20:50 ` Keith Busch 0 siblings, 1 reply; 32+ messages in thread From: Tim Walker @ 2020-02-19 16:28 UTC (permalink / raw) To: Damien Le Moal Cc: Ming Lei, Keith Busch, Hannes Reinecke, Martin K. Petersen, linux-block, linux-scsi, linux-nvme On Tue, Feb 18, 2020 at 9:56 PM Tim Walker <tim.t.walker@seagate.com> wrote: > > On Tue, Feb 18, 2020 at 9:32 PM Damien Le Moal <Damien.LeMoal@wdc.com> wrote: > > > > On 2020/02/19 11:16, Ming Lei wrote: > > > On Wed, Feb 19, 2020 at 01:53:53AM +0000, Damien Le Moal wrote: > > >> On 2020/02/19 10:32, Ming Lei wrote: > > >>> On Wed, Feb 19, 2020 at 02:41:14AM +0900, Keith Busch wrote: > > >>>> On Tue, Feb 18, 2020 at 10:54:54AM -0500, Tim Walker wrote: > > >>>>> With regards to our discussion on queue depths, it's common knowledge > > >>>>> that an HDD choses commands from its internal command queue to > > >>>>> optimize performance. The HDD looks at things like the current > > >>>>> actuator position, current media rotational position, power > > >>>>> constraints, command age, etc to choose the best next command to > > >>>>> service. A large number of commands in the queue gives the HDD a > > >>>>> better selection of commands from which to choose to maximize > > >>>>> throughput/IOPS/etc but at the expense of the added latency due to > > >>>>> commands sitting in the queue. > > >>>>> > > >>>>> NVMe doesn't allow us to pull commands randomly from the SQ, so the > > >>>>> HDD should attempt to fill its internal queue from the various SQs, > > >>>>> according to the SQ servicing policy, so it can have a large number of > > >>>>> commands to choose from for its internal command processing > > >>>>> optimization. > > >>>> > > >>>> You don't need multiple queues for that. While the device has to fifo > > >>>> fetch commands from a host's submission queue, it may reorder their > > >>>> executuion and completion however it wants, which you can do with a > > >>>> single queue. > > >>>> > > >>>>> It seems to me that the host would want to limit the total number of > > >>>>> outstanding commands to an NVMe HDD > > >>>> > > >>>> The host shouldn't have to decide on limits. NVMe lets the device report > > >>>> it's queue count and depth. It should the device's responsibility to > > >>> > > >>> Will NVMe HDD support multiple NS? If yes, this queue depth isn't > > >>> enough, given all NSs share this single host queue depth. > > >>> > > >>>> report appropriate values that maximize iops within your latency limits, > > >>>> and the host will react accordingly. > > >>> > > >>> Suppose NVMe HDD just wants to support single NS and there is single queue, > > >>> if the device just reports one host queue depth, block layer IO sort/merge > > >>> can only be done when there is device saturation feedback provided. > > >>> > > >>> So, looks either NS queue depth or per-NS device saturation feedback > > >>> mechanism is needed, otherwise NVMe HDD may have to do internal IO > > >>> sort/merge. > > >> > > >> SAS and SATA HDDs today already do internal IO reordering and merging, a > > >> lot. That is partly why even with "none" set as the scheduler, you can see > > >> iops increasing with QD used. > > > > > > That is why I asked if NVMe HDD will attempt to sort/merge IO among SQs > > > from the beginning, but Tim said no, see: > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dblock_20200212215251.GA25314-40ming.t460p_T_-23m2d0eff5ef8fcaced0f304180e571bb8fefc72e84&d=DwIFAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=nUNT2_IvSlbeY_25S516HctZv4od6WK6h2q2_C4Q8SY&s=TTxCbaBVGOCBZROb7fqSBDCe9wIZrYdBDSCW2TqrLzM&e= > > > > > > It could be cheap for NVMe HDD to do that, given all queues/requests > > > just stay in system's RAM. > > > > Yes. Keith also commented on that. SQEs have to be removed in order from > > the SQ, but that does not mean that the disk has to execute them in order. > > So I do not think this is an issue. > > > > > Also I guess internal IO sort/merge may not be good enough compared with > > > SW's implementation: > > > > > > 1) device internal queue depth is often low, and the participated requests won't > > > be enough many, but SW's scheduler queue depth is often 2 times of > > > device queue depth. > > > > Drive internal QD can actually be quite large to accommodate for internal > > house-keeping commands (e.g. ATI/FTI refreshes, media cache flushes, etc) > > while simultaneously executing incoming user commands. These internal task > > are often one of the reason for SAS drives to return QF at different > > host-seen QD, and why in the end NVMe may need a mechanism similar to task > > set full notifications in SAS. > > > > > 2) HDD drive doesn't have context info, so when concurrent IOs are run from > > > multiple contexts, HDD internal reorder/merge can't work well enough. blk-mq > > > doesn't address this case too, however the legacy IO path does consider that > > > via IOC batch.> > > > > > > Thanks, > > > Ming > > > > > > > > > > > > -- > > Damien Le Moal > > Western Digital Research > [sorry for the duplicate mailing - forgot about plain text!] > > Hi Damien- > > You're right. The HDD needs those commands in its internal queue to > sort and merge them, because commands are pulled from the SQ strictly > FIFO which precludes any sorting or merging within the SQ. That being > said, HDDs still work better with a good kernel scheduler to group > commands into HDD-friendly sequences. So it would be helpful if we > could devise a method to help the kernel sort/merge before loading the > commands into the SQ, just as we do with SCSI today. > > Ming: > Regarding sorting across SQs, I mean to say these two things: > 1. The HDD would not try and reach up into the SQs and choose the next > best command. I understand the SQs are FIFO, so that is why NVMe HDD > has to pull them into our internal queue for sorting and merging. Our > internal queue has historically been more than adequate (SAS-256, > SATA-32) to provide pretty good optimization without excessive command > latencies. > > 2. Also, I know NVMe specifically does not imply any completion order > within the SQ, but an NVMe HDD will likely honor the submission order > within any single SQ, but not try and correlate across multiple SQs > (if the host sets up multiple SQs). I believe this is different from > SSD. I think of this as being left over from SAS/SATA where we manage > overlapped commands by command order-of-arrival. > > Many HDD customers spend a lot of time balancing workload and queue > depth to reach the IOPS/throughput targets they desire. It's not > straightforward since HDD command completion time is extremely > workload-sensitive. Some more sophisticated customers dynamically > control queue depth to keep all the command latencies within QOS. But > that requires extensive workload characterization, plus knowledge of > the upcoming workload, both of which makes it difficult for the HDD to > auto-tune its own queue depth. I'm really interested to have this > queue approach discussion at the conference - there seems to be areas > where we can improve on legacy behavior. > > In all these scenarios, a single SQ/CQ pair is certainly more than > adequate to keep an HDD busy. Multiple SQ/CQ pairs probably only > assist driver or system architects to separate traffic classes into > separate SQs. At any rate, the HDD won't mandate >1 SQ, but it will > support it if desired. > > -Tim > -- > Tim Walker > Product Design Systems Engineering, Seagate Technology > (303) 775-3770 Hi Ming- >Will NVMe HDD support multiple NS? At this point it doesn't seem like an NVMe HDD could benefit from multiple namespaces. However, a multiple actuator HDD can present the actuators as independent channels that are capable of independent media access. It seems that we would want them on separate namespaces, or sets. I'd like to discuss the pros and cons of each, and which would be better for system integration. Best regards, -Tim -- Tim Walker Product Design Systems Engineering, Seagate Technology (303) 775-3770 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-19 16:28 ` Tim Walker @ 2020-02-19 20:50 ` Keith Busch 0 siblings, 0 replies; 32+ messages in thread From: Keith Busch @ 2020-02-19 20:50 UTC (permalink / raw) To: Tim Walker Cc: Damien Le Moal, Ming Lei, Hannes Reinecke, Martin K. Petersen, linux-block, linux-scsi, linux-nvme On Wed, Feb 19, 2020 at 11:28:46AM -0500, Tim Walker wrote: > Hi Ming- > > >Will NVMe HDD support multiple NS? > > At this point it doesn't seem like an NVMe HDD could benefit from > multiple namespaces. However, a multiple actuator HDD can present the > actuators as independent channels that are capable of independent > media access. It seems that we would want them on separate namespaces, > or sets. I'd like to discuss the pros and cons of each, and which > would be better for system integration. If NVM Sets are not implemented, the host is not aware of resource separatation for each namespace. If you implement NVM Sets, two namespaces in different sets tells the host that the device has a backend resource partition (logical or physical) such that processing commands for one namespace will not affect the processing capabilities of the other. Sets define "noisy neighbor" domains. Dual actuators sound like you have independent resources appropriate to report as NVM Sets, but that may depend on other implementation details. The NVMe specification does not go far enough, though, since IO queues are always a shared resource. The host may implement a different IO queue policy such that they're not shared (you'd need at least one IO queue per set), but we don't currently do that. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-13 3:02 ` Martin K. Petersen 2020-02-13 3:12 ` Tim Walker @ 2020-02-14 0:35 ` Ming Lei 1 sibling, 0 replies; 32+ messages in thread From: Ming Lei @ 2020-02-14 0:35 UTC (permalink / raw) To: Martin K. Petersen Cc: Damien Le Moal, linux-block, linux-nvme, Tim Walker, linux-scsi On Wed, Feb 12, 2020 at 10:02:08PM -0500, Martin K. Petersen wrote: > > Damien, > > > Exposing an HDD through multiple-queues each with a high queue depth > > is simply asking for troubles. Commands will end up spending so much > > time sitting in the queues that they will timeout. > > Yep! > > > This can already be observed with the smartpqi SAS HBA which exposes > > single drives as multiqueue block devices with high queue depth. > > Exercising these drives heavily leads to thousands of commands being > > queued and to timeouts. It is fairly easy to trigger this without a > > manual change to the QD. This is on my to-do list of fixes for some > > time now (lacking time to do it). > > Controllers that queue internally are very susceptible to application or > filesystem timeouts when drives are struggling to keep up. > > > NVMe HDDs need to have an interface setup that match their speed, that > > is, something like a SAS interface: *single* queue pair with a max QD > > of 256 or less depending on what the drive can take. Their is no > > TASK_SET_FULL notification on NVMe, so throttling has to come from the > > max QD of the SQ, which the drive will advertise to the host. > > At the very minimum we'll need low queue depths. But I have my doubts > whether we can make this work well enough without some kind of TASK SET > FULL style AER to throttle the I/O. Looks 32 or sort of works fine for HDD, and 128 is good enough for SSD. And this number should drive enough parallelism, meantime timeout can be avoided most of times if not too small timeout value is set. But SCSI still allows to adjust the queue depth via sysfs. Thanks, Ming ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [LSF/MM/BPF TOPIC] NVMe HDD 2020-02-11 19:01 ` Tim Walker 2020-02-12 1:47 ` Damien Le Moal @ 2020-02-12 21:52 ` Ming Lei 1 sibling, 0 replies; 32+ messages in thread From: Ming Lei @ 2020-02-12 21:52 UTC (permalink / raw) To: Tim Walker; +Cc: linux-block, linux-scsi, linux-nvme On Tue, Feb 11, 2020 at 02:01:18PM -0500, Tim Walker wrote: > On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@redhat.com> wrote: > > > > On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote: > > > Background: > > > > > > NVMe specification has hardened over the decade and now NVMe devices > > > are well integrated into our customers’ systems. As we look forward, > > > moving HDDs to the NVMe command set eliminates the SAS IOC and driver > > > stack, consolidating on a single access method for rotational and > > > static storage technologies. PCIe-NVMe offers near-SATA interface > > > costs, features and performance suitable for high-cap HDDs, and > > > optimal interoperability for storage automation, tiering, and > > > management. We will share some early conceptual results and proposed > > > salient design goals and challenges surrounding an NVMe HDD. > > > > HDD. performance is very sensitive to IO order. Could you provide some > > background info about NVMe HDD? Such as: > > > > - number of hw queues > > - hw queue depth > > - will NVMe sort/merge IO among all SQs or not? > > > > > > > > > > > Discussion Proposal: > > > > > > We’d like to share our views and solicit input on: > > > > > > -What Linux storage stack assumptions do we need to be aware of as we > > > develop these devices with drastically different performance > > > characteristics than traditional NAND? For example, what schedular or > > > device driver level changes will be needed to integrate NVMe HDDs? > > > > IO merge is often important for HDD. IO merge is usually triggered when > > .queue_rq() returns STS_RESOURCE, so far this condition won't be > > triggered for NVMe SSD. > > > > Also blk-mq kills BDI queue congestion and ioc batching, and causes > > writeback performance regression[1][2]. > > > > What I am thinking is that if we need to switch to use independent IO > > path for handling SSD and HDD. IO, given the two mediums are so > > different from performance viewpoint. > > > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e= > > [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e= > > > > > > Thanks, > > Ming > > > > I would expect the drive would support a reasonable number of queues > and a relatively deep queue depth, more in line with NVMe practices > than SAS HDD's typical 128. But it probably doesn't make sense to > queue up thousands of commands on something as slow as an HDD, and > many customers keep queues < 32 for latency management. MQ & deep queue depth will cause trouble for HDD., as Damien mentioned, IO timeout may be caused. Then looks you need to add per-ns queue depth, just like what sdev->device_busy does for avoiding IO timeout. On the other hand, with per-ns queue depth, you may prevent IO submitted to NVMe when this ns is saturated, then block layer's IO merge can be triggered. > > Merge and elevator are important to HDD performance. I don't believe > NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort > within a SQ without driving large differences between SSD & HDD data > paths? If NVMe doesn't sort/merge across SQs, it should be better to just use single queue for HDD. Otherwise, it is easy to break IO order & merge. Even someone complains that sequential IO becomes dis-continuous on NVMe(SSD) when arbitration burst is less than IO queue depth. It is said fio performance is hurt, but I don't understand how that can happen on SSD. Thanks, Ming ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2020-02-19 20:51 UTC | newest] Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-02-10 19:20 [LSF/MM/BPF TOPIC] NVMe HDD Tim Walker 2020-02-10 20:43 ` Keith Busch 2020-02-10 22:25 ` Finn Thain 2020-02-11 12:28 ` Ming Lei 2020-02-11 19:01 ` Tim Walker 2020-02-12 1:47 ` Damien Le Moal 2020-02-12 22:03 ` Ming Lei 2020-02-13 2:40 ` Damien Le Moal 2020-02-13 7:53 ` Ming Lei 2020-02-13 8:24 ` Damien Le Moal 2020-02-13 8:34 ` Ming Lei 2020-02-13 16:30 ` Keith Busch 2020-02-14 0:40 ` Ming Lei 2020-02-13 3:02 ` Martin K. Petersen 2020-02-13 3:12 ` Tim Walker 2020-02-13 4:17 ` Martin K. Petersen 2020-02-14 7:32 ` Hannes Reinecke 2020-02-14 14:40 ` Keith Busch 2020-02-14 16:04 ` Hannes Reinecke 2020-02-14 17:05 ` Keith Busch 2020-02-18 15:54 ` Tim Walker 2020-02-18 17:41 ` Keith Busch 2020-02-18 17:52 ` James Smart 2020-02-19 1:31 ` Ming Lei 2020-02-19 1:53 ` Damien Le Moal 2020-02-19 2:15 ` Ming Lei 2020-02-19 2:32 ` Damien Le Moal 2020-02-19 2:56 ` Tim Walker 2020-02-19 16:28 ` Tim Walker 2020-02-19 20:50 ` Keith Busch 2020-02-14 0:35 ` Ming Lei 2020-02-12 21:52 ` Ming Lei
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).