Re: [LSF/MM/BPF TOPIC] NVMe HDD

From: Ming Lei <ming.lei@redhat.com>
To: Keith Busch <kbusch@kernel.org>
Cc: "linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	Damien Le Moal <Damien.LeMoal@wdc.com>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	Tim Walker <tim.t.walker@seagate.com>,
	linux-scsi <linux-scsi@vger.kernel.org>
Subject: Re: [LSF/MM/BPF TOPIC] NVMe HDD
Date: Fri, 14 Feb 2020 08:40:56 +0800	[thread overview]
Message-ID: <20200214004056.GC4907@ming.t460p> (raw)
In-Reply-To: <20200213163038.GB7634@redsun51.ssa.fujisawa.hgst.com>

On Fri, Feb 14, 2020 at 01:30:38AM +0900, Keith Busch wrote:
> On Thu, Feb 13, 2020 at 04:34:13PM +0800, Ming Lei wrote:
> > On Thu, Feb 13, 2020 at 08:24:36AM +0000, Damien Le Moal wrote:
> > > Got it. And since queue full will mean no more tags, submission will block
> > > on get_request() and there will be no chance in the elevator to merge
> > > anything (aside from opportunistic merging in plugs), isn't it ?
> > > So I guess NVMe HDDs will need some tuning in this area.
> > 
> > scheduler queue depth is usually 2 times of hw queue depth, so requests
> > ar usually enough for merging.
> > 
> > For NVMe, there isn't ns queue depth, such as scsi's device queue depth,
> > meantime the hw queue depth is big enough, so no chance to trigger merge.
> 
> Most NVMe devices contain a single namespace anyway, so the shared tag
> queue depth is effectively the ns queue depth, and an NVMe HDD should
> advertise queue count and depth capabilities orders of magnitude lower
> than what we're used to with nvme SSDs. That should get merging and
> BLK_STS_DEV_RESOURCE handling to occur as desired, right?

Right.

The advertised queue depth might serve two purposes:

1) reflect the namespace's actual queueing capability, so block layer's merging
is possible

2) avoid timeout caused by too many in-flight IO

Thanks,
Ming