Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD

From: Ming Lei <ming.lei@redhat.com>
To: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Bart Van Assche <bvanassche@acm.org>,
	"Ewan D. Milne" <emilne@redhat.com>,
	Hannes Reinecke <hare@suse.de>, Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org,
	"James E . J . Bottomley" <jejb@linux.ibm.com>,
	linux-scsi@vger.kernel.org,
	Sathya Prakash <sathya.prakash@broadcom.com>,
	Chaitra P B <chaitra.basappa@broadcom.com>,
	Suganath Prabu Subramani  <suganath-prabu.subramani@broadcom.com>,
	Kashyap Desai <kashyap.desai@broadcom.com>,
	Sumit Saxena <sumit.saxena@broadcom.com>,
	Shivasharan S <shivasharan.srikanteshwara@broadcom.com>,
	Christoph Hellwig <hch@lst.de>,
	Bart Van Assche <bart.vanassche@wdc.com>
Subject: Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
Date: Fri, 22 Nov 2019 11:24:32 +0800	[thread overview]
Message-ID: <20191122032432.GB903@ming.t460p> (raw)
In-Reply-To: <yq1pnhkbopi.fsf@oracle.com>

Hi Martin,

On Thu, Nov 21, 2019 at 09:59:53PM -0500, Martin K. Petersen wrote:
> 
> Ming,
> 
> > I don't understand the motivation of ramp-up/ramp-down, maybe it is just
> > for fairness among LUNs.
> 
> Congestion control. Devices have actual, physical limitations that are
> different from the tag context limitations on the HBA. You don't have
> that problem on NVMe because (at least for PCIe) the storage device and
> the controller are one and the same.
> 
> If you submit 100000 concurrent requests to a SCSI drive that does 100
> IOPS, some requests will time out before they get serviced.
> Consequently we have the ability to raise and lower the queue depth to
> constrain the amount of requests in flight to a given device at any
> point in time.

blk-mq has already puts a limit on each LUN, the number is
host_queue_depth / nr_active_LUNs, see hctx_may_queue().

Looks this way works for NVMe, that is why I try to bypass
.device_busy for SSD which is too expensive on fast storage. Even
Hannes wants to kill it completely.

> 
> Also, devices use BUSY/QUEUE_FULL/TASK_SET_FULL to cause the OS to back
> off. We frequently see issues where the host can submit burst I/O much
> faster than the device can de-stage from cache. In that scenario the
> device reports BUSY/QF/TSF and we will back off so the device gets a
> chance to recover. If we just let the application submit new I/O without
> bounds, the system would never actually recover.
> 
> Note that the actual, physical limitations for how many commands a
> target can handle are typically much, much lower than the number of tags
> the HBA can manage. SATA devices can only express 32 concurrent
> commands. SAS devices typically 128 concurrent commands per
> port. Arrays differ.

I understand SATA's host queue depth is set as 32.

But SAS HBA's queue depth is often big, so do we reply on .device_busy for
throttling requests to SAS?

> 
> If we ignore the RAID controller use case where the controller
> internally queues and arbitrates commands between many devices, how is
> submitting 1000 concurrent requests to a device which only has 128
> command slots going to work?

For SSD, I guess it might be fine, given NVMe sets per-hw-queue depth
as 1023 usually. That means the concurrent requests can be as many as 
1023 * nr_hw_queues in case of single namespace.

> 
> Some HBAs have special sauce to manage BUSY/QF/TSF, some don't. If we
> blindly stop restricting the number of I/Os in flight in the ML, we may
> exceed either the capabilities of what the transport protocol can
> express or internal device resources.

OK, one conservative approach may be just to just bypass .device_busy 
in case of SSD only for some high end HBA.

Or maybe we can wire up sdev->queue_depth with block layer's scheduler
queue depth? One issue is that sdev->queue_depth may be updated some
times.

Thanks,
Ming