All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] nvme/fc: sq flow control
@ 2020-02-25 23:59 Hannes Reinecke
  2020-02-26  0:08 ` Sagi Grimberg
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Hannes Reinecke @ 2020-02-25 23:59 UTC (permalink / raw)
  To: Keith Busch
  Cc: Sagi Grimberg, Chaitanya Kulkarni, James Smart, linux-nvme,
	Hannes Reinecke, John Meneghini

As per NVMe-oF spec sq flow control is actually mandatory, and we should
be implementing it to avoid the controller to return a fatal status
error, and try to play nicely with controllers using sq flow control
to implement QoS.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/host/fc.c | 29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index a19ddb61039d..628397bd5065 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -12,6 +12,7 @@
 
 #include "nvme.h"
 #include "fabrics.h"
+#include "trace.h"
 #include <linux/nvme-fc-driver.h>
 #include <linux/nvme-fc.h>
 #include <scsi/scsi_transport_fc.h>
@@ -34,7 +35,8 @@ struct nvme_fc_queue {
 	size_t			cmnd_capsule_len;
 	u32			qnum;
 	u32			seqno;
-
+	int			sq_head;
+	int			sq_tail;
 	u64			connection_id;
 	atomic_t		csn;
 
@@ -1671,6 +1673,8 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
 				cqe->command_id);
 			goto done;
 		}
+		WRITE_ONCE(queue->sq_head, cpu_to_le16(cqe->sq_head));
+		trace_nvme_sq(rq, cqe->sq_head, queue->sq_tail);
 		result = cqe->result;
 		status = cqe->status;
 		break;
@@ -2177,6 +2181,18 @@ nvme_fc_unmap_data(struct nvme_fc_ctrl *ctrl, struct request *rq,
 	freq->sg_cnt = 0;
 }
 
+static int nvme_fc_update_sq_tail(struct nvme_fc_queue *queue, int incr)
+{
+	int old_sqtl, new_sqtl;
+
+	do {
+		old_sqtl = queue->sq_tail;
+		new_sqtl = (old_sqtl + incr) % queue->ctrl->ctrl.sqsize;
+	} while (cmpxchg(&queue->sq_tail, old_sqtl, new_sqtl) !=
+		 old_sqtl);
+	return new_sqtl;
+}
+
 /*
  * In FC, the queue is a logical thing. At transport connect, the target
  * creates its "queue" and returns a handle that is to be given to the
@@ -2219,6 +2235,14 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue,
 	if (!nvme_fc_ctrl_get(ctrl))
 		return BLK_STS_IOERR;
 
+	if (!ctrl->ctrl.opts->disable_sqflow) {
+		if (nvme_fc_update_sq_tail(queue, 1) ==
+		    READ_ONCE(queue->sq_head)) {
+			nvme_fc_update_sq_tail(queue, -1);
+			return BLK_STS_RESOURCE;
+		}
+	}
+
 	/* format the FC-NVME CMD IU and fcp_req */
 	cmdiu->connection_id = cpu_to_be64(queue->connection_id);
 	cmdiu->data_len = cpu_to_be32(data_len);
@@ -2284,6 +2308,9 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue,
 					queue->lldd_handle, &op->fcp_req);
 
 	if (ret) {
+		if (ctrl->ctrl.opts->disable_sqflow)
+			nvme_fc_update_sq_tail(queue, -1);
+
 		/*
 		 * If the lld fails to send the command is there an issue with
 		 * the csn value?  If the command that fails is the Connect,
-- 
2.16.4


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] nvme/fc: sq flow control
  2020-02-25 23:59 [PATCH RFC] nvme/fc: sq flow control Hannes Reinecke
@ 2020-02-26  0:08 ` Sagi Grimberg
  2020-02-26  0:14   ` Hannes Reinecke
  2020-02-26 10:44 ` Martin Wilck
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Sagi Grimberg @ 2020-02-26  0:08 UTC (permalink / raw)
  To: Hannes Reinecke, Keith Busch
  Cc: linux-nvme, James Smart, Chaitanya Kulkarni, John Meneghini


> As per NVMe-oF spec sq flow control is actually mandatory, and we should
> be implementing it to avoid the controller to return a fatal status
> error, and try to play nicely with controllers using sq flow control
> to implement QoS.

Why is your target setting SQ flow control disable in the discovery
log entry treq field? We are not supposed to do it if the NVM subsystem
doesn't support it.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] nvme/fc: sq flow control
  2020-02-26  0:08 ` Sagi Grimberg
@ 2020-02-26  0:14   ` Hannes Reinecke
  2020-02-26  0:38     ` Sagi Grimberg
  0 siblings, 1 reply; 16+ messages in thread
From: Hannes Reinecke @ 2020-02-26  0:14 UTC (permalink / raw)
  To: Sagi Grimberg, Keith Busch
  Cc: linux-nvme, James Smart, Chaitanya Kulkarni, John Meneghini

On 2/26/20 1:08 AM, Sagi Grimberg wrote:
> 
>> As per NVMe-oF spec sq flow control is actually mandatory, and we should
>> be implementing it to avoid the controller to return a fatal status
>> error, and try to play nicely with controllers using sq flow control
>> to implement QoS.
> 
> Why is your target setting SQ flow control disable in the discovery
> log entry treq field? We are not supposed to do it if the NVM subsystem
> doesn't support it.
Did I say this? I wasn't aware that I did imply that in any way.
I'm just using the flag to figure out if sq flow control is disabled or not.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] nvme/fc: sq flow control
  2020-02-26  0:14   ` Hannes Reinecke
@ 2020-02-26  0:38     ` Sagi Grimberg
  2020-02-27 11:27       ` Hannes Reinecke
  0 siblings, 1 reply; 16+ messages in thread
From: Sagi Grimberg @ 2020-02-26  0:38 UTC (permalink / raw)
  To: Hannes Reinecke, Keith Busch
  Cc: linux-nvme, James Smart, Chaitanya Kulkarni, John Meneghini


>>> As per NVMe-oF spec sq flow control is actually mandatory, and we should
>>> be implementing it to avoid the controller to return a fatal status
>>> error, and try to play nicely with controllers using sq flow control
>>> to implement QoS.
>>
>> Why is your target setting SQ flow control disable in the discovery
>> log entry treq field? We are not supposed to do it if the NVM subsystem
>> doesn't support it.
> Did I say this? I wasn't aware that I did imply that in any way.
> I'm just using the flag to figure out if sq flow control is disabled or 
> not.

OK, I read it backwards.

So in essence blk-mq will just keep sending until sq_head is
incremented.. How often does that happen typically?

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] nvme/fc: sq flow control
  2020-02-25 23:59 [PATCH RFC] nvme/fc: sq flow control Hannes Reinecke
  2020-02-26  0:08 ` Sagi Grimberg
@ 2020-02-26 10:44 ` Martin Wilck
  2020-02-26 15:47   ` Hannes Reinecke
  2020-02-26 23:45 ` Sagi Grimberg
  2020-03-09 21:59 ` James Smart
  3 siblings, 1 reply; 16+ messages in thread
From: Martin Wilck @ 2020-02-26 10:44 UTC (permalink / raw)
  To: Hannes Reinecke, Keith Busch
  Cc: linux-nvme, John Meneghini, Sagi Grimberg, Chaitanya Kulkarni,
	James Smart

On Wed, 2020-02-26 at 00:59 +0100, Hannes Reinecke wrote:
> As per NVMe-oF spec sq flow control is actually mandatory, and we
> should
> be implementing it to avoid the controller to return a fatal status
> error, and try to play nicely with controllers using sq flow control
> to implement QoS.
> 
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>  drivers/nvme/host/fc.c | 29 ++++++++++++++++++++++++++++-
>  1 file changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> index a19ddb61039d..628397bd5065 100644
> --- a/drivers/nvme/host/fc.c
> +++ b/drivers/nvme/host/fc.c
> ...
> @@ -2177,6 +2181,18 @@ nvme_fc_unmap_data(struct nvme_fc_ctrl *ctrl,
> struct request *rq,
>  	freq->sg_cnt = 0;
>  }
>  
> +static int nvme_fc_update_sq_tail(struct nvme_fc_queue *queue, int
> incr)
> +{
> +	int old_sqtl, new_sqtl;
> +
> +	do {
> +		old_sqtl = queue->sq_tail;
> +		new_sqtl = (old_sqtl + incr) % queue->ctrl-
> >ctrl.sqsize;
> +	} while (cmpxchg(&queue->sq_tail, old_sqtl, new_sqtl) !=
> +		 old_sqtl);

Shouldn't you use READ_ONCE() to update old_sqtl, or better even, use
the return value of cmpxchg()?

Regards
Martin



_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] nvme/fc: sq flow control
  2020-02-26 10:44 ` Martin Wilck
@ 2020-02-26 15:47   ` Hannes Reinecke
  0 siblings, 0 replies; 16+ messages in thread
From: Hannes Reinecke @ 2020-02-26 15:47 UTC (permalink / raw)
  To: Martin Wilck, Keith Busch
  Cc: linux-nvme, John Meneghini, Sagi Grimberg, Chaitanya Kulkarni,
	James Smart

On 2/26/20 11:44 AM, Martin Wilck wrote:
> On Wed, 2020-02-26 at 00:59 +0100, Hannes Reinecke wrote:
>> As per NVMe-oF spec sq flow control is actually mandatory, and we
>> should
>> be implementing it to avoid the controller to return a fatal status
>> error, and try to play nicely with controllers using sq flow control
>> to implement QoS.
>>
>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>> ---
>>   drivers/nvme/host/fc.c | 29 ++++++++++++++++++++++++++++-
>>   1 file changed, 28 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
>> index a19ddb61039d..628397bd5065 100644
>> --- a/drivers/nvme/host/fc.c
>> +++ b/drivers/nvme/host/fc.c
>> ...
>> @@ -2177,6 +2181,18 @@ nvme_fc_unmap_data(struct nvme_fc_ctrl *ctrl,
>> struct request *rq,
>>   	freq->sg_cnt = 0;
>>   }
>>   
>> +static int nvme_fc_update_sq_tail(struct nvme_fc_queue *queue, int
>> incr)
>> +{
>> +	int old_sqtl, new_sqtl;
>> +
>> +	do {
>> +		old_sqtl = queue->sq_tail;
>> +		new_sqtl = (old_sqtl + incr) % queue->ctrl-
>>> ctrl.sqsize;
>> +	} while (cmpxchg(&queue->sq_tail, old_sqtl, new_sqtl) !=
>> +		 old_sqtl);
> 
> Shouldn't you use READ_ONCE() to update old_sqtl, or better even, use
> the return value of cmpxchg()?
> 
Good point.
Will be doing so.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] nvme/fc: sq flow control
  2020-02-25 23:59 [PATCH RFC] nvme/fc: sq flow control Hannes Reinecke
  2020-02-26  0:08 ` Sagi Grimberg
  2020-02-26 10:44 ` Martin Wilck
@ 2020-02-26 23:45 ` Sagi Grimberg
  2020-02-27  1:46   ` James Smart
  2020-02-28 10:39   ` Hannes Reinecke
  2020-03-09 21:59 ` James Smart
  3 siblings, 2 replies; 16+ messages in thread
From: Sagi Grimberg @ 2020-02-26 23:45 UTC (permalink / raw)
  To: Hannes Reinecke, Keith Busch
  Cc: linux-nvme, James Smart, Chaitanya Kulkarni, John Meneghini


> As per NVMe-oF spec sq flow control is actually mandatory, and we should
> be implementing it to avoid the controller to return a fatal status
> error, and try to play nicely with controllers using sq flow control
> to implement QoS.

Hannes,

Can you please clarify why the individual transports aren't sufficient
for this QoS feature you are talking about?

If we look at the transports landscape, each transport has a credit
mechanism that can throttle bulk data transfers. In FC exchanges the
target is in control pulling data from the host with xfer_ready,
In RDMA the target decides when to issue rdma_read, and in TCP the
target decides when to issue R2T.

These are all credits that give the control to the target to
back-pressure the host. Now if the target doesn't want the host to send
more commands, it can throttle sending completions thus controlling the
pace.

I must say that returning BLK_STS_RESOURCE for host managed SQ_HEAD is a
bit awkward in my mind, but that just one's opinion, what do others have
to say?

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] nvme/fc: sq flow control
  2020-02-26 23:45 ` Sagi Grimberg
@ 2020-02-27  1:46   ` James Smart
  2020-02-27  3:52     ` Sagi Grimberg
  2020-02-28 10:39   ` Hannes Reinecke
  1 sibling, 1 reply; 16+ messages in thread
From: James Smart @ 2020-02-27  1:46 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Keith Busch
  Cc: James Smart, linux-nvme, Chaitanya Kulkarni, John Meneghini



On 2/26/2020 3:45 PM, Sagi Grimberg wrote:
>
>> As per NVMe-oF spec sq flow control is actually mandatory, and we should
>> be implementing it to avoid the controller to return a fatal status
>> error, and try to play nicely with controllers using sq flow control
>> to implement QoS.
>
> Hannes,
>
> Can you please clarify why the individual transports aren't sufficient
> for this QoS feature you are talking about?


It's all about allocation of resources - whether mandated or not. 
Current nvme/nvmeof model requires memory for entire sq size (well maybe 
sqsize -1) so that the host can send a sqe and there's a place to put it 
on the target.  For a transport like FC, where there isn't dedicated 
memory allocated for the sq and each sqe arrives independently, there's 
a desire to guarantee only a small number of elements, then let the 
controller adjust sqhd up to open more of a window to allow more 
commands from a particular host.  When the subsystem is managing 
multiple controllers to perhaps many hosts, they may be all sharing the 
same set of receive buffers. The controller would like to control the 
ebb and flow of what controllers/what queues use the available resources 
at any particular point in time based on host/controller QOS settings.

As most arrays are limited in the amount of memory they can dedicate to 
receive buffers/queues and as the rules to date require guaranteed 
allocation of sq size, the controllers have to pick how much they scale 
- how many associations/controllers vs how many queues per controller vs 
sq size per queue that they allow.   The disablement of sqflow control 
doesn't help the "guarantee", it actually confirms it. There are a 
couple of issues in the nvmeof spec that conflict with removing the 
guarantee - namely - can a value other than 1 (or 0) be returned from 
the connect fabric command for sqhd ?  It can be argued that our 
implementation today, which returns 1, which assumes an increment after 
connect at idx 0 is actually incorrect to the spec as the queue is to be 
created as part of connect and it is to be "empty" with 
sqhd=sqtail=0.     A poor mans implementation, which ignores this 
initial connect response issue in the spec, and tries to avoid dedicated 
resources, would live with allowing a storm of 1 sqsize worth per queue, 
expecting that it won't really happen in a complete burst, and sqhd can 
be controlled going back to the host such that further io can't be sent 
beyond 1 sqsizes's worth unless the controller moves sqhd.


>
> If we look at the transports landscape, each transport has a credit
> mechanism that can throttle bulk data transfers. In FC exchanges the
> target is in control pulling data from the host with xfer_ready,
> In RDMA the target decides when to issue rdma_read, and in TCP the
> target decides when to issue R2T.

it's not data flow, it's reception of io commands vs sq.

>
> These are all credits that give the control to the target to
> back-pressure the host. Now if the target doesn't want the host to send
> more commands, it can throttle sending completions thus controlling the
> pace.

that doesn't help the system. It only makes the io look like it takes a 
lot longer to complete. At some point, the completion has to go back.

>
> I must say that returning BLK_STS_RESOURCE for host managed SQ_HEAD is a
> bit awkward in my mind, but that just one's opinion, what do others have
> to say?

it isn't a host-managed SQ_HEAD. it's a real implementation of 
acknowledging the sq flow control that was originally spec'd.  We've 
always ignored this in lieu of cheating and setting the blk-mq request 
count, and hoped as we never sent less than a queue's worth there was 
never a reason for the controller to need to not increment sqhd on a 
1by1 basis.    But the other issue with this is - we're wasting lots of 
sq space on the controller. We have 4, 10, 100 sqs but only send 1 sq's 
worth of io ?  why would a controller ever want to support a high queue 
count except for a back-to-back attachment to a fixed number of hosts (1 
?).  That's not reasonable for a real SAN device - so we're forcing them 
into small queue counts and small sq sizes so they don't waste memory 
yet can handle bunches of associations.   So in addition to this sqhd 
tracking that Hannes was proposing, for FC at least, as sq's are 
logical, I was also going to look into supporting more than 1 sq's worth 
of ios.

-- james



_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] nvme/fc: sq flow control
  2020-02-27  1:46   ` James Smart
@ 2020-02-27  3:52     ` Sagi Grimberg
  2020-02-27 21:46       ` Meneghini, John
  2020-02-28 16:35       ` James Smart
  0 siblings, 2 replies; 16+ messages in thread
From: Sagi Grimberg @ 2020-02-27  3:52 UTC (permalink / raw)
  To: James Smart, Hannes Reinecke, Keith Busch
  Cc: Chaitanya Kulkarni, linux-nvme, John Meneghini


>>> As per NVMe-oF spec sq flow control is actually mandatory, and we should
>>> be implementing it to avoid the controller to return a fatal status
>>> error, and try to play nicely with controllers using sq flow control
>>> to implement QoS.
>>
>> Hannes,
>>
>> Can you please clarify why the individual transports aren't sufficient
>> for this QoS feature you are talking about?
> 
> 
> It's all about allocation of resources - whether mandated or not.

See Hannes' patch description, he talks about QoS, not a resource
problem. I am arguing that rate-limiting a specific host can be done
today.

> Current nvme/nvmeof model requires memory for entire sq size (well maybe 
> sqsize -1) so that the host can send a sqe and there's a place to put it 
> on the target.

That is not the case for either TCP or RDMA. TCP can not read from the
socket, and RDMA can not post a receive buffer, which will cause the
HW to issue rnr nak (receive not ready), and the host would be
effectively throttled. So there are no such memory requirement (not for
PCIe either).

> For a transport like FC, where there isn't dedicated 
> memory allocated for the sq and each sqe arrives independently, there's 
> a desire to guarantee only a small number of elements, then let the 
> controller adjust sqhd up to open more of a window to allow more 
> commands from a particular host.

I can have shared resources with both RDMA, TCP (and PCIe) and
implementing what you are saying is perfectly achievable.

> When the subsystem is managing 
> multiple controllers to perhaps many hosts, they may be all sharing the 
> same set of receive buffers.

This is not specific to FC, see nvmet-rdma usage of srq for example, and
nvmet-tcp can have the same concept.

This is becoming a resource constrained issue which is different from
the patch description.

> The controller would like to control the 
> ebb and flow of what controllers/what queues use the available resources 
> at any particular point in time based on host/controller QOS settings.
> 
> As most arrays are limited in the amount of memory they can dedicate to 
> receive buffers/queues and as the rules to date require guaranteed 
> allocation of sq size, the controllers have to pick how much they scale 
> - how many associations/controllers vs how many queues per controller vs 
> sq size per queue that they allow.

They should do that regardless, there is no point allowing lots of
queues if effectively the array supports very little right?

> The disablement of sqflow control 
> doesn't help the "guarantee", it actually confirms it. There are a 
> couple of issues in the nvmeof spec that conflict with removing the 
> guarantee - namely - can a value other than 1 (or 0) be returned from 
> the connect fabric command for sqhd ?  It can be argued that our 
> implementation today, which returns 1, which assumes an increment after 
> connect at idx 0 is actually incorrect to the spec as the queue is to be 
> created as part of connect and it is to be "empty" with 
> sqhd=sqtail=0.     A poor mans implementation, which ignores this 
> initial connect response issue in the spec, and tries to avoid dedicated 
> resources, would live with allowing a storm of 1 sqsize worth per queue, 
> expecting that it won't really happen in a complete burst, and sqhd can 
> be controlled going back to the host such that further io can't be sent 
> beyond 1 sqsizes's worth unless the controller moves sqhd.

You lost me a bit, is the issue supporting QD=1 from a large number of
queues?

>> If we look at the transports landscape, each transport has a credit
>> mechanism that can throttle bulk data transfers. In FC exchanges the
>> target is in control pulling data from the host with xfer_ready,
>> In RDMA the target decides when to issue rdma_read, and in TCP the
>> target decides when to issue R2T.
> 
> it's not data flow, it's reception of io commands vs sq.
> 
>>
>> These are all credits that give the control to the target to
>> back-pressure the host. Now if the target doesn't want the host to send
>> more commands, it can throttle sending completions thus controlling the
>> pace.
> 
> that doesn't help the system. It only makes the io look like it takes a 
> lot longer to complete.

Exactly, they will take longer to complete, meaning the host won't issue
more. Even if the host is not issuing the command, it is still
effectively taking the same time to complete.

> At some point, the completion has to go back.

Yes.

>> I must say that returning BLK_STS_RESOURCE for host managed SQ_HEAD is a
>> bit awkward in my mind, but that just one's opinion, what do others have
>> to say?
> 
> it isn't a host-managed SQ_HEAD. it's a real implementation of 
> acknowledging the sq flow control that was originally spec'd.  We've 
> always ignored this in lieu of cheating and setting the blk-mq request 
> count, and hoped as we never sent less than a queue's worth there was 
> never a reason for the controller to need to not increment sqhd on a 
> 1by1 basis.    But the other issue with this is - we're wasting lots of 
> sq space on the controller. We have 4, 10, 100 sqs but only send 1 sq's 
> worth of io ?  why would a controller ever want to support a high queue 
> count except for a back-to-back attachment to a fixed number of hosts (1 
> ?).

I agree, the controller can tell the host how many queues and max
queue depth, it can prevent the host from allocating lots of long
queues.

> That's not reasonable for a real SAN device - so we're forcing them 
> into small queue counts and small sq sizes so they don't waste memory 
> yet can handle bunches of associations.

Sounds like this makes sense.

> So in addition to this sqhd 
> tracking that Hannes was proposing, for FC at least, as sq's are 
> logical,

You describe this sqhd throttling as something that will happen a lot,
and in normal working points, so returning BLK_STS_RESOURCE having
blk-mq retrying over and over doesn't seem like an adequate solution
in my mind...

> I was also going to look into supporting more than 1 sq's worth 
> of ios.

What do you mean? where is more than 1 sq's worth of ios not supported?
You talking about for a single queue? Maybe I'm missing something...

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] nvme/fc: sq flow control
  2020-02-26  0:38     ` Sagi Grimberg
@ 2020-02-27 11:27       ` Hannes Reinecke
  0 siblings, 0 replies; 16+ messages in thread
From: Hannes Reinecke @ 2020-02-27 11:27 UTC (permalink / raw)
  To: Sagi Grimberg, Keith Busch
  Cc: linux-nvme, James Smart, Chaitanya Kulkarni, John Meneghini

On 2/26/20 1:38 AM, Sagi Grimberg wrote:
> 
>>>> As per NVMe-oF spec sq flow control is actually mandatory, and we 
>>>> should
>>>> be implementing it to avoid the controller to return a fatal status
>>>> error, and try to play nicely with controllers using sq flow control
>>>> to implement QoS.
>>>
>>> Why is your target setting SQ flow control disable in the discovery
>>> log entry treq field? We are not supposed to do it if the NVM subsystem
>>> doesn't support it.
>> Did I say this? I wasn't aware that I did imply that in any way.
>> I'm just using the flag to figure out if sq flow control is disabled 
>> or not.
> 
> OK, I read it backwards.
> 
> So in essence blk-mq will just keep sending until sq_head is
> incremented.. How often does that happen typically?

... keep sending until sq_head matches sq_tail.
But yes, that was the plan.
The frequency with which it'll be called depends on the controller;
when it's used by thge controller for QoS and the controller is under 
load then it might happen quite often indeed.
But then, in these scenarios we _want_ a slowdown, so performance 
arguments do not really apply ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] nvme/fc: sq flow control
  2020-02-27  3:52     ` Sagi Grimberg
@ 2020-02-27 21:46       ` Meneghini, John
  2020-02-28 16:35       ` James Smart
  1 sibling, 0 replies; 16+ messages in thread
From: Meneghini, John @ 2020-02-27 21:46 UTC (permalink / raw)
  To: Sagi Grimberg, James Smart, Hannes Reinecke, Keith Busch
  Cc: Knight, Frederick, Chaitanya Kulkarni, linux-nvme, Meneghini, John

> There are a couple of issues in the nvmeof spec that conflict with removing the
> guarantee - namely - can a value other than 1 (or 0) be returned from
> the connect fabric command for sqhd ?  

I think the Connect command response tells the host exactly whether it's 1 or 0.

   Figure 21: Connect Response
   SQ Head Pointer (SQHD): Indicates the current Submission Queue Head pointer for the associated Submission Queue.
  [Fabrics 1.0]

>  It can be argued that our implementation today, which returns 1, which assumes an increment after
>  connect at idx 0 is actually incorrect to the spec as the queue is to be created as part of connect and
>  it is to be "empty" with sqhd=sqtail=0.

I agree that this is an assumption, and I think a number of problems arise from the fact that what we have today
did not implement the spec when it comes to the Send Q head pointer.

> A poor mans implementation, which ignores this initial connect response issue in the spec, and tries to avoid dedicated
> resources, would live with allowing a storm of 1 sqsize worth per queue, expecting that it won't really happen in a complete 
> burst, and sqhd can be controlled going back to the host such that further io can't be sent beyond 1 sqsizes's worth unless
> the controller moves sqhd.

Yes,  NVMe allows the controller to support a logical queue depth (MAXCMD) and a physical
queue depth (CAP.MQES).  So if proper support for the SQHD updates can be implemented, it
will solve a number of "resource problems" with both Fabrics and PCI devices.  There is no need
for the controller to allocate a huge MQES simply to support a large logical queue depth.  In fact, 
the minimum IO queue depth (MEQS) is 2.  

So, as James describes above,  it should be possible for controllers out there to reduce the resources
needed to provision NVMe queues, without reducing their actual queue depth.   But it all depends
upon properly supporting the SQHD field in the CQE.   

>> See Hannes' patch description, he talks about QoS, not a resource
>>  problem.

So, the patch description may be a misnomer.  This is an RFC, and I think the real
Issue here is our current treatment of the SQHD.  QOS is just one use case that may
benefit from properly supporting the queue mechanics specified by the NVMe specification.

We argued about this at Vault, and there needs to be some clarification w.r.t. NVMe spec., but
assuming the assertion that the implementation we have today is not compliant with the NVMe
spec, I think this RFC is an important patch.
  
>>  I am arguing that rate-limiting a specific host can be done today.

I  disagree. It can't, because it was never implemented.  I know this was the idea, at least with FC, 
but there is no end to end credit management system that can be used to solve this problem
in any of the transports.  So this means all of IOs, and all of the problems, end up in the controller.

>>> Current nvme/nvmeof model requires memory for entire sq size (well maybe
>>> sqsize -1) so that the host can send a sqe and there's a place to put it
>>> on the target.

Actually, I don't think this is true.  As mentioned above, the current NVMe
spec allows the controller to support a huge logical queue size with a small
sq size. 
 
>> That is not the case for either TCP or RDMA. TCP can not read from the
>> socket, 

Right, and that's exactly the kind of hack that leads to IO timeouts and
controller resets.  This is not an effective way to support  rate limiting or QOS. 

>> and  RDMA can not post a receive buffer, which will cause the
>> HW to issue rnr nak (receive not ready), and the host would be
>> effectively throttled. 

I disagree.  This RNR/NACK solution assumes SRQ, and it is not
a viable solution for QOS. It suffers from the same problems that
dropping IP packets on the floor does.  There is no way for
the host to figure out what the problem is and you just end up
with IO timeouts and controller resets.

>> So there are no such memory requirement (not for PCIe either).

If you are saying that the PCI controller can run with a send queue
of 2, I agree.  But to make that solution work, the host needs to 
support SQHD update correctly.

     Figure 69: Offset 0h: CAP – Controller Capabilities
     Maximum Queue Entries Supported (MQES): This field indicates the maximum individual queue size that the controller supports.
     For NVMe over PCIe implementations, this value applies to the I/O Submission Queues and I/O Completion Queues that the host creates.
     For NVMe over Fabrics implementations, this value applies to only the I/O Submission Queues that the host creates. 
>  This is a 0’s based value. The minimum value is 1h, indicating two entries.

     4.1 Submission Queue & Completion Queue Definition
     The controller uses the SQ Head Pointer (SQHD) field in Completion Queue entries to communicate new values of the Submission Queue Head 
     Pointer to the host. A new SQHD value indicates that Submission Queue entries have been consumed, but does not indicate either execution or
     completion of any command. Refer to section 4.6.
     [NVMe 1.4]

Ultimately this all boils down to who's queue is going hold the IO?   The assertion is, it's better to 
hold the IOs on the host's queue than in the controller's queue.  And the assertion is that the NVMe
protocol provided a mechanism to do that.  So we want to use it.

Otherwise NVMe is really no better that SCSI, SATA, SAS or any other protocol we've worked on over
the last 20 years.

/John

On 2/26/20, 7:52 PM, "Sagi Grimberg" <sagi@grimberg.me> wrote:    
    
    >>> As per NVMe-oF spec sq flow control is actually mandatory, and we should
    >>> be implementing it to avoid the controller to return a fatal status
    >>> error, and try to play nicely with controllers using sq flow control
    >>> to implement QoS.
    >>
    >> Hannes,
    >>
    >> Can you please clarify why the individual transports aren't sufficient
    >> for this QoS feature you are talking about?
    >
    >
    > It's all about allocation of resources - whether mandated or not.
    
    See Hannes' patch description, he talks about QoS, not a resource
    problem. I am arguing that rate-limiting a specific host can be done
    today.
    
    > Current nvme/nvmeof model requires memory for entire sq size (well maybe
    > sqsize -1) so that the host can send a sqe and there's a place to put it
    > on the target.
    
    That is not the case for either TCP or RDMA. TCP can not read from the
    socket, and RDMA can not post a receive buffer, which will cause the
    HW to issue rnr nak (receive not ready), and the host would be
    effectively throttled. So there are no such memory requirement (not for
    PCIe either).
    
    > For a transport like FC, where there isn't dedicated
    > memory allocated for the sq and each sqe arrives independently, there's
    > a desire to guarantee only a small number of elements, then let the
    > controller adjust sqhd up to open more of a window to allow more
    > commands from a particular host.
    
    I can have shared resources with both RDMA, TCP (and PCIe) and
    implementing what you are saying is perfectly achievable.
    
    > When the subsystem is managing
    > multiple controllers to perhaps many hosts, they may be all sharing the
    > same set of receive buffers.
    
    This is not specific to FC, see nvmet-rdma usage of srq for example, and
    nvmet-tcp can have the same concept.
    
    This is becoming a resource constrained issue which is different from
    the patch description.
    
    > The controller would like to control the
    > ebb and flow of what controllers/what queues use the available resources
    > at any particular point in time based on host/controller QOS settings.
    >
    > As most arrays are limited in the amount of memory they can dedicate to
    > receive buffers/queues and as the rules to date require guaranteed
    > allocation of sq size, the controllers have to pick how much they scale
    > - how many associations/controllers vs how many queues per controller vs
    > sq size per queue that they allow.
    
    They should do that regardless, there is no point allowing lots of
    queues if effectively the array supports very little right?
    
    > The disablement of sqflow control
    > doesn't help the "guarantee", it actually confirms it. There are a
    > couple of issues in the nvmeof spec that conflict with removing the
    > guarantee - namely - can a value other than 1 (or 0) be returned from
    > the connect fabric command for sqhd ?  It can be argued that our
    > implementation today, which returns 1, which assumes an increment after
    > connect at idx 0 is actually incorrect to the spec as the queue is to be
    > created as part of connect and it is to be "empty" with
    > sqhd=sqtail=0.     A poor mans implementation, which ignores this
    > initial connect response issue in the spec, and tries to avoid dedicated
    > resources, would live with allowing a storm of 1 sqsize worth per queue,
    > expecting that it won't really happen in a complete burst, and sqhd can
    > be controlled going back to the host such that further io can't be sent
    > beyond 1 sqsizes's worth unless the controller moves sqhd.
    
    You lost me a bit, is the issue supporting QD=1 from a large number of
    queues?
    
    >> If we look at the transports landscape, each transport has a credit
    >> mechanism that can throttle bulk data transfers. In FC exchanges the
    >> target is in control pulling data from the host with xfer_ready,
    >> In RDMA the target decides when to issue rdma_read, and in TCP the
    >> target decides when to issue R2T.
    >
    > it's not data flow, it's reception of io commands vs sq.
    >
    >>
    >> These are all credits that give the control to the target to
    >> back-pressure the host. Now if the target doesn't want the host to send
    >> more commands, it can throttle sending completions thus controlling the
    >> pace.
    >
    > that doesn't help the system. It only makes the io look like it takes a
    > lot longer to complete.
    
    Exactly, they will take longer to complete, meaning the host won't issue
    more. Even if the host is not issuing the command, it is still
    effectively taking the same time to complete.
    
    > At some point, the completion has to go back.
    
    Yes.
    
    >> I must say that returning BLK_STS_RESOURCE for host managed SQ_HEAD is a
    >> bit awkward in my mind, but that just one's opinion, what do others have
    >> to say?
    >
    > it isn't a host-managed SQ_HEAD. it's a real implementation of
    > acknowledging the sq flow control that was originally spec'd.  We've
    > always ignored this in lieu of cheating and setting the blk-mq request
    > count, and hoped as we never sent less than a queue's worth there was
    > never a reason for the controller to need to not increment sqhd on a
    > 1by1 basis.    But the other issue with this is - we're wasting lots of
    > sq space on the controller. We have 4, 10, 100 sqs but only send 1 sq's
    > worth of io ?  why would a controller ever want to support a high queue
    > count except for a back-to-back attachment to a fixed number of hosts (1
    > ?).
    
    I agree, the controller can tell the host how many queues and max
    queue depth, it can prevent the host from allocating lots of long
    queues.
    
    > That's not reasonable for a real SAN device - so we're forcing them
    > into small queue counts and small sq sizes so they don't waste memory
    > yet can handle bunches of associations.
    
    Sounds like this makes sense.
    
    > So in addition to this sqhd
    > tracking that Hannes was proposing, for FC at least, as sq's are
    > logical,
    
    You describe this sqhd throttling as something that will happen a lot,
    and in normal working points, so returning BLK_STS_RESOURCE having
    blk-mq retrying over and over doesn't seem like an adequate solution
    in my mind...
    
    > I was also going to look into supporting more than 1 sq's worth
    > of ios.
    
    What do you mean? where is more than 1 sq's worth of ios not supported?
    You talking about for a single queue? Maybe I'm missing something...
    

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] nvme/fc: sq flow control
  2020-02-26 23:45 ` Sagi Grimberg
  2020-02-27  1:46   ` James Smart
@ 2020-02-28 10:39   ` Hannes Reinecke
  1 sibling, 0 replies; 16+ messages in thread
From: Hannes Reinecke @ 2020-02-28 10:39 UTC (permalink / raw)
  To: Sagi Grimberg, Keith Busch
  Cc: linux-nvme, James Smart, Chaitanya Kulkarni, John Meneghini

On 2/27/20 12:45 AM, Sagi Grimberg wrote:
> 
>> As per NVMe-oF spec sq flow control is actually mandatory, and we should
>> be implementing it to avoid the controller to return a fatal status
>> error, and try to play nicely with controllers using sq flow control
>> to implement QoS.
> 
> Hannes,
> 
> Can you please clarify why the individual transports aren't sufficient
> for this QoS feature you are talking about?
> 
> If we look at the transports landscape, each transport has a credit
> mechanism that can throttle bulk data transfers. In FC exchanges the
> target is in control pulling data from the host with xfer_ready,
> In RDMA the target decides when to issue rdma_read, and in TCP the
> target decides when to issue R2T.
> 
> These are all credits that give the control to the target to
> back-pressure the host. Now if the target doesn't want the host to send
> more commands, it can throttle sending completions thus controlling the
> pace.
> 
Yes, that's true. However, when using this mechanism it requires the 
target to already allocate resources to hold the first part of the 
transfer, ie the command structure itself will have to be allocated at 
the target.
So with this method we'll have issues when the target goes out of memory 
due to high traffic, as not all drivers are coded carefully to avoid 
memory allocation in the I/O path.
The other problem we have on FC is that the SAN carries an internal 
timeout (RATOV, resource allocation timeout), during which a transaction 
needs to be completed. So any delay in sending RTS etc cannot exceed 
this value.
However, the real problem is that we're unable to detect a conformant 
implementation. Per default we do not disable flow control, and do not 
look for SQ Head updates. So if we run against a conformant target which 
decides to block I/O by not returning SQ Head updates the controller 
will eventually terminate the transport connection with no indication as 
to why the reset happened.

> I must say that returning BLK_STS_RESOURCE for host managed SQ_HEAD is a
> bit awkward in my mind, but that just one's opinion, what do others have
> to say?

What would be the alternative?
Reducing the queue size for the hardware queue seems a bit excessive, 
but I'm open to have a different return code here.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] nvme/fc: sq flow control
  2020-02-27  3:52     ` Sagi Grimberg
  2020-02-27 21:46       ` Meneghini, John
@ 2020-02-28 16:35       ` James Smart
  1 sibling, 0 replies; 16+ messages in thread
From: James Smart @ 2020-02-28 16:35 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Keith Busch
  Cc: Chaitanya Kulkarni, linux-nvme, John Meneghini



On 2/26/2020 7:52 PM, Sagi Grimberg wrote:
>
>>>> As per NVMe-oF spec sq flow control is actually mandatory, and we 
>>>> should
>>>> be implementing it to avoid the controller to return a fatal status
>>>> error, and try to play nicely with controllers using sq flow control
>>>> to implement QoS.
>>>
>>> Hannes,
>>>
>>> Can you please clarify why the individual transports aren't sufficient
>>> for this QoS feature you are talking about?
>>
>>
>> It's all about allocation of resources - whether mandated or not.
>
> See Hannes' patch description, he talks about QoS, not a resource
> problem. I am arguing that rate-limiting a specific host can be done
> today.

QoS is an overloaded term. It's not just rate limiting, but is expending 
effort on doing an io is considered a type of QoS, with most devices 
assuming if I have it I'll work on it - not just park it. I don't want 
to waste memory on parking lots.

>
>> Current nvme/nvmeof model requires memory for entire sq size (well 
>> maybe sqsize -1) so that the host can send a sqe and there's a place 
>> to put it on the target.
>
> That is not the case for either TCP or RDMA. TCP can not read from the
> socket, and RDMA can not post a receive buffer, which will cause the
> HW to issue rnr nak (receive not ready), and the host would be
> effectively throttled. So there are no such memory requirement (not for
> PCIe either).

yep - so a difference vs FC which doesn't have real "connections".

>
>> For a transport like FC, where there isn't dedicated memory allocated 
>> for the sq and each sqe arrives independently, there's a desire to 
>> guarantee only a small number of elements, then let the controller 
>> adjust sqhd up to open more of a window to allow more commands from a 
>> particular host.
>
> I can have shared resources with both RDMA, TCP (and PCIe) and
> implementing what you are saying is perfectly achievable.
>
>> When the subsystem is managing multiple controllers to perhaps many 
>> hosts, they may be all sharing the same set of receive buffers.
>
> This is not specific to FC, see nvmet-rdma usage of srq for example, and
> nvmet-tcp can have the same concept.
>
> This is becoming a resource constrained issue which is different from
> the patch description.

the only disagreement is terms - many consider resource allocation an 
element of QoS.

>
>> The controller would like to control the ebb and flow of what 
>> controllers/what queues use the available resources at any particular 
>> point in time based on host/controller QOS settings.
>>
>> As most arrays are limited in the amount of memory they can dedicate 
>> to receive buffers/queues and as the rules to date require guaranteed 
>> allocation of sq size, the controllers have to pick how much they 
>> scale - how many associations/controllers vs how many queues per 
>> controller vs sq size per queue that they allow.
>
> They should do that regardless, there is no point allowing lots of
> queues if effectively the array supports very little right?

Agree - but in some cases, the scale that has to be reached has things 
reaching absurdly small levels.  Given that most things are bursty it 
would be better to give each enough to recognize when they start pushing 
load, then based on a priority scheme or some other metric, dole out the 
resources of the subsystem.  People view that policy choice as QoS.

>
>> The disablement of sqflow control doesn't help the "guarantee", it 
>> actually confirms it. There are a couple of issues in the nvmeof spec 
>> that conflict with removing the guarantee - namely - can a value 
>> other than 1 (or 0) be returned from the connect fabric command for 
>> sqhd ?  It can be argued that our implementation today, which returns 
>> 1, which assumes an increment after connect at idx 0 is actually 
>> incorrect to the spec as the queue is to be created as part of 
>> connect and it is to be "empty" with sqhd=sqtail=0.     A poor mans 
>> implementation, which ignores this initial connect response issue in 
>> the spec, and tries to avoid dedicated resources, would live with 
>> allowing a storm of 1 sqsize worth per queue, expecting that it won't 
>> really happen in a complete burst, and sqhd can be controlled going 
>> back to the host such that further io can't be sent beyond 1 
>> sqsizes's worth unless the controller moves sqhd.
>
> You lost me a bit, is the issue supporting QD=1 from a large number of
> queues?

No.  What I'm trying to say is: to use sqhd for flow control, there are 
2 needs: a) the ability to control when sqhd increments, not inherently 
1:1 with a completion; and b) control of the initial value for sqhd.   
For (b), if you can't control it - then it says you always are in a 
startup condition where the host may send a full sqsize-1 worth of 
sqes.  What the device would have rather done is send an sqhd value at 
Connect completion such that the sqtail/sqhd computation shows a few 
"free" slots, not the full sqsize-1's worth.  The spec doesn't 
necessarily say that the device couldn't do that today although many 
people, based on the other "initialization" and "empty" queue statements 
believes it must be 0...  and... that is actually at odds with what we 
do today as we actually return 1 due to the "send it on the queue it 
creates" issue.  This should-it-be-0 or should-it-be-1 needs to be 
cleared up as well as can the controller return other values. I had 
proposed errata over a year ago for an ECN but has been constantly 
backburner-ed.

So the latter half of the paragraph was stating - as long as the 
controller has (a), it would live with not having (b), just assuming 
that the initial sqsize-1 credits worth won't be all in one storm and 
it's management of sqhd will resolve within the 1st iteration of the sq 
and it can manage solely with (a).

>
>>> If we look at the transports landscape, each transport has a credit
>>> mechanism that can throttle bulk data transfers. In FC exchanges the
>>> target is in control pulling data from the host with xfer_ready,
>>> In RDMA the target decides when to issue rdma_read, and in TCP the
>>> target decides when to issue R2T.
>>
>> it's not data flow, it's reception of io commands vs sq.
>>
>>>
>>> These are all credits that give the control to the target to
>>> back-pressure the host. Now if the target doesn't want the host to send
>>> more commands, it can throttle sending completions thus controlling the
>>> pace.
>>
>> that doesn't help the system. It only makes the io look like it takes 
>> a lot longer to complete.
>
> Exactly, they will take longer to complete, meaning the host won't issue
> more. Even if the host is not issuing the command, it is still
> effectively taking the same time to complete.

Well an io that moves for 10us latency to 100us will definitely affect 
the application that issued it, which is why I don't agree with the 
approach.

>
>> At some point, the completion has to go back.
>
> Yes.
>
>>> I must say that returning BLK_STS_RESOURCE for host managed SQ_HEAD 
>>> is a
>>> bit awkward in my mind, but that just one's opinion, what do others 
>>> have
>>> to say?
>>
>> it isn't a host-managed SQ_HEAD. it's a real implementation of 
>> acknowledging the sq flow control that was originally spec'd. We've 
>> always ignored this in lieu of cheating and setting the blk-mq 
>> request count, and hoped as we never sent less than a queue's worth 
>> there was never a reason for the controller to need to not increment 
>> sqhd on a 1by1 basis.    But the other issue with this is - we're 
>> wasting lots of sq space on the controller. We have 4, 10, 100 sqs 
>> but only send 1 sq's worth of io ?  why would a controller ever want 
>> to support a high queue count except for a back-to-back attachment to 
>> a fixed number of hosts (1 ?).
>
> I agree, the controller can tell the host how many queues and max
> queue depth, it can prevent the host from allocating lots of long
> queues.
>
>> That's not reasonable for a real SAN device - so we're forcing them 
>> into small queue counts and small sq sizes so they don't waste memory 
>> yet can handle bunches of associations.
>
> Sounds like this makes sense.
>
>> So in addition to this sqhd tracking that Hannes was proposing, for 
>> FC at least, as sq's are logical,
>
> You describe this sqhd throttling as something that will happen a lot,
> and in normal working points, so returning BLK_STS_RESOURCE having
> blk-mq retrying over and over doesn't seem like an adequate solution
> in my mind...

shrug - I don't know. but I don't know how you do otherwise. Retrying, 
usually with small intervals before retry will typically be good as most 
things will be in bursts.
Certainly open to other ideas that may influence blk-mq to avoid all the 
retries.

>
>> I was also going to look into supporting more than 1 sq's worth of ios.
>
> What do you mean? where is more than 1 sq's worth of ios not supported?
> You talking about for a single queue? Maybe I'm missing something...

Sorry my mistake. I was under the impression that tag_set.queue_depth 
was an absolute cap, spread across the hw queues, not a depth per hw queue.

-- james



_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] nvme/fc: sq flow control
  2020-02-25 23:59 [PATCH RFC] nvme/fc: sq flow control Hannes Reinecke
                   ` (2 preceding siblings ...)
  2020-02-26 23:45 ` Sagi Grimberg
@ 2020-03-09 21:59 ` James Smart
  2020-03-10  6:55   ` Hannes Reinecke
  3 siblings, 1 reply; 16+ messages in thread
From: James Smart @ 2020-03-09 21:59 UTC (permalink / raw)
  To: Hannes Reinecke, Keith Busch
  Cc: linux-nvme, Sagi Grimberg, Chaitanya Kulkarni, John Meneghini



On 2/25/2020 3:59 PM, Hannes Reinecke wrote:
> As per NVMe-oF spec sq flow control is actually mandatory, and we should
> be implementing it to avoid the controller to return a fatal status
> error, and try to play nicely with controllers using sq flow control
> to implement QoS.
>
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>   drivers/nvme/host/fc.c | 29 ++++++++++++++++++++++++++++-
>   1 file changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> index a19ddb61039d..628397bd5065 100644
> --- a/drivers/nvme/host/fc.c
> +++ b/drivers/nvme/host/fc.c
> @@ -12,6 +12,7 @@
>   
>   #include "nvme.h"
>   #include "fabrics.h"
> +#include "trace.h"
>   #include <linux/nvme-fc-driver.h>
>   #include <linux/nvme-fc.h>
>   #include <scsi/scsi_transport_fc.h>
> @@ -34,7 +35,8 @@ struct nvme_fc_queue {
>   	size_t			cmnd_capsule_len;
>   	u32			qnum;
>   	u32			seqno;
> -
> +	int			sq_head;
> +	int			sq_tail;
>   	u64			connection_id;
>   	atomic_t		csn;
>   
> @@ -1671,6 +1673,8 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
>   				cqe->command_id);
>   			goto done;
>   		}
> +		WRITE_ONCE(queue->sq_head, cpu_to_le16(cqe->sq_head));
> +		trace_nvme_sq(rq, cqe->sq_head, queue->sq_tail);
>   		result = cqe->result;
>   		status = cqe->status;
>   		break;
> @@ -2177,6 +2181,18 @@ nvme_fc_unmap_data(struct nvme_fc_ctrl *ctrl, struct request *rq,
>   	freq->sg_cnt = 0;
>   }
>   
> +static int nvme_fc_update_sq_tail(struct nvme_fc_queue *queue, int incr)
> +{
> +	int old_sqtl, new_sqtl;
> +
> +	do {
> +		old_sqtl = queue->sq_tail;
> +		new_sqtl = (old_sqtl + incr) % queue->ctrl->ctrl.sqsize;
> +	} while (cmpxchg(&queue->sq_tail, old_sqtl, new_sqtl) !=
> +		 old_sqtl);
> +	return new_sqtl;
> +}
> +
>   /*
>    * In FC, the queue is a logical thing. At transport connect, the target
>    * creates its "queue" and returns a handle that is to be given to the
> @@ -2219,6 +2235,14 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue,
>   	if (!nvme_fc_ctrl_get(ctrl))
>   		return BLK_STS_IOERR;
>   
> +	if (!ctrl->ctrl.opts->disable_sqflow) {
> +		if (nvme_fc_update_sq_tail(queue, 1) ==
> +		    READ_ONCE(queue->sq_head)) {
> +			nvme_fc_update_sq_tail(queue, -1);
> +			return BLK_STS_RESOURCE;
> +		}
> +	}
> +
>   	/* format the FC-NVME CMD IU and fcp_req */
>   	cmdiu->connection_id = cpu_to_be64(queue->connection_id);
>   	cmdiu->data_len = cpu_to_be32(data_len);
> @@ -2284,6 +2308,9 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue,
>   					queue->lldd_handle, &op->fcp_req);
>   
>   	if (ret) {
> +		if (ctrl->ctrl.opts->disable_sqflow)
> +			nvme_fc_update_sq_tail(queue, -1);
> +
>   		/*
>   		 * If the lld fails to send the command is there an issue with
>   		 * the csn value?  If the command that fails is the Connect,

fyi - I NAK'd this patch mainly as it wasn't ordering ERSP's before 
processing SQHD.

-- james


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] nvme/fc: sq flow control
  2020-03-09 21:59 ` James Smart
@ 2020-03-10  6:55   ` Hannes Reinecke
  2020-03-10 16:44     ` James Smart
  0 siblings, 1 reply; 16+ messages in thread
From: Hannes Reinecke @ 2020-03-10  6:55 UTC (permalink / raw)
  To: James Smart, Keith Busch
  Cc: linux-nvme, Sagi Grimberg, Chaitanya Kulkarni, John Meneghini

On 3/9/20 10:59 PM, James Smart wrote:
> 
> 
> On 2/25/2020 3:59 PM, Hannes Reinecke wrote:
>> As per NVMe-oF spec sq flow control is actually mandatory, and we should
>> be implementing it to avoid the controller to return a fatal status
>> error, and try to play nicely with controllers using sq flow control
>> to implement QoS.
>>
>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>> ---
>>   drivers/nvme/host/fc.c | 29 ++++++++++++++++++++++++++++-
>>   1 file changed, 28 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
>> index a19ddb61039d..628397bd5065 100644
>> --- a/drivers/nvme/host/fc.c
>> +++ b/drivers/nvme/host/fc.c
>> @@ -12,6 +12,7 @@
>>     #include "nvme.h"
>>   #include "fabrics.h"
>> +#include "trace.h"
>>   #include <linux/nvme-fc-driver.h>
>>   #include <linux/nvme-fc.h>
>>   #include <scsi/scsi_transport_fc.h>
>> @@ -34,7 +35,8 @@ struct nvme_fc_queue {
>>       size_t            cmnd_capsule_len;
>>       u32            qnum;
>>       u32            seqno;
>> -
>> +    int            sq_head;
>> +    int            sq_tail;
>>       u64            connection_id;
>>       atomic_t        csn;
>>   @@ -1671,6 +1673,8 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
>>                   cqe->command_id);
>>               goto done;
>>           }
>> +        WRITE_ONCE(queue->sq_head, cpu_to_le16(cqe->sq_head));
>> +        trace_nvme_sq(rq, cqe->sq_head, queue->sq_tail);
>>           result = cqe->result;
>>           status = cqe->status;
>>           break;
>> @@ -2177,6 +2181,18 @@ nvme_fc_unmap_data(struct nvme_fc_ctrl *ctrl,
>> struct request *rq,
>>       freq->sg_cnt = 0;
>>   }
>>   +static int nvme_fc_update_sq_tail(struct nvme_fc_queue *queue, int
>> incr)
>> +{
>> +    int old_sqtl, new_sqtl;
>> +
>> +    do {
>> +        old_sqtl = queue->sq_tail;
>> +        new_sqtl = (old_sqtl + incr) % queue->ctrl->ctrl.sqsize;
>> +    } while (cmpxchg(&queue->sq_tail, old_sqtl, new_sqtl) !=
>> +         old_sqtl);
>> +    return new_sqtl;
>> +}
>> +
>>   /*
>>    * In FC, the queue is a logical thing. At transport connect, the
>> target
>>    * creates its "queue" and returns a handle that is to be given to the
>> @@ -2219,6 +2235,14 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl,
>> struct nvme_fc_queue *queue,
>>       if (!nvme_fc_ctrl_get(ctrl))
>>           return BLK_STS_IOERR;
>>   +    if (!ctrl->ctrl.opts->disable_sqflow) {
>> +        if (nvme_fc_update_sq_tail(queue, 1) ==
>> +            READ_ONCE(queue->sq_head)) {
>> +            nvme_fc_update_sq_tail(queue, -1);
>> +            return BLK_STS_RESOURCE;
>> +        }
>> +    }
>> +
>>       /* format the FC-NVME CMD IU and fcp_req */
>>       cmdiu->connection_id = cpu_to_be64(queue->connection_id);
>>       cmdiu->data_len = cpu_to_be32(data_len);
>> @@ -2284,6 +2308,9 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl,
>> struct nvme_fc_queue *queue,
>>                       queue->lldd_handle, &op->fcp_req);
>>         if (ret) {
>> +        if (ctrl->ctrl.opts->disable_sqflow)
>> +            nvme_fc_update_sq_tail(queue, -1);
>> +
>>           /*
>>            * If the lld fails to send the command is there an issue with
>>            * the csn value?  If the command that fails is the Connect,
> 
> fyi - I NAK'd this patch mainly as it wasn't ordering ERSP's before
> processing SQHD.
> 
I know.

I'm working on V2, as I've found that I'll need to modify nvmet, too
(which currently doesn't send SQHD updates at all)
And the 'P' bit handling is missing completely.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] nvme/fc: sq flow control
  2020-03-10  6:55   ` Hannes Reinecke
@ 2020-03-10 16:44     ` James Smart
  0 siblings, 0 replies; 16+ messages in thread
From: James Smart @ 2020-03-10 16:44 UTC (permalink / raw)
  To: Hannes Reinecke, Keith Busch
  Cc: linux-nvme, Sagi Grimberg, Chaitanya Kulkarni, John Meneghini



On 3/9/2020 11:55 PM, Hannes Reinecke wrote:
> On 3/9/20 10:59 PM, James Smart wrote:
>>
>> On 2/25/2020 3:59 PM, Hannes Reinecke wrote:
>>> As per NVMe-oF spec sq flow control is actually mandatory, and we should
>>> be implementing it to avoid the controller to return a fatal status
>>> error, and try to play nicely with controllers using sq flow control
>>> to implement QoS.
>>>
>>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>>> ---
>>>    drivers/nvme/host/fc.c | 29 ++++++++++++++++++++++++++++-
>>>    1 file changed, 28 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
>>> index a19ddb61039d..628397bd5065 100644
>>> --- a/drivers/nvme/host/fc.c
>>> +++ b/drivers/nvme/host/fc.c
>>> @@ -12,6 +12,7 @@
>>>      #include "nvme.h"
>>>    #include "fabrics.h"
>>> +#include "trace.h"
>>>    #include <linux/nvme-fc-driver.h>
>>>    #include <linux/nvme-fc.h>
>>>    #include <scsi/scsi_transport_fc.h>
>>> @@ -34,7 +35,8 @@ struct nvme_fc_queue {
>>>        size_t            cmnd_capsule_len;
>>>        u32            qnum;
>>>        u32            seqno;
>>> -
>>> +    int            sq_head;
>>> +    int            sq_tail;
>>>        u64            connection_id;
>>>        atomic_t        csn;
>>>    @@ -1671,6 +1673,8 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
>>>                    cqe->command_id);
>>>                goto done;
>>>            }
>>> +        WRITE_ONCE(queue->sq_head, cpu_to_le16(cqe->sq_head));
>>> +        trace_nvme_sq(rq, cqe->sq_head, queue->sq_tail);
>>>            result = cqe->result;
>>>            status = cqe->status;
>>>            break;
>>> @@ -2177,6 +2181,18 @@ nvme_fc_unmap_data(struct nvme_fc_ctrl *ctrl,
>>> struct request *rq,
>>>        freq->sg_cnt = 0;
>>>    }
>>>    +static int nvme_fc_update_sq_tail(struct nvme_fc_queue *queue, int
>>> incr)
>>> +{
>>> +    int old_sqtl, new_sqtl;
>>> +
>>> +    do {
>>> +        old_sqtl = queue->sq_tail;
>>> +        new_sqtl = (old_sqtl + incr) % queue->ctrl->ctrl.sqsize;
>>> +    } while (cmpxchg(&queue->sq_tail, old_sqtl, new_sqtl) !=
>>> +         old_sqtl);
>>> +    return new_sqtl;
>>> +}
>>> +
>>>    /*
>>>     * In FC, the queue is a logical thing. At transport connect, the
>>> target
>>>     * creates its "queue" and returns a handle that is to be given to the
>>> @@ -2219,6 +2235,14 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl,
>>> struct nvme_fc_queue *queue,
>>>        if (!nvme_fc_ctrl_get(ctrl))
>>>            return BLK_STS_IOERR;
>>>    +    if (!ctrl->ctrl.opts->disable_sqflow) {
>>> +        if (nvme_fc_update_sq_tail(queue, 1) ==
>>> +            READ_ONCE(queue->sq_head)) {
>>> +            nvme_fc_update_sq_tail(queue, -1);
>>> +            return BLK_STS_RESOURCE;
>>> +        }
>>> +    }
>>> +
>>>        /* format the FC-NVME CMD IU and fcp_req */
>>>        cmdiu->connection_id = cpu_to_be64(queue->connection_id);
>>>        cmdiu->data_len = cpu_to_be32(data_len);
>>> @@ -2284,6 +2308,9 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl,
>>> struct nvme_fc_queue *queue,
>>>                        queue->lldd_handle, &op->fcp_req);
>>>          if (ret) {
>>> +        if (ctrl->ctrl.opts->disable_sqflow)
>>> +            nvme_fc_update_sq_tail(queue, -1);
>>> +
>>>            /*
>>>             * If the lld fails to send the command is there an issue with
>>>             * the csn value?  If the command that fails is the Connect,
>> fyi - I NAK'd this patch mainly as it wasn't ordering ERSP's before
>> processing SQHD.
>>
> I know.
>
> I'm working on V2, as I've found that I'll need to modify nvmet, too
> (which currently doesn't send SQHD updates at all)
> And the 'P' bit handling is missing completely.
>
> Cheers,
>
> Hannes

that shouldn't be true. I added the sqhd updates:   commit 
f9cf2a64912d6   "nvmet: synchronize sqhd update"

And "SQ Identifier and Phase Tag fields are reserved because they are 
not used in NVMe over Fabrics."

-- james


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-03-10 16:44 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-25 23:59 [PATCH RFC] nvme/fc: sq flow control Hannes Reinecke
2020-02-26  0:08 ` Sagi Grimberg
2020-02-26  0:14   ` Hannes Reinecke
2020-02-26  0:38     ` Sagi Grimberg
2020-02-27 11:27       ` Hannes Reinecke
2020-02-26 10:44 ` Martin Wilck
2020-02-26 15:47   ` Hannes Reinecke
2020-02-26 23:45 ` Sagi Grimberg
2020-02-27  1:46   ` James Smart
2020-02-27  3:52     ` Sagi Grimberg
2020-02-27 21:46       ` Meneghini, John
2020-02-28 16:35       ` James Smart
2020-02-28 10:39   ` Hannes Reinecke
2020-03-09 21:59 ` James Smart
2020-03-10  6:55   ` Hannes Reinecke
2020-03-10 16:44     ` James Smart

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.