linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 2/2] nvme-tcp: support specifying the congestion-control
@ 2022-03-08 15:16 Mingbao Sun
  2022-03-08 15:36 ` Mingbao Sun
  2022-03-09  6:14 ` Christoph Hellwig
  0 siblings, 2 replies; 7+ messages in thread
From: Mingbao Sun @ 2022-03-08 15:16 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, linux-nvme, linux-kernel
  Cc: sunmingbao, tyler.sun, ping.gan, yanxiu.cai, libin.zhang, ao.sun

From: Mingbao Sun <tyler.sun@dell.com>

congestion-control could have a noticeable impaction on the
performance of TCP-based communications. This is of course true
to NVMe_over_TCP.

Different congestion-controls (e.g., cubic, dctcp) are suitable for
different scenarios. Proper adoption of congestion control would benefit
the performance. On the contrary, the performance could be destroyed.

Though we can specify the congestion-control of NVMe_over_TCP via
writing '/proc/sys/net/ipv4/tcp_congestion_control', but this also
changes the congestion-control of all the future TCP sockets that
have not been explicitly assigned the congestion-control, thus bringing
potential impaction on their performance.

So it makes sense to make NVMe_over_TCP support specifying the
congestion-control. And this commit addresses the host side.

Implementation approach:
a new option called 'tcp_congestion' was created in fabrics opt_tokens
for 'nvme connect' command to passed in the congestion-control
specified by the user.
Then later in nvme_tcp_alloc_queue, the specified congestion-control
would be applied to the relevant sockets of the host side.

Signed-off-by: Mingbao Sun <tyler.sun@dell.com>
---
 drivers/nvme/host/fabrics.c | 12 ++++++++++++
 drivers/nvme/host/fabrics.h |  2 ++
 drivers/nvme/host/tcp.c     | 20 +++++++++++++++++++-
 3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index ee79a6d639b4..79d5f0dbafd3 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -548,6 +548,7 @@ static const match_table_t opt_tokens = {
 	{ NVMF_OPT_TOS,			"tos=%d"		},
 	{ NVMF_OPT_FAIL_FAST_TMO,	"fast_io_fail_tmo=%d"	},
 	{ NVMF_OPT_DISCOVERY,		"discovery"		},
+	{ NVMF_OPT_TCP_CONGESTION,	"tcp_congestion=%s"	},
 	{ NVMF_OPT_ERR,			NULL			}
 };
 
@@ -829,6 +830,16 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 		case NVMF_OPT_DISCOVERY:
 			opts->discovery_nqn = true;
 			break;
+		case NVMF_OPT_TCP_CONGESTION:
+			p = match_strdup(args);
+			if (!p) {
+				ret = -ENOMEM;
+				goto out;
+			}
+
+			kfree(opts->tcp_congestion);
+			opts->tcp_congestion = p;
+			break;
 		default:
 			pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n",
 				p);
@@ -947,6 +958,7 @@ void nvmf_free_options(struct nvmf_ctrl_options *opts)
 	kfree(opts->subsysnqn);
 	kfree(opts->host_traddr);
 	kfree(opts->host_iface);
+	kfree(opts->tcp_congestion);
 	kfree(opts);
 }
 EXPORT_SYMBOL_GPL(nvmf_free_options);
diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index c3203ff1c654..25fdc169949d 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -68,6 +68,7 @@ enum {
 	NVMF_OPT_FAIL_FAST_TMO	= 1 << 20,
 	NVMF_OPT_HOST_IFACE	= 1 << 21,
 	NVMF_OPT_DISCOVERY	= 1 << 22,
+	NVMF_OPT_TCP_CONGESTION	= 1 << 23,
 };
 
 /**
@@ -117,6 +118,7 @@ struct nvmf_ctrl_options {
 	unsigned int		nr_io_queues;
 	unsigned int		reconnect_delay;
 	bool			discovery_nqn;
+	const char		*tcp_congestion;
 	bool			duplicate_connect;
 	unsigned int		kato;
 	struct nvmf_host	*host;
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index babbc14a4b76..3415e178a78b 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1403,6 +1403,8 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
 {
 	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
 	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
+	char ca_name[TCP_CA_NAME_MAX];
+	sockptr_t optval;
 	int ret, rcv_pdu_size;
 
 	mutex_init(&queue->queue_lock);
@@ -1447,6 +1449,21 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
 	if (nctrl->opts->tos >= 0)
 		ip_sock_set_tos(queue->sock->sk, nctrl->opts->tos);
 
+	if (nctrl->opts->mask & NVMF_OPT_TCP_CONGESTION) {
+		strncpy(ca_name, nctrl->opts->tcp_congestion,
+			TCP_CA_NAME_MAX-1);
+		optval = KERNEL_SOCKPTR(ca_name);
+		ret = sock_common_setsockopt(queue->sock, IPPROTO_TCP,
+					     TCP_CONGESTION, optval,
+					     strlen(ca_name));
+		if (ret) {
+			dev_err(nctrl->device,
+				"failed to set TCP congestion to %s: %d\n",
+				ca_name, ret);
+			goto err_sock;
+		}
+	}
+
 	/* Set 10 seconds timeout for icresp recvmsg */
 	queue->sock->sk->sk_rcvtimeo = 10 * HZ;
 
@@ -2610,7 +2627,8 @@ static struct nvmf_transport_ops nvme_tcp_transport = {
 			  NVMF_OPT_HOST_TRADDR | NVMF_OPT_CTRL_LOSS_TMO |
 			  NVMF_OPT_HDR_DIGEST | NVMF_OPT_DATA_DIGEST |
 			  NVMF_OPT_NR_WRITE_QUEUES | NVMF_OPT_NR_POLL_QUEUES |
-			  NVMF_OPT_TOS | NVMF_OPT_HOST_IFACE,
+			  NVMF_OPT_TOS | NVMF_OPT_HOST_IFACE |
+			  NVMF_OPT_TCP_CONGESTION,
 	.create_ctrl	= nvme_tcp_create_ctrl,
 };
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 2/2] nvme-tcp: support specifying the congestion-control
  2022-03-08 15:16 [PATCH v2 2/2] nvme-tcp: support specifying the congestion-control Mingbao Sun
@ 2022-03-08 15:36 ` Mingbao Sun
  2022-03-09  6:14 ` Christoph Hellwig
  1 sibling, 0 replies; 7+ messages in thread
From: Mingbao Sun @ 2022-03-08 15:36 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, linux-nvme, linux-kernel
  Cc: tyler.sun, ping.gan, yanxiu.cai, libin.zhang, ao.sun

Per the comments from Christoph Hellwig, the calls to networking APIs
in nvme-fabrics.ko were deleted.

Since the tcp_congestion passed in from the user-space could also get
checked later within sock_common_setsockopt in nvme_tcp_alloc_queue.
So this deletion brings no downside to command 'nvme connect'.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 2/2] nvme-tcp: support specifying the congestion-control
  2022-03-08 15:16 [PATCH v2 2/2] nvme-tcp: support specifying the congestion-control Mingbao Sun
  2022-03-08 15:36 ` Mingbao Sun
@ 2022-03-09  6:14 ` Christoph Hellwig
  2022-03-09  7:32   ` Mingbao Sun
  1 sibling, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2022-03-09  6:14 UTC (permalink / raw)
  To: Mingbao Sun
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, linux-nvme, linux-kernel, tyler.sun,
	ping.gan, yanxiu.cai, libin.zhang, ao.sun

On Tue, Mar 08, 2022 at 11:16:06PM +0800, Mingbao Sun wrote:
> From: Mingbao Sun <tyler.sun@dell.com>
> 
> congestion-control could have a noticeable impaction on the
> performance of TCP-based communications. This is of course true
> to NVMe_over_TCP.
> 
> Different congestion-controls (e.g., cubic, dctcp) are suitable for
> different scenarios. Proper adoption of congestion control would benefit
> the performance. On the contrary, the performance could be destroyed.
> 
> Though we can specify the congestion-control of NVMe_over_TCP via
> writing '/proc/sys/net/ipv4/tcp_congestion_control', but this also
> changes the congestion-control of all the future TCP sockets that
> have not been explicitly assigned the congestion-control, thus bringing
> potential impaction on their performance.
> 
> So it makes sense to make NVMe_over_TCP support specifying the
> congestion-control. And this commit addresses the host side.
> 
> Implementation approach:
> a new option called 'tcp_congestion' was created in fabrics opt_tokens
> for 'nvme connect' command to passed in the congestion-control
> specified by the user.
> Then later in nvme_tcp_alloc_queue, the specified congestion-control
> would be applied to the relevant sockets of the host side.
> 
> Signed-off-by: Mingbao Sun <tyler.sun@dell.com>
> ---
>  drivers/nvme/host/fabrics.c | 12 ++++++++++++
>  drivers/nvme/host/fabrics.h |  2 ++
>  drivers/nvme/host/tcp.c     | 20 +++++++++++++++++++-
>  3 files changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
> index ee79a6d639b4..79d5f0dbafd3 100644
> --- a/drivers/nvme/host/fabrics.c
> +++ b/drivers/nvme/host/fabrics.c
> @@ -548,6 +548,7 @@ static const match_table_t opt_tokens = {
>  	{ NVMF_OPT_TOS,			"tos=%d"		},
>  	{ NVMF_OPT_FAIL_FAST_TMO,	"fast_io_fail_tmo=%d"	},
>  	{ NVMF_OPT_DISCOVERY,		"discovery"		},
> +	{ NVMF_OPT_TCP_CONGESTION,	"tcp_congestion=%s"	},
>  	{ NVMF_OPT_ERR,			NULL			}
>  };
>  
> @@ -829,6 +830,16 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
>  		case NVMF_OPT_DISCOVERY:
>  			opts->discovery_nqn = true;
>  			break;
> +		case NVMF_OPT_TCP_CONGESTION:
> +			p = match_strdup(args);
> +			if (!p) {
> +				ret = -ENOMEM;
> +				goto out;
> +			}
> +
> +			kfree(opts->tcp_congestion);
> +			opts->tcp_congestion = p;
> +			break;
>  		default:
>  			pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n",
>  				p);
> @@ -947,6 +958,7 @@ void nvmf_free_options(struct nvmf_ctrl_options *opts)
>  	kfree(opts->subsysnqn);
>  	kfree(opts->host_traddr);
>  	kfree(opts->host_iface);
> +	kfree(opts->tcp_congestion);
>  	kfree(opts);
>  }
>  EXPORT_SYMBOL_GPL(nvmf_free_options);
> diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
> index c3203ff1c654..25fdc169949d 100644
> --- a/drivers/nvme/host/fabrics.h
> +++ b/drivers/nvme/host/fabrics.h
> @@ -68,6 +68,7 @@ enum {
>  	NVMF_OPT_FAIL_FAST_TMO	= 1 << 20,
>  	NVMF_OPT_HOST_IFACE	= 1 << 21,
>  	NVMF_OPT_DISCOVERY	= 1 << 22,
> +	NVMF_OPT_TCP_CONGESTION	= 1 << 23,
>  };
>  
>  /**
> @@ -117,6 +118,7 @@ struct nvmf_ctrl_options {
>  	unsigned int		nr_io_queues;
>  	unsigned int		reconnect_delay;
>  	bool			discovery_nqn;
> +	const char		*tcp_congestion;
>  	bool			duplicate_connect;
>  	unsigned int		kato;
>  	struct nvmf_host	*host;
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index babbc14a4b76..3415e178a78b 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1403,6 +1403,8 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
>  {
>  	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
>  	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
> +	char ca_name[TCP_CA_NAME_MAX];
> +	sockptr_t optval;
>  	int ret, rcv_pdu_size;
>  
>  	mutex_init(&queue->queue_lock);
> @@ -1447,6 +1449,21 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
>  	if (nctrl->opts->tos >= 0)
>  		ip_sock_set_tos(queue->sock->sk, nctrl->opts->tos);
>  
> +	if (nctrl->opts->mask & NVMF_OPT_TCP_CONGESTION) {
> +		strncpy(ca_name, nctrl->opts->tcp_congestion,
> +			TCP_CA_NAME_MAX-1);
> +		optval = KERNEL_SOCKPTR(ca_name);
> +		ret = sock_common_setsockopt(queue->sock, IPPROTO_TCP,
> +					     TCP_CONGESTION, optval,
> +					     strlen(ca_name));

This needs to use kernel_setsockopt.  I also can see absolutely no
need for the optval local variable, and I also don't really see why
we need ca_name either - if we need to limit the length and terminate
it (but why?) that can be done during option parsing.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 2/2] nvme-tcp: support specifying the congestion-control
  2022-03-09  6:14 ` Christoph Hellwig
@ 2022-03-09  7:32   ` Mingbao Sun
  2022-03-09 13:41     ` Mingbao Sun
  0 siblings, 1 reply; 7+ messages in thread
From: Mingbao Sun @ 2022-03-09  7:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni,
	linux-nvme, linux-kernel, tyler.sun, ping.gan, yanxiu.cai,
	libin.zhang, ao.sun

> > @@ -1447,6 +1449,21 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
> >  	if (nctrl->opts->tos >= 0)
> >  		ip_sock_set_tos(queue->sock->sk, nctrl->opts->tos);
> >  
> > +	if (nctrl->opts->mask & NVMF_OPT_TCP_CONGESTION) {
> > +		strncpy(ca_name, nctrl->opts->tcp_congestion,
> > +			TCP_CA_NAME_MAX-1);
> > +		optval = KERNEL_SOCKPTR(ca_name);
> > +		ret = sock_common_setsockopt(queue->sock, IPPROTO_TCP,
> > +					     TCP_CONGESTION, optval,
> > +					     strlen(ca_name));  
> 
> This needs to use kernel_setsockopt.  I also can see absolutely no
> need for the optval local variable, and I also don't really see why
> we need ca_name either - if we need to limit the length and terminate
> it (but why?) that can be done during option parsing.

Well, actually at the beginning I did use kernel_setsockopt.
But the compilation failed.

Then I found the API kernel_setsockopt disappeared from kernel v5.8.
So I use sock_common_setsockopt instead.
The birth of variable ca_name and optval is just because of the
the relevant definitions related to the prototype of
sock_common_setsockopt.

But now through thinking your comments and investigating the story of
the disappearance of kernel_setsockopt, I feel I should use
tcp_set_congestion_control instead of sock_common_setsockopt.
Just as target/tcp.c replaced kernel_setsockopt with ip_sock_set_tos.

As for the length limitation of the ca name, I will (if required) move
them into the phase of option parsing in the next version.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 2/2] nvme-tcp: support specifying the congestion-control
  2022-03-09  7:32   ` Mingbao Sun
@ 2022-03-09 13:41     ` Mingbao Sun
  2022-03-10  8:19       ` Christoph Hellwig
  0 siblings, 1 reply; 7+ messages in thread
From: Mingbao Sun @ 2022-03-09 13:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni,
	linux-nvme, linux-kernel, tyler.sun, ping.gan, yanxiu.cai,
	libin.zhang, ao.sun

On Wed, 9 Mar 2022 15:32:33 +0800
Mingbao Sun <sunmingbao@tom.com> wrote:

> > > @@ -1447,6 +1449,21 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
> > >  	if (nctrl->opts->tos >= 0)
> > >  		ip_sock_set_tos(queue->sock->sk, nctrl->opts->tos);
> > >  
> > > +	if (nctrl->opts->mask & NVMF_OPT_TCP_CONGESTION) {
> > > +		strncpy(ca_name, nctrl->opts->tcp_congestion,
> > > +			TCP_CA_NAME_MAX-1);
> > > +		optval = KERNEL_SOCKPTR(ca_name);
> > > +		ret = sock_common_setsockopt(queue->sock, IPPROTO_TCP,
> > > +					     TCP_CONGESTION, optval,
> > > +					     strlen(ca_name));    
> > 
> > This needs to use kernel_setsockopt.  I also can see absolutely no
> > need for the optval local variable, and I also don't really see why
> > we need ca_name either - if we need to limit the length and terminate
> > it (but why?) that can be done during option parsing.  

Regards to the replacement of 'sock_common_setsockopt'.

Per the story of the deletion of 'kernel_setsockopt', 
users of this API should switch to small functions that
implement setting a sockopt directly.

So I tried with 'tcp_set_congestion_control'.
But then I found this symbol is not exported yet.
Then I applied ‘EXPORT_SYMBOL_GPL(tcp_set_congestion_control);’
in my local source, and it works well in the testing.

Then what should I do with this?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 2/2] nvme-tcp: support specifying the congestion-control
  2022-03-09 13:41     ` Mingbao Sun
@ 2022-03-10  8:19       ` Christoph Hellwig
  2022-03-10  8:31         ` Mingbao Sun
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2022-03-10  8:19 UTC (permalink / raw)
  To: Mingbao Sun
  Cc: Christoph Hellwig, Keith Busch, Jens Axboe, Sagi Grimberg,
	Chaitanya Kulkarni, linux-nvme, linux-kernel, tyler.sun,
	ping.gan, yanxiu.cai, libin.zhang, ao.sun

On Wed, Mar 09, 2022 at 09:41:59PM +0800, Mingbao Sun wrote:
> So I tried with 'tcp_set_congestion_control'.
> But then I found this symbol is not exported yet.
> Then I applied ‘EXPORT_SYMBOL_GPL(tcp_set_congestion_control);’
> in my local source, and it works well in the testing.
> 
> Then what should I do with this?

Add the export in a separate, clearly documented, patch, and Cc the
netdev list and maintainers to get their opinion on all list.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 2/2] nvme-tcp: support specifying the congestion-control
  2022-03-10  8:19       ` Christoph Hellwig
@ 2022-03-10  8:31         ` Mingbao Sun
  0 siblings, 0 replies; 7+ messages in thread
From: Mingbao Sun @ 2022-03-10  8:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni,
	linux-nvme, linux-kernel, tyler.sun, ping.gan, yanxiu.cai,
	libin.zhang, ao.sun

On Thu, 10 Mar 2022 09:19:08 +0100
Christoph Hellwig <hch@lst.de> wrote:

> On Wed, Mar 09, 2022 at 09:41:59PM +0800, Mingbao Sun wrote:
> > So I tried with 'tcp_set_congestion_control'.
> > But then I found this symbol is not exported yet.
> > Then I applied ‘EXPORT_SYMBOL_GPL(tcp_set_congestion_control);’
> > in my local source, and it works well in the testing.
> > 
> > Then what should I do with this?  
> 
> Add the export in a separate, clearly documented, patch, and Cc the
> netdev list and maintainers to get their opinion on all list.

Got it.
Will do that soon.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-03-10  8:31 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-08 15:16 [PATCH v2 2/2] nvme-tcp: support specifying the congestion-control Mingbao Sun
2022-03-08 15:36 ` Mingbao Sun
2022-03-09  6:14 ` Christoph Hellwig
2022-03-09  7:32   ` Mingbao Sun
2022-03-09 13:41     ` Mingbao Sun
2022-03-10  8:19       ` Christoph Hellwig
2022-03-10  8:31         ` Mingbao Sun

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).