linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] NVMe_over_TCP: support specifying the congestion-control
@ 2022-03-04  9:27 Mingbao Sun
  2022-03-04  9:27 ` [PATCH 1/2] nvmet-tcp: " Mingbao Sun
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Mingbao Sun @ 2022-03-04  9:27 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, linux-nvme, linux-kernel
  Cc: sunmingbao, tyler.sun, ping.gan, yanxiu.cai, libin.zhang, ao.sun

From: Mingbao Sun <tyler.sun@dell.com>

Hi all,

congestion-control could have a noticeable impaction on the
performance of TCP-based communications. This is of course true
to NVMe_over_TCP.

Different congestion-controls (e.g., cubic, dctcp) are suitable for
different scenarios. Proper adoption of congestion control would benefit
the performance. On the contrary, the performance could be destroyed.

Though we can specify the congestion-control of NVMe_over_TCP via
writing '/proc/sys/net/ipv4/tcp_congestion_control', but this also
changes the congestion-control of all the future TCP sockets that
have not been explicitly assigned the congestion-control, thus bringing
potential impaction on their performance.

So it makes sense to make NVMe_over_TCP support specifying the
congestion-control.

The first commit addresses the target side, and the second one
addresses the host side.

Mingbao Sun (2):
  nvmet-tcp: support specifying the congestion-control
  nvme-tcp: support specifying the congestion-control

 drivers/nvme/host/fabrics.c    | 24 ++++++++++++++++
 drivers/nvme/host/fabrics.h    |  2 ++
 drivers/nvme/host/tcp.c        | 20 ++++++++++++-
 drivers/nvme/target/configfs.c | 52 ++++++++++++++++++++++++++++++++++
 drivers/nvme/target/nvmet.h    |  1 +
 drivers/nvme/target/tcp.c      | 27 ++++++++++++++++++
 6 files changed, 125 insertions(+), 1 deletion(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] nvmet-tcp: support specifying the congestion-control
  2022-03-04  9:27 [PATCH 0/2] NVMe_over_TCP: support specifying the congestion-control Mingbao Sun
@ 2022-03-04  9:27 ` Mingbao Sun
  2022-03-04  9:27 ` [PATCH 2/2] nvme-tcp: " Mingbao Sun
  2022-03-08 13:03 ` [PATCH 0/2] NVMe_over_TCP: " Mingbao Sun
  2 siblings, 0 replies; 8+ messages in thread
From: Mingbao Sun @ 2022-03-04  9:27 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, linux-nvme, linux-kernel
  Cc: sunmingbao, tyler.sun, ping.gan, yanxiu.cai, libin.zhang, ao.sun

From: Mingbao Sun <tyler.sun@dell.com>

congestion-control could have a noticeable impaction on the
performance of TCP-based communications. This is of course true
to NVMe_over_TCP.

Different congestion-controls (e.g., cubic, dctcp) are suitable for
different scenarios. Proper adoption of congestion control would benefit
the performance. On the contrary, the performance could be destroyed.

Though we can specify the congestion-control of NVMe_over_TCP via
writing '/proc/sys/net/ipv4/tcp_congestion_control', but this also
changes the congestion-control of all the future TCP sockets that
have not been explicitly assigned the congestion-control, thus bringing
potential impaction on their performance.

So it makes sense to make NVMe_over_TCP support specifying the
congestion-control. And this commit addresses the target side.

Implementation approach:
the following new file entry was created for user to specify the
congestion-control of each nvmet port.
'/sys/kernel/config/nvmet/ports/X/tcp_congestion'
Then later in nvmet_tcp_add_port, the specified congestion-control
would be applied to the listening socket of the nvmet port.

Signed-off-by: Mingbao Sun <tyler.sun@dell.com>
---
 drivers/nvme/target/configfs.c | 52 ++++++++++++++++++++++++++++++++++
 drivers/nvme/target/nvmet.h    |  1 +
 drivers/nvme/target/tcp.c      | 27 ++++++++++++++++++
 3 files changed, 80 insertions(+)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index 091a0ca16361..fcf01f2b8045 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -11,6 +11,7 @@
 #include <linux/ctype.h>
 #include <linux/pci.h>
 #include <linux/pci-p2pdma.h>
+#include <net/tcp.h>
 
 #include "nvmet.h"
 
@@ -222,6 +223,55 @@ static ssize_t nvmet_addr_trsvcid_store(struct config_item *item,
 
 CONFIGFS_ATTR(nvmet_, addr_trsvcid);
 
+static ssize_t nvmet_tcp_congestion_show(struct config_item *item,
+		char *page)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+
+	return snprintf(page, PAGE_SIZE, "%s\n",
+			port->tcp_congestion ? port->tcp_congestion : "");
+}
+
+static ssize_t nvmet_tcp_congestion_store(struct config_item *item,
+		const char *page, size_t count)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+	int len;
+	bool ecn_ca;
+	u32 key;
+
+	len = strcspn(page, "\n");
+	if (!len)
+		return -EINVAL;
+
+	if (len >= TCP_CA_NAME_MAX) {
+		pr_err("name of TCP congestion control can not exceed %d bytes.\n",
+		       TCP_CA_NAME_MAX);
+		return -EINVAL;
+	}
+
+	if (nvmet_is_port_enabled(port, __func__))
+		return -EACCES;
+
+	kfree(port->tcp_congestion);
+	port->tcp_congestion = kmemdup_nul(page, len, GFP_KERNEL);
+	if (!port->tcp_congestion)
+		return -ENOMEM;
+
+	key = tcp_ca_get_key_by_name(NULL, port->tcp_congestion, &ecn_ca);
+	if (key == TCP_CA_UNSPEC) {
+		pr_err("congestion control %s not found.\n",
+		       port->tcp_congestion);
+		kfree(port->tcp_congestion);
+		port->tcp_congestion = NULL;
+		return -EINVAL;
+	}
+
+	return count;
+}
+
+CONFIGFS_ATTR(nvmet_, tcp_congestion);
+
 static ssize_t nvmet_param_inline_data_size_show(struct config_item *item,
 		char *page)
 {
@@ -1597,6 +1647,7 @@ static void nvmet_port_release(struct config_item *item)
 	list_del(&port->global_entry);
 
 	kfree(port->ana_state);
+	kfree(port->tcp_congestion);
 	kfree(port);
 }
 
@@ -1605,6 +1656,7 @@ static struct configfs_attribute *nvmet_port_attrs[] = {
 	&nvmet_attr_addr_treq,
 	&nvmet_attr_addr_traddr,
 	&nvmet_attr_addr_trsvcid,
+	&nvmet_attr_tcp_congestion,
 	&nvmet_attr_addr_trtype,
 	&nvmet_attr_param_inline_data_size,
 #ifdef CONFIG_BLK_DEV_INTEGRITY
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 69637bf8f8e1..76a57c4c3456 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -145,6 +145,7 @@ struct nvmet_port {
 	struct config_group		ana_groups_group;
 	struct nvmet_ana_group		ana_default_group;
 	enum nvme_ana_state		*ana_state;
+	const char			*tcp_congestion;
 	void				*priv;
 	bool				enabled;
 	int				inline_data_size;
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 83ca577f72be..3b72e782c901 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -1657,8 +1657,10 @@ static void nvmet_tcp_accept_work(struct work_struct *w)
 	struct nvmet_tcp_port *port =
 		container_of(w, struct nvmet_tcp_port, accept_work);
 	struct socket *newsock;
+	struct inet_connection_sock *icsk, *icsk_new;
 	int ret;
 
+	icsk = inet_csk(port->sock->sk);
 	while (true) {
 		ret = kernel_accept(port->sock, &newsock, O_NONBLOCK);
 		if (ret < 0) {
@@ -1666,6 +1668,16 @@ static void nvmet_tcp_accept_work(struct work_struct *w)
 				pr_warn("failed to accept err=%d\n", ret);
 			return;
 		}
+
+		if (port->nport->tcp_congestion) {
+			icsk_new = inet_csk(newsock->sk);
+			if (icsk_new->icsk_ca_ops != icsk->icsk_ca_ops) {
+				pr_warn("congestion abnormal: expected %s, actual %s.\n",
+					icsk->icsk_ca_ops->name,
+					icsk_new->icsk_ca_ops->name);
+			}
+		}
+
 		ret = nvmet_tcp_alloc_queue(port, newsock);
 		if (ret) {
 			pr_err("failed to allocate queue\n");
@@ -1693,6 +1705,8 @@ static int nvmet_tcp_add_port(struct nvmet_port *nport)
 {
 	struct nvmet_tcp_port *port;
 	__kernel_sa_family_t af;
+	char ca_name[TCP_CA_NAME_MAX];
+	sockptr_t optval;
 	int ret;
 
 	port = kzalloc(sizeof(*port), GFP_KERNEL);
@@ -1741,6 +1755,19 @@ static int nvmet_tcp_add_port(struct nvmet_port *nport)
 	if (so_priority > 0)
 		sock_set_priority(port->sock->sk, so_priority);
 
+	if (nport->tcp_congestion) {
+		strncpy(ca_name, nport->tcp_congestion, TCP_CA_NAME_MAX-1);
+		optval = KERNEL_SOCKPTR(ca_name);
+		ret = sock_common_setsockopt(port->sock, IPPROTO_TCP,
+					     TCP_CONGESTION, optval,
+					     strlen(ca_name));
+		if (ret) {
+			pr_err("failed to set port socket's congestion to %s: %d\n",
+			       ca_name, ret);
+			goto err_sock;
+		}
+	}
+
 	ret = kernel_bind(port->sock, (struct sockaddr *)&port->addr,
 			sizeof(port->addr));
 	if (ret) {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/2] nvme-tcp: support specifying the congestion-control
  2022-03-04  9:27 [PATCH 0/2] NVMe_over_TCP: support specifying the congestion-control Mingbao Sun
  2022-03-04  9:27 ` [PATCH 1/2] nvmet-tcp: " Mingbao Sun
@ 2022-03-04  9:27 ` Mingbao Sun
  2022-03-04 16:20   ` Christoph Hellwig
  2022-03-08 13:03 ` [PATCH 0/2] NVMe_over_TCP: " Mingbao Sun
  2 siblings, 1 reply; 8+ messages in thread
From: Mingbao Sun @ 2022-03-04  9:27 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, linux-nvme, linux-kernel
  Cc: sunmingbao, tyler.sun, ping.gan, yanxiu.cai, libin.zhang, ao.sun

From: Mingbao Sun <tyler.sun@dell.com>

congestion-control could have a noticeable impaction on the
performance of TCP-based communications. This is of course true
to NVMe_over_TCP.

Different congestion-controls (e.g., cubic, dctcp) are suitable for
different scenarios. Proper adoption of congestion control would benefit
the performance. On the contrary, the performance could be destroyed.

Though we can specify the congestion-control of NVMe_over_TCP via
writing '/proc/sys/net/ipv4/tcp_congestion_control', but this also
changes the congestion-control of all the future TCP sockets that
have not been explicitly assigned the congestion-control, thus bringing
potential impaction on their performance.

So it makes sense to make NVMe_over_TCP support specifying the
congestion-control. And this commit addresses the host side.

Implementation approach:
a new option called 'tcp_congestion' was created in fabrics opt_tokens
for 'nvme connect' command to passed in the congestion-control
specified by the user.
Then later in nvme_tcp_alloc_queue, the specified congestion-control
would be applied to the relevant sockets of the host side.

Signed-off-by: Mingbao Sun <tyler.sun@dell.com>
---
 drivers/nvme/host/fabrics.c | 24 ++++++++++++++++++++++++
 drivers/nvme/host/fabrics.h |  2 ++
 drivers/nvme/host/tcp.c     | 20 +++++++++++++++++++-
 3 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index ee79a6d639b4..6d946f758372 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -10,6 +10,7 @@
 #include <linux/mutex.h>
 #include <linux/parser.h>
 #include <linux/seq_file.h>
+#include <net/tcp.h>
 #include "nvme.h"
 #include "fabrics.h"
 
@@ -548,6 +549,7 @@ static const match_table_t opt_tokens = {
 	{ NVMF_OPT_TOS,			"tos=%d"		},
 	{ NVMF_OPT_FAIL_FAST_TMO,	"fast_io_fail_tmo=%d"	},
 	{ NVMF_OPT_DISCOVERY,		"discovery"		},
+	{ NVMF_OPT_TCP_CONGESTION,	"tcp_congestion=%s"	},
 	{ NVMF_OPT_ERR,			NULL			}
 };
 
@@ -560,6 +562,8 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 	size_t nqnlen  = 0;
 	int ctrl_loss_tmo = NVMF_DEF_CTRL_LOSS_TMO;
 	uuid_t hostid;
+	bool ecn_ca;
+	u32 key;
 
 	/* Set defaults */
 	opts->queue_size = NVMF_DEF_QUEUE_SIZE;
@@ -829,6 +833,25 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 		case NVMF_OPT_DISCOVERY:
 			opts->discovery_nqn = true;
 			break;
+		case NVMF_OPT_TCP_CONGESTION:
+			p = match_strdup(args);
+			if (!p) {
+				ret = -ENOMEM;
+				goto out;
+			}
+
+			key = tcp_ca_get_key_by_name(NULL, p, &ecn_ca);
+			if (key == TCP_CA_UNSPEC) {
+				pr_err("congestion control %s not found.\n",
+				       p);
+				ret = -EINVAL;
+				kfree(p);
+				goto out;
+			}
+
+			kfree(opts->tcp_congestion);
+			opts->tcp_congestion = p;
+			break;
 		default:
 			pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n",
 				p);
@@ -947,6 +970,7 @@ void nvmf_free_options(struct nvmf_ctrl_options *opts)
 	kfree(opts->subsysnqn);
 	kfree(opts->host_traddr);
 	kfree(opts->host_iface);
+	kfree(opts->tcp_congestion);
 	kfree(opts);
 }
 EXPORT_SYMBOL_GPL(nvmf_free_options);
diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index c3203ff1c654..25fdc169949d 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -68,6 +68,7 @@ enum {
 	NVMF_OPT_FAIL_FAST_TMO	= 1 << 20,
 	NVMF_OPT_HOST_IFACE	= 1 << 21,
 	NVMF_OPT_DISCOVERY	= 1 << 22,
+	NVMF_OPT_TCP_CONGESTION	= 1 << 23,
 };
 
 /**
@@ -117,6 +118,7 @@ struct nvmf_ctrl_options {
 	unsigned int		nr_io_queues;
 	unsigned int		reconnect_delay;
 	bool			discovery_nqn;
+	const char		*tcp_congestion;
 	bool			duplicate_connect;
 	unsigned int		kato;
 	struct nvmf_host	*host;
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 6cbcc8b4daaf..cb2c7d7371d4 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1403,6 +1403,8 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
 {
 	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
 	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
+	char ca_name[TCP_CA_NAME_MAX];
+	sockptr_t optval;
 	int ret, rcv_pdu_size;
 
 	mutex_init(&queue->queue_lock);
@@ -1447,6 +1449,21 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
 	if (nctrl->opts->tos >= 0)
 		ip_sock_set_tos(queue->sock->sk, nctrl->opts->tos);
 
+	if (nctrl->opts->mask & NVMF_OPT_TCP_CONGESTION) {
+		strncpy(ca_name, nctrl->opts->tcp_congestion,
+			TCP_CA_NAME_MAX-1);
+		optval = KERNEL_SOCKPTR(ca_name);
+		ret = sock_common_setsockopt(queue->sock, IPPROTO_TCP,
+					     TCP_CONGESTION, optval,
+					     strlen(ca_name));
+		if (ret) {
+			dev_err(nctrl->device,
+				"failed to set TCP congestion to %s: %d\n",
+				ca_name, ret);
+			goto err_sock;
+		}
+	}
+
 	/* Set 10 seconds timeout for icresp recvmsg */
 	queue->sock->sk->sk_rcvtimeo = 10 * HZ;
 
@@ -2611,7 +2628,8 @@ static struct nvmf_transport_ops nvme_tcp_transport = {
 			  NVMF_OPT_HOST_TRADDR | NVMF_OPT_CTRL_LOSS_TMO |
 			  NVMF_OPT_HDR_DIGEST | NVMF_OPT_DATA_DIGEST |
 			  NVMF_OPT_NR_WRITE_QUEUES | NVMF_OPT_NR_POLL_QUEUES |
-			  NVMF_OPT_TOS | NVMF_OPT_HOST_IFACE,
+			  NVMF_OPT_TOS | NVMF_OPT_HOST_IFACE |
+			  NVMF_OPT_TCP_CONGESTION,
 	.create_ctrl	= nvme_tcp_create_ctrl,
 };
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] nvme-tcp: support specifying the congestion-control
  2022-03-04  9:27 ` [PATCH 2/2] nvme-tcp: " Mingbao Sun
@ 2022-03-04 16:20   ` Christoph Hellwig
  2022-03-05  7:09     ` Mingbao Sun
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2022-03-04 16:20 UTC (permalink / raw)
  To: Mingbao Sun
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, linux-nvme, linux-kernel, tyler.sun,
	ping.gan, yanxiu.cai, libin.zhang, ao.sun

I'll let the NVMe/TCP maintainer comment on the actual functionality,
but:

> +			p = match_strdup(args);
> +			if (!p) {
> +				ret = -ENOMEM;
> +				goto out;
> +			}
> +
> +			key = tcp_ca_get_key_by_name(NULL, p, &ecn_ca);
> +			if (key == TCP_CA_UNSPEC) {
> +				pr_err("congestion control %s not found.\n",
> +				       p);
> +				ret = -EINVAL;
> +				kfree(p);
> +				goto out;
> +			}

We can't just call networking code from nvme-fabrics.ko

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] nvme-tcp: support specifying the congestion-control
  2022-03-04 16:20   ` Christoph Hellwig
@ 2022-03-05  7:09     ` Mingbao Sun
  2022-03-08  7:12       ` Christoph Hellwig
  0 siblings, 1 reply; 8+ messages in thread
From: Mingbao Sun @ 2022-03-05  7:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni,
	linux-nvme, linux-kernel, tyler.sun, ping.gan, yanxiu.cai,
	libin.zhang, ao.sun

On Fri, 4 Mar 2022 17:20:32 +0100
Christoph Hellwig <hch@lst.de> wrote:

> I'll let the NVMe/TCP maintainer comment on the actual functionality,
> but:
> 
> > +			p = match_strdup(args);
> > +			if (!p) {
> > +				ret = -ENOMEM;
> > +				goto out;
> > +			}
> > +
> > +			key = tcp_ca_get_key_by_name(NULL, p,
> > &ecn_ca);
> > +			if (key == TCP_CA_UNSPEC) {
> > +				pr_err("congestion control %s not
> > found.\n",
> > +				       p);
> > +				ret = -EINVAL;
> > +				kfree(p);
> > +				goto out;
> > +			}  
> 
> We can't just call networking code from nvme-fabrics.ko

Well, actually I did have thought whether the calling of network API
here is proper. Since I did find that there is no call to APIs of
PCI/RDMA/TCP in fabrics.c.

But I hope the following could make a defense for it:

Anyway, we need to validate the tcp_congestion passed in from
user-space, right?
And it's reasonable to validate it via network API, right?

The role of nvmf_parse_options is similar to that of
drivers/nvme/target/configfs.c from the target side.
And both of them can not avoid handling specific options of the
sub-classes (e.g., NVMF_OPT_HDR_DIGEST, NVMF_OPT_TOS, NVMF_OPT_KATO).

Given the fact that the configfs.c already contains some RDMA-specific
code and has the calls to PCI-specific APIs pci_p2pdma_enable_store and
pci_p2pdma_enable_show, so I added the calling of network APIs in
configfs.c for the validation of tcp_congestion specified by the user.

So I feel this is also acceptable for nvme-fabrics.ko.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] nvme-tcp: support specifying the congestion-control
  2022-03-05  7:09     ` Mingbao Sun
@ 2022-03-08  7:12       ` Christoph Hellwig
  2022-03-08  7:57         ` Mingbao Sun
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2022-03-08  7:12 UTC (permalink / raw)
  To: Mingbao Sun
  Cc: Christoph Hellwig, Keith Busch, Jens Axboe, Sagi Grimberg,
	Chaitanya Kulkarni, linux-nvme, linux-kernel, tyler.sun,
	ping.gan, yanxiu.cai, libin.zhang, ao.sun

On Sat, Mar 05, 2022 at 03:09:15PM +0800, Mingbao Sun wrote:
> Well, actually I did have thought whether the calling of network API
> here is proper. Since I did find that there is no call to APIs of
> PCI/RDMA/TCP in fabrics.c.

Yes - for a good reason.  Without networking support your patch won't
even compile (both the host and target side).

> But I hope the following could make a defense for it:
> 
> Anyway, we need to validate the tcp_congestion passed in from
> user-space, right?

Do we?  It seems like no one else really calls this routine to verify
things.  In fact it has no modular users at all in the current tree.

> The role of nvmf_parse_options is similar to that of
> drivers/nvme/target/configfs.c from the target side.
> And both of them can not avoid handling specific options of the
> sub-classes (e.g., NVMF_OPT_HDR_DIGEST, NVMF_OPT_TOS, NVMF_OPT_KATO).

NVMF_OPT_KATO is completely generic, but yes, there other two are
transport specific.  None of them calls out into other modules
that would need dependecies, though.

I'm also a little concerned that no other in kernel user like iSCSI,
NBD or NFS has any code like this.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] nvme-tcp: support specifying the congestion-control
  2022-03-08  7:12       ` Christoph Hellwig
@ 2022-03-08  7:57         ` Mingbao Sun
  0 siblings, 0 replies; 8+ messages in thread
From: Mingbao Sun @ 2022-03-08  7:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni,
	linux-nvme, linux-kernel, tyler.sun, ping.gan, yanxiu.cai,
	libin.zhang, ao.sun

On Tue, 8 Mar 2022 08:12:27 +0100
Christoph Hellwig <hch@lst.de> wrote:

> On Sat, Mar 05, 2022 at 03:09:15PM +0800, Mingbao Sun wrote:
> > Well, actually I did have thought whether the calling of network API
> > here is proper. Since I did find that there is no call to APIs of
> > PCI/RDMA/TCP in fabrics.c.  
> 
> Yes - for a good reason.  Without networking support your patch won't
> even compile (both the host and target side).

accept.

Will remove the calls to networking APIs in the next version.
With investigation, I found the tcp_congestion could also get checked
later within sock_common_setsockopt in nvme_tcp_alloc_queue.
And this brings no difference to the command 'nvme connect'.

> 
> > But I hope the following could make a defense for it:
> > 
> > Anyway, we need to validate the tcp_congestion passed in from
> > user-space, right?  
> 
> Do we?  It seems like no one else really calls this routine to verify
> things.  In fact it has no modular users at all in the current tree.

OK. Got it.

> 
> > The role of nvmf_parse_options is similar to that of
> > drivers/nvme/target/configfs.c from the target side.
> > And both of them can not avoid handling specific options of the
> > sub-classes (e.g., NVMF_OPT_HDR_DIGEST, NVMF_OPT_TOS, NVMF_OPT_KATO).  
> 
> NVMF_OPT_KATO is completely generic, but yes, there other two are
> transport specific.  None of them calls out into other modules
> that would need dependecies, though.

Yeah, NVMF_OPT_KATO is generic. Sorry for the mistake.

> 
> I'm also a little concerned that no other in kernel user like iSCSI,
> NBD or NFS has any code like this.

Well, at least I could first remove the calls to networking APIs on the
host side. And it brings no downside.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/2] NVMe_over_TCP: support specifying the congestion-control
  2022-03-04  9:27 [PATCH 0/2] NVMe_over_TCP: support specifying the congestion-control Mingbao Sun
  2022-03-04  9:27 ` [PATCH 1/2] nvmet-tcp: " Mingbao Sun
  2022-03-04  9:27 ` [PATCH 2/2] nvme-tcp: " Mingbao Sun
@ 2022-03-08 13:03 ` Mingbao Sun
  2 siblings, 0 replies; 8+ messages in thread
From: Mingbao Sun @ 2022-03-08 13:03 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, linux-nvme, linux-kernel
  Cc: tyler.sun, ping.gan, yanxiu.cai, libin.zhang, ao.sun

I feel that I'd better address this a little bit more to express the
meaning behind this feature.

You know, InfiniBand/RoCE provides NVMe-oF a lossless network
environment (that is zero packet loss), which is a great advantage
to performance.

In contrast, 'TCP/IP + ethernet' is often used as a lossy network
environment (packet dropping often occurs). 
And once packet dropping occurs, timeout-retransmission would be
triggered. But once timeout-retransmission was triggered, it’s a great
damage to the performance.

So although NVMe/TCP may have a bandwidth competitive to that of
NVMe/RDMA, but the packet dropping of the former is a flaw to
its performance.

However, with the combination of the following conditions, NVMe/TCP
can almost be as competitive as NVMe/RDMA in the data center.

  - Ethernet NICs supporting QoS configuration (support mapping TOS/DSCP
    in IP header into priority, support PFC)

  - Ethernet Switches supporting ECN marking, supporting adjusting
    buffer size of each priority.

  - NVMe/TCP supports specifying the tos for its TCP traffic
    (already implemented)

  - NVMe/TCP supports specifying dctcp as the congestion-control of its
    TCP sockets (the work of this feature)

So this feature is the last item from the software aspect to form up the
above combination.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-03-08 13:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-04  9:27 [PATCH 0/2] NVMe_over_TCP: support specifying the congestion-control Mingbao Sun
2022-03-04  9:27 ` [PATCH 1/2] nvmet-tcp: " Mingbao Sun
2022-03-04  9:27 ` [PATCH 2/2] nvme-tcp: " Mingbao Sun
2022-03-04 16:20   ` Christoph Hellwig
2022-03-05  7:09     ` Mingbao Sun
2022-03-08  7:12       ` Christoph Hellwig
2022-03-08  7:57         ` Mingbao Sun
2022-03-08 13:03 ` [PATCH 0/2] NVMe_over_TCP: " Mingbao Sun

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).