All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC net-next] net/smc:introduce 1RTT to SMC
@ 2022-05-24  6:52 D. Wythe
  2022-05-24  7:49 ` Tony Lu
  2022-05-24 12:40 ` kernel test robot
  0 siblings, 2 replies; 13+ messages in thread
From: D. Wythe @ 2022-05-24  6:52 UTC (permalink / raw)
  To: kgraul; +Cc: kuba, davem, netdev, linux-s390, linux-rdma

From: "D. Wythe" <alibuda@linux.alibaba.com>

Hi Karsten,

We are promoting SMC-R to the field of cloud computing, dues to the
particularity of business on the cloud, the scale and the types of
customer applications are unpredictable. As a participant of SMC-R, we
also hope that SMC-R can cover more application scenarios. Therefore,
many connection problems are exposed during this time. There are two
main issue, one is that the establishment of a single connection takes
longer than that of the TCP, another is that the degree of concurrency
is low under multi-connection processing. This patch set is mainly
optimized for the first issue, and the follow-up of the second issue
will be synchronized in the future.

In terms of communication process, under current implement, a TCP
three-way handshake only needs 1-RTT time, while SMC-R currently
requires 4-RTT times, including 2-RTT over IP(TCP handshake, SMC
proposal & accept ) and 2-RTT over IB ( two times RKEY exchange), which
is most influential factor affecting connection established time at the
moment.

We have noticed that single network interface card is mainstream on the
cloud, dues to the advantages of cloud deployment costs and the cloud's
own disaster recovery support. On the other hand, the emergence of RoCE
LAG technology makes us no longer need to deal with multiple RDMA
network interface cards by ourselves,  just like NIC bonding does. In
Alibaba, Roce LAG is widely used for RDMA.

In that case, SMC-R have only one single link, if so, the RKEY LLC
messages that to perform information exchange in all links are no longer
needed, the SMC Proposal & accept has already complete the exchange of
all information needed. So we think that we can remove the RKEY exchange
in that case, which will save us 2-RTT over IB. We call it as SMC-R 2-RTT.

On the other hand, we can use TCP fast open, carry the SMC proposal data
by TCP SYN message, reduce the time that the SMC waits for the TCP
connection to be established. This will save us another 1-RTT over IP.

Based on the above two viewpoints, in this scenario, we can compress the
communication process of SMC-R into 1-RTT over IP, so that we can
theoretically obtain a time close to that of TCP connection
establishment. We call it as SMC-R 1-RTT. Of course, the specific results
will also be affected by the implementation.

In our test environment, we host two VMs on the same host for wrk/nginx
tests, used a script similar to the following to performing test:

Client.sh

conn=$1
thread=$2

wrk -H ‘Connection: Close’ -c ${conn} -t ${thread} -d 10

Server.sh

sysctl -w net.ipv4.tcp_fastopen=3
smc_run nginx

Statistic shows that:

+-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
|type|args  |   -c1 -t1     |   -c2 -t1     |   -c5 -t1      |  -c10 -t1    |   -c200 -t1    |  -c200 -t4    |  -c2000 -t8   |
+-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
|next-next  |   4188.5qps   |   5942.04qps  |   7621.81qps   |  7678.62qps  |   8204.94qps   |  8457.57qps   |  5687.60qps   |
+-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
|SMC-2RTT   |   4730.17qps  |   7394.85qps  |   11532.78qps  |  12016.22qps |   11520.81qps  |  11391.36qps  |  10364.41qps  |
+-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
|SMC-1RTT   |   5702.77qps  |   9645.18qps  |   11899.20qps  |  12005.16qps |   11536.67qps  |  11420.87qps  |  10392.4qps   |
+-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+-
|TCP        |   6415.74qps  |   11034.10qps |   16716.21qps  |  22217.06qps |   35926.74qps  |  117460.qps   |  120291.16qps |
+-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+

It can clearly be seen that:

1. In step by step short-link scenarios ( -c1 -t1 ), SMC-R after
optimization can reach 88% of TCP. There are still many implementation
details that can be optimized, we hope to optimize the performance of
SMC in this scenario to 90% of TCP.

2. The problem is very serious in the scenario of multi-threading and
multi-connection, the worst case is only 10% of TCP. Even though the
SMC-1RTT has certain optimizations for this scenario, it is clear that
the bottleneck is not here. We are doing some prototyping to solve this,
we hope to reach 60% of TCP in multi-threading and multi-connection
scenarios, and SMC-1RTT is the important prerequisite for upper limit of
subsequent optimization.

In this patch set, we had only completed a simple prototype, only make
sure SMC-1RTT can works.

Sincerely, we are looking forward for you comments, please
let us know if you have any suggestions.

Thanks.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
 net/smc/af_smc.c   | 72 ++++++++++++++++++++++++++++++++++++++++++------------
 net/smc/smc.h      |  8 ++++++
 net/smc/smc_clc.c  | 32 ++++++++++++++++++++----
 net/smc/smc_core.c |  2 ++
 net/smc/smc_pnet.c |  4 +--
 net/smc/smc_pnet.h |  3 +++
 6 files changed, 98 insertions(+), 23 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 1a556f4..bf646d1 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -492,7 +492,7 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link,
 			     struct smc_buf_desc *rmb_desc)
 {
 	struct smc_link_group *lgr = link->lgr;
-	int i, rc = 0;
+	int i, lnk = 0, rc = 0;
 
 	rc = smc_llc_flow_initiate(lgr, SMC_LLC_FLOW_RKEY);
 	if (rc)
@@ -507,14 +507,20 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link,
 		rc = smcr_link_reg_rmb(&lgr->lnk[i], rmb_desc);
 		if (rc)
 			goto out;
+		/* available link count inc */
+		lnk++;
 	}
 
-	/* exchange confirm_rkey msg with peer */
-	rc = smc_llc_do_confirm_rkey(link, rmb_desc);
-	if (rc) {
-		rc = -EFAULT;
-		goto out;
+	/* do not exchange confirm_rkey msg since there are only one link */
+	if (lnk > 1) {
+		/* exchange confirm_rkey msg with peer */
+		rc = smc_llc_do_confirm_rkey(link, rmb_desc);
+		if (rc) {
+			rc = -EFAULT;
+			goto out;
+		}
 	}
+
 	rmb_desc->is_conf_rkey = true;
 out:
 	mutex_unlock(&lgr->llc_conf_mutex);
@@ -932,6 +938,31 @@ static int smc_find_rdma_device(struct smc_sock *smc, struct smc_init_info *ini)
 	return 0;
 }
 
+/* just prototype code
+ * since tcp connect has not happen, using route to perform smc_pnet_find_roce_by_pnetid
+ */
+static int smc_find_rdma_device_with_dst(struct smc_sock *smc, struct smc_init_info *ini)
+{
+	struct sock *tsk = smc->clcsock->sk;
+	struct rtable *rt;
+
+	rt = ip_route_output(sock_net(tsk), smc->remote_address.v4.sin_addr.s_addr,
+			     0, 0, 0);
+
+	if (IS_ERR(rt))
+		return -ECONNRESET;
+
+	smc_pnet_find_roce_by_pnetid(rt->dst.dev, ini);
+	__builtin_prefetch(&ini->ib_dev->mac[ini->ib_port - 1]);
+
+	if (!ini->check_smcrv2 && !ini->ib_dev)
+		return SMC_CLC_DECL_NOSMCRDEV;
+	if (ini->check_smcrv2 && !ini->smcrv2.ib_dev_v2)
+		return SMC_CLC_DECL_NOSMCRDEV;
+
+	return 0;
+}
+
 /* check if there is an ISM device available for this connection. */
 /* called for connect and listen */
 static int smc_find_ism_device(struct smc_sock *smc, struct smc_init_info *ini)
@@ -1019,13 +1050,17 @@ static int smc_find_proposal_devices(struct smc_sock *smc,
 
 	/* check if there is an rdma device available */
 	if (!(ini->smcr_version & SMC_V1) ||
-	    smc_find_rdma_device(smc, ini))
+	    smc_find_rdma_device_with_dst(smc, ini))
 		ini->smcr_version &= ~SMC_V1;
 	/* else RDMA is supported for this connection */
 
 	ini->smc_type_v1 = smc_indicated_type(ini->smcd_version & SMC_V1,
 					      ini->smcr_version & SMC_V1);
 
+	/* just prototype, do this for simple */
+	ini->smc_type_v2 = SMC_TYPE_N;
+	return rc;
+
 	/* check if there is an ism v2 device available */
 	if (!(ini->smcd_version & SMC_V2) ||
 	    !smc_ism_is_v2_capable() ||
@@ -1492,11 +1527,7 @@ static void smc_connect_work(struct work_struct *work)
 		smc->sk.sk_err = smc->clcsock->sk->sk_err;
 	} else if ((1 << smc->clcsock->sk->sk_state) &
 					(TCPF_SYN_SENT | TCPF_SYN_RECV)) {
-		rc = sk_stream_wait_connect(smc->clcsock->sk, &timeo);
-		if ((rc == -EPIPE) &&
-		    ((1 << smc->clcsock->sk->sk_state) &
-					(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)))
-			rc = 0;
+		rc = 0;
 	}
 	release_sock(smc->clcsock->sk);
 	lock_sock(&smc->sk);
@@ -1580,9 +1611,10 @@ static int smc_connect(struct socket *sock, struct sockaddr *addr,
 		rc = -EALREADY;
 		goto out;
 	}
-	rc = kernel_connect(smc->clcsock, addr, alen, flags);
-	if (rc && rc != -EINPROGRESS)
-		goto out;
+
+	/* copy remote address backup */
+	memcpy(&smc->remote_address.ss, addr, alen);
+	rc = -EINPROGRESS;
 
 	if (smc->use_fallback) {
 		sock->state = rc ? SS_CONNECTING : SS_CONNECTED;
@@ -2452,9 +2484,17 @@ static int smc_listen(struct socket *sock, int backlog)
 {
 	struct sock *sk = sock->sk;
 	struct smc_sock *smc;
-	int rc;
+	int rc, val;
 
 	smc = smc_sk(sk);
+
+	/* enable server clcsock tcp fastopen.
+	 * just a proto type code, magic number 5 for no reason
+	 */
+	val = 5;
+	smc->clcsock->ops->setsockopt(smc->clcsock, SOL_TCP,
+				      TCP_FASTOPEN, KERNEL_SOCKPTR(&val), sizeof(val));
+
 	lock_sock(sk);
 
 	rc = -EINVAL;
diff --git a/net/smc/smc.h b/net/smc/smc.h
index 5ed765e..ef18894 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -261,6 +261,14 @@ struct smc_sock {				/* smc sock container */
 	int			fallback_rsn;	/* reason for fallback */
 	u32			peer_diagnosis; /* decline reason from peer */
 	atomic_t                queued_smc_hs;  /* queued smc handshakes */
+
+	union {
+		struct sockaddr		addr;
+		struct sockaddr_in	v4;
+		struct sockaddr_in6	v6;
+		struct sockaddr_storage ss;
+	} remote_address;
+
 	struct inet_connection_sock_af_ops		af_ops;
 	const struct inet_connection_sock_af_ops	*ori_af_ops;
 						/* original af ops */
diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c
index f9f3f59..f944c67 100644
--- a/net/smc/smc_clc.c
+++ b/net/smc/smc_clc.c
@@ -20,6 +20,7 @@
 #include <net/addrconf.h>
 #include <net/sock.h>
 #include <net/tcp.h>
+#include <net/route.h>
 
 #include "smc.h"
 #include "smc_core.h"
@@ -486,8 +487,7 @@ static int smc_clc_prfx_set4_rcu(struct dst_entry *dst, __be32 ipv4,
 		return -ENODEV;
 
 	in_dev_for_each_ifa_rcu(ifa, in_dev) {
-		if (!inet_ifa_match(ipv4, ifa))
-			continue;
+		/* delete this for simple, just prototype code*/
 		prop->prefix_len = inet_mask_len(ifa->ifa_mask);
 		prop->outgoing_subnet = ifa->ifa_address & ifa->ifa_mask;
 		/* prop->ipv6_prefixes_cnt = 0; already done by memset before */
@@ -528,10 +528,10 @@ static int smc_clc_prfx_set6_rcu(struct dst_entry *dst,
 
 /* retrieve and set prefixes in CLC proposal msg */
 static int smc_clc_prfx_set(struct socket *clcsock,
+			    struct dst_entry *dst,
 			    struct smc_clc_msg_proposal_prefix *prop,
 			    struct smc_clc_ipv6_prefix *ipv6_prfx)
 {
-	struct dst_entry *dst = sk_dst_get(clcsock->sk);
 	struct sockaddr_storage addrs;
 	struct sockaddr_in6 *addr6;
 	struct sockaddr_in *addr;
@@ -802,7 +802,8 @@ int smc_clc_send_decline(struct smc_sock *smc, u32 peer_diag_info, u8 version)
 }
 
 /* send CLC PROPOSAL message across internal TCP socket */
-int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
+int smc_clc_send_proposal_with_nexthop(struct smc_sock *smc,
+				       struct dst_entry *dst, struct smc_init_info *ini)
 {
 	struct smc_clc_smcd_v2_extension *smcd_v2_ext;
 	struct smc_clc_msg_proposal_prefix *pclc_prfx;
@@ -838,7 +839,7 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
 
 	/* retrieve ip prefixes for CLC proposal msg */
 	if (ini->smc_type_v1 != SMC_TYPE_N) {
-		rc = smc_clc_prfx_set(smc->clcsock, pclc_prfx, ipv6_prfx);
+		rc = smc_clc_prfx_set(smc->clcsock, dst, pclc_prfx, ipv6_prfx);
 		if (rc) {
 			if (ini->smc_type_v2 == SMC_TYPE_N) {
 				kfree(pclc);
@@ -961,6 +962,11 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
 	}
 	vec[i].iov_base = trl;
 	vec[i++].iov_len = sizeof(*trl);
+
+	msg.msg_flags	|= MSG_FASTOPEN;
+	msg.msg_name	= &smc->remote_address.addr;
+	msg.msg_namelen = sizeof(struct sockaddr_in);
+
 	/* due to the few bytes needed for clc-handshake this cannot block */
 	len = kernel_sendmsg(smc->clcsock, &msg, vec, i, plen);
 	if (len < 0) {
@@ -975,6 +981,22 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
 	return reason_code;
 }
 
+int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
+{
+	struct sock *tsk = smc->clcsock->sk;
+	struct rtable *rt;
+	int rc;
+
+	rt = ip_route_output(sock_net(tsk), smc->remote_address.v4.sin_addr.s_addr,
+			     0, 0, 0);
+
+	if (IS_ERR(rt))
+		return -ECONNRESET;
+
+	rc = smc_clc_send_proposal_with_nexthop(smc, &rt->dst, ini);
+	return rc;
+}
+
 /* build and send CLC CONFIRM / ACCEPT message */
 static int smc_clc_send_confirm_accept(struct smc_sock *smc,
 				       struct smc_clc_msg_accept_confirm_v2 *clc_v2,
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index f40f6ed..ef5e5411 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -1765,6 +1765,8 @@ int smc_vlan_by_tcpsk(struct socket *clcsock, struct smc_init_info *ini)
 	int rc = 0;
 
 	ini->vlan_id = 0;
+	/* just for simple , prototype code */
+	return 0;
 	if (!dst) {
 		rc = -ENOTCONN;
 		goto out;
diff --git a/net/smc/smc_pnet.c b/net/smc/smc_pnet.c
index 7055ed1..6aa3304 100644
--- a/net/smc/smc_pnet.c
+++ b/net/smc/smc_pnet.c
@@ -1064,8 +1064,8 @@ static void smc_pnet_find_rdma_dev(struct net_device *netdev,
  * If nothing found, check pnetid table.
  * If nothing found, try to use handshake device
  */
-static void smc_pnet_find_roce_by_pnetid(struct net_device *ndev,
-					 struct smc_init_info *ini)
+void smc_pnet_find_roce_by_pnetid(struct net_device *ndev,
+				  struct smc_init_info *ini)
 {
 	u8 ndev_pnetid[SMC_MAX_PNETID_LEN];
 	struct net *net;
diff --git a/net/smc/smc_pnet.h b/net/smc/smc_pnet.h
index 80a88ee..2ffaf22 100644
--- a/net/smc/smc_pnet.h
+++ b/net/smc/smc_pnet.h
@@ -67,4 +67,7 @@ void smc_pnet_find_alt_roce(struct smc_link_group *lgr,
 			    struct smc_ib_device *known_dev);
 bool smc_pnet_is_ndev_pnetid(struct net *net, u8 *pnetid);
 bool smc_pnet_is_pnetid_set(u8 *pnetid);
+
+void smc_pnet_find_roce_by_pnetid(struct net_device *ndev,
+				  struct smc_init_info *ini);
 #endif
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC net-next] net/smc:introduce 1RTT to SMC
  2022-05-24  6:52 [RFC net-next] net/smc:introduce 1RTT to SMC D. Wythe
@ 2022-05-24  7:49 ` Tony Lu
  2022-05-25 13:42   ` Alexandra Winter
  2022-05-24 12:40 ` kernel test robot
  1 sibling, 1 reply; 13+ messages in thread
From: Tony Lu @ 2022-05-24  7:49 UTC (permalink / raw)
  To: D. Wythe; +Cc: kgraul, kuba, davem, netdev, linux-s390, linux-rdma

On Tue, May 24, 2022 at 02:52:07PM +0800, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> Hi Karsten,
> 
> We are promoting SMC-R to the field of cloud computing, dues to the
> particularity of business on the cloud, the scale and the types of
> customer applications are unpredictable. As a participant of SMC-R, we
> also hope that SMC-R can cover more application scenarios. Therefore,
> many connection problems are exposed during this time. There are two
> main issue, one is that the establishment of a single connection takes
> longer than that of the TCP, another is that the degree of concurrency
> is low under multi-connection processing. This patch set is mainly
> optimized for the first issue, and the follow-up of the second issue
> will be synchronized in the future.
> 
> In terms of communication process, under current implement, a TCP
> three-way handshake only needs 1-RTT time, while SMC-R currently
> requires 4-RTT times, including 2-RTT over IP(TCP handshake, SMC
> proposal & accept ) and 2-RTT over IB ( two times RKEY exchange), which
> is most influential factor affecting connection established time at the
> moment.
> 
> We have noticed that single network interface card is mainstream on the
> cloud, dues to the advantages of cloud deployment costs and the cloud's
> own disaster recovery support. On the other hand, the emergence of RoCE
> LAG technology makes us no longer need to deal with multiple RDMA
> network interface cards by ourselves,  just like NIC bonding does. In
> Alibaba, Roce LAG is widely used for RDMA.

I think this is an interesting topic whether we need SMC-level link
redundancy. I agreed with that RoCE LAG and RDMA in cloud vendors handle
redundancy and failover in the lower layer, and do it transparently for
SMC.

So let's move on, if a RDMA device has redundancy ability, we could make
SMC simpler by give an option for user-space or based on the device
capability (if we have this flag). This allows under layer to ensure the
reliability of link group.

As RFC 7609 mentioned, we should do some extra work for reliability to
add link. It should be an optional work if the device have capability
for redundancy, and make link group simpler and faster (for the
so-called SMC-2RTT in this RFC).

I also notice that RFC 7609 is released on August 2015, which is earlier
than RoCE LAG. RoCE LAG is provided after ConnectX-3/ConnectX-3 Pro in
kernel 4.0, and is available in 2017. And cloud vendors' RDMA adapters,
such as Alibaba Elastic RDMA adapter in [1].

Given that, I propose whether the second link can be used as an option
in newly created link group. Also, if it is possible, RFC 7609 can be
updated or extend it for this nowadays case.

Looking forward for your message, Karsten, D. Wythe and folks.

[1] https://lore.kernel.org/linux-rdma/20220523075528.35017-1-chengyou@linux.alibaba.com/

Thanks,
Tony Lu
 
> In that case, SMC-R have only one single link, if so, the RKEY LLC
> messages that to perform information exchange in all links are no longer
> needed, the SMC Proposal & accept has already complete the exchange of
> all information needed. So we think that we can remove the RKEY exchange
> in that case, which will save us 2-RTT over IB. We call it as SMC-R 2-RTT.
> 
> On the other hand, we can use TCP fast open, carry the SMC proposal data
> by TCP SYN message, reduce the time that the SMC waits for the TCP
> connection to be established. This will save us another 1-RTT over IP.
> 
> Based on the above two viewpoints, in this scenario, we can compress the
> communication process of SMC-R into 1-RTT over IP, so that we can
> theoretically obtain a time close to that of TCP connection
> establishment. We call it as SMC-R 1-RTT. Of course, the specific results
> will also be affected by the implementation.
> 
> In our test environment, we host two VMs on the same host for wrk/nginx
> tests, used a script similar to the following to performing test:
> 
> Client.sh
> 
> conn=$1
> thread=$2
> 
> wrk -H ‘Connection: Close’ -c ${conn} -t ${thread} -d 10
> 
> Server.sh
> 
> sysctl -w net.ipv4.tcp_fastopen=3
> smc_run nginx
> 
> Statistic shows that:
> 
> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
> |type|args  |   -c1 -t1     |   -c2 -t1     |   -c5 -t1      |  -c10 -t1    |   -c200 -t1    |  -c200 -t4    |  -c2000 -t8   |
> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
> |next-next  |   4188.5qps   |   5942.04qps  |   7621.81qps   |  7678.62qps  |   8204.94qps   |  8457.57qps   |  5687.60qps   |
> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
> |SMC-2RTT   |   4730.17qps  |   7394.85qps  |   11532.78qps  |  12016.22qps |   11520.81qps  |  11391.36qps  |  10364.41qps  |
> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
> |SMC-1RTT   |   5702.77qps  |   9645.18qps  |   11899.20qps  |  12005.16qps |   11536.67qps  |  11420.87qps  |  10392.4qps   |
> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+-
> |TCP        |   6415.74qps  |   11034.10qps |   16716.21qps  |  22217.06qps |   35926.74qps  |  117460.qps   |  120291.16qps |
> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
> 
> It can clearly be seen that:
> 
> 1. In step by step short-link scenarios ( -c1 -t1 ), SMC-R after
> optimization can reach 88% of TCP. There are still many implementation
> details that can be optimized, we hope to optimize the performance of
> SMC in this scenario to 90% of TCP.
> 
> 2. The problem is very serious in the scenario of multi-threading and
> multi-connection, the worst case is only 10% of TCP. Even though the
> SMC-1RTT has certain optimizations for this scenario, it is clear that
> the bottleneck is not here. We are doing some prototyping to solve this,
> we hope to reach 60% of TCP in multi-threading and multi-connection
> scenarios, and SMC-1RTT is the important prerequisite for upper limit of
> subsequent optimization.
> 
> In this patch set, we had only completed a simple prototype, only make
> sure SMC-1RTT can works.
> 
> Sincerely, we are looking forward for you comments, please
> let us know if you have any suggestions.
> 
> Thanks.
> 
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> ---
>  net/smc/af_smc.c   | 72 ++++++++++++++++++++++++++++++++++++++++++------------
>  net/smc/smc.h      |  8 ++++++
>  net/smc/smc_clc.c  | 32 ++++++++++++++++++++----
>  net/smc/smc_core.c |  2 ++
>  net/smc/smc_pnet.c |  4 +--
>  net/smc/smc_pnet.h |  3 +++
>  6 files changed, 98 insertions(+), 23 deletions(-)
> 
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index 1a556f4..bf646d1 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -492,7 +492,7 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link,
>  			     struct smc_buf_desc *rmb_desc)
>  {
>  	struct smc_link_group *lgr = link->lgr;
> -	int i, rc = 0;
> +	int i, lnk = 0, rc = 0;
>  
>  	rc = smc_llc_flow_initiate(lgr, SMC_LLC_FLOW_RKEY);
>  	if (rc)
> @@ -507,14 +507,20 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link,
>  		rc = smcr_link_reg_rmb(&lgr->lnk[i], rmb_desc);
>  		if (rc)
>  			goto out;
> +		/* available link count inc */
> +		lnk++;
>  	}
>  
> -	/* exchange confirm_rkey msg with peer */
> -	rc = smc_llc_do_confirm_rkey(link, rmb_desc);
> -	if (rc) {
> -		rc = -EFAULT;
> -		goto out;
> +	/* do not exchange confirm_rkey msg since there are only one link */
> +	if (lnk > 1) {
> +		/* exchange confirm_rkey msg with peer */
> +		rc = smc_llc_do_confirm_rkey(link, rmb_desc);
> +		if (rc) {
> +			rc = -EFAULT;
> +			goto out;
> +		}
>  	}
> +
>  	rmb_desc->is_conf_rkey = true;
>  out:
>  	mutex_unlock(&lgr->llc_conf_mutex);
> @@ -932,6 +938,31 @@ static int smc_find_rdma_device(struct smc_sock *smc, struct smc_init_info *ini)
>  	return 0;
>  }
>  
> +/* just prototype code
> + * since tcp connect has not happen, using route to perform smc_pnet_find_roce_by_pnetid
> + */
> +static int smc_find_rdma_device_with_dst(struct smc_sock *smc, struct smc_init_info *ini)
> +{
> +	struct sock *tsk = smc->clcsock->sk;
> +	struct rtable *rt;
> +
> +	rt = ip_route_output(sock_net(tsk), smc->remote_address.v4.sin_addr.s_addr,
> +			     0, 0, 0);
> +
> +	if (IS_ERR(rt))
> +		return -ECONNRESET;
> +
> +	smc_pnet_find_roce_by_pnetid(rt->dst.dev, ini);
> +	__builtin_prefetch(&ini->ib_dev->mac[ini->ib_port - 1]);
> +
> +	if (!ini->check_smcrv2 && !ini->ib_dev)
> +		return SMC_CLC_DECL_NOSMCRDEV;
> +	if (ini->check_smcrv2 && !ini->smcrv2.ib_dev_v2)
> +		return SMC_CLC_DECL_NOSMCRDEV;
> +
> +	return 0;
> +}
> +
>  /* check if there is an ISM device available for this connection. */
>  /* called for connect and listen */
>  static int smc_find_ism_device(struct smc_sock *smc, struct smc_init_info *ini)
> @@ -1019,13 +1050,17 @@ static int smc_find_proposal_devices(struct smc_sock *smc,
>  
>  	/* check if there is an rdma device available */
>  	if (!(ini->smcr_version & SMC_V1) ||
> -	    smc_find_rdma_device(smc, ini))
> +	    smc_find_rdma_device_with_dst(smc, ini))
>  		ini->smcr_version &= ~SMC_V1;
>  	/* else RDMA is supported for this connection */
>  
>  	ini->smc_type_v1 = smc_indicated_type(ini->smcd_version & SMC_V1,
>  					      ini->smcr_version & SMC_V1);
>  
> +	/* just prototype, do this for simple */
> +	ini->smc_type_v2 = SMC_TYPE_N;
> +	return rc;
> +
>  	/* check if there is an ism v2 device available */
>  	if (!(ini->smcd_version & SMC_V2) ||
>  	    !smc_ism_is_v2_capable() ||
> @@ -1492,11 +1527,7 @@ static void smc_connect_work(struct work_struct *work)
>  		smc->sk.sk_err = smc->clcsock->sk->sk_err;
>  	} else if ((1 << smc->clcsock->sk->sk_state) &
>  					(TCPF_SYN_SENT | TCPF_SYN_RECV)) {
> -		rc = sk_stream_wait_connect(smc->clcsock->sk, &timeo);
> -		if ((rc == -EPIPE) &&
> -		    ((1 << smc->clcsock->sk->sk_state) &
> -					(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)))
> -			rc = 0;
> +		rc = 0;
>  	}
>  	release_sock(smc->clcsock->sk);
>  	lock_sock(&smc->sk);
> @@ -1580,9 +1611,10 @@ static int smc_connect(struct socket *sock, struct sockaddr *addr,
>  		rc = -EALREADY;
>  		goto out;
>  	}
> -	rc = kernel_connect(smc->clcsock, addr, alen, flags);
> -	if (rc && rc != -EINPROGRESS)
> -		goto out;
> +
> +	/* copy remote address backup */
> +	memcpy(&smc->remote_address.ss, addr, alen);
> +	rc = -EINPROGRESS;
>  
>  	if (smc->use_fallback) {
>  		sock->state = rc ? SS_CONNECTING : SS_CONNECTED;
> @@ -2452,9 +2484,17 @@ static int smc_listen(struct socket *sock, int backlog)
>  {
>  	struct sock *sk = sock->sk;
>  	struct smc_sock *smc;
> -	int rc;
> +	int rc, val;
>  
>  	smc = smc_sk(sk);
> +
> +	/* enable server clcsock tcp fastopen.
> +	 * just a proto type code, magic number 5 for no reason
> +	 */
> +	val = 5;
> +	smc->clcsock->ops->setsockopt(smc->clcsock, SOL_TCP,
> +				      TCP_FASTOPEN, KERNEL_SOCKPTR(&val), sizeof(val));
> +
>  	lock_sock(sk);
>  
>  	rc = -EINVAL;
> diff --git a/net/smc/smc.h b/net/smc/smc.h
> index 5ed765e..ef18894 100644
> --- a/net/smc/smc.h
> +++ b/net/smc/smc.h
> @@ -261,6 +261,14 @@ struct smc_sock {				/* smc sock container */
>  	int			fallback_rsn;	/* reason for fallback */
>  	u32			peer_diagnosis; /* decline reason from peer */
>  	atomic_t                queued_smc_hs;  /* queued smc handshakes */
> +
> +	union {
> +		struct sockaddr		addr;
> +		struct sockaddr_in	v4;
> +		struct sockaddr_in6	v6;
> +		struct sockaddr_storage ss;
> +	} remote_address;
> +
>  	struct inet_connection_sock_af_ops		af_ops;
>  	const struct inet_connection_sock_af_ops	*ori_af_ops;
>  						/* original af ops */
> diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c
> index f9f3f59..f944c67 100644
> --- a/net/smc/smc_clc.c
> +++ b/net/smc/smc_clc.c
> @@ -20,6 +20,7 @@
>  #include <net/addrconf.h>
>  #include <net/sock.h>
>  #include <net/tcp.h>
> +#include <net/route.h>
>  
>  #include "smc.h"
>  #include "smc_core.h"
> @@ -486,8 +487,7 @@ static int smc_clc_prfx_set4_rcu(struct dst_entry *dst, __be32 ipv4,
>  		return -ENODEV;
>  
>  	in_dev_for_each_ifa_rcu(ifa, in_dev) {
> -		if (!inet_ifa_match(ipv4, ifa))
> -			continue;
> +		/* delete this for simple, just prototype code*/
>  		prop->prefix_len = inet_mask_len(ifa->ifa_mask);
>  		prop->outgoing_subnet = ifa->ifa_address & ifa->ifa_mask;
>  		/* prop->ipv6_prefixes_cnt = 0; already done by memset before */
> @@ -528,10 +528,10 @@ static int smc_clc_prfx_set6_rcu(struct dst_entry *dst,
>  
>  /* retrieve and set prefixes in CLC proposal msg */
>  static int smc_clc_prfx_set(struct socket *clcsock,
> +			    struct dst_entry *dst,
>  			    struct smc_clc_msg_proposal_prefix *prop,
>  			    struct smc_clc_ipv6_prefix *ipv6_prfx)
>  {
> -	struct dst_entry *dst = sk_dst_get(clcsock->sk);
>  	struct sockaddr_storage addrs;
>  	struct sockaddr_in6 *addr6;
>  	struct sockaddr_in *addr;
> @@ -802,7 +802,8 @@ int smc_clc_send_decline(struct smc_sock *smc, u32 peer_diag_info, u8 version)
>  }
>  
>  /* send CLC PROPOSAL message across internal TCP socket */
> -int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
> +int smc_clc_send_proposal_with_nexthop(struct smc_sock *smc,
> +				       struct dst_entry *dst, struct smc_init_info *ini)
>  {
>  	struct smc_clc_smcd_v2_extension *smcd_v2_ext;
>  	struct smc_clc_msg_proposal_prefix *pclc_prfx;
> @@ -838,7 +839,7 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
>  
>  	/* retrieve ip prefixes for CLC proposal msg */
>  	if (ini->smc_type_v1 != SMC_TYPE_N) {
> -		rc = smc_clc_prfx_set(smc->clcsock, pclc_prfx, ipv6_prfx);
> +		rc = smc_clc_prfx_set(smc->clcsock, dst, pclc_prfx, ipv6_prfx);
>  		if (rc) {
>  			if (ini->smc_type_v2 == SMC_TYPE_N) {
>  				kfree(pclc);
> @@ -961,6 +962,11 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
>  	}
>  	vec[i].iov_base = trl;
>  	vec[i++].iov_len = sizeof(*trl);
> +
> +	msg.msg_flags	|= MSG_FASTOPEN;
> +	msg.msg_name	= &smc->remote_address.addr;
> +	msg.msg_namelen = sizeof(struct sockaddr_in);
> +
>  	/* due to the few bytes needed for clc-handshake this cannot block */
>  	len = kernel_sendmsg(smc->clcsock, &msg, vec, i, plen);
>  	if (len < 0) {
> @@ -975,6 +981,22 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
>  	return reason_code;
>  }
>  
> +int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
> +{
> +	struct sock *tsk = smc->clcsock->sk;
> +	struct rtable *rt;
> +	int rc;
> +
> +	rt = ip_route_output(sock_net(tsk), smc->remote_address.v4.sin_addr.s_addr,
> +			     0, 0, 0);
> +
> +	if (IS_ERR(rt))
> +		return -ECONNRESET;
> +
> +	rc = smc_clc_send_proposal_with_nexthop(smc, &rt->dst, ini);
> +	return rc;
> +}
> +
>  /* build and send CLC CONFIRM / ACCEPT message */
>  static int smc_clc_send_confirm_accept(struct smc_sock *smc,
>  				       struct smc_clc_msg_accept_confirm_v2 *clc_v2,
> diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
> index f40f6ed..ef5e5411 100644
> --- a/net/smc/smc_core.c
> +++ b/net/smc/smc_core.c
> @@ -1765,6 +1765,8 @@ int smc_vlan_by_tcpsk(struct socket *clcsock, struct smc_init_info *ini)
>  	int rc = 0;
>  
>  	ini->vlan_id = 0;
> +	/* just for simple , prototype code */
> +	return 0;
>  	if (!dst) {
>  		rc = -ENOTCONN;
>  		goto out;
> diff --git a/net/smc/smc_pnet.c b/net/smc/smc_pnet.c
> index 7055ed1..6aa3304 100644
> --- a/net/smc/smc_pnet.c
> +++ b/net/smc/smc_pnet.c
> @@ -1064,8 +1064,8 @@ static void smc_pnet_find_rdma_dev(struct net_device *netdev,
>   * If nothing found, check pnetid table.
>   * If nothing found, try to use handshake device
>   */
> -static void smc_pnet_find_roce_by_pnetid(struct net_device *ndev,
> -					 struct smc_init_info *ini)
> +void smc_pnet_find_roce_by_pnetid(struct net_device *ndev,
> +				  struct smc_init_info *ini)
>  {
>  	u8 ndev_pnetid[SMC_MAX_PNETID_LEN];
>  	struct net *net;
> diff --git a/net/smc/smc_pnet.h b/net/smc/smc_pnet.h
> index 80a88ee..2ffaf22 100644
> --- a/net/smc/smc_pnet.h
> +++ b/net/smc/smc_pnet.h
> @@ -67,4 +67,7 @@ void smc_pnet_find_alt_roce(struct smc_link_group *lgr,
>  			    struct smc_ib_device *known_dev);
>  bool smc_pnet_is_ndev_pnetid(struct net *net, u8 *pnetid);
>  bool smc_pnet_is_pnetid_set(u8 *pnetid);
> +
> +void smc_pnet_find_roce_by_pnetid(struct net_device *ndev,
> +				  struct smc_init_info *ini);
>  #endif
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC net-next] net/smc:introduce 1RTT to SMC
  2022-05-24  6:52 [RFC net-next] net/smc:introduce 1RTT to SMC D. Wythe
  2022-05-24  7:49 ` Tony Lu
@ 2022-05-24 12:40 ` kernel test robot
  1 sibling, 0 replies; 13+ messages in thread
From: kernel test robot @ 2022-05-24 12:40 UTC (permalink / raw)
  To: D. Wythe; +Cc: llvm, kbuild-all

Hi Wythe",

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on net-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/D-Wythe/net-smc-introduce-1RTT-to-SMC/20220524-145404
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 677fb7525331375ba2f90f4bc94a80b9b6e697a3
config: mips-randconfig-r014-20220524 (https://download.01.org/0day-ci/archive/20220524/202205242056.SUbsT3nP-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 10c9ecce9f6096e18222a331c5e7d085bd813f75)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install mips cross compiling tool for clang build
        # apt-get install binutils-mips-linux-gnu
        # https://github.com/intel-lab-lkp/linux/commit/4658b14876ebb1577289bd0bc4c2b7fccd1af486
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review D-Wythe/net-smc-introduce-1RTT-to-SMC/20220524-145404
        git checkout 4658b14876ebb1577289bd0bc4c2b7fccd1af486
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=mips SHELL=/bin/bash net/smc/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> net/smc/smc_clc.c:805:5: warning: no previous prototype for function 'smc_clc_send_proposal_with_nexthop' [-Wmissing-prototypes]
   int smc_clc_send_proposal_with_nexthop(struct smc_sock *smc,
       ^
   net/smc/smc_clc.c:805:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   int smc_clc_send_proposal_with_nexthop(struct smc_sock *smc,
   ^
   static 
   1 warning generated.


vim +/smc_clc_send_proposal_with_nexthop +805 net/smc/smc_clc.c

   803	
   804	/* send CLC PROPOSAL message across internal TCP socket */
 > 805	int smc_clc_send_proposal_with_nexthop(struct smc_sock *smc,
   806					       struct dst_entry *dst, struct smc_init_info *ini)
   807	{
   808		struct smc_clc_smcd_v2_extension *smcd_v2_ext;
   809		struct smc_clc_msg_proposal_prefix *pclc_prfx;
   810		struct smc_clc_msg_proposal *pclc_base;
   811		struct smc_clc_smcd_gid_chid *gidchids;
   812		struct smc_clc_msg_proposal_area *pclc;
   813		struct smc_clc_ipv6_prefix *ipv6_prfx;
   814		struct smc_clc_v2_extension *v2_ext;
   815		struct smc_clc_msg_smcd *pclc_smcd;
   816		struct smc_clc_msg_trail *trl;
   817		int len, i, plen, rc;
   818		int reason_code = 0;
   819		struct kvec vec[8];
   820		struct msghdr msg;
   821	
   822		pclc = kzalloc(sizeof(*pclc), GFP_KERNEL);
   823		if (!pclc)
   824			return -ENOMEM;
   825	
   826		pclc_base = &pclc->pclc_base;
   827		pclc_smcd = &pclc->pclc_smcd;
   828		pclc_prfx = &pclc->pclc_prfx;
   829		ipv6_prfx = pclc->pclc_prfx_ipv6;
   830		v2_ext = &pclc->pclc_v2_ext;
   831		smcd_v2_ext = &pclc->pclc_smcd_v2_ext;
   832		gidchids = pclc->pclc_gidchids;
   833		trl = &pclc->pclc_trl;
   834	
   835		pclc_base->hdr.version = SMC_V2;
   836		pclc_base->hdr.typev1 = ini->smc_type_v1;
   837		pclc_base->hdr.typev2 = ini->smc_type_v2;
   838		plen = sizeof(*pclc_base) + sizeof(*pclc_smcd) + sizeof(*trl);
   839	
   840		/* retrieve ip prefixes for CLC proposal msg */
   841		if (ini->smc_type_v1 != SMC_TYPE_N) {
   842			rc = smc_clc_prfx_set(smc->clcsock, dst, pclc_prfx, ipv6_prfx);
   843			if (rc) {
   844				if (ini->smc_type_v2 == SMC_TYPE_N) {
   845					kfree(pclc);
   846					return SMC_CLC_DECL_CNFERR;
   847				}
   848				pclc_base->hdr.typev1 = SMC_TYPE_N;
   849			} else {
   850				pclc_base->iparea_offset = htons(sizeof(*pclc_smcd));
   851				plen += sizeof(*pclc_prfx) +
   852						pclc_prfx->ipv6_prefixes_cnt *
   853						sizeof(ipv6_prfx[0]);
   854			}
   855		}
   856	
   857		/* build SMC Proposal CLC message */
   858		memcpy(pclc_base->hdr.eyecatcher, SMC_EYECATCHER,
   859		       sizeof(SMC_EYECATCHER));
   860		pclc_base->hdr.type = SMC_CLC_PROPOSAL;
   861		if (smcr_indicated(ini->smc_type_v1)) {
   862			/* add SMC-R specifics */
   863			memcpy(pclc_base->lcl.id_for_peer, local_systemid,
   864			       sizeof(local_systemid));
   865			memcpy(pclc_base->lcl.gid, ini->ib_gid, SMC_GID_SIZE);
   866			memcpy(pclc_base->lcl.mac, &ini->ib_dev->mac[ini->ib_port - 1],
   867			       ETH_ALEN);
   868		}
   869		if (smcd_indicated(ini->smc_type_v1)) {
   870			/* add SMC-D specifics */
   871			if (ini->ism_dev[0]) {
   872				pclc_smcd->ism.gid = htonll(ini->ism_dev[0]->local_gid);
   873				pclc_smcd->ism.chid =
   874					htons(smc_ism_get_chid(ini->ism_dev[0]));
   875			}
   876		}
   877		if (ini->smc_type_v2 == SMC_TYPE_N) {
   878			pclc_smcd->v2_ext_offset = 0;
   879		} else {
   880			struct smc_clc_eid_entry *ueident;
   881			u16 v2_ext_offset;
   882	
   883			v2_ext->hdr.flag.release = SMC_RELEASE;
   884			v2_ext_offset = sizeof(*pclc_smcd) -
   885				offsetofend(struct smc_clc_msg_smcd, v2_ext_offset);
   886			if (ini->smc_type_v1 != SMC_TYPE_N)
   887				v2_ext_offset += sizeof(*pclc_prfx) +
   888							pclc_prfx->ipv6_prefixes_cnt *
   889							sizeof(ipv6_prfx[0]);
   890			pclc_smcd->v2_ext_offset = htons(v2_ext_offset);
   891			plen += sizeof(*v2_ext);
   892	
   893			read_lock(&smc_clc_eid_table.lock);
   894			v2_ext->hdr.eid_cnt = smc_clc_eid_table.ueid_cnt;
   895			plen += smc_clc_eid_table.ueid_cnt * SMC_MAX_EID_LEN;
   896			i = 0;
   897			list_for_each_entry(ueident, &smc_clc_eid_table.list, list) {
   898				memcpy(v2_ext->user_eids[i++], ueident->eid,
   899				       sizeof(ueident->eid));
   900			}
   901			read_unlock(&smc_clc_eid_table.lock);
   902		}
   903		if (smcd_indicated(ini->smc_type_v2)) {
   904			u8 *eid = NULL;
   905	
   906			v2_ext->hdr.flag.seid = smc_clc_eid_table.seid_enabled;
   907			v2_ext->hdr.ism_gid_cnt = ini->ism_offered_cnt;
   908			v2_ext->hdr.smcd_v2_ext_offset = htons(sizeof(*v2_ext) -
   909					offsetofend(struct smc_clnt_opts_area_hdr,
   910						    smcd_v2_ext_offset) +
   911					v2_ext->hdr.eid_cnt * SMC_MAX_EID_LEN);
   912			smc_ism_get_system_eid(&eid);
   913			if (eid && v2_ext->hdr.flag.seid)
   914				memcpy(smcd_v2_ext->system_eid, eid, SMC_MAX_EID_LEN);
   915			plen += sizeof(*smcd_v2_ext);
   916			if (ini->ism_offered_cnt) {
   917				for (i = 1; i <= ini->ism_offered_cnt; i++) {
   918					gidchids[i - 1].gid =
   919						htonll(ini->ism_dev[i]->local_gid);
   920					gidchids[i - 1].chid =
   921						htons(smc_ism_get_chid(ini->ism_dev[i]));
   922				}
   923				plen += ini->ism_offered_cnt *
   924					sizeof(struct smc_clc_smcd_gid_chid);
   925			}
   926		}
   927		if (smcr_indicated(ini->smc_type_v2))
   928			memcpy(v2_ext->roce, ini->smcrv2.ib_gid_v2, SMC_GID_SIZE);
   929	
   930		pclc_base->hdr.length = htons(plen);
   931		memcpy(trl->eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER));
   932	
   933		/* send SMC Proposal CLC message */
   934		memset(&msg, 0, sizeof(msg));
   935		i = 0;
   936		vec[i].iov_base = pclc_base;
   937		vec[i++].iov_len = sizeof(*pclc_base);
   938		vec[i].iov_base = pclc_smcd;
   939		vec[i++].iov_len = sizeof(*pclc_smcd);
   940		if (ini->smc_type_v1 != SMC_TYPE_N) {
   941			vec[i].iov_base = pclc_prfx;
   942			vec[i++].iov_len = sizeof(*pclc_prfx);
   943			if (pclc_prfx->ipv6_prefixes_cnt > 0) {
   944				vec[i].iov_base = ipv6_prfx;
   945				vec[i++].iov_len = pclc_prfx->ipv6_prefixes_cnt *
   946						   sizeof(ipv6_prfx[0]);
   947			}
   948		}
   949		if (ini->smc_type_v2 != SMC_TYPE_N) {
   950			vec[i].iov_base = v2_ext;
   951			vec[i++].iov_len = sizeof(*v2_ext) +
   952					   (v2_ext->hdr.eid_cnt * SMC_MAX_EID_LEN);
   953			if (smcd_indicated(ini->smc_type_v2)) {
   954				vec[i].iov_base = smcd_v2_ext;
   955				vec[i++].iov_len = sizeof(*smcd_v2_ext);
   956				if (ini->ism_offered_cnt) {
   957					vec[i].iov_base = gidchids;
   958					vec[i++].iov_len = ini->ism_offered_cnt *
   959						sizeof(struct smc_clc_smcd_gid_chid);
   960				}
   961			}
   962		}
   963		vec[i].iov_base = trl;
   964		vec[i++].iov_len = sizeof(*trl);
   965	
   966		msg.msg_flags	|= MSG_FASTOPEN;
   967		msg.msg_name	= &smc->remote_address.addr;
   968		msg.msg_namelen = sizeof(struct sockaddr_in);
   969	
   970		/* due to the few bytes needed for clc-handshake this cannot block */
   971		len = kernel_sendmsg(smc->clcsock, &msg, vec, i, plen);
   972		if (len < 0) {
   973			smc->sk.sk_err = smc->clcsock->sk->sk_err;
   974			reason_code = -smc->sk.sk_err;
   975		} else if (len < ntohs(pclc_base->hdr.length)) {
   976			reason_code = -ENETUNREACH;
   977			smc->sk.sk_err = -reason_code;
   978		}
   979	
   980		kfree(pclc);
   981		return reason_code;
   982	}
   983	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC net-next] net/smc:introduce 1RTT to SMC
  2022-05-24  7:49 ` Tony Lu
@ 2022-05-25 13:42   ` Alexandra Winter
  2022-05-26  3:47     ` D. Wythe
                       ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Alexandra Winter @ 2022-05-25 13:42 UTC (permalink / raw)
  To: Tony Lu, D. Wythe; +Cc: kgraul, kuba, davem, netdev, linux-s390, linux-rdma



On 24.05.22 09:49, Tony Lu wrote:
> On Tue, May 24, 2022 at 02:52:07PM +0800, D. Wythe wrote:
>> From: "D. Wythe" <alibuda@linux.alibaba.com>
>>
>> Hi Karsten,
>>
>> We are promoting SMC-R to the field of cloud computing, dues to the
>> particularity of business on the cloud, the scale and the types of
>> customer applications are unpredictable. As a participant of SMC-R, we
>> also hope that SMC-R can cover more application scenarios. Therefore,
>> many connection problems are exposed during this time. There are two
>> main issue, one is that the establishment of a single connection takes
>> longer than that of the TCP, another is that the degree of concurrency
>> is low under multi-connection processing. This patch set is mainly
>> optimized for the first issue, and the follow-up of the second issue
>> will be synchronized in the future.
>>
>> In terms of communication process, under current implement, a TCP
>> three-way handshake only needs 1-RTT time, while SMC-R currently
>> requires 4-RTT times, including 2-RTT over IP(TCP handshake, SMC
>> proposal & accept ) and 2-RTT over IB ( two times RKEY exchange), which
>> is most influential factor affecting connection established time at the
>> moment.
>>
>> We have noticed that single network interface card is mainstream on the
>> cloud, dues to the advantages of cloud deployment costs and the cloud's
>> own disaster recovery support. On the other hand, the emergence of RoCE
>> LAG technology makes us no longer need to deal with multiple RDMA
>> network interface cards by ourselves,  just like NIC bonding does. In
>> Alibaba, Roce LAG is widely used for RDMA.
> 
> I think this is an interesting topic whether we need SMC-level link
> redundancy. I agreed with that RoCE LAG and RDMA in cloud vendors handle
> redundancy and failover in the lower layer, and do it transparently for
> SMC.
> 
> So let's move on, if a RDMA device has redundancy ability, we could make
> SMC simpler by give an option for user-space or based on the device
> capability (if we have this flag). This allows under layer to ensure the
> reliability of link group.
> 
> As RFC 7609 mentioned, we should do some extra work for reliability to
> add link. It should be an optional work if the device have capability
> for redundancy, and make link group simpler and faster (for the
> so-called SMC-2RTT in this RFC).
> 
> I also notice that RFC 7609 is released on August 2015, which is earlier
> than RoCE LAG. RoCE LAG is provided after ConnectX-3/ConnectX-3 Pro in
> kernel 4.0, and is available in 2017. And cloud vendors' RDMA adapters,
> such as Alibaba Elastic RDMA adapter in [1].
> 
> Given that, I propose whether the second link can be used as an option
> in newly created link group. Also, if it is possible, RFC 7609 can be
> updated or extend it for this nowadays case.
> 
> Looking forward for your message, Karsten, D. Wythe and folks.
> 
> [1] https://lore.kernel.org/linux-rdma/20220523075528.35017-1-chengyou@linux.alibaba.com/
> 
> Thanks,
> Tony Lu
>  
Thank you D. Wythe for your proposals, the prototype and measurements.
They sound quite promising to us.

We need to carefully evaluate them and make sure everything is compatible
with the existing implementations of SMC-D and SMC-R v1 and v2. In the
typical s390 environment ROCE LAG is propably not good enough, as the card
is still a single point of failure. So your ideas need to be compatible
with link redundancy. We also need to consider that the extension of the
protocol does not block other desirable extensions.

Your prototype is very helpful for the understanding. Before submitting any
code patches to net-next, we should agree on the details of the protocol
extension. Maybe you could formulate your proposal in plain text, so we can
discuss it here? 

We also need to inform you that several public holidays are upcoming in the
next weeks and several of our team will be out for summer vacation, so please
allow for longer response times.

Kind regards
Alexandra Winter

>> In that case, SMC-R have only one single link, if so, the RKEY LLC
>> messages that to perform information exchange in all links are no longer
>> needed, the SMC Proposal & accept has already complete the exchange of
>> all information needed. So we think that we can remove the RKEY exchange
>> in that case, which will save us 2-RTT over IB. We call it as SMC-R 2-RTT.
>>
>> On the other hand, we can use TCP fast open, carry the SMC proposal data
>> by TCP SYN message, reduce the time that the SMC waits for the TCP
>> connection to be established. This will save us another 1-RTT over IP.
>>
>> Based on the above two viewpoints, in this scenario, we can compress the
>> communication process of SMC-R into 1-RTT over IP, so that we can
>> theoretically obtain a time close to that of TCP connection
>> establishment. We call it as SMC-R 1-RTT. Of course, the specific results
>> will also be affected by the implementation.
>>
>> In our test environment, we host two VMs on the same host for wrk/nginx
>> tests, used a script similar to the following to performing test:
>>
>> Client.sh
>>
>> conn=$1
>> thread=$2
>>
>> wrk -H ‘Connection: Close’ -c ${conn} -t ${thread} -d 10
>>
>> Server.sh
>>
>> sysctl -w net.ipv4.tcp_fastopen=3
>> smc_run nginx
>>
>> Statistic shows that:
>>
>> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
>> |type|args  |   -c1 -t1     |   -c2 -t1     |   -c5 -t1      |  -c10 -t1    |   -c200 -t1    |  -c200 -t4    |  -c2000 -t8   |
>> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
>> |next-next  |   4188.5qps   |   5942.04qps  |   7621.81qps   |  7678.62qps  |   8204.94qps   |  8457.57qps   |  5687.60qps   |
>> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
>> |SMC-2RTT   |   4730.17qps  |   7394.85qps  |   11532.78qps  |  12016.22qps |   11520.81qps  |  11391.36qps  |  10364.41qps  |
>> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
>> |SMC-1RTT   |   5702.77qps  |   9645.18qps  |   11899.20qps  |  12005.16qps |   11536.67qps  |  11420.87qps  |  10392.4qps   |
>> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+-
>> |TCP        |   6415.74qps  |   11034.10qps |   16716.21qps  |  22217.06qps |   35926.74qps  |  117460.qps   |  120291.16qps |
>> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
>>
>> It can clearly be seen that:
>>
>> 1. In step by step short-link scenarios ( -c1 -t1 ), SMC-R after
>> optimization can reach 88% of TCP. There are still many implementation
>> details that can be optimized, we hope to optimize the performance of
>> SMC in this scenario to 90% of TCP.
>>
>> 2. The problem is very serious in the scenario of multi-threading and
>> multi-connection, the worst case is only 10% of TCP. Even though the
>> SMC-1RTT has certain optimizations for this scenario, it is clear that
>> the bottleneck is not here. We are doing some prototyping to solve this,
>> we hope to reach 60% of TCP in multi-threading and multi-connection
>> scenarios, and SMC-1RTT is the important prerequisite for upper limit of
>> subsequent optimization.
>>
>> In this patch set, we had only completed a simple prototype, only make
>> sure SMC-1RTT can works.
>>
>> Sincerely, we are looking forward for you comments, please
>> let us know if you have any suggestions.
>>
>> Thanks.
>>
>> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
>> ---
--------8<  snip  >8-------- 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC net-next] net/smc:introduce 1RTT to SMC
  2022-05-25 13:42   ` Alexandra Winter
@ 2022-05-26  3:47     ` D. Wythe
  2022-05-26  7:04     ` Tony Lu
  2022-06-01  6:33     ` D. Wythe
  2 siblings, 0 replies; 13+ messages in thread
From: D. Wythe @ 2022-05-26  3:47 UTC (permalink / raw)
  To: Alexandra Winter, Tony Lu
  Cc: kgraul, kuba, davem, netdev, linux-s390, linux-rdma



在 2022/5/25 下午9:42, Alexandra Winter 写道:

> Thank you D. Wythe for your proposals, the prototype and measurements.
> They sound quite promising to us.
>  > We need to carefully evaluate them and make sure everything is compatible
> with the existing implementations of SMC-D and SMC-R v1 and v2. In the
> typical s390 environment ROCE LAG is propably not good enough, as the card
> is still a single point of failure. So your ideas need to be compatible
> with link redundancy. We also need to consider that the extension of the
> protocol does not block other desirable extensions.
> 
> Your prototype is very helpful for the understanding. Before submitting any
> code patches to net-next, we should agree on the details of the protocol
> extension. Maybe you could formulate your proposal in plain text, so we can
> discuss it here?

I am very pleased to hear that your team have interest in this 
proposals, and thanks a lot for your advise. We really appreciate your 
point of view about compatibility, In fact, we are working on some 
written drafts which compatibility is quite a important part, and will 
be shared here soon.

> 
> We also need to inform you that several public holidays are upcoming in the
> next weeks and several of our team will be out for summer vacation, so please
> allow for longer response times.

Thanks for your informing, that's totaly okay to us. May your holidays 
be full of warmth and cheer.


> Kind regards
> Alexandra Winter
> 


D. Wyther
Thanks.





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC net-next] net/smc:introduce 1RTT to SMC
  2022-05-25 13:42   ` Alexandra Winter
  2022-05-26  3:47     ` D. Wythe
@ 2022-05-26  7:04     ` Tony Lu
  2022-06-01  6:33     ` D. Wythe
  2 siblings, 0 replies; 13+ messages in thread
From: Tony Lu @ 2022-05-26  7:04 UTC (permalink / raw)
  To: Alexandra Winter
  Cc: D. Wythe, kgraul, kuba, davem, netdev, linux-s390, linux-rdma

On Wed, May 25, 2022 at 03:42:28PM +0200, Alexandra Winter wrote:
> 
> 
> On 24.05.22 09:49, Tony Lu wrote:
> > On Tue, May 24, 2022 at 02:52:07PM +0800, D. Wythe wrote:
> >> From: "D. Wythe" <alibuda@linux.alibaba.com>
> >>
> >> Hi Karsten,
> >>
> >> We are promoting SMC-R to the field of cloud computing, dues to the
> >> particularity of business on the cloud, the scale and the types of
> >> customer applications are unpredictable. As a participant of SMC-R, we
> >> also hope that SMC-R can cover more application scenarios. Therefore,
> >> many connection problems are exposed during this time. There are two
> >> main issue, one is that the establishment of a single connection takes
> >> longer than that of the TCP, another is that the degree of concurrency
> >> is low under multi-connection processing. This patch set is mainly
> >> optimized for the first issue, and the follow-up of the second issue
> >> will be synchronized in the future.
> >>
> >> In terms of communication process, under current implement, a TCP
> >> three-way handshake only needs 1-RTT time, while SMC-R currently
> >> requires 4-RTT times, including 2-RTT over IP(TCP handshake, SMC
> >> proposal & accept ) and 2-RTT over IB ( two times RKEY exchange), which
> >> is most influential factor affecting connection established time at the
> >> moment.
> >>
> >> We have noticed that single network interface card is mainstream on the
> >> cloud, dues to the advantages of cloud deployment costs and the cloud's
> >> own disaster recovery support. On the other hand, the emergence of RoCE
> >> LAG technology makes us no longer need to deal with multiple RDMA
> >> network interface cards by ourselves,  just like NIC bonding does. In
> >> Alibaba, Roce LAG is widely used for RDMA.
> > 
> > I think this is an interesting topic whether we need SMC-level link
> > redundancy. I agreed with that RoCE LAG and RDMA in cloud vendors handle
> > redundancy and failover in the lower layer, and do it transparently for
> > SMC.
> > 
> > So let's move on, if a RDMA device has redundancy ability, we could make
> > SMC simpler by give an option for user-space or based on the device
> > capability (if we have this flag). This allows under layer to ensure the
> > reliability of link group.
> > 
> > As RFC 7609 mentioned, we should do some extra work for reliability to
> > add link. It should be an optional work if the device have capability
> > for redundancy, and make link group simpler and faster (for the
> > so-called SMC-2RTT in this RFC).
> > 
> > I also notice that RFC 7609 is released on August 2015, which is earlier
> > than RoCE LAG. RoCE LAG is provided after ConnectX-3/ConnectX-3 Pro in
> > kernel 4.0, and is available in 2017. And cloud vendors' RDMA adapters,
> > such as Alibaba Elastic RDMA adapter in [1].
> > 
> > Given that, I propose whether the second link can be used as an option
> > in newly created link group. Also, if it is possible, RFC 7609 can be
> > updated or extend it for this nowadays case.
> > 
> > Looking forward for your message, Karsten, D. Wythe and folks.
> > 
> > [1] https://lore.kernel.org/linux-rdma/20220523075528.35017-1-chengyou@linux.alibaba.com/
> > 
> > Thanks,
> > Tony Lu
> >  
> Thank you D. Wythe for your proposals, the prototype and measurements.
> They sound quite promising to us.
> 
> We need to carefully evaluate them and make sure everything is compatible
> with the existing implementations of SMC-D and SMC-R v1 and v2. In the
> typical s390 environment ROCE LAG is propably not good enough, as the card
> is still a single point of failure. So your ideas need to be compatible
> with link redundancy. We also need to consider that the extension of the
> protocol does not block other desirable extensions.
> 
> Your prototype is very helpful for the understanding. Before submitting any
> code patches to net-next, we should agree on the details of the protocol
> extension. Maybe you could formulate your proposal in plain text, so we can
> discuss it here? 
> 
> We also need to inform you that several public holidays are upcoming in the
> next weeks and several of our team will be out for summer vacation, so please
> allow for longer response times.
> 
> Kind regards
> Alexandra Winter
> 
It's glad to hear this. This gave us a lot of confidence to insist on
it, thank you.

Cheers,
Tony Lu

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC net-next] net/smc:introduce 1RTT to SMC
  2022-05-25 13:42   ` Alexandra Winter
  2022-05-26  3:47     ` D. Wythe
  2022-05-26  7:04     ` Tony Lu
@ 2022-06-01  6:33     ` D. Wythe
  2022-06-01  9:24       ` Tony Lu
  2022-06-16 13:49       ` D. Wythe
  2 siblings, 2 replies; 13+ messages in thread
From: D. Wythe @ 2022-06-01  6:33 UTC (permalink / raw)
  To: Alexandra Winter, Karsten Graul, Tony Lu
  Cc: kgraul, kuba, davem, netdev, linux-s390, linux-rdma


在 2022/5/25 下午9:42, Alexandra Winter 写道:

> We need to carefully evaluate them and make sure everything is compatible
> with the existing implementations of SMC-D and SMC-R v1 and v2. In the
> typical s390 environment ROCE LAG is propably not good enough, as the card
> is still a single point of failure. So your ideas need to be compatible
> with link redundancy. We also need to consider that the extension of the
> protocol does not block other desirable extensions.
> 
> Your prototype is very helpful for the understanding. Before submitting any
> code patches to net-next, we should agree on the details of the protocol
> extension. Maybe you could formulate your proposal in plain text, so we can
> discuss it here?
> 
> We also need to inform you that several public holidays are upcoming in the
> next weeks and several of our team will be out for summer vacation, so please
> allow for longer response times.
> 
> Kind regards
> Alexandra Winter
> 

Hi alls,

In order to achieve signle-link compatibility, we must
complete at least once negotiation. We wish to provide
higher scalability while meeting this feature. There are
few ways to reach this.

1. Use the available reserved bits. According to
the SMC v2 protocol, there are at least 28 reserved octets
in PROPOSAL MESSAGE and at least 10 reserved octets in
ACCEPT MESSAGE are available. We can define an area in which
as a feature area, works like bitmap. Considering the subsequent 
scalability, we MAY use at least 2 reserved ctets, which can support 
negotiation of at least 16 features.

2. Unify all the areas named extension in current
SMC v2 protocol spec without reinterpreting any existing field
and field offset changes, including 'PROPOSAL V1 IP Subnet Extension',
'PROPOSAL V2 Extension', 'PROPOSAL SMC-DV2 EXTENSION' .etc. And provides
the ability to grow dynamically as needs expand. This scheme will use
at least 10 reserved octets in the PROPOSAL MESSAGE and at least 4 
reserved octets in ACCEPT MESSAGE and CONFIRM MESSAGE. Fortunately, we 
only need to use reserved fields, and the current reserved fields are 
sufficient. And then we can easily add a new extension named SIGNLE 
LINK. Limited by space, the details will be elaborated after the scheme 
is finalized.

But no matter what scheme is finalized, the workflow should be similar to:

Allow Single-link:

client							    server
	proposal with Single-link feature bit or extension
		-------->

	accept with Single-link feature bit extension
		<--------
		
		confirm
		-------->


Deny or not recognized:

client							     server
	proposal with Single-link feature bit or extension
		-------->

		rkey confirm
		<------
		------>

	accept without Single-link feature bit or extension
		<------

		rkey confirm
		------->
		<------
		
		confirm
		------->


Look forward to your advice and comments.

Thanks.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC net-next] net/smc:introduce 1RTT to SMC
  2022-06-01  6:33     ` D. Wythe
@ 2022-06-01  9:24       ` Tony Lu
  2022-06-01 11:35         ` Alexandra Winter
  2022-06-16 13:49       ` D. Wythe
  1 sibling, 1 reply; 13+ messages in thread
From: Tony Lu @ 2022-06-01  9:24 UTC (permalink / raw)
  To: D. Wythe
  Cc: Alexandra Winter, Karsten Graul, kuba, davem, netdev, linux-s390,
	linux-rdma

On Wed, Jun 01, 2022 at 02:33:09PM +0800, D. Wythe wrote:
> 
> 在 2022/5/25 下午9:42, Alexandra Winter 写道:
> 
> > We need to carefully evaluate them and make sure everything is compatible
> > with the existing implementations of SMC-D and SMC-R v1 and v2. In the
> > typical s390 environment ROCE LAG is propably not good enough, as the card
> > is still a single point of failure. So your ideas need to be compatible
> > with link redundancy. We also need to consider that the extension of the
> > protocol does not block other desirable extensions.
> > 
> > Your prototype is very helpful for the understanding. Before submitting any
> > code patches to net-next, we should agree on the details of the protocol
> > extension. Maybe you could formulate your proposal in plain text, so we can
> > discuss it here?
> > 
> > We also need to inform you that several public holidays are upcoming in the
> > next weeks and several of our team will be out for summer vacation, so please
> > allow for longer response times.
> > 
> > Kind regards
> > Alexandra Winter
> > 
> 
> Hi alls,
> 
> In order to achieve signle-link compatibility, we must
> complete at least once negotiation. We wish to provide
> higher scalability while meeting this feature. There are
> few ways to reach this.
> 
> 1. Use the available reserved bits. According to
> the SMC v2 protocol, there are at least 28 reserved octets
> in PROPOSAL MESSAGE and at least 10 reserved octets in
> ACCEPT MESSAGE are available. We can define an area in which
> as a feature area, works like bitmap. Considering the subsequent
> scalability, we MAY use at least 2 reserved ctets, which can support
> negotiation of at least 16 features.
> 
> 2. Unify all the areas named extension in current
> SMC v2 protocol spec without reinterpreting any existing field
> and field offset changes, including 'PROPOSAL V1 IP Subnet Extension',
> 'PROPOSAL V2 Extension', 'PROPOSAL SMC-DV2 EXTENSION' .etc. And provides
> the ability to grow dynamically as needs expand. This scheme will use
> at least 10 reserved octets in the PROPOSAL MESSAGE and at least 4 reserved
> octets in ACCEPT MESSAGE and CONFIRM MESSAGE. Fortunately, we only need to
> use reserved fields, and the current reserved fields are sufficient. And
> then we can easily add a new extension named SIGNLE LINK. Limited by space,
> the details will be elaborated after the scheme is finalized.

After reading this and latest version of protocol, I agree with that the
idea to provide a more flexible extension facilities. And, it's a good
chance for us to set here talking about the protocol extension.

There are some potential scenarios that need flexible extensions in my
mind:
- other protocols support, such as iWARP / IB or new version protocol,
- dozens of feature flags in the future, like this proposal. With the
  growth of new feature, it could overflow bitmap.

Actually, this extension facilities are very similar to TCP options.

So what about your opinions about the solution of this? If there are
some existed approaches for the future extensions, maybe this can get
involved in it. Or we can start a discuss about this as this mail
mentioned.

Also, I am wondering if there is plan to update the RFC7609, add the
latest v2 support?

Thanks,
Tony Lu

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC net-next] net/smc:introduce 1RTT to SMC
  2022-06-01  9:24       ` Tony Lu
@ 2022-06-01 11:35         ` Alexandra Winter
  2022-06-02  3:26           ` D. Wythe
  0 siblings, 1 reply; 13+ messages in thread
From: Alexandra Winter @ 2022-06-01 11:35 UTC (permalink / raw)
  To: Tony Lu, D. Wythe
  Cc: Karsten Graul, kuba, davem, netdev, linux-s390, linux-rdma



On 01.06.22 11:24, Tony Lu wrote:
> On Wed, Jun 01, 2022 at 02:33:09PM +0800, D. Wythe wrote:
>>
>> 在 2022/5/25 下午9:42, Alexandra Winter 写道:
>>
>>> We need to carefully evaluate them and make sure everything is compatible
>>> with the existing implementations of SMC-D and SMC-R v1 and v2. In the
>>> typical s390 environment ROCE LAG is propably not good enough, as the card
>>> is still a single point of failure. So your ideas need to be compatible
>>> with link redundancy. We also need to consider that the extension of the
>>> protocol does not block other desirable extensions.
>>>
>>> Your prototype is very helpful for the understanding. Before submitting any
>>> code patches to net-next, we should agree on the details of the protocol
>>> extension. Maybe you could formulate your proposal in plain text, so we can
>>> discuss it here?
>>>
>>> We also need to inform you that several public holidays are upcoming in the
>>> next weeks and several of our team will be out for summer vacation, so please
>>> allow for longer response times.
>>>
>>> Kind regards
>>> Alexandra Winter
>>>
>>
>> Hi alls,
>>
>> In order to achieve signle-link compatibility, we must
>> complete at least once negotiation. We wish to provide
>> higher scalability while meeting this feature. There are
>> few ways to reach this.
>>
>> 1. Use the available reserved bits. According to
>> the SMC v2 protocol, there are at least 28 reserved octets
>> in PROPOSAL MESSAGE and at least 10 reserved octets in
>> ACCEPT MESSAGE are available. We can define an area in which
>> as a feature area, works like bitmap. Considering the subsequent
>> scalability, we MAY use at least 2 reserved ctets, which can support
>> negotiation of at least 16 features.
>>
>> 2. Unify all the areas named extension in current
>> SMC v2 protocol spec without reinterpreting any existing field
>> and field offset changes, including 'PROPOSAL V1 IP Subnet Extension',
>> 'PROPOSAL V2 Extension', 'PROPOSAL SMC-DV2 EXTENSION' .etc. And provides
>> the ability to grow dynamically as needs expand. This scheme will use
>> at least 10 reserved octets in the PROPOSAL MESSAGE and at least 4 reserved
>> octets in ACCEPT MESSAGE and CONFIRM MESSAGE. Fortunately, we only need to
>> use reserved fields, and the current reserved fields are sufficient. And
>> then we can easily add a new extension named SIGNLE LINK. Limited by space,
>> the details will be elaborated after the scheme is finalized.
> 
> After reading this and latest version of protocol, I agree with that the
> idea to provide a more flexible extension facilities. And, it's a good
> chance for us to set here talking about the protocol extension.
> 
> There are some potential scenarios that need flexible extensions in my
> mind:
> - other protocols support, such as iWARP / IB or new version protocol,
> - dozens of feature flags in the future, like this proposal. With the
>   growth of new feature, it could overflow bitmap.
> 
> Actually, this extension facilities are very similar to TCP options.
> 
> So what about your opinions about the solution of this? If there are
> some existed approaches for the future extensions, maybe this can get
> involved in it. Or we can start a discuss about this as this mail
> mentioned.
> 
> Also, I am wondering if there is plan to update the RFC7609, add the
> latest v2 support?
> 
> Thanks,
> Tony Lu

We have asked the SMC protocol owners about their opinion about using the
reserved fields for new options in particular, and about where and how to
discuss this in general. (including where to document the versions).
Please allow some time for us to come back to you.

Kind regards
Alexandra

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC net-next] net/smc:introduce 1RTT to SMC
  2022-06-01 11:35         ` Alexandra Winter
@ 2022-06-02  3:26           ` D. Wythe
  0 siblings, 0 replies; 13+ messages in thread
From: D. Wythe @ 2022-06-02  3:26 UTC (permalink / raw)
  To: Alexandra Winter
  Cc: Tony Lu, Karsten Graul, kuba, davem, netdev, linux-s390, linux-rdma

On Wed, Jun 01, 2022 at 01:35:52PM +0200, Alexandra Winter wrote:
> 
> 
> On 01.06.22 11:24, Tony Lu wrote:
> > On Wed, Jun 01, 2022 at 02:33:09PM +0800, D. Wythe wrote:
> >>
> >> 在 2022/5/25 下午9:42, Alexandra Winter 写道:
> >>
> >>> We need to carefully evaluate them and make sure everything is compatible
> >>> with the existing implementations of SMC-D and SMC-R v1 and v2. In the
> >>> typical s390 environment ROCE LAG is propably not good enough, as the card
> >>> is still a single point of failure. So your ideas need to be compatible
> >>> with link redundancy. We also need to consider that the extension of the
> >>> protocol does not block other desirable extensions.
> >>>
> >>> Your prototype is very helpful for the understanding. Before submitting any
> >>> code patches to net-next, we should agree on the details of the protocol
> >>> extension. Maybe you could formulate your proposal in plain text, so we can
> >>> discuss it here?
> >>>
> >>> We also need to inform you that several public holidays are upcoming in the
> >>> next weeks and several of our team will be out for summer vacation, so please
> >>> allow for longer response times.
> >>>
> >>> Kind regards
> >>> Alexandra Winter
> >>>
> >>
> >> Hi alls,
> >>
> >> In order to achieve signle-link compatibility, we must
> >> complete at least once negotiation. We wish to provide
> >> higher scalability while meeting this feature. There are
> >> few ways to reach this.
> >>
> >> 1. Use the available reserved bits. According to
> >> the SMC v2 protocol, there are at least 28 reserved octets
> >> in PROPOSAL MESSAGE and at least 10 reserved octets in
> >> ACCEPT MESSAGE are available. We can define an area in which
> >> as a feature area, works like bitmap. Considering the subsequent
> >> scalability, we MAY use at least 2 reserved ctets, which can support
> >> negotiation of at least 16 features.
> >>
> >> 2. Unify all the areas named extension in current
> >> SMC v2 protocol spec without reinterpreting any existing field
> >> and field offset changes, including 'PROPOSAL V1 IP Subnet Extension',
> >> 'PROPOSAL V2 Extension', 'PROPOSAL SMC-DV2 EXTENSION' .etc. And provides
> >> the ability to grow dynamically as needs expand. This scheme will use
> >> at least 10 reserved octets in the PROPOSAL MESSAGE and at least 4 reserved
> >> octets in ACCEPT MESSAGE and CONFIRM MESSAGE. Fortunately, we only need to
> >> use reserved fields, and the current reserved fields are sufficient. And
> >> then we can easily add a new extension named SIGNLE LINK. Limited by space,
> >> the details will be elaborated after the scheme is finalized.
> > 
> > After reading this and latest version of protocol, I agree with that the
> > idea to provide a more flexible extension facilities. And, it's a good
> > chance for us to set here talking about the protocol extension.
> > 
> > There are some potential scenarios that need flexible extensions in my
> > mind:
> > - other protocols support, such as iWARP / IB or new version protocol,
> > - dozens of feature flags in the future, like this proposal. With the
> >   growth of new feature, it could overflow bitmap.
> > 
> > Actually, this extension facilities are very similar to TCP options.
> > 
> > So what about your opinions about the solution of this? If there are
> > some existed approaches for the future extensions, maybe this can get
> > involved in it. Or we can start a discuss about this as this mail
> > mentioned.
> > 
> > Also, I am wondering if there is plan to update the RFC7609, add the
> > latest v2 support?
> > 
> > Thanks,
> > Tony Lu
> 
> We have asked the SMC protocol owners about their opinion about using the
> reserved fields for new options in particular, and about where and how to
> discuss this in general. (including where to document the versions).
> Please allow some time for us to come back to you.
> 
> Kind regards
> Alexandra

Thank you for the information. Before we officially push the document update,
if you had any suggestions for the two schemes we are mentioned above,
or which one you prefer, please keep us informed.

Best wishes.
D. Wyther


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC net-next] net/smc:introduce 1RTT to SMC
  2022-06-01  6:33     ` D. Wythe
  2022-06-01  9:24       ` Tony Lu
@ 2022-06-16 13:49       ` D. Wythe
  2022-06-23 11:59         ` Alexandra Winter
  1 sibling, 1 reply; 13+ messages in thread
From: D. Wythe @ 2022-06-16 13:49 UTC (permalink / raw)
  To: Alexandra Winter, Karsten Graul, Tony Lu
  Cc: kuba, davem, netdev, linux-s390, linux-rdma



On 2022/6/1 下午2:33, D. Wythe wrote:
> 
> 在 2022/5/25 下午9:42, Alexandra Winter 写道:
> 
>> We need to carefully evaluate them and make sure everything is compatible
>> with the existing implementations of SMC-D and SMC-R v1 and v2. In the
>> typical s390 environment ROCE LAG is propably not good enough, as the card
>> is still a single point of failure. So your ideas need to be compatible
>> with link redundancy. We also need to consider that the extension of the
>> protocol does not block other desirable extensions.
>>
>> Your prototype is very helpful for the understanding. Before submitting any
>> code patches to net-next, we should agree on the details of the protocol
>> extension. Maybe you could formulate your proposal in plain text, so we can
>> discuss it here?
>>
>> We also need to inform you that several public holidays are upcoming in the
>> next weeks and several of our team will be out for summer vacation, so please
>> allow for longer response times.
>>
>> Kind regards
>> Alexandra Winter
>>
> 
> Hi alls,
> 
> In order to achieve signle-link compatibility, we must
> complete at least once negotiation. We wish to provide
> higher scalability while meeting this feature. There are
> few ways to reach this.
> 
> 1. Use the available reserved bits. According to
> the SMC v2 protocol, there are at least 28 reserved octets
> in PROPOSAL MESSAGE and at least 10 reserved octets in
> ACCEPT MESSAGE are available. We can define an area in which
> as a feature area, works like bitmap. Considering the subsequent scalability, we MAY use at least 2 reserved ctets, which can support negotiation of at least 16 features.
> 
> 2. Unify all the areas named extension in current
> SMC v2 protocol spec without reinterpreting any existing field
> and field offset changes, including 'PROPOSAL V1 IP Subnet Extension',
> 'PROPOSAL V2 Extension', 'PROPOSAL SMC-DV2 EXTENSION' .etc. And provides
> the ability to grow dynamically as needs expand. This scheme will use
> at least 10 reserved octets in the PROPOSAL MESSAGE and at least 4 reserved octets in ACCEPT MESSAGE and CONFIRM MESSAGE. Fortunately, we only need to use reserved fields, and the current reserved fields are sufficient. And then we can easily add a new extension named SIGNLE LINK. Limited by space, the details will be elaborated after the scheme is finalized.
> 
> But no matter what scheme is finalized, the workflow should be similar to:
> 
> Allow Single-link:
> 
> client                                server
>      proposal with Single-link feature bit or extension
>          -------->
> 
>      accept with Single-link feature bit extension
>          <--------
> 
>          confirm
>          -------->
> 
> 
> Deny or not recognized:
> 
> client                                 server
>      proposal with Single-link feature bit or extension
>          -------->
> 
>          rkey confirm
>          <------
>          ------>
> 
>      accept without Single-link feature bit or extension
>          <------
> 
>          rkey confirm
>          ------->
>          <------
> 
>          confirm
>          ------->
> 
> 
> Look forward to your advice and comments.
> 
> Thanks.

Hi all,

On the basis of previous,If we can put the application data over the PROPOSAL message,
we can achieve SMC 0-RTT. Its process should be similar to the following:

client									server
	PROPOSAL MESSAGE
		with first contact
		with 0RTT query extension
		-------->

	ACCEPT MESSAGE
			with(or without)
			0RTT response extension
		<--------

	CONFIRM MESSAGE
		-------->

client									server
	PROPOSAL MESSAGE
		without	first contact
		with ORTT Data
		-------->

	ACCEPT MESSAGE
		<---------

	CONFIRM MESSAGE
		-------->

If so, using reserved bit to exchange feature are not enough. We have a simple design
to perform compatibility with legacy extensions and support future extensions.

This draft try to unify all the areas named extension in current
SMC v2 protocol spec, includes 'PROPOSAL V1 IP Subnet Extension',
'PROPOSAL V2 Extension', 'PROPOSAL SMC-DV2 EXTENSION',
and 'First Contact Extension'.

This draft does lots of compromise designs in order to achieve compatibility.
I believe there must have better ways. Let me get the ball rolling. And please let
me know if you have any suggestions or better ideas. This draft of the design
is as follows:

SMC V2 CLC PROPOSAL MESSAGE:

+------+-------+------------------------------------------------------------+
|0	50     |NOT changed						    |
+------+-------+------------------------------------------------------------+
|50    |2      |SMC Version 2 Extension Offset(applicable when SMC V2)      |
+------+-------+------------------------------------------------------------+
|52    |19     |Reserved for growth                                         |
+------+-------+------------------------------------------------------------+
|71    |*      |Extension Area  (reserved before)                           |
+------+-------+------------------------------------------------------------+
|71    |2      |number of Extensions  (reserved before)                     |
+------+-------+------------------------------------------------------------+
|73    |7      |V1 IP Subnet Extension Header (when applicable)             |
+------+-------+------------------------------------------------------------+
|73    |7      |Padding Extension (when V1 IP Subnet Extension not present) |
+------+-------+------------------------------------------------------------+
|80    |*      |V1 IP Subnet Extension Payload (when applicable)            |
+------+-------+------------------------------------------------------------+
|      |       |V2 Extension (when applicable)                              |
+------+-------+------------------------------------------------------------+
|      |       |other available Extension (when applicable)                 |
+------+-------+------------------------------------------------------------+
|*     |4      |Eye catcher ‘SMCR’ (EBCDIC) message end                     |
+------+-------+------------------------------------------------------------+

Notes:

     1. In the current implementation, server read the proposal message with
fixed length, areas beyond the length will be silently ignored, and server will give
up to check eye catcher. Therefore, It's safe to extend the message from the tail.

     2. (reserved before) means that the areas used to be reserved.

     3. none of the existing fields have their offsets changed
within the PROPOSAL message.


Extension Areas Format:

+------+-------+-----------------------+
|0     |*      |Extensions Area        |
+------+-------+-----------------------+
|0     |2      |Number of Extensions   |
+------+-------+-----------------------+
|2     |*      |Extensions             |
+------+-------+-----------------------+
|      |       |End of Extensions Area |
+------+-------+-----------------------+

notes:

     1. All extensions within the extension areas should be contiguous.


Extension Format:

+------+-------+----------------------------------------+
|0     |*      |Extension                               |
+------+-------+----------------------------------------+
|0     |6+     |Extension header                        |
+------+-------+----------------------------------------+
|0     |4      |reserved                                |
+------+-------+----------------------------------------+
|2     |*      |Extension Type (variable length)        |
+------+-------+----------------------------------------+
|*     |*      |Payload Length (variable length)        |
+------+-------+----------------------------------------+
|*     |*      |payload                                 |
+------+-------+----------------------------------------+

notes:

     1. This scheme was specially designed to be compatible with
'PROPOSAL V2 Extension', since it is the only extension with no
reserved octets ahead of it.

     2. Another special case is 'PROPOSAL SMC-DV2 EXTENSION', it's also
has no reserved octets ahead of it, but it can be treats as an
optional part of 'PROPOSAL V2 Extension'.

     3. To be compatible with 'PROPOSAL V2 Extension', there are only
2 reserved octets left to place type and length fields. If octet per
each fileds, there can be only a maximum of 255 extension types and
a maximum length of 255. For better scalability, the type and length
fields are encoded as variable length integer.

variable length integer encoding:

+--------------+-------+---------------+--------+
|first bit     |octet  |Usable Bits    |Range   |
+--------------+-------+---------------+--------+
|0             |1      |7              |0-127   |
+--------------+-------+---------------+--------+
|1             |2      |15             |0-32767 |
+--------------+-------+---------------+--------+

notes;

     1. This design introduces some complexity and we can totally give it
up if we do not need more than 255 extensions at all.

V1 IP Subnet Extension Format:

+------+-------+-------------------------------------------------+
|0     |7      |Extension Header                                 |
+------+-------+-------------------------------------------------+
|0     |4      |Reserved                                         |
+------+-------+-------------------------------------------------+
|4     |1      |Extension type(0x2)                              |
+------+-------+-------------------------------------------------+
|5     |2      |payload length                                   |
+------+-------+-------------------------------------------------+
|7     |*      |V1 IP Subnet Extension Payload                   |
+------+-------+-------------------------------------------------+
|7     |5      |Client IPv4 Subnet Mask (IPv4 only)              |
+------+-------+-------------------------------------------------+
|7     |4      |Subnet Mask                                      |
+------+-------+-------------------------------------------------+
|9     |2      |Reserved                                         |
+------+-------+-------------------------------------------------+
|11    |*      |Client IPv6 Prefix Array (zero for IPv4)         |
+------+-------+-------------------------------------------------+
|11    |1      |Number of IPv6 Prefixes in Prefix array (1 - 8)  |
+------+-------+-------------------------------------------------+
|12    |*      |Prefix Array, variable length array              |
+------+-------+-------------------------------------------------+

notes:

     1. newly V1 IP Subnet Extension borrows 7 octets from the
reserved fields in the upper near part to form a completed extension.

     2. none of the existing fields have their offsets changed
within the PROPOSAL message.

Padding Extension Format:

+------+-------+-------------------------------+
|0     |2      |Reserved                       |
+------+-------+-------------------------------+
|2     |1      |Extension type(0x0)            |
+------+-------+-------------------------------+
|3     |*      |Payload length                 |
+------+-------+-------------------------------+
|*     |*      |Padding (fill with 0x0)        |
+------+-------+-------------------------------+

notes:

     1. Padding Extension is used to fill reserved areas that
have not been used yet. It doesn't mean anything, and can be replaced
in the future.

SMCv2 EXTENSION Format:

+------+-------+------------------------------------------------------------+
|0     |8      |SMCv2 Extension - Client Options Area (SMCRv2 & SMCDv2)     |
+------+-------+------------------------------------------------------------+
|0     |8      |SMCv2 Extension - Client Options Area Header                |
+------+-------+------------------------------------------------------------+
|0     |1      |EID Number                                                  |
+------+-------+------------------------------------------------------------+
|1     |1      |ISMv2 GID Number                                            |
+------+-------+------------------------------------------------------------+
|2     |1      |Flag 1 (bit 8) - Reserved                                   |
+------+-------+------------------------------------------------------------+
|3     |1      |Flag 2 (bit 8)                                              |
+------+-------+------------------------------------------------------------+
|4     |2      |Extension Header (reserved before)                          |
+------+-------+------------------------------------------------------------+
|4     |1      |Extension type(0x3)                                         |
+------+-------+------------------------------------------------------------+
|5     |1      |payload length (range 0-127)                                |
+------+-------+------------------------------------------------------------+
|6     |2      |SMCDv2 Extension Offset (if present)                        |
+------+-------+------------------------------------------------------------+
|8     |16     |RoCEv2 GID (IPv4 or IPv6 address)                           |
+------+-------+------------------------------------------------------------+
|8     |16     |RoCEv2 GID IPv6 address (when IPv6)                         |
+------+-------+------------------------------------------------------------+
|8     |12     |RoCEv2 GID IPv4 reserved (when IPv4)                        |
+------+-------+------------------------------------------------------------+
|20    |4      |RoCEv2 GID IPv4 address (right aligned)                     |
+------+-------+------------------------------------------------------------+
|24    |9      |Reserved                                                    |
+------+-------+------------------------------------------------------------+
|33    |7      |Continuation extension (reserved before)                    |
+------+-------+------------------------------------------------------------+
|33    |4      |Reserved                                                    |
+------+-------+------------------------------------------------------------+
|37    |1      |Extension type(0x1) (reserved before)                       |
+------+-------+------------------------------------------------------------+
|38    |2      |Payload length (reserved before)                            |
+------+-------+------------------------------------------------------------+
|40    |*      |EID Array Area – variable length (32 bytes * EID Number)    |
+------+-------+------------------------------------------------------------+
|*     |*      |SMCDv2 optional area (used to called SMCDv2 extension)      |
+------+-------+------------------------------------------------------------+

notes:

     1. newly V2 EXTENSION use several reserved octets to form a completed
extension. Note that none of the existing fields have their offsets changed
within the PROPOSAL message.

     2. the size of SMCv2 EXTENSION plus maximum size of EID Array Area is
much bigger than the highest number that one octet can represent. To be
compatible with 'legacy V2 Extension', there are only 2 reserved octets left to
place type and length fields. therefore, we use Continuation Extension to solve
it.

Continuation Extension Format:

+------+-------+-------------------------------+
|0     |4      |Reserved                       |
+------+-------+-------------------------------+
|4     |1      |Extension type(0x1)            |
+------+-------+-------------------------------+
|5     |*      |payload length                 |
+------+-------+-------------------------------+
|*     |*      |Continuation data              |
+------+-------+-------------------------------+

notes:

     1. Indicate that the content of this extension is continuation of
the content of its previous extension.

     2. In order to be compatible with some existing extensions,
when the reserved bytes that can be used are not enough to represent
its maximum length.


CLC ACCEPT MESSAGE (SMC-DV2 FORMAT) / CLC CONFIRM MESSAGE (SMC-Dv2 FORMAT)

+------+-------+---------------------------------------------------+
|34    |32     |EID (Negotiated Common EID selected by the server) |
+------+-------+---------------------------------------------------+
|66    |4      |Reserved                                           |
+------+-------+---------------------------------------------------+
|70    |*      |Extensions Area                                    |
+------+-------+---------------------------------------------------+
|70    |2      |number of Extensions  (reserved before)            |
+------+-------+---------------------------------------------------+
|72    |38     |First Contact Extension -                          |
|      |       |only present when first contact flag is on         |
+------+-------+---------------------------------------------------+
|72    |6      |First Contact Extension Header                     |
+------+-------+---------------------------------------------------+
|72    |2      |Reserved                                           |
+------+-------+---------------------------------------------------+
|74    |4      |FCE Header                                         |
+------+-------+---------------------------------------------------+
|74    |1      |FCE Header - reserved                              |
+------+-------+---------------------------------------------------+
|75    |1      |FCE Header Flag 1 (bit 8)                          |
+------+-------+---------------------------------------------------+
|76    |1      |Extension type (0x4) (reserved before)             |
+------+-------+---------------------------------------------------+
|77    |1      |Payload length (0x20) (reserved before)		   |
+------+-------+---------------------------------------------------+
|78    |32     |FCE Peer Host Name                                 |
+------+-------+---------------------------------------------------+
|110   |*      |other available Extension (when applicable)        |
+------+-------+---------------------------------------------------+
|*     |4      |Eye catcher ‘SMCD’ (EBCDIC) message end            |
+------+-------+---------------------------------------------------+

CLC ACCEPT MESSAGE (SMC-RV2 FORMAT)

+------+-------+---------------------------------------------------+
|64    |32     |EID (Negotiated EID selected by server)            |
+------+-------+---------------------------------------------------+
|96    |4      |Reserved                                           +
+------+-------+---------------------------------------------------+
|100   |*      |Extension Area                                     |
+------+-------+---------------------------------------------------+
|100   |2      |number of Extension  (reserved before)             |
+------+-------+---------------------------------------------------+
|102   |38     |First Contact Extension -                          |
|      |       |only present when first contact flag is on         |
+------+-------+---------------------------------------------------+
|102   |6      |First Contact Extension Header                     |
+------+-------+---------------------------------------------------+
|102   |2      |reserved                                           |
+------+-------+---------------------------------------------------+
|104   |4      |FCE Header                                         |
+------+-------+---------------------------------------------------+
|104   |1      |FCE Header - reserved                              |
+------+-------+---------------------------------------------------+
|105   |1      |FCE Header Flag 1 (bit 8)                          |
+------+-------+---------------------------------------------------+
|106   |1      |Extension type (0x4) (reserved before)             |
+------+-------+---------------------------------------------------+
|107   |1      |Payload length (0x20) (reserved before)            |
+------+-------+---------------------------------------------------+
|108   |32     |FCE Peer Host Name                                 |
+------+-------+---------------------------------------------------+
|140   |16     |Padding Extension	(reserved before)	   |
+------+-------+---------------------------------------------------+
|156   |*      |other available Extension (when applicable)        |
+------+-------+---------------------------------------------------+
|*     |4      |Eye catcher ‘SMCR’ (EBCDIC) message end            |
+------+-------+---------------------------------------------------+

notes:

     1. none of the existing fields have their offsets changed
within the message.


First Contact Extension Format:

+------+-------+----------------------------------------------------------------+
|0     |6      |First Contact Extension Header                                  |
+------+-------+----------------------------------------------------------------+
|0     |2      |Reserved                                                        |
+------+-------+----------------------------------------------------------------+
|2     |1      |FCE Header Flag 0                                               |
+------+-------+----------------------------------------------------------------+
|3     |1      |FCE Header Flag 1                                               |
+------+-------+----------------------------------------------------------------+
|4     |1      |Extension type (0x4) (reserved before)                          |
+------+-------+----------------------------------------------------------------+
|5     |1      |Payload length (0x20) (reserved before)                         |
+------+-------+----------------------------------------------------------------+
|6     |32     |FCE Peer Host Name (ASCII character - padded with ASCII blanks) |
+------+-------+----------------------------------------------------------------+

notes:

     1. newly First Contact Extension borrows 2 octets from the
reserved fields in the upper near part to form a completed extension.


CLC CONFIRM MESSAGE (SMC-RV2 FORMAT)

+------+-------+---------------------------------------------------+
|64    |32     |EID (Negotiated EID selected by server)            |
+------+-------+---------------------------------------------------+
|96    |4      |Reserved                                           |
+------+-------+---------------------------------------------------+
|100   |*      |Extension Area                                     |
+------+-------+---------------------------------------------------+
|100   |2      |number of Extension  (reserved before)             |
+------+-------+---------------------------------------------------+
|102   |38     |First Contact Extension -                          |
|      |        only present when first contact flag is on         |
+------+-------+---------------------------------------------------+
|102   |6      |First Contact Extension Header                     |
+------+-------+---------------------------------------------------+
|102   |2      |reserved                                           |
+------+-------+---------------------------------------------------+
|104   |4      |FCE Header                                         |
+------+-------+---------------------------------------------------+
|104   |1      |FCE Header - reserved                              |
+------+-------+---------------------------------------------------+
|105   |1      |FCE Header Flag 1 (bit 8)                          |
+------+-------+---------------------------------------------------+
|106   |1      |Extension type (0x4)  (reserved before)            |
+------+-------+---------------------------------------------------+
|107   |1      |Payload length (0x20)                              |
+------+-------+---------------------------------------------------+
|108   |32     |FCE Peer Host Name                                 |
+------+-------+---------------------------------------------------+
|140   |9      |PADDING extension (reserved before)                |
+------+-------+---------------------------------------------------+
|149   |*      |Client RoCEv2 GID Extension                        |
+------+-------+---------------------------------------------------+
|149   |7      |Client RoCEv2 GID Extension Header(reserved before)|
+------+-------+---------------------------------------------------+
|156   |*      |FCE Client RoCEv2 GID List                         |
+------+-------+---------------------------------------------------+
|*     |*      |other available Extension (when applicable)        |
+------+-------+---------------------------------------------------+
|*     |4      |Eye catcher ‘SMCR’ (EBCDIC) message end            |
+------+-------+---------------------------------------------------+

notes:

     1. Client RoCEv2 GID was once part of the First Contact Extension, and now it's
standalone extension.

Client RoCEv2 GID Extension Format:

+-+----+-------+---------------------------------------------------+
|0     |*      |Client RoCEv2 GID                                  |
+------+-------+---------------------------------------------------+
|0     |4      |Reserved                                           |
+------+-------+---------------------------------------------------+
|4     |1      |Extension type(0x5)                                |
+------+-------+---------------------------------------------------+
|5     |2      |Payload length                                     |
+------+-------+---------------------------------------------------+
|7     |*      |Client RoCEv2 GID List                             |
+------+-------+---------------------------------------------------+
|7     |4      |GID List Header                                    |
+------+-------+---------------------------------------------------+
|7     |1      |GID List No of Entries (1 - 8)                     |
+------+-------+---------------------------------------------------+
|8     |3      |Reserved                                           |
+------+-------+---------------------------------------------------+
|11    |*      |GID List Array Area                                |
+------+-------+---------------------------------------------------+
|11    |16     |GID List Entry - RoCEv2 IP address (IPv4 or IPv6)  |
+------+-------+---------------------------------------------------+
+*     |*      |End of Client GID List                             |
+------+-------+---------------------------------------------------+

notes:

     1.  newly Client RoCEv2 GID Extension borrows 7 octets from the
reserved fields in the upper near part to form a completed extension.

     2. none of the existing fields have their offsets changed
within the CONFIRM message.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC net-next] net/smc:introduce 1RTT to SMC
  2022-06-16 13:49       ` D. Wythe
@ 2022-06-23 11:59         ` Alexandra Winter
  2022-06-23 12:50           ` D. Wythe
  0 siblings, 1 reply; 13+ messages in thread
From: Alexandra Winter @ 2022-06-23 11:59 UTC (permalink / raw)
  To: D. Wythe, Karsten Graul, Tony Lu
  Cc: kuba, davem, netdev, linux-s390, linux-rdma



On 16.06.22 15:49, D. Wythe wrote:
> 
> 
> On 2022/6/1 下午2:33, D. Wythe wrote:
>>
>> 在 2022/5/25 下午9:42, Alexandra Winter 写道:
>>
>>> We need to carefully evaluate them and make sure everything is compatible
>>> with the existing implementations of SMC-D and SMC-R v1 and v2. In the
>>> typical s390 environment ROCE LAG is propably not good enough, as the card
>>> is still a single point of failure. So your ideas need to be compatible
>>> with link redundancy. We also need to consider that the extension of the
>>> protocol does not block other desirable extensions.
>>>
>>> Your prototype is very helpful for the understanding. Before submitting any
>>> code patches to net-next, we should agree on the details of the protocol
>>> extension. Maybe you could formulate your proposal in plain text, so we can
>>> discuss it here?
>>>
>>> We also need to inform you that several public holidays are upcoming in the
>>> next weeks and several of our team will be out for summer vacation, so please
>>> allow for longer response times.
>>>
>>> Kind regards
>>> Alexandra Winter
>>>
>>
>> Hi alls,
>>
>> In order to achieve signle-link compatibility, we must
>> complete at least once negotiation. We wish to provide
>> higher scalability while meeting this feature. There are
>> few ways to reach this.
>>
>> 1. Use the available reserved bits. According to
>> the SMC v2 protocol, there are at least 28 reserved octets
>> in PROPOSAL MESSAGE and at least 10 reserved octets in
>> ACCEPT MESSAGE are available. We can define an area in which
>> as a feature area, works like bitmap. Considering the subsequent scalability, we MAY use at least 2 reserved ctets, which can support negotiation of at least 16 features.
>>
>> 2. Unify all the areas named extension in current
>> SMC v2 protocol spec without reinterpreting any existing field
>> and field offset changes, including 'PROPOSAL V1 IP Subnet Extension',
>> 'PROPOSAL V2 Extension', 'PROPOSAL SMC-DV2 EXTENSION' .etc. And provides
>> the ability to grow dynamically as needs expand. This scheme will use
>> at least 10 reserved octets in the PROPOSAL MESSAGE and at least 4 reserved octets in ACCEPT MESSAGE and CONFIRM MESSAGE. Fortunately, we only need to use reserved fields, and the current reserved fields are sufficient. And then we can easily add a new extension named SIGNLE LINK. Limited by space, the details will be elaborated after the scheme is finalized.
>>
[...]
>>
>>
>> Look forward to your advice and comments.
>>
>> Thanks.
> 
> Hi all,
> 
> On the basis of previous,If we can put the application data over the PROPOSAL message,
> we can achieve SMC 0-RTT. Its process should be similar to the following:
> 
[...]

Thank you D. Wythe for the detailed proposal, I have forwarded it to the protocol owner
and we are currently reviewing it. 
We may contact you and Tony Lu directly to discuss the details, if that is ok for you.

Kind regards
Alexandra Winter






^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC net-next] net/smc:introduce 1RTT to SMC
  2022-06-23 11:59         ` Alexandra Winter
@ 2022-06-23 12:50           ` D. Wythe
  0 siblings, 0 replies; 13+ messages in thread
From: D. Wythe @ 2022-06-23 12:50 UTC (permalink / raw)
  To: Alexandra Winter, Karsten Graul, Tony Lu
  Cc: kuba, davem, netdev, linux-s390, linux-rdma



On 2022/6/23 下午7:59, Alexandra Winter wrote:
> 
> 
> On 16.06.22 15:49, D. Wythe wrote:
>>
>>
>> On 2022/6/1 下午2:33, D. Wythe wrote:
>>>
>>> 在 2022/5/25 下午9:42, Alexandra Winter 写道:
>>>
>>>> We need to carefully evaluate them and make sure everything is compatible
>>>> with the existing implementations of SMC-D and SMC-R v1 and v2. In the
>>>> typical s390 environment ROCE LAG is propably not good enough, as the card
>>>> is still a single point of failure. So your ideas need to be compatible
>>>> with link redundancy. We also need to consider that the extension of the
>>>> protocol does not block other desirable extensions.
>>>>
>>>> Your prototype is very helpful for the understanding. Before submitting any
>>>> code patches to net-next, we should agree on the details of the protocol
>>>> extension. Maybe you could formulate your proposal in plain text, so we can
>>>> discuss it here?
>>>>
>>>> We also need to inform you that several public holidays are upcoming in the
>>>> next weeks and several of our team will be out for summer vacation, so please
>>>> allow for longer response times.
>>>>
>>>> Kind regards
>>>> Alexandra Winter
>>>>
>>>
>>> Hi alls,
>>>
>>> In order to achieve signle-link compatibility, we must
>>> complete at least once negotiation. We wish to provide
>>> higher scalability while meeting this feature. There are
>>> few ways to reach this.
>>>
>>> 1. Use the available reserved bits. According to
>>> the SMC v2 protocol, there are at least 28 reserved octets
>>> in PROPOSAL MESSAGE and at least 10 reserved octets in
>>> ACCEPT MESSAGE are available. We can define an area in which
>>> as a feature area, works like bitmap. Considering the subsequent scalability, we MAY use at least 2 reserved ctets, which can support negotiation of at least 16 features.
>>>
>>> 2. Unify all the areas named extension in current
>>> SMC v2 protocol spec without reinterpreting any existing field
>>> and field offset changes, including 'PROPOSAL V1 IP Subnet Extension',
>>> 'PROPOSAL V2 Extension', 'PROPOSAL SMC-DV2 EXTENSION' .etc. And provides
>>> the ability to grow dynamically as needs expand. This scheme will use
>>> at least 10 reserved octets in the PROPOSAL MESSAGE and at least 4 reserved octets in ACCEPT MESSAGE and CONFIRM MESSAGE. Fortunately, we only need to use reserved fields, and the current reserved fields are sufficient. And then we can easily add a new extension named SIGNLE LINK. Limited by space, the details will be elaborated after the scheme is finalized.
>>>
> [...]
>>>
>>>
>>> Look forward to your advice and comments.
>>>
>>> Thanks.
>>
>> Hi all,
>>
>> On the basis of previous,If we can put the application data over the PROPOSAL message,
>> we can achieve SMC 0-RTT. Its process should be similar to the following:
>>
> [...]
> 
> Thank you D. Wythe for the detailed proposal, I have forwarded it to the protocol owner
> and we are currently reviewing it.
> We may contact you and Tony Lu directly to discuss the details, if that is ok for you.
> 
> Kind regards
> Alexandra Winter
> 
> 
> 
> 

Thanks a lot for your support, it seems good to us. We are totally okay with that.

Best Wishes.
D. Wythe


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-06-23 12:50 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-24  6:52 [RFC net-next] net/smc:introduce 1RTT to SMC D. Wythe
2022-05-24  7:49 ` Tony Lu
2022-05-25 13:42   ` Alexandra Winter
2022-05-26  3:47     ` D. Wythe
2022-05-26  7:04     ` Tony Lu
2022-06-01  6:33     ` D. Wythe
2022-06-01  9:24       ` Tony Lu
2022-06-01 11:35         ` Alexandra Winter
2022-06-02  3:26           ` D. Wythe
2022-06-16 13:49       ` D. Wythe
2022-06-23 11:59         ` Alexandra Winter
2022-06-23 12:50           ` D. Wythe
2022-05-24 12:40 ` kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.