All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net 0/4] Fixes for SMC
@ 2021-10-27  8:52 Tony Lu
  2021-10-27  8:52 ` [PATCH net 1/4] Revert "net/smc: don't wait for send buffer space when data was already sent" Tony Lu
                   ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Tony Lu @ 2021-10-27  8:52 UTC (permalink / raw)
  To: kgraul, davem, kuba, ubraun
  Cc: netdev, linux-s390, linux-rdma, jacob.qi, xuanzhuo, guwen, dust.li

From: Tony Lu <tony.ly@linux.alibaba.com>

We are using SMC to replace TCP, and find some issues when running TCP's
test cases.

Tony Lu (2):
  Revert "net/smc: don't wait for send buffer space when data was
    already sent"
  net/smc: Fix smc_link->llc_testlink_time overflow

Wen Gu (2):
  net/smc: Correct spelling mistake to TCPF_SYN_RECV
  net/smc: Fix wq mismatch issue caused by smc fallback

 net/smc/af_smc.c  | 17 ++++++++++++++++-
 net/smc/smc_llc.c |  2 +-
 net/smc/smc_tx.c  |  7 ++++---
 3 files changed, 21 insertions(+), 5 deletions(-)

-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH net 1/4] Revert "net/smc: don't wait for send buffer space when data was already sent"
  2021-10-27  8:52 [PATCH net 0/4] Fixes for SMC Tony Lu
@ 2021-10-27  8:52 ` Tony Lu
  2021-10-27 10:21   ` Karsten Graul
  2021-10-27  8:52 ` [PATCH net 2/4] net/smc: Fix smc_link->llc_testlink_time overflow Tony Lu
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 28+ messages in thread
From: Tony Lu @ 2021-10-27  8:52 UTC (permalink / raw)
  To: kgraul, davem, kuba, ubraun
  Cc: netdev, linux-s390, linux-rdma, jacob.qi, xuanzhuo, guwen, dust.li

From: Tony Lu <tony.ly@linux.alibaba.com>

This reverts commit 6889b36da78a21a312d8b462c1fa25a03c2ff192.

When using SMC to replace TCP, some userspace applications like netperf
don't check the return code of send syscall correctly, which means how
many bytes are sent. If rc of send() is smaller than expected, it should
try to send again, instead of exit directly. It is difficult to change
the uncorrect behaviors of userspace applications, so choose to revert it.

Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Ursula Braun <ubraun@linux.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Reported-by: Jacob Qi <jacob.qi@linux.alibaba.com>
Signed-off-by: Tony Lu <tony.ly@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
---
 net/smc/smc_tx.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
index 738a4a99c827..d401286e9058 100644
--- a/net/smc/smc_tx.c
+++ b/net/smc/smc_tx.c
@@ -178,11 +178,12 @@ int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len)
 			conn->local_tx_ctrl.prod_flags.urg_data_pending = 1;
 
 		if (!atomic_read(&conn->sndbuf_space) || conn->urg_tx_pend) {
-			if (send_done)
-				return send_done;
 			rc = smc_tx_wait(smc, msg->msg_flags);
-			if (rc)
+			if (rc) {
+				if (send_done)
+					return send_done;
 				goto out_err;
+			}
 			continue;
 		}
 
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH net 2/4] net/smc: Fix smc_link->llc_testlink_time overflow
  2021-10-27  8:52 [PATCH net 0/4] Fixes for SMC Tony Lu
  2021-10-27  8:52 ` [PATCH net 1/4] Revert "net/smc: don't wait for send buffer space when data was already sent" Tony Lu
@ 2021-10-27  8:52 ` Tony Lu
  2021-10-27 10:24   ` Karsten Graul
  2021-10-27  8:52 ` [PATCH net 3/4] net/smc: Correct spelling mistake to TCPF_SYN_RECV Tony Lu
  2021-10-27  8:52 ` [PATCH net 4/4] net/smc: Fix wq mismatch issue caused by smc fallback Tony Lu
  3 siblings, 1 reply; 28+ messages in thread
From: Tony Lu @ 2021-10-27  8:52 UTC (permalink / raw)
  To: kgraul, davem, kuba, ubraun
  Cc: netdev, linux-s390, linux-rdma, jacob.qi, xuanzhuo, guwen, dust.li

From: Tony Lu <tony.ly@linux.alibaba.com>

The value of llc_testlink_time is set to the value stored in
net->ipv4.sysctl_tcp_keepalive_time when linkgroup init. The value of
sysctl_tcp_keepalive_time is already jiffies, so we don't need to
multiply by HZ, which would cause smc_link->llc_testlink_time overflow,
and test_link send flood.

Fixes: 00a049cfde95 ("net/smc: move llc layer related init and clear into smc_llc.c")
Cc: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Tony Lu <tony.ly@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
---
 net/smc/smc_llc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
index 72f4b72eb175..f1d323439a2a 100644
--- a/net/smc/smc_llc.c
+++ b/net/smc/smc_llc.c
@@ -1822,7 +1822,7 @@ void smc_llc_link_active(struct smc_link *link)
 			    link->smcibdev->ibdev->name, link->ibport);
 	link->state = SMC_LNK_ACTIVE;
 	if (link->lgr->llc_testlink_time) {
-		link->llc_testlink_time = link->lgr->llc_testlink_time * HZ;
+		link->llc_testlink_time = link->lgr->llc_testlink_time;
 		schedule_delayed_work(&link->llc_testlink_wrk,
 				      link->llc_testlink_time);
 	}
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH net 3/4] net/smc: Correct spelling mistake to TCPF_SYN_RECV
  2021-10-27  8:52 [PATCH net 0/4] Fixes for SMC Tony Lu
  2021-10-27  8:52 ` [PATCH net 1/4] Revert "net/smc: don't wait for send buffer space when data was already sent" Tony Lu
  2021-10-27  8:52 ` [PATCH net 2/4] net/smc: Fix smc_link->llc_testlink_time overflow Tony Lu
@ 2021-10-27  8:52 ` Tony Lu
  2021-10-27 10:23   ` Karsten Graul
  2021-10-27  8:52 ` [PATCH net 4/4] net/smc: Fix wq mismatch issue caused by smc fallback Tony Lu
  3 siblings, 1 reply; 28+ messages in thread
From: Tony Lu @ 2021-10-27  8:52 UTC (permalink / raw)
  To: kgraul, davem, kuba, ubraun
  Cc: netdev, linux-s390, linux-rdma, jacob.qi, xuanzhuo, guwen, dust.li

From: Wen Gu <guwen@linux.alibaba.com>

There should use TCPF_SYN_RECV instead of TCP_SYN_RECV.

Fixes: 50717a37db03 ("net/smc: nonblocking connect rework")
Cc: Ursula Braun <ubraun@linux.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Tony Lu <tony.ly@linux.alibaba.com>
---
 net/smc/af_smc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index c038efc23ce3..78b663dbfa1f 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -1057,7 +1057,7 @@ static void smc_connect_work(struct work_struct *work)
 	if (smc->clcsock->sk->sk_err) {
 		smc->sk.sk_err = smc->clcsock->sk->sk_err;
 	} else if ((1 << smc->clcsock->sk->sk_state) &
-					(TCPF_SYN_SENT | TCP_SYN_RECV)) {
+					(TCPF_SYN_SENT | TCPF_SYN_RECV)) {
 		rc = sk_stream_wait_connect(smc->clcsock->sk, &timeo);
 		if ((rc == -EPIPE) &&
 		    ((1 << smc->clcsock->sk->sk_state) &
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH net 4/4] net/smc: Fix wq mismatch issue caused by smc fallback
  2021-10-27  8:52 [PATCH net 0/4] Fixes for SMC Tony Lu
                   ` (2 preceding siblings ...)
  2021-10-27  8:52 ` [PATCH net 3/4] net/smc: Correct spelling mistake to TCPF_SYN_RECV Tony Lu
@ 2021-10-27  8:52 ` Tony Lu
  2021-10-28 14:16   ` Karsten Graul
  2021-10-29  9:40   ` Karsten Graul
  3 siblings, 2 replies; 28+ messages in thread
From: Tony Lu @ 2021-10-27  8:52 UTC (permalink / raw)
  To: kgraul, davem, kuba, ubraun
  Cc: netdev, linux-s390, linux-rdma, jacob.qi, xuanzhuo, guwen, dust.li

From: Wen Gu <guwen@linux.alibaba.com>

A socket_wq mismatch issue may occur because of fallback.

When use SMC to replace TCP, applications add an epoll entry into SMC
socket's wq, but kernel uses clcsock's wq instead of SMC socket's wq
once fallback occurs, which means the application's epoll fd dosen't
work anymore.

For example:
server: nginx -g 'daemon off;'

client: smc_run wrk -c 1 -t 1 -d 5 http://11.200.15.93/index.html

  Running 5s test @ http://11.200.15.93/index.html
    1 threads and 1 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency     0.00us    0.00us   0.00us    -nan%
      Req/Sec     0.00      0.00     0.00      -nan%
    0 requests in 5.00s, 0.00B read
  Requests/sec:      0.00
  Transfer/sec:       0.00B

This patch fixes this issue by using clcsock's wq regardless of
whether fallback occurs.

Reported-by: Jacob Qi <jacob.qi@linux.alibaba.com>
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
---
 net/smc/af_smc.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 78b663dbfa1f..3b7ec0abff52 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -546,6 +546,10 @@ static void smc_switch_to_fallback(struct smc_sock *smc, int reason_code)
 {
 	smc->use_fallback = true;
 	smc->fallback_rsn = reason_code;
+
+	/* clcsock's sock uses back to clcsock->wq, see also smc_create() */
+	rcu_assign_pointer(smc->clcsock->sk->sk_wq, &smc->clcsock->wq);
+
 	smc_stat_fallback(smc);
 	if (smc->sk.sk_socket && smc->sk.sk_socket->file) {
 		smc->clcsock->file = smc->sk.sk_socket->file;
@@ -1972,6 +1976,10 @@ static int smc_accept(struct socket *sock, struct socket *new_sock,
 	if (rc)
 		goto out;
 
+	/* new smc sock uses clcsock's wq. see also smc_create() */
+	if (!smc_sk(nsk)->use_fallback)
+		rcu_assign_pointer(nsk->sk_wq, &smc_sk(nsk)->clcsock->wq);
+
 	if (lsmc->sockopt_defer_accept && !(flags & O_NONBLOCK)) {
 		/* wait till data arrives on the socket */
 		timeo = msecs_to_jiffies(lsmc->sockopt_defer_accept *
@@ -2108,6 +2116,9 @@ static __poll_t smc_poll(struct file *file, struct socket *sock,
 		mask = smc->clcsock->ops->poll(file, smc->clcsock, wait);
 		sk->sk_err = smc->clcsock->sk->sk_err;
 	} else {
+		/* use clcsock->wq in sock_poll_wait(), see also smc_create() */
+		sock = smc->clcsock;
+
 		if (sk->sk_state != SMC_CLOSED)
 			sock_poll_wait(file, sock, wait);
 		if (sk->sk_err)
@@ -2505,6 +2516,10 @@ static int smc_create(struct net *net, struct socket *sock, int protocol,
 	smc->sk.sk_sndbuf = max(smc->clcsock->sk->sk_sndbuf, SMC_BUF_MIN_SIZE);
 	smc->sk.sk_rcvbuf = max(smc->clcsock->sk->sk_rcvbuf, SMC_BUF_MIN_SIZE);
 
+	/* In case smc fallbacks to tcp, smc's sock will use clcsock's wq in advance */
+	rcu_assign_pointer(sk->sk_wq, &smc->clcsock->wq);
+	rcu_assign_pointer(smc->clcsock->sk->sk_wq, &smc->sk.sk_socket->wq);
+
 out:
 	return rc;
 }
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH net 1/4] Revert "net/smc: don't wait for send buffer space when data was already sent"
  2021-10-27  8:52 ` [PATCH net 1/4] Revert "net/smc: don't wait for send buffer space when data was already sent" Tony Lu
@ 2021-10-27 10:21   ` Karsten Graul
  2021-10-27 15:08     ` Jakub Kicinski
  0 siblings, 1 reply; 28+ messages in thread
From: Karsten Graul @ 2021-10-27 10:21 UTC (permalink / raw)
  To: Tony Lu, davem, kuba
  Cc: netdev, linux-s390, linux-rdma, jacob.qi, xuanzhuo, guwen, dust.li

On 27/10/2021 10:52, Tony Lu wrote:
> From: Tony Lu <tony.ly@linux.alibaba.com>
> 
> This reverts commit 6889b36da78a21a312d8b462c1fa25a03c2ff192.
> 
> When using SMC to replace TCP, some userspace applications like netperf
> don't check the return code of send syscall correctly, which means how
> many bytes are sent. If rc of send() is smaller than expected, it should
> try to send again, instead of exit directly. It is difficult to change
> the uncorrect behaviors of userspace applications, so choose to revert it.

Your change would restore the old behavior to handle all sockets like they 
are blocking sockets, trying forever to send the provided data bytes.
This is not how it should work.

We encountered the same issue with netperf, but this is the only 'broken'
application that we know of so far which does not implement the socket API
correctly.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 3/4] net/smc: Correct spelling mistake to TCPF_SYN_RECV
  2021-10-27  8:52 ` [PATCH net 3/4] net/smc: Correct spelling mistake to TCPF_SYN_RECV Tony Lu
@ 2021-10-27 10:23   ` Karsten Graul
  2021-10-28  6:53     ` Tony Lu
  0 siblings, 1 reply; 28+ messages in thread
From: Karsten Graul @ 2021-10-27 10:23 UTC (permalink / raw)
  To: Tony Lu, davem, kuba
  Cc: netdev, linux-s390, linux-rdma, jacob.qi, xuanzhuo, guwen, dust.li

On 27/10/2021 10:52, Tony Lu wrote:
> From: Wen Gu <guwen@linux.alibaba.com>
> 
> There should use TCPF_SYN_RECV instead of TCP_SYN_RECV.

Thanks for fixing this, we will include your patch in our next submission
to the netdev tree.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 2/4] net/smc: Fix smc_link->llc_testlink_time overflow
  2021-10-27  8:52 ` [PATCH net 2/4] net/smc: Fix smc_link->llc_testlink_time overflow Tony Lu
@ 2021-10-27 10:24   ` Karsten Graul
  2021-10-28  6:52     ` Tony Lu
  0 siblings, 1 reply; 28+ messages in thread
From: Karsten Graul @ 2021-10-27 10:24 UTC (permalink / raw)
  To: Tony Lu, davem, kuba, ubraun
  Cc: netdev, linux-s390, linux-rdma, jacob.qi, xuanzhuo, guwen, dust.li

On 27/10/2021 10:52, Tony Lu wrote:
> From: Tony Lu <tony.ly@linux.alibaba.com>
> 
> The value of llc_testlink_time is set to the value stored in
> net->ipv4.sysctl_tcp_keepalive_time when linkgroup init. The value of
> sysctl_tcp_keepalive_time is already jiffies, so we don't need to
> multiply by HZ, which would cause smc_link->llc_testlink_time overflow,
> and test_link send flood.

Thanks for fixing this, we will include your patch in our next submission
to the netdev tree.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 1/4] Revert "net/smc: don't wait for send buffer space when data was already sent"
  2021-10-27 10:21   ` Karsten Graul
@ 2021-10-27 15:08     ` Jakub Kicinski
  2021-10-27 15:38       ` Karsten Graul
  0 siblings, 1 reply; 28+ messages in thread
From: Jakub Kicinski @ 2021-10-27 15:08 UTC (permalink / raw)
  To: Karsten Graul
  Cc: Tony Lu, davem, netdev, linux-s390, linux-rdma, jacob.qi,
	xuanzhuo, guwen, dust.li

On Wed, 27 Oct 2021 12:21:32 +0200 Karsten Graul wrote:
> On 27/10/2021 10:52, Tony Lu wrote:
> > From: Tony Lu <tony.ly@linux.alibaba.com>
> > 
> > This reverts commit 6889b36da78a21a312d8b462c1fa25a03c2ff192.
> > 
> > When using SMC to replace TCP, some userspace applications like netperf
> > don't check the return code of send syscall correctly, which means how
> > many bytes are sent. If rc of send() is smaller than expected, it should
> > try to send again, instead of exit directly. It is difficult to change
> > the uncorrect behaviors of userspace applications, so choose to revert it.  
> 
> Your change would restore the old behavior to handle all sockets like they 
> are blocking sockets, trying forever to send the provided data bytes.
> This is not how it should work.

Isn't the application supposed to make the socket non-blocking or
pass MSG_DONTWAIT if it doesn't want to sleep? It's unclear why 
the fix was needed in the first place.
 
> We encountered the same issue with netperf, but this is the only 'broken'
> application that we know of so far which does not implement the socket API
> correctly.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 1/4] Revert "net/smc: don't wait for send buffer space when data was already sent"
  2021-10-27 15:08     ` Jakub Kicinski
@ 2021-10-27 15:38       ` Karsten Graul
  2021-10-27 15:47         ` Jakub Kicinski
  0 siblings, 1 reply; 28+ messages in thread
From: Karsten Graul @ 2021-10-27 15:38 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Tony Lu, davem, netdev, linux-s390, linux-rdma, jacob.qi,
	xuanzhuo, guwen, dust.li



On 27/10/2021 17:08, Jakub Kicinski wrote:
> On Wed, 27 Oct 2021 12:21:32 +0200 Karsten Graul wrote:
>> On 27/10/2021 10:52, Tony Lu wrote:
>>> From: Tony Lu <tony.ly@linux.alibaba.com>
>>>
>>> This reverts commit 6889b36da78a21a312d8b462c1fa25a03c2ff192.
>>>
>>> When using SMC to replace TCP, some userspace applications like netperf
>>> don't check the return code of send syscall correctly, which means how
>>> many bytes are sent. If rc of send() is smaller than expected, it should
>>> try to send again, instead of exit directly. It is difficult to change
>>> the uncorrect behaviors of userspace applications, so choose to revert it.  
>>
>> Your change would restore the old behavior to handle all sockets like they 
>> are blocking sockets, trying forever to send the provided data bytes.
>> This is not how it should work.
> 
> Isn't the application supposed to make the socket non-blocking or
> pass MSG_DONTWAIT if it doesn't want to sleep? It's unclear why 
> the fix was needed in the first place.
  
You are right, all of this non-blocking handling is already done in smc_tx_wait().
So this fix was for blocking sockets. The commit message explains the original
intention:

    net/smc: don't wait for send buffer space when data was already sent

    When there is no more send buffer space and at least 1 byte was already
    sent then return to user space. The wait is only done when no data was
    sent by the sendmsg() call.
    This fixes smc_tx_sendmsg() which tried to always send all user data and
    started to wait for free send buffer space when needed. During this wait
    the user space program was blocked in the sendmsg() call and hence not
    able to receive incoming data. When both sides were in such a situation
    then the connection stalled forever.

What we found out was that applications called sendmsg() with large data
buffers using blocking sockets. This led to the described situation, were the
solution was to early return to user space even if not all data were sent yet.
Userspace applications should not have a problem with the fact that sendmsg()
returns a smaller byte count than requested.

Reverting this patch would bring back the stalled connection problem.

>> We encountered the same issue with netperf, but this is the only 'broken'
>> application that we know of so far which does not implement the socket API
>> correctly.
> 

-- 
Karsten

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 1/4] Revert "net/smc: don't wait for send buffer space when data was already sent"
  2021-10-27 15:38       ` Karsten Graul
@ 2021-10-27 15:47         ` Jakub Kicinski
  2021-10-28  6:48           ` Tony Lu
  2021-10-28 11:57           ` Karsten Graul
  0 siblings, 2 replies; 28+ messages in thread
From: Jakub Kicinski @ 2021-10-27 15:47 UTC (permalink / raw)
  To: Karsten Graul
  Cc: Tony Lu, davem, netdev, linux-s390, linux-rdma, jacob.qi,
	xuanzhuo, guwen, dust.li

On Wed, 27 Oct 2021 17:38:27 +0200 Karsten Graul wrote:
> What we found out was that applications called sendmsg() with large data
> buffers using blocking sockets. This led to the described situation, were the
> solution was to early return to user space even if not all data were sent yet.
> Userspace applications should not have a problem with the fact that sendmsg()
> returns a smaller byte count than requested.
> 
> Reverting this patch would bring back the stalled connection problem.

I'm not sure. The man page for send says:

       When the message does not fit into  the  send  buffer  of  the  socket,
       send()  normally blocks, unless the socket has been placed in nonblock‐
       ing I/O mode.  In nonblocking mode it would fail with the error  EAGAIN
       or  EWOULDBLOCK in this case.

dunno if that's required by POSIX or just a best practice.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 1/4] Revert "net/smc: don't wait for send buffer space when data was already sent"
  2021-10-27 15:47         ` Jakub Kicinski
@ 2021-10-28  6:48           ` Tony Lu
  2021-10-28 11:57           ` Karsten Graul
  1 sibling, 0 replies; 28+ messages in thread
From: Tony Lu @ 2021-10-28  6:48 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Karsten Graul, davem, netdev, linux-s390, linux-rdma, jacob.qi,
	xuanzhuo, guwen, dust.li

On Wed, Oct 27, 2021 at 08:47:10AM -0700, Jakub Kicinski wrote:
> On Wed, 27 Oct 2021 17:38:27 +0200 Karsten Graul wrote:
> > What we found out was that applications called sendmsg() with large data
> > buffers using blocking sockets. This led to the described situation, were the
> > solution was to early return to user space even if not all data were sent yet.
> > Userspace applications should not have a problem with the fact that sendmsg()
> > returns a smaller byte count than requested.
> > 
> > Reverting this patch would bring back the stalled connection problem.
> 
> I'm not sure. The man page for send says:
> 
>        When the message does not fit into  the  send  buffer  of  the  socket,
>        send()  normally blocks, unless the socket has been placed in nonblock‐
>        ing I/O mode.  In nonblocking mode it would fail with the error  EAGAIN
>        or  EWOULDBLOCK in this case.
> 
> dunno if that's required by POSIX or just a best practice.

The man page describes the common cases about the socket API behavior,
but depends on the implement.

For example, the connect(2) implement of TCP, it would never block, but
also provides EAGAIN errors for UNIX domain sockets.

       EAGAIN For nonblocking UNIX domain sockets, the socket is
              nonblocking, and the connection cannot be completed
              immediately.  For other socket families, there are
              insufficient entries in the routing cache.

In my opinion, if we are going to replace TCP with SMC, these userspace
socket API should behavior as same, and don't break the userspace
applications like netperf. It could be better to block here when sending
message without enough buffer.

In our benchmarks and E2E tests (Redis, MySQL, etc.), it is acceptable to
block here. Because the userspace applications usually block in the loop
until the data send out. If it blocks, the scheduler can handle it.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 2/4] net/smc: Fix smc_link->llc_testlink_time overflow
  2021-10-27 10:24   ` Karsten Graul
@ 2021-10-28  6:52     ` Tony Lu
  0 siblings, 0 replies; 28+ messages in thread
From: Tony Lu @ 2021-10-28  6:52 UTC (permalink / raw)
  To: Karsten Graul
  Cc: davem, kuba, ubraun, netdev, linux-s390, linux-rdma, jacob.qi,
	xuanzhuo, guwen, dust.li

On Wed, Oct 27, 2021 at 12:24:18PM +0200, Karsten Graul wrote:
> On 27/10/2021 10:52, Tony Lu wrote:
> > From: Tony Lu <tony.ly@linux.alibaba.com>
> > 
> > The value of llc_testlink_time is set to the value stored in
> > net->ipv4.sysctl_tcp_keepalive_time when linkgroup init. The value of
> > sysctl_tcp_keepalive_time is already jiffies, so we don't need to
> > multiply by HZ, which would cause smc_link->llc_testlink_time overflow,
> > and test_link send flood.
> 
> Thanks for fixing this, we will include your patch in our next submission
> to the netdev tree.

Thanks for your reply. There is a little mistake for my email address,
the wrong email address (tony.ly@linux.alibaba.com) should be corrected
to tonylu@linux.alibaba.com. I will send these two patches with the next
one in v2.

Cheers,
Tony Lu

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 3/4] net/smc: Correct spelling mistake to TCPF_SYN_RECV
  2021-10-27 10:23   ` Karsten Graul
@ 2021-10-28  6:53     ` Tony Lu
  0 siblings, 0 replies; 28+ messages in thread
From: Tony Lu @ 2021-10-28  6:53 UTC (permalink / raw)
  To: Karsten Graul
  Cc: davem, kuba, netdev, linux-s390, linux-rdma, jacob.qi, xuanzhuo,
	guwen, dust.li

On Wed, Oct 27, 2021 at 12:23:34PM +0200, Karsten Graul wrote:
> On 27/10/2021 10:52, Tony Lu wrote:
> > From: Wen Gu <guwen@linux.alibaba.com>
> > 
> > There should use TCPF_SYN_RECV instead of TCP_SYN_RECV.
> 
> Thanks for fixing this, we will include your patch in our next submission
> to the netdev tree.

I will resend it in v2 patch to fix my wrong email address
(tony.ly@linux.alibaba.com) to tonylu@linux.alibaba.com.

Cheers,
Tony Lu

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 1/4] Revert "net/smc: don't wait for send buffer space when data was already sent"
  2021-10-27 15:47         ` Jakub Kicinski
  2021-10-28  6:48           ` Tony Lu
@ 2021-10-28 11:57           ` Karsten Graul
  2021-10-28 14:38             ` Jakub Kicinski
  1 sibling, 1 reply; 28+ messages in thread
From: Karsten Graul @ 2021-10-28 11:57 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Tony Lu, davem, netdev, linux-s390, linux-rdma, jacob.qi,
	xuanzhuo, guwen, dust.li

On 27/10/2021 17:47, Jakub Kicinski wrote:
> On Wed, 27 Oct 2021 17:38:27 +0200 Karsten Graul wrote:
>> What we found out was that applications called sendmsg() with large data
>> buffers using blocking sockets. This led to the described situation, were the
>> solution was to early return to user space even if not all data were sent yet.
>> Userspace applications should not have a problem with the fact that sendmsg()
>> returns a smaller byte count than requested.
>>
>> Reverting this patch would bring back the stalled connection problem.
> 
> I'm not sure. The man page for send says:
> 
>        When the message does not fit into  the  send  buffer  of  the  socket,
>        send()  normally blocks, unless the socket has been placed in nonblock‐
>        ing I/O mode.  In nonblocking mode it would fail with the error  EAGAIN
>        or  EWOULDBLOCK in this case.
> 
> dunno if that's required by POSIX or just a best practice.

I see your point, and I am also not sure about how it should work in reality.

The test case where the connection stalled is that both communication peers try
to send data larger than there is space in the local send buffer plus the remote 
receive buffer. They use blocking sockets, so if the send() call is meant to send
all data as requested then both sides would hang in send() forever/until a timeout.

In our case both sides run a send/recv loop, so allowing send() to return lesser 
bytes then requested resulted in a follow-on recv() call which freed up space in the
buffers, and the processing continues.

There is also some discussion about this topic in this SO thread
    https://stackoverflow.com/questions/19697218/can-send-on-a-tcp-socket-return-0-and-length
which points out that this (send returns smaller length) may happen already, e.g. when there
is an interruption.

So how to deal with all of this? Is it an accepted programming error when a user space 
program gets itself into this kind of situation? Since this problem depends on 
internal send/recv buffer sizes such a program might work on one system but not on 
other systems.

At the end the question might be if either such kind of a 'deadlock' is acceptable, 
or if it is okay to have send() return lesser bytes than requested.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 4/4] net/smc: Fix wq mismatch issue caused by smc fallback
  2021-10-27  8:52 ` [PATCH net 4/4] net/smc: Fix wq mismatch issue caused by smc fallback Tony Lu
@ 2021-10-28 14:16   ` Karsten Graul
  2021-10-29  9:40   ` Karsten Graul
  1 sibling, 0 replies; 28+ messages in thread
From: Karsten Graul @ 2021-10-28 14:16 UTC (permalink / raw)
  To: Tony Lu, davem, kuba
  Cc: netdev, linux-s390, linux-rdma, jacob.qi, xuanzhuo, guwen, dust.li

On 27/10/2021 10:52, Tony Lu wrote:
> From: Wen Gu <guwen@linux.alibaba.com>
> 
> A socket_wq mismatch issue may occur because of fallback.
> 
> When use SMC to replace TCP, applications add an epoll entry into SMC
> socket's wq, but kernel uses clcsock's wq instead of SMC socket's wq
> once fallback occurs, which means the application's epoll fd dosen't
> work anymore.
> 
> For example:
> server: nginx -g 'daemon off;'
> 
> client: smc_run wrk -c 1 -t 1 -d 5 http://11.200.15.93/index.html
> 
>   Running 5s test @ http://11.200.15.93/index.html
>     1 threads and 1 connections
>     Thread Stats   Avg      Stdev     Max   +/- Stdev
>       Latency     0.00us    0.00us   0.00us    -nan%
>       Req/Sec     0.00      0.00     0.00      -nan%
>     0 requests in 5.00s, 0.00B read
>   Requests/sec:      0.00
>   Transfer/sec:       0.00B
> 
> This patch fixes this issue by using clcsock's wq regardless of
> whether fallback occurs.
> 

I need to spend some more time testing and thinking about this fix.
Thank you.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 1/4] Revert "net/smc: don't wait for send buffer space when data was already sent"
  2021-10-28 11:57           ` Karsten Graul
@ 2021-10-28 14:38             ` Jakub Kicinski
  2021-11-01  7:04               ` Tony Lu
  0 siblings, 1 reply; 28+ messages in thread
From: Jakub Kicinski @ 2021-10-28 14:38 UTC (permalink / raw)
  To: Karsten Graul
  Cc: Tony Lu, davem, netdev, linux-s390, linux-rdma, jacob.qi,
	xuanzhuo, guwen, dust.li

On Thu, 28 Oct 2021 13:57:55 +0200 Karsten Graul wrote:
> So how to deal with all of this? Is it an accepted programming error
> when a user space program gets itself into this kind of situation?
> Since this problem depends on internal send/recv buffer sizes such a
> program might work on one system but not on other systems.

It's a gray area so unless someone else has a strong opinion we can
leave it as is.

> At the end the question might be if either such kind of a 'deadlock'
> is acceptable, or if it is okay to have send() return lesser bytes
> than requested.

Yeah.. the thing is we have better APIs for applications to ask not to
block than we do for applications to block. If someone really wants to
wait for all data to come out for performance reasons they will
struggle to get that behavior. 

We also have the small yet pernicious case where the buffer is
completely full at sendmsg() time, IOW we didn't send a single byte.
We won't be able to return "partial" results and deadlock. IDK if your
application can hit this, but it should really use non-blocking send if
it doesn't want blocking behavior..

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 4/4] net/smc: Fix wq mismatch issue caused by smc fallback
  2021-10-27  8:52 ` [PATCH net 4/4] net/smc: Fix wq mismatch issue caused by smc fallback Tony Lu
  2021-10-28 14:16   ` Karsten Graul
@ 2021-10-29  9:40   ` Karsten Graul
  2021-11-01  6:15     ` Wen Gu
  1 sibling, 1 reply; 28+ messages in thread
From: Karsten Graul @ 2021-10-29  9:40 UTC (permalink / raw)
  To: Tony Lu, davem, kuba, ubraun
  Cc: netdev, linux-s390, linux-rdma, jacob.qi, xuanzhuo, guwen, dust.li

On 27/10/2021 10:52, Tony Lu wrote:
> From: Wen Gu <guwen@linux.alibaba.com>
> 
> A socket_wq mismatch issue may occur because of fallback.
> 
> When use SMC to replace TCP, applications add an epoll entry into SMC
> socket's wq, but kernel uses clcsock's wq instead of SMC socket's wq
> once fallback occurs, which means the application's epoll fd dosen't
> work anymore.

I am not sure if I understand this fix completely, please explain your intentions
for the changes in more detail.

What I see so far:
- smc_create() swaps the sk->sk_wq of the clcsocket and the new SMC socket
  - sets clcsocket sk->sk_wq to smcsocket->wq (why?)
  - sets smcsocket sk->sk_wq to clcsocket->wq (why?)
- smc_switch_to_fallback() resets the clcsock sk->sk_wq to clcsocket->wq
- smc_accept() sets smcsocket sk->sk_wq to clcsocket->wq when it is NOT fallback
  - but this was already done before in smc_create() ??
- smc_poll() now always uses clcsocket->wq for the call to sock_poll_wait()

In smc_poll() the comment says that now clcsocket->wq is used for poll, whats
the relation between socket->wq and socket->sk->sk_wq here?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 4/4] net/smc: Fix wq mismatch issue caused by smc fallback
  2021-10-29  9:40   ` Karsten Graul
@ 2021-11-01  6:15     ` Wen Gu
  2021-11-02  9:25       ` Karsten Graul
  0 siblings, 1 reply; 28+ messages in thread
From: Wen Gu @ 2021-11-01  6:15 UTC (permalink / raw)
  To: Karsten Graul, Tony Lu, davem, kuba, ubraun
  Cc: netdev, linux-s390, linux-rdma, jacob.qi, xuanzhuo, dust.li



On 2021/10/29 5:40, Karsten Graul wrote:
> On 27/10/2021 10:52, Tony Lu wrote:
>> From: Wen Gu <guwen@linux.alibaba.com>
>>
>> A socket_wq mismatch issue may occur because of fallback.
>>
>> When use SMC to replace TCP, applications add an epoll entry into SMC
>> socket's wq, but kernel uses clcsock's wq instead of SMC socket's wq
>> once fallback occurs, which means the application's epoll fd dosen't
>> work anymore.
> 
> I am not sure if I understand this fix completely, please explain your intentions
> for the changes in more detail.
> 
> What I see so far:
> - smc_create() swaps the sk->sk_wq of the clcsocket and the new SMC socket
>    - sets clcsocket sk->sk_wq to smcsocket->wq (why?)
>    - sets smcsocket sk->sk_wq to clcsocket->wq (why?)
> - smc_switch_to_fallback() resets the clcsock sk->sk_wq to clcsocket->wq
> - smc_accept() sets smcsocket sk->sk_wq to clcsocket->wq when it is NOT fallback
>    - but this was already done before in smc_create() ??
> - smc_poll() now always uses clcsocket->wq for the call to sock_poll_wait()
> 
> In smc_poll() the comment says that now clcsocket->wq is used for poll, whats
> the relation between socket->wq and socket->sk->sk_wq here?
> 

Thanks for your reply.

Before explaining my intentions, I thought it would be better to 
describe the issue I encountered first:

In nginx/wrk tests, when nginx uses TCP and wrk uses SMC to replace TCP, 
wrk should fall back to TCP and get correct results theoretically, But 
in fact it only got all zeros.

For example:
server: nginx -g 'daemon off;'

client: smc_run wrk -c 1 -t 1 -d 5 http://11.200.15.93/index.html

   Running 5s test @ http://11.200.15.93/index.html
     1 threads and 1 connections
     Thread Stats   Avg      Stdev     Max   ± Stdev
         Latency     0.00us    0.00us   0.00us    -nan%
         Req/Sec     0.00      0.00     0.00      -nan%
        0 requests in 5.00s, 0.00B read
     Requests/sec:      0.00
     Transfer/sec:       0.00B

The reason for this result is that:

1) Before fallback: wrk uses epoll_insert() to associate an epoll fd 
with a smc socket, adding eppoll_entry into smc socket->wq, allowing 
itself to be notified when events such as EPOLL_OUT/EPOLL_IN occur on 
the smc socket.

2) After fallback: smc_switch_to_fallback() set fd->private_data as 
clcsocket. wrk starts to use TCP stack and kernel calls tcp_data_ready() 
when receving data, which uses clcsocket sk->sk_wq (equal to 
clcsocket->wq). As a result, the epoll fd hold by wrk in 1) can't 
receive epoll events and wrk stops transmitting data.

So the root cause of the issue is that wrk's eppoll_entry always stay in 
smcsocket->wq, but kernel turns to wake up 
clcsocket->sk->sk_wq(clcsocket->wq) after fallback, which makes wrk's 
epoll fd unusable.

[before fallback]
    application
         ^
         |
       (poll)
         |
         v
   smc socket->wq            clcsocket->wq
(has eppoll_entry)                .
         .                         .
         .                         .
         .                         .
         .                         .
smc socket->sk->sk_wq    clcsocket->sk->sk_wq
         ^                         ^
         |                         |
         | (data)                  |(tcp handshake in rendezvous)
         v                         v
|-------------------------------------------|
|                                           |
| sk_data_ready / sk_write_space ..         |
|                                           |
|-------------------------------------------|

[after fallback]
    application--------------------|
                         (cant poll anything)
                                   |
                                   |
                                   v
   smc socket->wq            clcsocket->wq
(has eppoll_entry)                .
         .                         .
         .                         .
         .                         .
         .                         .
smc socket->sk->sk_wq    clcsocket->sk->sk_wq
                                   ^
                                   |
                                   |(data)
                                   v
|-------------------------------------------|
|                                           |
| sk_data_ready / sk_write_space ..         |
|                                           |
|-------------------------------------------|


This patch's main idea is that since wait queue is switched from smc 
socket->wq to clcsocket->wq after fallback, making eppoll_entry in smc 
socket->wq unusable, maybe we can try to put eppoll_entry into 
clcsocket->wq at the beginning, set smc socket sk->sk_wq to 
clcsocket->wq before fallback, and reset clcsocket sk->sk_wq to 
clcsocket->wq after fallback. So that no matter whether fallback occurs, 
the kernel always wakes up clcsocket->wq to notify applications.


Therefore, the main modifications of this patch are as follows:

1) smc_poll() always uses clcsocket->wq for the call to sock_poll_wait()

After this modification, the user application's eppoll_entry will be 
added into clcsocket->wq at the beginning instead of smc socket->wq, 
allowing application's epoll fd to receive events such as EPOLL_IN even 
when a fallback occurs.

2) Swap the sk->sk_wq of the clcsocket and the new SMC socket in 
smc_create()

Sets smcsocket sk->sk_wq to clcsocket->wq. If SMC is used and NOT 
fallback, the sk_data_ready() will wake up clcsocket->wq, and user 
application's epoll fd will receive epoll events.

Sets clcsocket sk->sk_wq to smcsocket->wq. Since clcsocket->wq is 
occupied by smcsocket sk->sk_wq, clcsocket sk->sk_wq have to use another 
wait queue (smc socket->wq) during TCP handshake in rendezvous.

3) smc_switch_to_fallback() resets the clcsocket sk->sk_wq to clcsocket->wq

Once a fallback occurs, user application will start using clcsocket. At 
this point, clcsocket sk->sk_wq should be reseted to clcsocket->wq 
because eppoll_entry is in clcsocket->wq.

4) smc_accept() sets smcsocket sk->sk_wq to clcsocket->wq when it is NOT 
fallback

The reason for doing this on the listening side is simmilar to 
smc_create(): using clcsocket->wq in concert with smc_poll() when it is 
NOT fallback. For your query 'this was already done before in 
smc_create() ? ', the new smc socket here in smc_accept() seems be 
different from the smc socket created in smc_create().

So after the fix:

[before fallback]
    application--------------------|
                                (poll)
                                   |
                                   |
                                   v
   smc socket->wq            clcsocket->wq
          .                (has eppoll_entry)
             `   .                  .
                     `  .  .   `
                 .    `    `   .
            `                      `
smc socket->sk->sk_wq    clcsocket->sk->sk_wq
         ^                         ^
         |                         |
         |(data)                   |(tcp handshake)
         v                         v
|-------------------------------------------|
|                                           |
| sk_data_ready / sk_write_space ..         |
|                                           |
|-------------------------------------------|

[after fallback]
    application--------------------|
                                (poll)
                                   |
                                   |
                                   v
   smc socket->wq            clcsocket->wq
         .                 (has eppoll_entry)
             `   .                  .
                     `   .          .
                            `   .   .
                                    `
smc socket->sk->sk_wq    clcsocket->sk->sk_wq
                                    ^
                                    |(data)
                                    v
|-------------------------------------------|
|                                           |
| sk_data_ready / sk_write_space ..         |
|                                           |
|-------------------------------------------|


Cheers,
Wen Gu

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 1/4] Revert "net/smc: don't wait for send buffer space when data was already sent"
  2021-10-28 14:38             ` Jakub Kicinski
@ 2021-11-01  7:04               ` Tony Lu
  2021-11-02  9:17                 ` Karsten Graul
  0 siblings, 1 reply; 28+ messages in thread
From: Tony Lu @ 2021-11-01  7:04 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Karsten Graul, davem, netdev, linux-s390, linux-rdma, jacob.qi,
	xuanzhuo, guwen, dust.li

On Thu, Oct 28, 2021 at 07:38:27AM -0700, Jakub Kicinski wrote:
> On Thu, 28 Oct 2021 13:57:55 +0200 Karsten Graul wrote:
> > So how to deal with all of this? Is it an accepted programming error
> > when a user space program gets itself into this kind of situation?
> > Since this problem depends on internal send/recv buffer sizes such a
> > program might work on one system but not on other systems.
> 
> It's a gray area so unless someone else has a strong opinion we can
> leave it as is.

Things might be different. IMHO, the key point of this problem is to
implement the "standard" POSIX socket API, or TCP-socket compatible API.

> > At the end the question might be if either such kind of a 'deadlock'
> > is acceptable, or if it is okay to have send() return lesser bytes
> > than requested.
> 
> Yeah.. the thing is we have better APIs for applications to ask not to
> block than we do for applications to block. If someone really wants to
> wait for all data to come out for performance reasons they will
> struggle to get that behavior. 

IMO, it is better to do something to unify this behavior. Some
applications like netperf would be broken, and the people who want to use
SMC to run basic benchmark, would be confused about this, and its
compatibility with TCP. Maybe we could:
1) correct the behavior of netperf to check the rc as we discussed.
2) "copy" the behavior of TCP, and try to compatiable with TCP, though
it is a gray area.


Cheers,
Tony Lu

> We also have the small yet pernicious case where the buffer is
> completely full at sendmsg() time, IOW we didn't send a single byte.
> We won't be able to return "partial" results and deadlock. IDK if your
> application can hit this, but it should really use non-blocking send if
> it doesn't want blocking behavior..

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 1/4] Revert "net/smc: don't wait for send buffer space when data was already sent"
  2021-11-01  7:04               ` Tony Lu
@ 2021-11-02  9:17                 ` Karsten Graul
  2021-11-03  3:06                   ` Tony Lu
  0 siblings, 1 reply; 28+ messages in thread
From: Karsten Graul @ 2021-11-02  9:17 UTC (permalink / raw)
  To: Tony Lu, Jakub Kicinski
  Cc: davem, netdev, linux-s390, linux-rdma, jacob.qi, xuanzhuo, guwen,
	dust.li

On 01/11/2021 08:04, Tony Lu wrote:
> On Thu, Oct 28, 2021 at 07:38:27AM -0700, Jakub Kicinski wrote:
>> On Thu, 28 Oct 2021 13:57:55 +0200 Karsten Graul wrote:
>>> So how to deal with all of this? Is it an accepted programming error
>>> when a user space program gets itself into this kind of situation?
>>> Since this problem depends on internal send/recv buffer sizes such a
>>> program might work on one system but not on other systems.
>>
>> It's a gray area so unless someone else has a strong opinion we can
>> leave it as is.
> 
> Things might be different. IMHO, the key point of this problem is to
> implement the "standard" POSIX socket API, or TCP-socket compatible API.
> 
>>> At the end the question might be if either such kind of a 'deadlock'
>>> is acceptable, or if it is okay to have send() return lesser bytes
>>> than requested.
>>
>> Yeah.. the thing is we have better APIs for applications to ask not to
>> block than we do for applications to block. If someone really wants to
>> wait for all data to come out for performance reasons they will
>> struggle to get that behavior. 
> 
> IMO, it is better to do something to unify this behavior. Some
> applications like netperf would be broken, and the people who want to use
> SMC to run basic benchmark, would be confused about this, and its
> compatibility with TCP. Maybe we could:
> 1) correct the behavior of netperf to check the rc as we discussed.
> 2) "copy" the behavior of TCP, and try to compatiable with TCP, though
> it is a gray area.

I have a strong opinion here, so when the question is if the user either
encounters a deadlock or if send() returns lesser bytes than requested,
I prefer the latter behavior.
The second case is much easier to debug for users, they can do something
to handle the problem (loop around send()), and this case can even be detected
using strace. But the deadlock case is nearly not debuggable by users and there
is nothing to prevent it when the workload pattern runs into this situation
(except to not use blocking sends).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 4/4] net/smc: Fix wq mismatch issue caused by smc fallback
  2021-11-01  6:15     ` Wen Gu
@ 2021-11-02  9:25       ` Karsten Graul
  2021-11-03  8:56         ` Wen Gu
  2021-11-04  4:39         ` Wen Gu
  0 siblings, 2 replies; 28+ messages in thread
From: Karsten Graul @ 2021-11-02  9:25 UTC (permalink / raw)
  To: Wen Gu, Tony Lu, davem, kuba
  Cc: netdev, linux-s390, linux-rdma, jacob.qi, xuanzhuo, dust.li

On 01/11/2021 07:15, Wen Gu wrote:
> Before explaining my intentions, I thought it would be better to describe the issue I encountered first:
> 
> In nginx/wrk tests, when nginx uses TCP and wrk uses SMC to replace TCP, wrk should fall back to TCP and get correct results theoretically, But in fact it only got all zeros.

Thank you for the very detailed description, I now understand the situation.

The fix is not obvious and not easy to understand for the reader of the code,
did you think about a fix that uses own sk_data_ready / sk_write_space
implementations on the SMC socket to forward the call to the clcsock in the 
fallback situation?

I.e. we already have smc_tx_write_space(), and there is smc_clcsock_data_ready()
which is right now only used for the listening socket case.

If this works this would be a much cleaner and more understandable way to fix this issue.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 1/4] Revert "net/smc: don't wait for send buffer space when data was already sent"
  2021-11-02  9:17                 ` Karsten Graul
@ 2021-11-03  3:06                   ` Tony Lu
  2021-11-06 12:46                     ` Karsten Graul
  0 siblings, 1 reply; 28+ messages in thread
From: Tony Lu @ 2021-11-03  3:06 UTC (permalink / raw)
  To: Karsten Graul
  Cc: Jakub Kicinski, davem, netdev, linux-s390, linux-rdma, jacob.qi,
	xuanzhuo, guwen, dust.li

On Tue, Nov 02, 2021 at 10:17:15AM +0100, Karsten Graul wrote:
> On 01/11/2021 08:04, Tony Lu wrote:
> > On Thu, Oct 28, 2021 at 07:38:27AM -0700, Jakub Kicinski wrote:
> >> On Thu, 28 Oct 2021 13:57:55 +0200 Karsten Graul wrote:
> >>> So how to deal with all of this? Is it an accepted programming error
> >>> when a user space program gets itself into this kind of situation?
> >>> Since this problem depends on internal send/recv buffer sizes such a
> >>> program might work on one system but not on other systems.
> >>
> >> It's a gray area so unless someone else has a strong opinion we can
> >> leave it as is.
> > 
> > Things might be different. IMHO, the key point of this problem is to
> > implement the "standard" POSIX socket API, or TCP-socket compatible API.
> > 
> >>> At the end the question might be if either such kind of a 'deadlock'
> >>> is acceptable, or if it is okay to have send() return lesser bytes
> >>> than requested.
> >>
> >> Yeah.. the thing is we have better APIs for applications to ask not to
> >> block than we do for applications to block. If someone really wants to
> >> wait for all data to come out for performance reasons they will
> >> struggle to get that behavior. 
> > 
> > IMO, it is better to do something to unify this behavior. Some
> > applications like netperf would be broken, and the people who want to use
> > SMC to run basic benchmark, would be confused about this, and its
> > compatibility with TCP. Maybe we could:
> > 1) correct the behavior of netperf to check the rc as we discussed.
> > 2) "copy" the behavior of TCP, and try to compatiable with TCP, though
> > it is a gray area.
> 
> I have a strong opinion here, so when the question is if the user either
> encounters a deadlock or if send() returns lesser bytes than requested,
> I prefer the latter behavior.
> The second case is much easier to debug for users, they can do something
> to handle the problem (loop around send()), and this case can even be detected
> using strace. But the deadlock case is nearly not debuggable by users and there
> is nothing to prevent it when the workload pattern runs into this situation
> (except to not use blocking sends).

I agree with you. I am curious about this deadlock scene. If it was
convenient, could you provide a reproducible test case? We are also
setting up a SMC CI/CD system to find the compatible and performance
fallback problems. Maybe we could do something to make it better.

Cheers,
Tony Lu

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 4/4] net/smc: Fix wq mismatch issue caused by smc fallback
  2021-11-02  9:25       ` Karsten Graul
@ 2021-11-03  8:56         ` Wen Gu
  2021-11-04  4:39         ` Wen Gu
  1 sibling, 0 replies; 28+ messages in thread
From: Wen Gu @ 2021-11-03  8:56 UTC (permalink / raw)
  To: Karsten Graul, Tony Lu, davem, kuba
  Cc: netdev, linux-s390, linux-rdma, jacob.qi, xuanzhuo, dust.li, guwen



On 2021/11/2 下午5:25, Karsten Graul wrote:
> On 01/11/2021 07:15, Wen Gu wrote:
>> Before explaining my intentions, I thought it would be better to describe the issue I encountered first:
>>
>> In nginx/wrk tests, when nginx uses TCP and wrk uses SMC to replace TCP, wrk should fall back to TCP and get correct results theoretically, But in fact it only got all zeros.
> 
> Thank you for the very detailed description, I now understand the situation.
> 
> The fix is not obvious and not easy to understand for the reader of the code,
> did you think about a fix that uses own sk_data_ready / sk_write_space
> implementations on the SMC socket to forward the call to the clcsock in the
> fallback situation?
> 
> I.e. we already have smc_tx_write_space(), and there is smc_clcsock_data_ready()
> which is right now only used for the listening socket case.
> 
> If this works this would be a much cleaner and more understandable way to fix this issue.
> 

Thanks for your suggestions, they are very helpful.

I will try these ways and send a new patch later.

cheers,
Wen Gu

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 4/4] net/smc: Fix wq mismatch issue caused by smc fallback
  2021-11-02  9:25       ` Karsten Graul
  2021-11-03  8:56         ` Wen Gu
@ 2021-11-04  4:39         ` Wen Gu
  2021-11-06 12:51           ` Karsten Graul
  1 sibling, 1 reply; 28+ messages in thread
From: Wen Gu @ 2021-11-04  4:39 UTC (permalink / raw)
  To: Karsten Graul, Tony Lu
  Cc: netdev, linux-s390, linux-rdma, jacob.qi, xuanzhuo, dust.li,
	davem, kuba, guwen



On 2021/11/2 5:25 pm, Karsten Graul wrote:
> On 01/11/2021 07:15, Wen Gu wrote:
>> Before explaining my intentions, I thought it would be better to describe the issue I encountered first:
>>
>> In nginx/wrk tests, when nginx uses TCP and wrk uses SMC to replace TCP, wrk should fall back to TCP and get correct results theoretically, But in fact it only got all zeros.
> 
> Thank you for the very detailed description, I now understand the situation.
> 
> The fix is not obvious and not easy to understand for the reader of the code,
> did you think about a fix that uses own sk_data_ready / sk_write_space
> implementations on the SMC socket to forward the call to the clcsock in the
> fallback situation?
> 
> I.e. we already have smc_tx_write_space(), and there is smc_clcsock_data_ready()
> which is right now only used for the listening socket case.
> 
> If this works this would be a much cleaner and more understandable way to fix this issue.
> 

Thanks for your suggestions about implementing SMC own sk_data_ready / 
sk_write_space and forwarding call to clcsock. It's a great idea. But I 
found some difficulties here in implementation process:

In my humble opinion, SMC own sk_write_space implementation should be 
called by clcsk->sk_write_space and complete the following steps:

1) Get smc_sock through clcsk->sk_user_data, like what did in 
smc_clcsock_data_ready().

2) Forward call to original clcsk->sk_write_space, it MIGHT wake up 
clcsk->sk_wq, depending on whether certain conditions are met.

3) Wake up smc sk->sk_wq to nodify application if clcsk->sk_write_space 
acctually wakes up clcsk->sk_wq.

In step 3), it seems a bit troublesome for SMC to know whether 
clcsk->sk_write_space acctually wake up clcsk->sk_wq, which is a black 
box to SMC.

There might be a feasible way that add a wait_queue_head_t to 
clcsk->sk_wq and report to SMC when clcsk->sk_wq is waked up. Then SMC 
can report to application by waking up smc sk->sk_wq. But that seems to 
be complex and redundancy.

I'm looking forward to hear your opinion about it. Thank you!


cheers,
Wen Gu

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 1/4] Revert "net/smc: don't wait for send buffer space when data was already sent"
  2021-11-03  3:06                   ` Tony Lu
@ 2021-11-06 12:46                     ` Karsten Graul
  0 siblings, 0 replies; 28+ messages in thread
From: Karsten Graul @ 2021-11-06 12:46 UTC (permalink / raw)
  To: Tony Lu
  Cc: Jakub Kicinski, davem, netdev, linux-s390, linux-rdma, jacob.qi,
	xuanzhuo, guwen, dust.li

On 03/11/2021 04:06, Tony Lu wrote:
> 
> I agree with you. I am curious about this deadlock scene. If it was
> convenient, could you provide a reproducible test case? We are also
> setting up a SMC CI/CD system to find the compatible and performance
> fallback problems. Maybe we could do something to make it better.

Run an echo server that uses blocking sockets. First call recv() with an 1MB buffer
and when recv() returns then send all received bytes back to the client, no matter
how many bytes where received. Run this in a loop (recv / send).
On the client side also use only blocking sockets and a send / recv loop. Use
an 1MB data buffer and call send() with the whole 1MB of data.
Now run that with both scenarios (send() returns lesser bytes than requested vs. not).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 4/4] net/smc: Fix wq mismatch issue caused by smc fallback
  2021-11-04  4:39         ` Wen Gu
@ 2021-11-06 12:51           ` Karsten Graul
  2021-11-10 12:50             ` Wen Gu
  0 siblings, 1 reply; 28+ messages in thread
From: Karsten Graul @ 2021-11-06 12:51 UTC (permalink / raw)
  To: Wen Gu, Tony Lu
  Cc: netdev, linux-s390, linux-rdma, jacob.qi, xuanzhuo, dust.li, davem, kuba

On 04/11/2021 05:39, Wen Gu wrote:
> Thanks for your suggestions about implementing SMC own sk_data_ready / sk_write_space and forwarding call to clcsock. It's a great idea. But I found some difficulties here in implementation process:
> 
> In my humble opinion, SMC own sk_write_space implementation should be called by clcsk->sk_write_space and complete the following steps:
> 
> 1) Get smc_sock through clcsk->sk_user_data, like what did in smc_clcsock_data_ready().
> 
> 2) Forward call to original clcsk->sk_write_space, it MIGHT wake up clcsk->sk_wq, depending on whether certain conditions are met.
> 
> 3) Wake up smc sk->sk_wq to nodify application if clcsk->sk_write_space acctually wakes up clcsk->sk_wq.
> 
> In step 3), it seems a bit troublesome for SMC to know whether clcsk->sk_write_space acctually wake up clcsk->sk_wq, which is a black box to SMC.
> 
> There might be a feasible way that add a wait_queue_head_t to clcsk->sk_wq and report to SMC when clcsk->sk_wq is waked up. Then SMC can report to application by waking up smc sk->sk_wq. But that seems to be complex and redundancy.

Hmm so when more certain conditions have to be met in (2) to 
actually wake up clcsk->sk_wq then this might not be the right
way to go...
So when there are no better ways I would vote for your initial patch.
But please add the complete description about how this is intended to 
work to this patch to allow readers to understand the idea behind it.

Thank you.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net 4/4] net/smc: Fix wq mismatch issue caused by smc fallback
  2021-11-06 12:51           ` Karsten Graul
@ 2021-11-10 12:50             ` Wen Gu
  0 siblings, 0 replies; 28+ messages in thread
From: Wen Gu @ 2021-11-10 12:50 UTC (permalink / raw)
  To: Karsten Graul, Tony Lu
  Cc: netdev, linux-s390, linux-rdma, jacob.qi, xuanzhuo, dust.li,
	davem, kuba, guwen



On 2021/11/6 8:51 pm, Karsten Graul wrote:
> On 04/11/2021 05:39, Wen Gu wrote:
>> Thanks for your suggestions about implementing SMC own sk_data_ready / sk_write_space and forwarding call to clcsock. It's a great idea. But I found some difficulties here in implementation process:
>>
>> In my humble opinion, SMC own sk_write_space implementation should be called by clcsk->sk_write_space and complete the following steps:
>>
>> 1) Get smc_sock through clcsk->sk_user_data, like what did in smc_clcsock_data_ready().
>>
>> 2) Forward call to original clcsk->sk_write_space, it MIGHT wake up clcsk->sk_wq, depending on whether certain conditions are met.
>>
>> 3) Wake up smc sk->sk_wq to nodify application if clcsk->sk_write_space acctually wakes up clcsk->sk_wq.
>>
>> In step 3), it seems a bit troublesome for SMC to know whether clcsk->sk_write_space acctually wake up clcsk->sk_wq, which is a black box to SMC.
>>
>> There might be a feasible way that add a wait_queue_head_t to clcsk->sk_wq and report to SMC when clcsk->sk_wq is waked up. Then SMC can report to application by waking up smc sk->sk_wq. But that seems to be complex and redundancy.
> 
> Hmm so when more certain conditions have to be met in (2) to
> actually wake up clcsk->sk_wq then this might not be the right
> way to go...
> So when there are no better ways I would vote for your initial patch.
> But please add the complete description about how this is intended to
> work to this patch to allow readers to understand the idea behind it.
> 
> Thank you.
> 

Thanks for your suggestions.

Unfortunately, I found a defect about fasync_list in my initial patch. 
Swapping sk->sk_wq of smc socket and clcsocket seems also an imperfect 
way to fix the issue. I will describe this in the next new mail.

In addition, I will provide another RFC patch which is aimed to fix the 
same issue. It will be also included in the next new mail.

Thank you.

Cheers,
Wen Gu

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2021-11-10 12:50 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-27  8:52 [PATCH net 0/4] Fixes for SMC Tony Lu
2021-10-27  8:52 ` [PATCH net 1/4] Revert "net/smc: don't wait for send buffer space when data was already sent" Tony Lu
2021-10-27 10:21   ` Karsten Graul
2021-10-27 15:08     ` Jakub Kicinski
2021-10-27 15:38       ` Karsten Graul
2021-10-27 15:47         ` Jakub Kicinski
2021-10-28  6:48           ` Tony Lu
2021-10-28 11:57           ` Karsten Graul
2021-10-28 14:38             ` Jakub Kicinski
2021-11-01  7:04               ` Tony Lu
2021-11-02  9:17                 ` Karsten Graul
2021-11-03  3:06                   ` Tony Lu
2021-11-06 12:46                     ` Karsten Graul
2021-10-27  8:52 ` [PATCH net 2/4] net/smc: Fix smc_link->llc_testlink_time overflow Tony Lu
2021-10-27 10:24   ` Karsten Graul
2021-10-28  6:52     ` Tony Lu
2021-10-27  8:52 ` [PATCH net 3/4] net/smc: Correct spelling mistake to TCPF_SYN_RECV Tony Lu
2021-10-27 10:23   ` Karsten Graul
2021-10-28  6:53     ` Tony Lu
2021-10-27  8:52 ` [PATCH net 4/4] net/smc: Fix wq mismatch issue caused by smc fallback Tony Lu
2021-10-28 14:16   ` Karsten Graul
2021-10-29  9:40   ` Karsten Graul
2021-11-01  6:15     ` Wen Gu
2021-11-02  9:25       ` Karsten Graul
2021-11-03  8:56         ` Wen Gu
2021-11-04  4:39         ` Wen Gu
2021-11-06 12:51           ` Karsten Graul
2021-11-10 12:50             ` Wen Gu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.