All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] net/smc: Add autocork support
@ 2022-02-16  3:49 Dust Li
  2022-02-16  3:53 ` dust.li
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Dust Li @ 2022-02-16  3:49 UTC (permalink / raw)
  To: Karsten Graul, Tony Lu; +Cc: kuba, davem, netdev, linux-s390, linux-rdma

This patch adds autocork support for SMC which could improve
throughput for small message by x2 ~ x4.

The main idea is borrowed from TCP autocork with some RDMA
specific modification:
1. The first message should never cork to make sure we won't
   bring extra latency
2. If we have posted any Tx WRs to the NIC that have not
   completed, cork the new messages until:
   a) Receive CQE for the last Tx WR
   b) We have corked enough message on the connection
3. Try to push the corked data out when we receive CQE of
   the last Tx WR to prevent the corked messages hang in
   the send queue.

Both SMC autocork and TCP autocork check the TX completion
to decide whether we should cork or not. The difference is
when we got a SMC Tx WR completion, the data have been confirmed
by the RNIC while TCP TX completion just tells us the data
have been sent out by the local NIC.

Add an atomic variable tx_pushing in smc_connection to make
sure only one can send to let it cork more and save CDC slot.

SMC autocork should not bring extra latency since the first
message will always been sent out immediately.

The qperf tcp_bw test shows more than x4 increase under small
message size with Mellanox connectX4-Lx, same result with other
throughput benchmarks like sockperf/netperf.
The qperf tcp_lat test shows SMC autocork has not increase any
ping-pong latency.

BW test:
 client: smc_run taskset -c 1 qperf smc-server -oo msg_size:1:64K:*2 \
			-t 30 -vu tcp_bw
 server: smc_run taskset -c 1 qperf

MsgSize(Bytes)        TCP         SMC-NoCork           SMC-AutoCork
      1         2.57 MB/s     698 KB/s(-73.5%)     2.98 MB/s(16.0% )
      2          5.1 MB/s    1.41 MB/s(-72.4%)     5.82 MB/s(14.1% )
      4         10.2 MB/s    2.83 MB/s(-72.3%)     11.7 MB/s(14.7% )
      8         20.8 MB/s    5.62 MB/s(-73.0%)     22.9 MB/s(10.1% )
     16         42.5 MB/s    11.5 MB/s(-72.9%)     45.5 MB/s(7.1%  )
     32         80.7 MB/s    22.3 MB/s(-72.4%)     86.7 MB/s(7.4%  )
     64          155 MB/s    45.6 MB/s(-70.6%)      160 MB/s(3.2%  )
    128          295 MB/s    90.1 MB/s(-69.5%)      273 MB/s(-7.5% )
    256          539 MB/s     179 MB/s(-66.8%)      610 MB/s(13.2% )
    512          943 MB/s     360 MB/s(-61.8%)     1.02 GB/s(10.8% )
   1024         1.58 GB/s     710 MB/s(-56.1%)     1.91 GB/s(20.9% )
   2048         2.47 GB/s    1.34 GB/s(-45.7%)     2.92 GB/s(18.2% )
   4096         2.86 GB/s     2.5 GB/s(-12.6%)      2.4 GB/s(-16.1%)
   8192         3.89 GB/s    3.14 GB/s(-19.3%)     4.05 GB/s(4.1%  )
  16384         3.29 GB/s    4.67 GB/s(41.9% )     5.09 GB/s(54.7% )
  32768         2.73 GB/s    5.48 GB/s(100.7%)     5.49 GB/s(101.1%)
  65536            3 GB/s    4.85 GB/s(61.7% )     5.24 GB/s(74.7% )

Latency test:
 client: smc_run taskset -c 1 qperf smc-server -oo msg_size:1:64K:*2 \
			-t 30 -vu tcp_lat
 server: smc_run taskset -c 1 qperf

 MsgSize              SMC-NoCork           SMC-AutoCork
       1               9.7 us               9.6 us( -1.03%)
       2              9.43 us              9.39 us( -0.42%)
       4               9.6 us              9.35 us( -2.60%)
       8              9.42 us               9.2 us( -2.34%)
      16              9.13 us              9.43 us(  3.29%)
      32              9.19 us               9.5 us(  3.37%)
      64              9.38 us               9.5 us(  1.28%)
     128               9.9 us              9.29 us( -6.16%)
     256              9.42 us              9.26 us( -1.70%)
     512                10 us              9.45 us( -5.50%)
    1024              10.4 us               9.6 us( -7.69%)
    2048              10.4 us              10.2 us( -1.92%)
    4096                11 us              10.5 us( -4.55%)
    8192              11.7 us              11.8 us(  0.85%)
   16384              14.5 us              14.2 us( -2.07%)
   32768              19.4 us              19.3 us( -0.52%)
   65536              28.1 us              28.8 us(  2.49%)

With SMC autocork support, we can archive better throughput than
TCP in most message sizes without any latency tradeoff.

Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
---
 net/smc/smc.h     |   2 +
 net/smc/smc_cdc.c |  11 +++--
 net/smc/smc_tx.c  | 118 ++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 114 insertions(+), 17 deletions(-)

diff --git a/net/smc/smc.h b/net/smc/smc.h
index a096d8af21a0..bc7df235281c 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -192,6 +192,8 @@ struct smc_connection {
 						 * - dec on polled tx cqe
 						 */
 	wait_queue_head_t	cdc_pend_tx_wq; /* wakeup on no cdc_pend_tx_wr*/
+	atomic_t		tx_pushing;     /* nr_threads trying tx push */
+
 	struct delayed_work	tx_work;	/* retry of smc_cdc_msg_send */
 	u32			tx_off;		/* base offset in peer rmb */
 
diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index 9d5a97168969..2b37bec90824 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -48,9 +48,14 @@ static void smc_cdc_tx_handler(struct smc_wr_tx_pend_priv *pnd_snd,
 		conn->tx_cdc_seq_fin = cdcpend->ctrl_seq;
 	}
 
-	if (atomic_dec_and_test(&conn->cdc_pend_tx_wr) &&
-	    unlikely(wq_has_sleeper(&conn->cdc_pend_tx_wq)))
-		wake_up(&conn->cdc_pend_tx_wq);
+	if (atomic_dec_and_test(&conn->cdc_pend_tx_wr)) {
+		/* If this is the last pending WR complete, we must push to
+		 * prevent hang when autocork enabled.
+		 */
+		smc_tx_sndbuf_nonempty(conn);
+		if (unlikely(wq_has_sleeper(&conn->cdc_pend_tx_wq)))
+			wake_up(&conn->cdc_pend_tx_wq);
+	}
 	WARN_ON(atomic_read(&conn->cdc_pend_tx_wr) < 0);
 
 	smc_tx_sndbuf_nonfull(smc);
diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
index 5df3940d4543..bc737ac79805 100644
--- a/net/smc/smc_tx.c
+++ b/net/smc/smc_tx.c
@@ -31,6 +31,7 @@
 #include "smc_tracepoint.h"
 
 #define SMC_TX_WORK_DELAY	0
+#define SMC_DEFAULT_AUTOCORK_SIZE	(64 * 1024)
 
 /***************************** sndbuf producer *******************************/
 
@@ -127,10 +128,52 @@ static int smc_tx_wait(struct smc_sock *smc, int flags)
 static bool smc_tx_is_corked(struct smc_sock *smc)
 {
 	struct tcp_sock *tp = tcp_sk(smc->clcsock->sk);
-
 	return (tp->nonagle & TCP_NAGLE_CORK) ? true : false;
 }
 
+/* If we have pending CDC messages, do not send:
+ * Because CQE of this CDC message will happen shortly, it gives
+ * a chance to coalesce future sendmsg() payload in to one RDMA Write,
+ * without need for a timer, and with no latency trade off.
+ * Algorithm here:
+ *  1. First message should never cork
+ *  2. If we have pending CDC messages, wait for the first
+ *     message's completion
+ *  3. Don't cork to much data in a single RDMA Write to prevent burst,
+ *     total corked message should not exceed min(64k, sendbuf/2)
+ */
+static bool smc_should_autocork(struct smc_sock *smc, struct msghdr *msg,
+				int size_goal)
+{
+	struct smc_connection *conn = &smc->conn;
+
+	if (atomic_read(&conn->cdc_pend_tx_wr) == 0 ||
+	    smc_tx_prepared_sends(conn) > min(size_goal,
+					      conn->sndbuf_desc->len >> 1))
+		return false;
+	return true;
+}
+
+static bool smc_tx_should_cork(struct smc_sock *smc, struct msghdr *msg)
+{
+	struct smc_connection *conn = &smc->conn;
+
+	if (smc_should_autocork(smc, msg, SMC_DEFAULT_AUTOCORK_SIZE))
+		return true;
+
+	if ((msg->msg_flags & MSG_MORE ||
+	     smc_tx_is_corked(smc) ||
+	     msg->msg_flags & MSG_SENDPAGE_NOTLAST) &&
+	    (atomic_read(&conn->sndbuf_space)))
+		/* for a corked socket defer the RDMA writes if
+		 * sndbuf_space is still available. The applications
+		 * should known how/when to uncork it.
+		 */
+		return true;
+
+	return false;
+}
+
 /* sndbuf producer: main API called by socket layer.
  * called under sock lock.
  */
@@ -177,6 +220,13 @@ int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len)
 		if (msg->msg_flags & MSG_OOB)
 			conn->local_tx_ctrl.prod_flags.urg_data_pending = 1;
 
+		/* If our send queue is full but peer have RMBE space,
+		 * we should send them out before wait
+		 */
+		if (!atomic_read(&conn->sndbuf_space) &&
+		    atomic_read(&conn->peer_rmbe_space) > 0)
+			smc_tx_sndbuf_nonempty(conn);
+
 		if (!atomic_read(&conn->sndbuf_space) || conn->urg_tx_pend) {
 			if (send_done)
 				return send_done;
@@ -235,15 +285,12 @@ int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len)
 		 */
 		if ((msg->msg_flags & MSG_OOB) && !send_remaining)
 			conn->urg_tx_pend = true;
-		if ((msg->msg_flags & MSG_MORE || smc_tx_is_corked(smc) ||
-		     msg->msg_flags & MSG_SENDPAGE_NOTLAST) &&
-		    (atomic_read(&conn->sndbuf_space)))
-			/* for a corked socket defer the RDMA writes if
-			 * sndbuf_space is still available. The applications
-			 * should known how/when to uncork it.
-			 */
-			continue;
-		smc_tx_sndbuf_nonempty(conn);
+
+		/* If we need to cork, do nothing and wait for the next
+		 * sendmsg() call or push on tx completion
+		 */
+		if (!smc_tx_should_cork(smc, msg))
+			smc_tx_sndbuf_nonempty(conn);
 
 		trace_smc_tx_sendmsg(smc, copylen);
 	} /* while (msg_data_left(msg)) */
@@ -590,13 +637,26 @@ static int smcd_tx_sndbuf_nonempty(struct smc_connection *conn)
 	return rc;
 }
 
-int smc_tx_sndbuf_nonempty(struct smc_connection *conn)
+static int __smc_tx_sndbuf_nonempty(struct smc_connection *conn)
 {
-	int rc;
+	int rc = 0;
+	struct smc_sock *smc = container_of(conn, struct smc_sock, conn);
+
+	/* No data in the send queue */
+	if (unlikely(smc_tx_prepared_sends(conn) <= 0))
+		goto out;
+
+	/* Peer don't have RMBE space */
+	if (unlikely(atomic_read(&conn->peer_rmbe_space) <= 0)) {
+		SMC_STAT_RMB_TX_PEER_FULL(smc, !conn->lnk);
+		goto out;
+	}
 
 	if (conn->killed ||
-	    conn->local_rx_ctrl.conn_state_flags.peer_conn_abort)
-		return -EPIPE;	/* connection being aborted */
+	    conn->local_rx_ctrl.conn_state_flags.peer_conn_abort) {
+		rc = -EPIPE;    /* connection being aborted */
+		goto out;
+	}
 	if (conn->lgr->is_smcd)
 		rc = smcd_tx_sndbuf_nonempty(conn);
 	else
@@ -608,6 +668,36 @@ int smc_tx_sndbuf_nonempty(struct smc_connection *conn)
 						    conn);
 		smc_close_wake_tx_prepared(smc);
 	}
+
+out:
+	return rc;
+}
+
+int smc_tx_sndbuf_nonempty(struct smc_connection *conn)
+{
+	int rc;
+
+	/* This make sure only one can send simultaneously to prevent wasting
+	 * of CPU and CDC slot.
+	 * Record whether someone has tried to push while we are pushing.
+	 */
+	if (atomic_inc_return(&conn->tx_pushing) > 1)
+		return 0;
+
+again:
+	atomic_set(&conn->tx_pushing, 1);
+	smp_wmb(); /* Make sure tx_pushing is 1 before real send */
+	rc = __smc_tx_sndbuf_nonempty(conn);
+
+	/* We need to check whether someone else have added some data into
+	 * the send queue and tried to push but failed after the atomic_set()
+	 * when we are pushing.
+	 * If so, we need to push again to prevent those data hang in the send
+	 * queue.
+	 */
+	if (unlikely(!atomic_dec_and_test(&conn->tx_pushing)))
+		goto again;
+
 	return rc;
 }
 
-- 
2.19.1.3.ge56e4f7


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] net/smc: Add autocork support
  2022-02-16  3:49 [PATCH] net/smc: Add autocork support Dust Li
@ 2022-02-16  3:53 ` dust.li
  2022-02-16 10:32 ` Karsten Graul
  2022-02-16 13:58 ` Stefan Raspl
  2 siblings, 0 replies; 16+ messages in thread
From: dust.li @ 2022-02-16  3:53 UTC (permalink / raw)
  To: Karsten Graul, Tony Lu; +Cc: kuba, davem, netdev, linux-s390, linux-rdma

On Wed, Feb 16, 2022 at 11:49:03AM +0800, Dust Li wrote:

Sorry for forgetting to change the --subject-prefix !

This patch is based on net-next, and is targeted to net-next.

Many thanks!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] net/smc: Add autocork support
  2022-02-16  3:49 [PATCH] net/smc: Add autocork support Dust Li
  2022-02-16  3:53 ` dust.li
@ 2022-02-16 10:32 ` Karsten Graul
  2022-02-16 10:55   ` dust.li
  2022-02-16 13:58 ` Stefan Raspl
  2 siblings, 1 reply; 16+ messages in thread
From: Karsten Graul @ 2022-02-16 10:32 UTC (permalink / raw)
  To: Dust Li, Tony Lu; +Cc: kuba, davem, netdev, linux-s390, linux-rdma

On 16/02/2022 04:49, Dust Li wrote:
> This patch adds autocork support for SMC which could improve
> throughput for small message by x2 ~ x4.
> 
> The main idea is borrowed from TCP autocork with some RDMA
> specific modification:

Sounds like a valuable improvement, thank you!

> ---
>  net/smc/smc.h     |   2 +
>  net/smc/smc_cdc.c |  11 +++--
>  net/smc/smc_tx.c  | 118 ++++++++++++++++++++++++++++++++++++++++------
>  3 files changed, 114 insertions(+), 17 deletions(-)
> 
> diff --git a/net/smc/smc.h b/net/smc/smc.h
> index a096d8af21a0..bc7df235281c 100644
> --- a/net/smc/smc.h
> +++ b/net/smc/smc.h
> @@ -192,6 +192,8 @@ struct smc_connection {
>  						 * - dec on polled tx cqe
>  						 */
>  	wait_queue_head_t	cdc_pend_tx_wq; /* wakeup on no cdc_pend_tx_wr*/
> +	atomic_t		tx_pushing;     /* nr_threads trying tx push */
> +

Is this extra empty line needed?

>  	struct delayed_work	tx_work;	/* retry of smc_cdc_msg_send */
>  	u32			tx_off;		/* base offset in peer rmb */
>  
> diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
> index 9d5a97168969..2b37bec90824 100644
> --- a/net/smc/smc_cdc.c
> +++ b/net/smc/smc_cdc.c
> @@ -48,9 +48,14 @@ static void smc_cdc_tx_handler(struct smc_wr_tx_pend_priv *pnd_snd,
>  		conn->tx_cdc_seq_fin = cdcpend->ctrl_seq;
>  	}
>  
> -	if (atomic_dec_and_test(&conn->cdc_pend_tx_wr) &&
> -	    unlikely(wq_has_sleeper(&conn->cdc_pend_tx_wq)))
> -		wake_up(&conn->cdc_pend_tx_wq);
> +	if (atomic_dec_and_test(&conn->cdc_pend_tx_wr)) {
> +		/* If this is the last pending WR complete, we must push to
> +		 * prevent hang when autocork enabled.
> +		 */
> +		smc_tx_sndbuf_nonempty(conn);
> +		if (unlikely(wq_has_sleeper(&conn->cdc_pend_tx_wq)))
> +			wake_up(&conn->cdc_pend_tx_wq);
> +	}
>  	WARN_ON(atomic_read(&conn->cdc_pend_tx_wr) < 0);
>  
>  	smc_tx_sndbuf_nonfull(smc);
> diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
> index 5df3940d4543..bc737ac79805 100644
> --- a/net/smc/smc_tx.c
> +++ b/net/smc/smc_tx.c
> @@ -31,6 +31,7 @@
>  #include "smc_tracepoint.h"
>  
>  #define SMC_TX_WORK_DELAY	0
> +#define SMC_DEFAULT_AUTOCORK_SIZE	(64 * 1024)
>  
>  /***************************** sndbuf producer *******************************/
>  
> @@ -127,10 +128,52 @@ static int smc_tx_wait(struct smc_sock *smc, int flags)
>  static bool smc_tx_is_corked(struct smc_sock *smc)
>  {
>  	struct tcp_sock *tp = tcp_sk(smc->clcsock->sk);
> -
>  	return (tp->nonagle & TCP_NAGLE_CORK) ? true : false;
>  }
>  
> +/* If we have pending CDC messages, do not send:
> + * Because CQE of this CDC message will happen shortly, it gives
> + * a chance to coalesce future sendmsg() payload in to one RDMA Write,
> + * without need for a timer, and with no latency trade off.
> + * Algorithm here:
> + *  1. First message should never cork
> + *  2. If we have pending CDC messages, wait for the first
> + *     message's completion
> + *  3. Don't cork to much data in a single RDMA Write to prevent burst,
> + *     total corked message should not exceed min(64k, sendbuf/2)
> + */
> +static bool smc_should_autocork(struct smc_sock *smc, struct msghdr *msg,
> +				int size_goal)
> +{
> +	struct smc_connection *conn = &smc->conn;
> +
> +	if (atomic_read(&conn->cdc_pend_tx_wr) == 0 ||
> +	    smc_tx_prepared_sends(conn) > min(size_goal,
> +					      conn->sndbuf_desc->len >> 1))
> +		return false;
> +	return true;
> +}
> +
> +static bool smc_tx_should_cork(struct smc_sock *smc, struct msghdr *msg)
> +{
> +	struct smc_connection *conn = &smc->conn;
> +
> +	if (smc_should_autocork(smc, msg, SMC_DEFAULT_AUTOCORK_SIZE))
> +		return true;
> +
> +	if ((msg->msg_flags & MSG_MORE ||
> +	     smc_tx_is_corked(smc) ||
> +	     msg->msg_flags & MSG_SENDPAGE_NOTLAST) &&
> +	    (atomic_read(&conn->sndbuf_space)))
> +		/* for a corked socket defer the RDMA writes if
> +		 * sndbuf_space is still available. The applications
> +		 * should known how/when to uncork it.
> +		 */
> +		return true;
> +
> +	return false;
> +}
> +
>  /* sndbuf producer: main API called by socket layer.
>   * called under sock lock.
>   */
> @@ -177,6 +220,13 @@ int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len)
>  		if (msg->msg_flags & MSG_OOB)
>  			conn->local_tx_ctrl.prod_flags.urg_data_pending = 1;
>  
> +		/* If our send queue is full but peer have RMBE space,
> +		 * we should send them out before wait
> +		 */
> +		if (!atomic_read(&conn->sndbuf_space) &&
> +		    atomic_read(&conn->peer_rmbe_space) > 0)
> +			smc_tx_sndbuf_nonempty(conn);
> +
>  		if (!atomic_read(&conn->sndbuf_space) || conn->urg_tx_pend) {
>  			if (send_done)
>  				return send_done;
> @@ -235,15 +285,12 @@ int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len)
>  		 */
>  		if ((msg->msg_flags & MSG_OOB) && !send_remaining)
>  			conn->urg_tx_pend = true;
> -		if ((msg->msg_flags & MSG_MORE || smc_tx_is_corked(smc) ||
> -		     msg->msg_flags & MSG_SENDPAGE_NOTLAST) &&
> -		    (atomic_read(&conn->sndbuf_space)))
> -			/* for a corked socket defer the RDMA writes if
> -			 * sndbuf_space is still available. The applications
> -			 * should known how/when to uncork it.
> -			 */
> -			continue;
> -		smc_tx_sndbuf_nonempty(conn);
> +
> +		/* If we need to cork, do nothing and wait for the next
> +		 * sendmsg() call or push on tx completion
> +		 */
> +		if (!smc_tx_should_cork(smc, msg))
> +			smc_tx_sndbuf_nonempty(conn);
>  
>  		trace_smc_tx_sendmsg(smc, copylen);
>  	} /* while (msg_data_left(msg)) */
> @@ -590,13 +637,26 @@ static int smcd_tx_sndbuf_nonempty(struct smc_connection *conn)
>  	return rc;
>  }
>  
> -int smc_tx_sndbuf_nonempty(struct smc_connection *conn)
> +static int __smc_tx_sndbuf_nonempty(struct smc_connection *conn)
>  {
> -	int rc;
> +	int rc = 0;
> +	struct smc_sock *smc = container_of(conn, struct smc_sock, conn);

Reverse Christmas tree style please.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] net/smc: Add autocork support
  2022-02-16 10:32 ` Karsten Graul
@ 2022-02-16 10:55   ` dust.li
  0 siblings, 0 replies; 16+ messages in thread
From: dust.li @ 2022-02-16 10:55 UTC (permalink / raw)
  To: Karsten Graul, Tony Lu; +Cc: kuba, davem, netdev, linux-s390, linux-rdma

On Wed, Feb 16, 2022 at 11:32:52AM +0100, Karsten Graul wrote:
>On 16/02/2022 04:49, Dust Li wrote:
>> This patch adds autocork support for SMC which could improve
>> throughput for small message by x2 ~ x4.
>> 
>> The main idea is borrowed from TCP autocork with some RDMA
>> specific modification:
>
>Sounds like a valuable improvement, thank you!
>
>> ---
>>  net/smc/smc.h     |   2 +
>>  net/smc/smc_cdc.c |  11 +++--
>>  net/smc/smc_tx.c  | 118 ++++++++++++++++++++++++++++++++++++++++------
>>  3 files changed, 114 insertions(+), 17 deletions(-)
>> 
>> diff --git a/net/smc/smc.h b/net/smc/smc.h
>> index a096d8af21a0..bc7df235281c 100644
>> --- a/net/smc/smc.h
>> +++ b/net/smc/smc.h
>> @@ -192,6 +192,8 @@ struct smc_connection {
>>  						 * - dec on polled tx cqe
>>  						 */
>>  	wait_queue_head_t	cdc_pend_tx_wq; /* wakeup on no cdc_pend_tx_wr*/
>> +	atomic_t		tx_pushing;     /* nr_threads trying tx push */
>> +
>
>Is this extra empty line needed?

Will remove this empty line in the next version.

>
>>  	struct delayed_work	tx_work;	/* retry of smc_cdc_msg_send */
>>  	u32			tx_off;		/* base offset in peer rmb */
>>  
>> diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
>> index 9d5a97168969..2b37bec90824 100644
>> --- a/net/smc/smc_cdc.c
>> +++ b/net/smc/smc_cdc.c
>> @@ -48,9 +48,14 @@ static void smc_cdc_tx_handler(struct smc_wr_tx_pend_priv *pnd_snd,
>>  		conn->tx_cdc_seq_fin = cdcpend->ctrl_seq;
>>  	}
>>  
>> -	if (atomic_dec_and_test(&conn->cdc_pend_tx_wr) &&
>> -	    unlikely(wq_has_sleeper(&conn->cdc_pend_tx_wq)))
>> -		wake_up(&conn->cdc_pend_tx_wq);
>> +	if (atomic_dec_and_test(&conn->cdc_pend_tx_wr)) {
>> +		/* If this is the last pending WR complete, we must push to
>> +		 * prevent hang when autocork enabled.
>> +		 */
>> +		smc_tx_sndbuf_nonempty(conn);
>> +		if (unlikely(wq_has_sleeper(&conn->cdc_pend_tx_wq)))
>> +			wake_up(&conn->cdc_pend_tx_wq);
>> +	}
>>  	WARN_ON(atomic_read(&conn->cdc_pend_tx_wr) < 0);
>>  
>>  	smc_tx_sndbuf_nonfull(smc);
>> diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
>> index 5df3940d4543..bc737ac79805 100644
>> --- a/net/smc/smc_tx.c
>> +++ b/net/smc/smc_tx.c
>> @@ -31,6 +31,7 @@
>>  #include "smc_tracepoint.h"
>>  
>>  #define SMC_TX_WORK_DELAY	0
>> +#define SMC_DEFAULT_AUTOCORK_SIZE	(64 * 1024)
>>  
>>  /***************************** sndbuf producer *******************************/
>>  
>> @@ -127,10 +128,52 @@ static int smc_tx_wait(struct smc_sock *smc, int flags)
>>  static bool smc_tx_is_corked(struct smc_sock *smc)
>>  {
>>  	struct tcp_sock *tp = tcp_sk(smc->clcsock->sk);
>> -
>>  	return (tp->nonagle & TCP_NAGLE_CORK) ? true : false;
>>  }
>>  
>> +/* If we have pending CDC messages, do not send:
>> + * Because CQE of this CDC message will happen shortly, it gives
>> + * a chance to coalesce future sendmsg() payload in to one RDMA Write,
>> + * without need for a timer, and with no latency trade off.
>> + * Algorithm here:
>> + *  1. First message should never cork
>> + *  2. If we have pending CDC messages, wait for the first
>> + *     message's completion
>> + *  3. Don't cork to much data in a single RDMA Write to prevent burst,
>> + *     total corked message should not exceed min(64k, sendbuf/2)
>> + */
>> +static bool smc_should_autocork(struct smc_sock *smc, struct msghdr *msg,
>> +				int size_goal)
>> +{
>> +	struct smc_connection *conn = &smc->conn;
>> +
>> +	if (atomic_read(&conn->cdc_pend_tx_wr) == 0 ||
>> +	    smc_tx_prepared_sends(conn) > min(size_goal,
>> +					      conn->sndbuf_desc->len >> 1))
>> +		return false;
>> +	return true;
>> +}
>> +
>> +static bool smc_tx_should_cork(struct smc_sock *smc, struct msghdr *msg)
>> +{
>> +	struct smc_connection *conn = &smc->conn;
>> +
>> +	if (smc_should_autocork(smc, msg, SMC_DEFAULT_AUTOCORK_SIZE))
>> +		return true;
>> +
>> +	if ((msg->msg_flags & MSG_MORE ||
>> +	     smc_tx_is_corked(smc) ||
>> +	     msg->msg_flags & MSG_SENDPAGE_NOTLAST) &&
>> +	    (atomic_read(&conn->sndbuf_space)))
>> +		/* for a corked socket defer the RDMA writes if
>> +		 * sndbuf_space is still available. The applications
>> +		 * should known how/when to uncork it.
>> +		 */
>> +		return true;
>> +
>> +	return false;
>> +}
>> +
>>  /* sndbuf producer: main API called by socket layer.
>>   * called under sock lock.
>>   */
>> @@ -177,6 +220,13 @@ int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len)
>>  		if (msg->msg_flags & MSG_OOB)
>>  			conn->local_tx_ctrl.prod_flags.urg_data_pending = 1;
>>  
>> +		/* If our send queue is full but peer have RMBE space,
>> +		 * we should send them out before wait
>> +		 */
>> +		if (!atomic_read(&conn->sndbuf_space) &&
>> +		    atomic_read(&conn->peer_rmbe_space) > 0)
>> +			smc_tx_sndbuf_nonempty(conn);
>> +
>>  		if (!atomic_read(&conn->sndbuf_space) || conn->urg_tx_pend) {
>>  			if (send_done)
>>  				return send_done;
>> @@ -235,15 +285,12 @@ int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len)
>>  		 */
>>  		if ((msg->msg_flags & MSG_OOB) && !send_remaining)
>>  			conn->urg_tx_pend = true;
>> -		if ((msg->msg_flags & MSG_MORE || smc_tx_is_corked(smc) ||
>> -		     msg->msg_flags & MSG_SENDPAGE_NOTLAST) &&
>> -		    (atomic_read(&conn->sndbuf_space)))
>> -			/* for a corked socket defer the RDMA writes if
>> -			 * sndbuf_space is still available. The applications
>> -			 * should known how/when to uncork it.
>> -			 */
>> -			continue;
>> -		smc_tx_sndbuf_nonempty(conn);
>> +
>> +		/* If we need to cork, do nothing and wait for the next
>> +		 * sendmsg() call or push on tx completion
>> +		 */
>> +		if (!smc_tx_should_cork(smc, msg))
>> +			smc_tx_sndbuf_nonempty(conn);
>>  
>>  		trace_smc_tx_sendmsg(smc, copylen);
>>  	} /* while (msg_data_left(msg)) */
>> @@ -590,13 +637,26 @@ static int smcd_tx_sndbuf_nonempty(struct smc_connection *conn)
>>  	return rc;
>>  }
>>  
>> -int smc_tx_sndbuf_nonempty(struct smc_connection *conn)
>> +static int __smc_tx_sndbuf_nonempty(struct smc_connection *conn)
>>  {
>> -	int rc;
>> +	int rc = 0;
>> +	struct smc_sock *smc = container_of(conn, struct smc_sock, conn);
>
>Reverse Christmas tree style please.

Sure, will do.

Thank you !


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] net/smc: Add autocork support
  2022-02-16  3:49 [PATCH] net/smc: Add autocork support Dust Li
  2022-02-16  3:53 ` dust.li
  2022-02-16 10:32 ` Karsten Graul
@ 2022-02-16 13:58 ` Stefan Raspl
  2022-02-16 15:27   ` dust.li
  2 siblings, 1 reply; 16+ messages in thread
From: Stefan Raspl @ 2022-02-16 13:58 UTC (permalink / raw)
  To: Dust Li, Karsten Graul, Tony Lu
  Cc: kuba, davem, netdev, linux-s390, linux-rdma

On 2/16/22 04:49, Dust Li wrote:
> This patch adds autocork support for SMC which could improve
> throughput for small message by x2 ~ x4.
> 
> The main idea is borrowed from TCP autocork with some RDMA
> specific modification:
> 1. The first message should never cork to make sure we won't
>     bring extra latency
> 2. If we have posted any Tx WRs to the NIC that have not
>     completed, cork the new messages until:
>     a) Receive CQE for the last Tx WR
>     b) We have corked enough message on the connection
> 3. Try to push the corked data out when we receive CQE of
>     the last Tx WR to prevent the corked messages hang in
>     the send queue.
> 
> Both SMC autocork and TCP autocork check the TX completion
> to decide whether we should cork or not. The difference is
> when we got a SMC Tx WR completion, the data have been confirmed
> by the RNIC while TCP TX completion just tells us the data
> have been sent out by the local NIC.
> 
> Add an atomic variable tx_pushing in smc_connection to make
> sure only one can send to let it cork more and save CDC slot.
> 
> SMC autocork should not bring extra latency since the first
> message will always been sent out immediately.
> 
> The qperf tcp_bw test shows more than x4 increase under small
> message size with Mellanox connectX4-Lx, same result with other
> throughput benchmarks like sockperf/netperf.
> The qperf tcp_lat test shows SMC autocork has not increase any
> ping-pong latency.
> 
> BW test:
>   client: smc_run taskset -c 1 qperf smc-server -oo msg_size:1:64K:*2 \
> 			-t 30 -vu tcp_bw
>   server: smc_run taskset -c 1 qperf
> 
> MsgSize(Bytes)        TCP         SMC-NoCork           SMC-AutoCork
>        1         2.57 MB/s     698 KB/s(-73.5%)     2.98 MB/s(16.0% )
>        2          5.1 MB/s    1.41 MB/s(-72.4%)     5.82 MB/s(14.1% )
>        4         10.2 MB/s    2.83 MB/s(-72.3%)     11.7 MB/s(14.7% )
>        8         20.8 MB/s    5.62 MB/s(-73.0%)     22.9 MB/s(10.1% )
>       16         42.5 MB/s    11.5 MB/s(-72.9%)     45.5 MB/s(7.1%  )
>       32         80.7 MB/s    22.3 MB/s(-72.4%)     86.7 MB/s(7.4%  )
>       64          155 MB/s    45.6 MB/s(-70.6%)      160 MB/s(3.2%  )
>      128          295 MB/s    90.1 MB/s(-69.5%)      273 MB/s(-7.5% )
>      256          539 MB/s     179 MB/s(-66.8%)      610 MB/s(13.2% )
>      512          943 MB/s     360 MB/s(-61.8%)     1.02 GB/s(10.8% )
>     1024         1.58 GB/s     710 MB/s(-56.1%)     1.91 GB/s(20.9% )
>     2048         2.47 GB/s    1.34 GB/s(-45.7%)     2.92 GB/s(18.2% )
>     4096         2.86 GB/s     2.5 GB/s(-12.6%)      2.4 GB/s(-16.1%)
>     8192         3.89 GB/s    3.14 GB/s(-19.3%)     4.05 GB/s(4.1%  )
>    16384         3.29 GB/s    4.67 GB/s(41.9% )     5.09 GB/s(54.7% )
>    32768         2.73 GB/s    5.48 GB/s(100.7%)     5.49 GB/s(101.1%)
>    65536            3 GB/s    4.85 GB/s(61.7% )     5.24 GB/s(74.7% )
> 
> Latency test:
>   client: smc_run taskset -c 1 qperf smc-server -oo msg_size:1:64K:*2 \
> 			-t 30 -vu tcp_lat
>   server: smc_run taskset -c 1 qperf
> 
>   MsgSize              SMC-NoCork           SMC-AutoCork
>         1               9.7 us               9.6 us( -1.03%)
>         2              9.43 us              9.39 us( -0.42%)
>         4               9.6 us              9.35 us( -2.60%)
>         8              9.42 us               9.2 us( -2.34%)
>        16              9.13 us              9.43 us(  3.29%)
>        32              9.19 us               9.5 us(  3.37%)
>        64              9.38 us               9.5 us(  1.28%)
>       128               9.9 us              9.29 us( -6.16%)
>       256              9.42 us              9.26 us( -1.70%)
>       512                10 us              9.45 us( -5.50%)
>      1024              10.4 us               9.6 us( -7.69%)
>      2048              10.4 us              10.2 us( -1.92%)
>      4096                11 us              10.5 us( -4.55%)
>      8192              11.7 us              11.8 us(  0.85%)
>     16384              14.5 us              14.2 us( -2.07%)
>     32768              19.4 us              19.3 us( -0.52%)
>     65536              28.1 us              28.8 us(  2.49%)
> 
> With SMC autocork support, we can archive better throughput than
> TCP in most message sizes without any latency tradeoff.
> 
> Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
> ---
>   net/smc/smc.h     |   2 +
>   net/smc/smc_cdc.c |  11 +++--
>   net/smc/smc_tx.c  | 118 ++++++++++++++++++++++++++++++++++++++++------
>   3 files changed, 114 insertions(+), 17 deletions(-)
> 
> diff --git a/net/smc/smc.h b/net/smc/smc.h
> index a096d8af21a0..bc7df235281c 100644
> --- a/net/smc/smc.h
> +++ b/net/smc/smc.h
> @@ -192,6 +192,8 @@ struct smc_connection {
>   						 * - dec on polled tx cqe
>   						 */
>   	wait_queue_head_t	cdc_pend_tx_wq; /* wakeup on no cdc_pend_tx_wr*/
> +	atomic_t		tx_pushing;     /* nr_threads trying tx push */
> +
>   	struct delayed_work	tx_work;	/* retry of smc_cdc_msg_send */
>   	u32			tx_off;		/* base offset in peer rmb */
>   
> diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
> index 9d5a97168969..2b37bec90824 100644
> --- a/net/smc/smc_cdc.c
> +++ b/net/smc/smc_cdc.c
> @@ -48,9 +48,14 @@ static void smc_cdc_tx_handler(struct smc_wr_tx_pend_priv *pnd_snd,
>   		conn->tx_cdc_seq_fin = cdcpend->ctrl_seq;
>   	}
>   
> -	if (atomic_dec_and_test(&conn->cdc_pend_tx_wr) &&
> -	    unlikely(wq_has_sleeper(&conn->cdc_pend_tx_wq)))
> -		wake_up(&conn->cdc_pend_tx_wq);
> +	if (atomic_dec_and_test(&conn->cdc_pend_tx_wr)) {
> +		/* If this is the last pending WR complete, we must push to
> +		 * prevent hang when autocork enabled.
> +		 */
> +		smc_tx_sndbuf_nonempty(conn);
> +		if (unlikely(wq_has_sleeper(&conn->cdc_pend_tx_wq)))
> +			wake_up(&conn->cdc_pend_tx_wq);
> +	}
>   	WARN_ON(atomic_read(&conn->cdc_pend_tx_wr) < 0);
>   
>   	smc_tx_sndbuf_nonfull(smc);
> diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
> index 5df3940d4543..bc737ac79805 100644
> --- a/net/smc/smc_tx.c
> +++ b/net/smc/smc_tx.c
> @@ -31,6 +31,7 @@
>   #include "smc_tracepoint.h"
>   
>   #define SMC_TX_WORK_DELAY	0
> +#define SMC_DEFAULT_AUTOCORK_SIZE	(64 * 1024)

Probably a matter of taste, but why not use hex here?

>   
>   /***************************** sndbuf producer *******************************/
>   
> @@ -127,10 +128,52 @@ static int smc_tx_wait(struct smc_sock *smc, int flags)
>   static bool smc_tx_is_corked(struct smc_sock *smc)
>   {
>   	struct tcp_sock *tp = tcp_sk(smc->clcsock->sk);
> -
>   	return (tp->nonagle & TCP_NAGLE_CORK) ? true : false;
>   }

Can you drop this line elimination?


> +/* If we have pending CDC messages, do not send:
> + * Because CQE of this CDC message will happen shortly, it gives
> + * a chance to coalesce future sendmsg() payload in to one RDMA Write,
> + * without need for a timer, and with no latency trade off.
> + * Algorithm here:
> + *  1. First message should never cork
> + *  2. If we have pending CDC messages, wait for the first
> + *     message's completion
> + *  3. Don't cork to much data in a single RDMA Write to prevent burst,
> + *     total corked message should not exceed min(64k, sendbuf/2)

I assume the 64k is incurred from IP as used by RoCEv2?


> + */
> +static bool smc_should_autocork(struct smc_sock *smc, struct msghdr *msg,
> +				int size_goal)
> +{
> +	struct smc_connection *conn = &smc->conn;
> +
> +	if (atomic_read(&conn->cdc_pend_tx_wr) == 0 ||
> +	    smc_tx_prepared_sends(conn) > min(size_goal,
> +					      conn->sndbuf_desc->len >> 1))
> +		return false;
> +	return true;
> +}
> +
> +static bool smc_tx_should_cork(struct smc_sock *smc, struct msghdr *msg)
> +{
> +	struct smc_connection *conn = &smc->conn;
> +
> +	if (smc_should_autocork(smc, msg, SMC_DEFAULT_AUTOCORK_SIZE))
> +		return true;

Are there any fixed plans to make SMC_DEFAULT_AUTOCORK dynamic...? 'cause 
otherwise we could simply eliminate this parameter, and use the define within 
smc_should_autocork() instead.


Ciao,
Stefan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] net/smc: Add autocork support
  2022-02-16 13:58 ` Stefan Raspl
@ 2022-02-16 15:27   ` dust.li
  2022-02-17  9:37     ` Stefan Raspl
  0 siblings, 1 reply; 16+ messages in thread
From: dust.li @ 2022-02-16 15:27 UTC (permalink / raw)
  To: Stefan Raspl, Karsten Graul, Tony Lu
  Cc: kuba, davem, netdev, linux-s390, linux-rdma

On Wed, Feb 16, 2022 at 02:58:32PM +0100, Stefan Raspl wrote:
>On 2/16/22 04:49, Dust Li wrote:
>> This patch adds autocork support for SMC which could improve
>> throughput for small message by x2 ~ x4.
>> 
>> The main idea is borrowed from TCP autocork with some RDMA
>> specific modification:
>> 1. The first message should never cork to make sure we won't
>>     bring extra latency
>> 2. If we have posted any Tx WRs to the NIC that have not
>>     completed, cork the new messages until:
>>     a) Receive CQE for the last Tx WR
>>     b) We have corked enough message on the connection
>> 3. Try to push the corked data out when we receive CQE of
>>     the last Tx WR to prevent the corked messages hang in
>>     the send queue.
>> 
>> Both SMC autocork and TCP autocork check the TX completion
>> to decide whether we should cork or not. The difference is
>> when we got a SMC Tx WR completion, the data have been confirmed
>> by the RNIC while TCP TX completion just tells us the data
>> have been sent out by the local NIC.
>> 
>> Add an atomic variable tx_pushing in smc_connection to make
>> sure only one can send to let it cork more and save CDC slot.
>> 
>> SMC autocork should not bring extra latency since the first
>> message will always been sent out immediately.
>> 
>> The qperf tcp_bw test shows more than x4 increase under small
>> message size with Mellanox connectX4-Lx, same result with other
>> throughput benchmarks like sockperf/netperf.
>> The qperf tcp_lat test shows SMC autocork has not increase any
>> ping-pong latency.
>> 
>> BW test:
>>   client: smc_run taskset -c 1 qperf smc-server -oo msg_size:1:64K:*2 \
>> 			-t 30 -vu tcp_bw
>>   server: smc_run taskset -c 1 qperf
>> 
>> MsgSize(Bytes)        TCP         SMC-NoCork           SMC-AutoCork
>>        1         2.57 MB/s     698 KB/s(-73.5%)     2.98 MB/s(16.0% )
>>        2          5.1 MB/s    1.41 MB/s(-72.4%)     5.82 MB/s(14.1% )
>>        4         10.2 MB/s    2.83 MB/s(-72.3%)     11.7 MB/s(14.7% )
>>        8         20.8 MB/s    5.62 MB/s(-73.0%)     22.9 MB/s(10.1% )
>>       16         42.5 MB/s    11.5 MB/s(-72.9%)     45.5 MB/s(7.1%  )
>>       32         80.7 MB/s    22.3 MB/s(-72.4%)     86.7 MB/s(7.4%  )
>>       64          155 MB/s    45.6 MB/s(-70.6%)      160 MB/s(3.2%  )
>>      128          295 MB/s    90.1 MB/s(-69.5%)      273 MB/s(-7.5% )
>>      256          539 MB/s     179 MB/s(-66.8%)      610 MB/s(13.2% )
>>      512          943 MB/s     360 MB/s(-61.8%)     1.02 GB/s(10.8% )
>>     1024         1.58 GB/s     710 MB/s(-56.1%)     1.91 GB/s(20.9% )
>>     2048         2.47 GB/s    1.34 GB/s(-45.7%)     2.92 GB/s(18.2% )
>>     4096         2.86 GB/s     2.5 GB/s(-12.6%)      2.4 GB/s(-16.1%)
>>     8192         3.89 GB/s    3.14 GB/s(-19.3%)     4.05 GB/s(4.1%  )
>>    16384         3.29 GB/s    4.67 GB/s(41.9% )     5.09 GB/s(54.7% )
>>    32768         2.73 GB/s    5.48 GB/s(100.7%)     5.49 GB/s(101.1%)
>>    65536            3 GB/s    4.85 GB/s(61.7% )     5.24 GB/s(74.7% )
>> 
>> Latency test:
>>   client: smc_run taskset -c 1 qperf smc-server -oo msg_size:1:64K:*2 \
>> 			-t 30 -vu tcp_lat
>>   server: smc_run taskset -c 1 qperf
>> 
>>   MsgSize              SMC-NoCork           SMC-AutoCork
>>         1               9.7 us               9.6 us( -1.03%)
>>         2              9.43 us              9.39 us( -0.42%)
>>         4               9.6 us              9.35 us( -2.60%)
>>         8              9.42 us               9.2 us( -2.34%)
>>        16              9.13 us              9.43 us(  3.29%)
>>        32              9.19 us               9.5 us(  3.37%)
>>        64              9.38 us               9.5 us(  1.28%)
>>       128               9.9 us              9.29 us( -6.16%)
>>       256              9.42 us              9.26 us( -1.70%)
>>       512                10 us              9.45 us( -5.50%)
>>      1024              10.4 us               9.6 us( -7.69%)
>>      2048              10.4 us              10.2 us( -1.92%)
>>      4096                11 us              10.5 us( -4.55%)
>>      8192              11.7 us              11.8 us(  0.85%)
>>     16384              14.5 us              14.2 us( -2.07%)
>>     32768              19.4 us              19.3 us( -0.52%)
>>     65536              28.1 us              28.8 us(  2.49%)
>> 
>> With SMC autocork support, we can archive better throughput than
>> TCP in most message sizes without any latency tradeoff.
>> 
>> Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
>> ---
>>   net/smc/smc.h     |   2 +
>>   net/smc/smc_cdc.c |  11 +++--
>>   net/smc/smc_tx.c  | 118 ++++++++++++++++++++++++++++++++++++++++------
>>   3 files changed, 114 insertions(+), 17 deletions(-)
>> 
>> diff --git a/net/smc/smc.h b/net/smc/smc.h
>> index a096d8af21a0..bc7df235281c 100644
>> --- a/net/smc/smc.h
>> +++ b/net/smc/smc.h
>> @@ -192,6 +192,8 @@ struct smc_connection {
>>   						 * - dec on polled tx cqe
>>   						 */
>>   	wait_queue_head_t	cdc_pend_tx_wq; /* wakeup on no cdc_pend_tx_wr*/
>> +	atomic_t		tx_pushing;     /* nr_threads trying tx push */
>> +
>>   	struct delayed_work	tx_work;	/* retry of smc_cdc_msg_send */
>>   	u32			tx_off;		/* base offset in peer rmb */
>> diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
>> index 9d5a97168969..2b37bec90824 100644
>> --- a/net/smc/smc_cdc.c
>> +++ b/net/smc/smc_cdc.c
>> @@ -48,9 +48,14 @@ static void smc_cdc_tx_handler(struct smc_wr_tx_pend_priv *pnd_snd,
>>   		conn->tx_cdc_seq_fin = cdcpend->ctrl_seq;
>>   	}
>> -	if (atomic_dec_and_test(&conn->cdc_pend_tx_wr) &&
>> -	    unlikely(wq_has_sleeper(&conn->cdc_pend_tx_wq)))
>> -		wake_up(&conn->cdc_pend_tx_wq);
>> +	if (atomic_dec_and_test(&conn->cdc_pend_tx_wr)) {
>> +		/* If this is the last pending WR complete, we must push to
>> +		 * prevent hang when autocork enabled.
>> +		 */
>> +		smc_tx_sndbuf_nonempty(conn);
>> +		if (unlikely(wq_has_sleeper(&conn->cdc_pend_tx_wq)))
>> +			wake_up(&conn->cdc_pend_tx_wq);
>> +	}
>>   	WARN_ON(atomic_read(&conn->cdc_pend_tx_wr) < 0);
>>   	smc_tx_sndbuf_nonfull(smc);
>> diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
>> index 5df3940d4543..bc737ac79805 100644
>> --- a/net/smc/smc_tx.c
>> +++ b/net/smc/smc_tx.c
>> @@ -31,6 +31,7 @@
>>   #include "smc_tracepoint.h"
>>   #define SMC_TX_WORK_DELAY	0
>> +#define SMC_DEFAULT_AUTOCORK_SIZE	(64 * 1024)
>
>Probably a matter of taste, but why not use hex here?

Yeah, I have no option on this, I will change it in the next version.
But I think it should have no real difference since the compiler
should do the calculation.

>
>>   /***************************** sndbuf producer *******************************/
>> @@ -127,10 +128,52 @@ static int smc_tx_wait(struct smc_sock *smc, int flags)
>>   static bool smc_tx_is_corked(struct smc_sock *smc)
>>   {
>>   	struct tcp_sock *tp = tcp_sk(smc->clcsock->sk);
>> -
>>   	return (tp->nonagle & TCP_NAGLE_CORK) ? true : false;
>>   }
>
>Can you drop this line elimination?

Sure, Will do it in the next version.

>
>
>> +/* If we have pending CDC messages, do not send:
>> + * Because CQE of this CDC message will happen shortly, it gives
>> + * a chance to coalesce future sendmsg() payload in to one RDMA Write,
>> + * without need for a timer, and with no latency trade off.
>> + * Algorithm here:
>> + *  1. First message should never cork
>> + *  2. If we have pending CDC messages, wait for the first
>> + *     message's completion
>> + *  3. Don't cork to much data in a single RDMA Write to prevent burst,
>> + *     total corked message should not exceed min(64k, sendbuf/2)
>
>I assume the 64k is incurred from IP as used by RoCEv2?

Acturally, 64K is from my previous experience with RoCEv2. We found in our
RoCE environment, When post large RDMA Write request, it is very easy to
trigger ECN/PFC and cause bandwidth drop with incast. And we add some flow
control in the software not to post single RDMA Write with large data to
alleviate that. 64K is a good value we found in our environment.

>
>
>> + */
>> +static bool smc_should_autocork(struct smc_sock *smc, struct msghdr *msg,
>> +				int size_goal)
>> +{
>> +	struct smc_connection *conn = &smc->conn;
>> +
>> +	if (atomic_read(&conn->cdc_pend_tx_wr) == 0 ||
>> +	    smc_tx_prepared_sends(conn) > min(size_goal,
>> +					      conn->sndbuf_desc->len >> 1))
>> +		return false;
>> +	return true;
>> +}
>> +
>> +static bool smc_tx_should_cork(struct smc_sock *smc, struct msghdr *msg)
>> +{
>> +	struct smc_connection *conn = &smc->conn;
>> +
>> +	if (smc_should_autocork(smc, msg, SMC_DEFAULT_AUTOCORK_SIZE))
>> +		return true;
>
>Are there any fixed plans to make SMC_DEFAULT_AUTOCORK dynamic...? 'cause
>otherwise we could simply eliminate this parameter, and use the define within
>smc_should_autocork() instead.

Yes! Actually I'd like it to be dynamic variable too...

I didn't do it because I also want to add a control switch for the autocork
feature just like TCP. In that case I need to add 2 variables here.
But I found adding dynamic variables using netlink would introduce a lot of
redundant code and may even bring ABI compatibility issues in the future, as
I mentioned here:
https://lore.kernel.org/netdev/20220216114618.GA39286@linux.alibaba.com/T/#mecfcd3f8c816d07dbe35e4748d17008331c89523

I'm not sure that's the right way to do it. In this case, I prefer using
sysctl which I think would be easier, but I would like to listen to your advice.

Thanks

>
>
>Ciao,
>Stefan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] net/smc: Add autocork support
  2022-02-16 15:27   ` dust.li
@ 2022-02-17  9:37     ` Stefan Raspl
  2022-02-17 13:22       ` dust.li
  0 siblings, 1 reply; 16+ messages in thread
From: Stefan Raspl @ 2022-02-17  9:37 UTC (permalink / raw)
  To: dust.li, Karsten Graul, Tony Lu
  Cc: kuba, davem, netdev, linux-s390, linux-rdma

On 2/16/22 16:27, dust.li wrote:
> On Wed, Feb 16, 2022 at 02:58:32PM +0100, Stefan Raspl wrote:
>> On 2/16/22 04:49, Dust Li wrote:
>>> diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
>>> index 5df3940d4543..bc737ac79805 100644
>>> --- a/net/smc/smc_tx.c
>>> +++ b/net/smc/smc_tx.c
>>> @@ -31,6 +31,7 @@
>>>    #include "smc_tracepoint.h"
>>>    #define SMC_TX_WORK_DELAY	0
>>> +#define SMC_DEFAULT_AUTOCORK_SIZE	(64 * 1024)
>>
>> Probably a matter of taste, but why not use hex here?
> 
> Yeah, I have no option on this, I will change it in the next version.
> But I think it should have no real difference since the compiler
> should do the calculation.

Agreed - this is just to make it a tiny bit easier to digest.


>> Are there any fixed plans to make SMC_DEFAULT_AUTOCORK dynamic...? 'cause
>> otherwise we could simply eliminate this parameter, and use the define within
>> smc_should_autocork() instead.
> 
> Yes! Actually I'd like it to be dynamic variable too...
> 
> I didn't do it because I also want to add a control switch for the autocork
> feature just like TCP. In that case I need to add 2 variables here.
> But I found adding dynamic variables using netlink would introduce a lot of
> redundant code and may even bring ABI compatibility issues in the future, as
> I mentioned here:
> https://lore.kernel.org/netdev/20220216114618.GA39286@linux.alibaba.com/T/#mecfcd3f8c816d07dbe35e4748d17008331c89523
> 
> I'm not sure that's the right way to do it. In this case, I prefer using
> sysctl which I think would be easier, but I would like to listen to your advice.

Extending the Netlink interface should be possible without breaking the API - 
we'd be adding further variables, not modifying or removing existing ones.
Conceptually, Netlink is the way to go for any userspace interaction with SMC, 
which includes anything config-related.
Now we understand that cloud workloads are a bit different, and the desire to be 
able to modify the environment of a container while leaving the container image 
unmodified is understandable. But then again, enabling the base image would be 
the cloud way to address this. The question to us is: How do other parts of the 
kernel address this?

Ciao,
Stefan



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] net/smc: Add autocork support
  2022-02-17  9:37     ` Stefan Raspl
@ 2022-02-17 13:22       ` dust.li
  2022-02-17 18:15         ` Hendrik Brueckner
  0 siblings, 1 reply; 16+ messages in thread
From: dust.li @ 2022-02-17 13:22 UTC (permalink / raw)
  To: Stefan Raspl, Karsten Graul, Tony Lu
  Cc: kuba, davem, netdev, linux-s390, linux-rdma

On Thu, Feb 17, 2022 at 10:37:28AM +0100, Stefan Raspl wrote:
>On 2/16/22 16:27, dust.li wrote:
>> On Wed, Feb 16, 2022 at 02:58:32PM +0100, Stefan Raspl wrote:
>> > On 2/16/22 04:49, Dust Li wrote:
>> > > diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
>> > > index 5df3940d4543..bc737ac79805 100644
>> > > --- a/net/smc/smc_tx.c
>> > > +++ b/net/smc/smc_tx.c
>> > > @@ -31,6 +31,7 @@
>> > >    #include "smc_tracepoint.h"
>> > >    #define SMC_TX_WORK_DELAY	0
>> > > +#define SMC_DEFAULT_AUTOCORK_SIZE	(64 * 1024)
>> > 
>> > Probably a matter of taste, but why not use hex here?
>> 
>> Yeah, I have no option on this, I will change it in the next version.
>> But I think it should have no real difference since the compiler
>> should do the calculation.
>
>Agreed - this is just to make it a tiny bit easier to digest.
>
>
>> > Are there any fixed plans to make SMC_DEFAULT_AUTOCORK dynamic...? 'cause
>> > otherwise we could simply eliminate this parameter, and use the define within
>> > smc_should_autocork() instead.
>> 
>> Yes! Actually I'd like it to be dynamic variable too...
>> 
>> I didn't do it because I also want to add a control switch for the autocork
>> feature just like TCP. In that case I need to add 2 variables here.
>> But I found adding dynamic variables using netlink would introduce a lot of
>> redundant code and may even bring ABI compatibility issues in the future, as
>> I mentioned here:
>> https://lore.kernel.org/netdev/20220216114618.GA39286@linux.alibaba.com/T/#mecfcd3f8c816d07dbe35e4748d17008331c89523
>> 
>> I'm not sure that's the right way to do it. In this case, I prefer using
>> sysctl which I think would be easier, but I would like to listen to your advice.
>
>Extending the Netlink interface should be possible without breaking the API -
>we'd be adding further variables, not modifying or removing existing ones.
>Conceptually, Netlink is the way to go for any userspace interaction with
>SMC, which includes anything config-related.


>Now we understand that cloud workloads are a bit different, and the desire to
>be able to modify the environment of a container while leaving the container
>image unmodified is understandable. But then again, enabling the base image
>would be the cloud way to address this. The question to us is: How do other
>parts of the kernel address this?

I'm not familiar with K8S, but from one of my colleague who has worked
in that area tells me for resources like CPU/MEM and configurations
like sysctl, can be set using K8S configuration:
https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/

I don't know. Maybe because most of the current kernel configurations
are configured through sysfs that for those container orchestration
systems have supported it ?

Thanks

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] net/smc: Add autocork support
  2022-02-17 13:22       ` dust.li
@ 2022-02-17 18:15         ` Hendrik Brueckner
  2022-02-18  7:33           ` dust.li
  0 siblings, 1 reply; 16+ messages in thread
From: Hendrik Brueckner @ 2022-02-17 18:15 UTC (permalink / raw)
  To: dust.li
  Cc: Stefan Raspl, Karsten Graul, Tony Lu, kuba, davem, netdev,
	linux-s390, linux-rdma, Hendrik Brueckner

On Thu, Feb 17, 2022 at 09:22:00PM +0800, dust.li wrote:
> On Thu, Feb 17, 2022 at 10:37:28AM +0100, Stefan Raspl wrote:
> >On 2/16/22 16:27, dust.li wrote:
> >> On Wed, Feb 16, 2022 at 02:58:32PM +0100, Stefan Raspl wrote:
> >> > On 2/16/22 04:49, Dust Li wrote:
> >> >
> 
> >Now we understand that cloud workloads are a bit different, and the desire to
> >be able to modify the environment of a container while leaving the container
> >image unmodified is understandable. But then again, enabling the base image
> >would be the cloud way to address this. The question to us is: How do other
> >parts of the kernel address this?
> 
> I'm not familiar with K8S, but from one of my colleague who has worked
> in that area tells me for resources like CPU/MEM and configurations
> like sysctl, can be set using K8S configuration:
> https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/

For K8s, this involves container engines like cri-o, containerd, podman,
and others towards the runtimes like runc.  To ensure they operate together,
specifications by the Open Container Initiative (OCI) at
https://opencontainers.org/release-notices/overview/

For container/pod deployments, there is especially the Container Runtime
Interface (CRI) that defines the interface, e.g., of K8s to cri-o etc.

CRI includes support for (namespaced) sysctl's:
https://github.com/opencontainers/runtime-spec/releases/tag/v1.0.2

In essence, the CRI spec would allow users to specify/control a specific
runtime for the container in a declarative way w/o modifying the (base)
container images.


Thanks and kind regards,
  Hendrik


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] net/smc: Add autocork support
  2022-02-17 18:15         ` Hendrik Brueckner
@ 2022-02-18  7:33           ` dust.li
  2022-02-18 16:03             ` Karsten Graul
  0 siblings, 1 reply; 16+ messages in thread
From: dust.li @ 2022-02-18  7:33 UTC (permalink / raw)
  To: Hendrik Brueckner
  Cc: Stefan Raspl, Karsten Graul, Tony Lu, kuba, davem, netdev,
	linux-s390, linux-rdma

On Thu, Feb 17, 2022 at 07:15:54PM +0100, Hendrik Brueckner wrote:
>On Thu, Feb 17, 2022 at 09:22:00PM +0800, dust.li wrote:
>> On Thu, Feb 17, 2022 at 10:37:28AM +0100, Stefan Raspl wrote:
>> >On 2/16/22 16:27, dust.li wrote:
>> >> On Wed, Feb 16, 2022 at 02:58:32PM +0100, Stefan Raspl wrote:
>> >> > On 2/16/22 04:49, Dust Li wrote:
>> >> >
>> 
>> >Now we understand that cloud workloads are a bit different, and the desire to
>> >be able to modify the environment of a container while leaving the container
>> >image unmodified is understandable. But then again, enabling the base image
>> >would be the cloud way to address this. The question to us is: How do other
>> >parts of the kernel address this?
>> 
>> I'm not familiar with K8S, but from one of my colleague who has worked
>> in that area tells me for resources like CPU/MEM and configurations
>> like sysctl, can be set using K8S configuration:
>> https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/
>
>For K8s, this involves container engines like cri-o, containerd, podman,
>and others towards the runtimes like runc.  To ensure they operate together,
>specifications by the Open Container Initiative (OCI) at
>https://opencontainers.org/release-notices/overview/
>
>For container/pod deployments, there is especially the Container Runtime
>Interface (CRI) that defines the interface, e.g., of K8s to cri-o etc.
>
>CRI includes support for (namespaced) sysctl's:
>https://github.com/opencontainers/runtime-spec/releases/tag/v1.0.2
>
>In essence, the CRI spec would allow users to specify/control a specific
>runtime for the container in a declarative way w/o modifying the (base)
>container images.

Thanks a lot for your kind explanation !

After a quick look at the OCI spec, I saw the support for file based
configuration (Including sysfs/procfs etc.). And unfortunately, no
netlink support.


Hi Karsten & Stefan:
Back to the patch itself, do you think I need to add the control switch
now ? Or just leave the switch and fix other issues first ?

Thanks

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] net/smc: Add autocork support
  2022-02-18  7:33           ` dust.li
@ 2022-02-18 16:03             ` Karsten Graul
  2022-02-18 23:42               ` dust.li
  0 siblings, 1 reply; 16+ messages in thread
From: Karsten Graul @ 2022-02-18 16:03 UTC (permalink / raw)
  To: dust.li, Hendrik Brueckner
  Cc: Stefan Raspl, Tony Lu, kuba, davem, netdev, linux-s390, linux-rdma

On 18/02/2022 08:33, dust.li wrote:
> On Thu, Feb 17, 2022 at 07:15:54PM +0100, Hendrik Brueckner wrote:
>> On Thu, Feb 17, 2022 at 09:22:00PM +0800, dust.li wrote:
>>> On Thu, Feb 17, 2022 at 10:37:28AM +0100, Stefan Raspl wrote:
>>>> On 2/16/22 16:27, dust.li wrote:
>>>>> On Wed, Feb 16, 2022 at 02:58:32PM +0100, Stefan Raspl wrote:
>>>>>> On 2/16/22 04:49, Dust Li wrote:
>>>>>>
>>>
>>>> Now we understand that cloud workloads are a bit different, and the desire to
>>>> be able to modify the environment of a container while leaving the container
>>>> image unmodified is understandable. But then again, enabling the base image
>>>> would be the cloud way to address this. The question to us is: How do other
>>>> parts of the kernel address this?
>>>
>>> I'm not familiar with K8S, but from one of my colleague who has worked
>>> in that area tells me for resources like CPU/MEM and configurations
>>> like sysctl, can be set using K8S configuration:
>>> https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/
>>
>> For K8s, this involves container engines like cri-o, containerd, podman,
>> and others towards the runtimes like runc.  To ensure they operate together,
>> specifications by the Open Container Initiative (OCI) at
>> https://opencontainers.org/release-notices/overview/
>>
>> For container/pod deployments, there is especially the Container Runtime
>> Interface (CRI) that defines the interface, e.g., of K8s to cri-o etc.
>>
>> CRI includes support for (namespaced) sysctl's:
>> https://github.com/opencontainers/runtime-spec/releases/tag/v1.0.2
>>
>> In essence, the CRI spec would allow users to specify/control a specific
>> runtime for the container in a declarative way w/o modifying the (base)
>> container images.
> 
> Thanks a lot for your kind explanation !
> 
> After a quick look at the OCI spec, I saw the support for file based
> configuration (Including sysfs/procfs etc.). And unfortunately, no
> netlink support.
> 
> 
> Hi Karsten & Stefan:
> Back to the patch itself, do you think I need to add the control switch
> now ? Or just leave the switch and fix other issues first ?

Hi, looks like we need more time to evaluate possibilities, so if you have 
additional topics on your desk move on and delay this one.
Right now for me it looks like there is no way to use netlink for container runtime
configuration, which is a pity.
We continue our discussions about this in the team, and also here on the list.

Thank you!



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] net/smc: Add autocork support
  2022-02-18 16:03             ` Karsten Graul
@ 2022-02-18 23:42               ` dust.li
  2022-02-23 18:57                 ` Karsten Graul
  0 siblings, 1 reply; 16+ messages in thread
From: dust.li @ 2022-02-18 23:42 UTC (permalink / raw)
  To: Karsten Graul, Hendrik Brueckner
  Cc: Stefan Raspl, Tony Lu, kuba, davem, netdev, linux-s390, linux-rdma

On Fri, Feb 18, 2022 at 05:03:56PM +0100, Karsten Graul wrote:
>On 18/02/2022 08:33, dust.li wrote:
>> On Thu, Feb 17, 2022 at 07:15:54PM +0100, Hendrik Brueckner wrote:
>>> On Thu, Feb 17, 2022 at 09:22:00PM +0800, dust.li wrote:
>>>> On Thu, Feb 17, 2022 at 10:37:28AM +0100, Stefan Raspl wrote:
>>>>> On 2/16/22 16:27, dust.li wrote:
>>>>>> On Wed, Feb 16, 2022 at 02:58:32PM +0100, Stefan Raspl wrote:
>>>>>>> On 2/16/22 04:49, Dust Li wrote:
>>>>>>>
>>>>
>>>>> Now we understand that cloud workloads are a bit different, and the desire to
>>>>> be able to modify the environment of a container while leaving the container
>>>>> image unmodified is understandable. But then again, enabling the base image
>>>>> would be the cloud way to address this. The question to us is: How do other
>>>>> parts of the kernel address this?
>>>>
>>>> I'm not familiar with K8S, but from one of my colleague who has worked
>>>> in that area tells me for resources like CPU/MEM and configurations
>>>> like sysctl, can be set using K8S configuration:
>>>> https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/
>>>
>>> For K8s, this involves container engines like cri-o, containerd, podman,
>>> and others towards the runtimes like runc.  To ensure they operate together,
>>> specifications by the Open Container Initiative (OCI) at
>>> https://opencontainers.org/release-notices/overview/
>>>
>>> For container/pod deployments, there is especially the Container Runtime
>>> Interface (CRI) that defines the interface, e.g., of K8s to cri-o etc.
>>>
>>> CRI includes support for (namespaced) sysctl's:
>>> https://github.com/opencontainers/runtime-spec/releases/tag/v1.0.2
>>>
>>> In essence, the CRI spec would allow users to specify/control a specific
>>> runtime for the container in a declarative way w/o modifying the (base)
>>> container images.
>> 
>> Thanks a lot for your kind explanation !
>> 
>> After a quick look at the OCI spec, I saw the support for file based
>> configuration (Including sysfs/procfs etc.). And unfortunately, no
>> netlink support.
>> 
>> 
>> Hi Karsten & Stefan:
>> Back to the patch itself, do you think I need to add the control switch
>> now ? Or just leave the switch and fix other issues first ?
>
>Hi, looks like we need more time to evaluate possibilities, so if you have 
>additional topics on your desk move on and delay this one.

OK, got it.

>Right now for me it looks like there is no way to use netlink for container runtime
>configuration, which is a pity.
>We continue our discussions about this in the team, and also here on the list.

Many thanks for your time on this topic !


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] net/smc: Add autocork support
  2022-02-18 23:42               ` dust.li
@ 2022-02-23 18:57                 ` Karsten Graul
  2022-02-24  2:02                   ` dust.li
  0 siblings, 1 reply; 16+ messages in thread
From: Karsten Graul @ 2022-02-23 18:57 UTC (permalink / raw)
  To: dust.li, Hendrik Brueckner
  Cc: Stefan Raspl, Tony Lu, kuba, davem, netdev, linux-s390, linux-rdma

On 19/02/2022 00:42, dust.li wrote:
> On Fri, Feb 18, 2022 at 05:03:56PM +0100, Karsten Graul wrote:
>> Right now for me it looks like there is no way to use netlink for container runtime
>> configuration, which is a pity.
>> We continue our discussions about this in the team, and also here on the list.
> 
> Many thanks for your time on this topic !

We checked more specs (like Container Network Interface (CNI) Specification) 
but all we found uses sysctl at the end. There is lot of infrastructure 
to use sysctls in a container environment.

Establishing netlink-like controls for containers is by far out of our scope, and
would take a long time until it would be available in the popular projects.

So at the moment I see no alternative to an additional sysctl interface in the 
SMC module that provides controls which are useful in container environments.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] net/smc: Add autocork support
  2022-02-23 18:57                 ` Karsten Graul
@ 2022-02-24  2:02                   ` dust.li
  2022-02-25 18:10                     ` Karsten Graul
  0 siblings, 1 reply; 16+ messages in thread
From: dust.li @ 2022-02-24  2:02 UTC (permalink / raw)
  To: Karsten Graul, Hendrik Brueckner
  Cc: Stefan Raspl, Tony Lu, kuba, davem, netdev, linux-s390, linux-rdma

On Wed, Feb 23, 2022 at 07:57:31PM +0100, Karsten Graul wrote:
>On 19/02/2022 00:42, dust.li wrote:
>> On Fri, Feb 18, 2022 at 05:03:56PM +0100, Karsten Graul wrote:
>>> Right now for me it looks like there is no way to use netlink for container runtime
>>> configuration, which is a pity.
>>> We continue our discussions about this in the team, and also here on the list.
>> 
>> Many thanks for your time on this topic !
>
>We checked more specs (like Container Network Interface (CNI) Specification) 
>but all we found uses sysctl at the end. There is lot of infrastructure 
>to use sysctls in a container environment.
>
>Establishing netlink-like controls for containers is by far out of our scope, and
>would take a long time until it would be available in the popular projects.
>
>So at the moment I see no alternative to an additional sysctl interface in the 
>SMC module that provides controls which are useful in container environments.

Got it, I will add sysctl interface and a switch with this function.

Thank again !

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] net/smc: Add autocork support
  2022-02-24  2:02                   ` dust.li
@ 2022-02-25 18:10                     ` Karsten Graul
  2022-02-25 23:42                       ` dust.li
  0 siblings, 1 reply; 16+ messages in thread
From: Karsten Graul @ 2022-02-25 18:10 UTC (permalink / raw)
  To: dust.li, Hendrik Brueckner
  Cc: Stefan Raspl, Tony Lu, kuba, davem, netdev, linux-s390, linux-rdma

On 24/02/2022 03:02, dust.li wrote:
> On Wed, Feb 23, 2022 at 07:57:31PM +0100, Karsten Graul wrote:
>> On 19/02/2022 00:42, dust.li wrote:
>>> On Fri, Feb 18, 2022 at 05:03:56PM +0100, Karsten Graul wrote:
>>>> Right now for me it looks like there is no way to use netlink for container runtime
>>>> configuration, which is a pity.
>>>> We continue our discussions about this in the team, and also here on the list.
>>>
>>> Many thanks for your time on this topic !
>>
>> We checked more specs (like Container Network Interface (CNI) Specification) 
>> but all we found uses sysctl at the end. There is lot of infrastructure 
>> to use sysctls in a container environment.
>>
>> Establishing netlink-like controls for containers is by far out of our scope, and
>> would take a long time until it would be available in the popular projects.
>>
>> So at the moment I see no alternative to an additional sysctl interface in the 
>> SMC module that provides controls which are useful in container environments.
> 
> Got it, I will add sysctl interface and a switch with this function.
> 
> Thank again !

Can you explain again why this auto_cork needs a switch to disable it?
My understanding is that this auto_cork makes always sense and is triggered
when there are not enough resources.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] net/smc: Add autocork support
  2022-02-25 18:10                     ` Karsten Graul
@ 2022-02-25 23:42                       ` dust.li
  0 siblings, 0 replies; 16+ messages in thread
From: dust.li @ 2022-02-25 23:42 UTC (permalink / raw)
  To: Karsten Graul, Hendrik Brueckner
  Cc: Stefan Raspl, Tony Lu, kuba, davem, netdev, linux-s390, linux-rdma

On Fri, Feb 25, 2022 at 07:10:40PM +0100, Karsten Graul wrote:
>On 24/02/2022 03:02, dust.li wrote:
>> On Wed, Feb 23, 2022 at 07:57:31PM +0100, Karsten Graul wrote:
>>> On 19/02/2022 00:42, dust.li wrote:
>>>> On Fri, Feb 18, 2022 at 05:03:56PM +0100, Karsten Graul wrote:
>>>>> Right now for me it looks like there is no way to use netlink for container runtime
>>>>> configuration, which is a pity.
>>>>> We continue our discussions about this in the team, and also here on the list.
>>>>
>>>> Many thanks for your time on this topic !
>>>
>>> We checked more specs (like Container Network Interface (CNI) Specification) 
>>> but all we found uses sysctl at the end. There is lot of infrastructure 
>>> to use sysctls in a container environment.
>>>
>>> Establishing netlink-like controls for containers is by far out of our scope, and
>>> would take a long time until it would be available in the popular projects.
>>>
>>> So at the moment I see no alternative to an additional sysctl interface in the 
>>> SMC module that provides controls which are useful in container environments.
>> 
>> Got it, I will add sysctl interface and a switch with this function.
>> 
>> Thank again !
>
>Can you explain again why this auto_cork needs a switch to disable it?
>My understanding is that this auto_cork makes always sense and is triggered
>when there are not enough resources.

My initial intention to provide a switch is to be like TCP to let user
to disable it. For user cases like debug and workaround bugs if it is
associated with auto cork, or compare performance like I did (But this
should not be a real world case in production environment).

But after Stefan suggested that we make the auto corked size turnable,
I realized that we can only need one sysctl switch: which tunes the auto
corked bytes size. Disable auto cork can be archived by setting this to 0.

Something like this bellow:
static bool smc_should_autocork(struct smc_sock *smc)
{
        struct smc_connection *conn = &smc->conn;
        int corking_size;

        corking_size = min(sock_net(&smc->sk)->smc.sysctl_autocorking_size,
                           conn->sndbuf_desc->len >> 1);

        if (atomic_read(&conn->cdc_pend_tx_wr) == 0 ||
            smc_tx_prepared_sends(conn) > corking_size)
                return false;
        return true;
}

Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-02-25 23:42 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-16  3:49 [PATCH] net/smc: Add autocork support Dust Li
2022-02-16  3:53 ` dust.li
2022-02-16 10:32 ` Karsten Graul
2022-02-16 10:55   ` dust.li
2022-02-16 13:58 ` Stefan Raspl
2022-02-16 15:27   ` dust.li
2022-02-17  9:37     ` Stefan Raspl
2022-02-17 13:22       ` dust.li
2022-02-17 18:15         ` Hendrik Brueckner
2022-02-18  7:33           ` dust.li
2022-02-18 16:03             ` Karsten Graul
2022-02-18 23:42               ` dust.li
2022-02-23 18:57                 ` Karsten Graul
2022-02-24  2:02                   ` dust.li
2022-02-25 18:10                     ` Karsten Graul
2022-02-25 23:42                       ` dust.li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.