* Re: SMC-R throughput drops for specific message sizes @ 2024-02-01 13:50 Iordache Costin (XC-AS/EAE-UK) 2024-02-05 3:50 ` Wen Gu 0 siblings, 1 reply; 9+ messages in thread From: Iordache Costin (XC-AS/EAE-UK) @ 2024-02-01 13:50 UTC (permalink / raw) To: Wen Gu, Gerd Bayer, D . Wythe, Tony Lu, Nils Hoppmann Cc: linux-s390, netdev, Wenjia Zhang, Jan Karcher, Dust Li [-- Attachment #1: Type: text/plain, Size: 2462 bytes --] Hi, This is Costin, Alex's colleague. We've got additional updates which we thought would be helpful to share with the community. Brief reminder, our hardware/software context is as follows: - 2 PCs, each equipped with one Mellanox ConnectX-5 HCA (MT27800), dual port - only one HCA port is active/connected on each side (QSFP28 cable) - max HCA throughput: 25Gbps ~ 3.12GBs. - max/active MTU: 4096 - kernel: 6.5.0-14 - benchmarking tool: qperf 0.4.11 Our goal has been to gauge the SMC-R benefits vs TCP/IP . We are particularly interested in maximizing the throughput whilst reducing CPU utilisation and DRAM memory bandwidth for large data (> 2MB) transfers. Our main issue so far has been SMC-R halves the throughput for some specific message sizes (as opposed to TCP/IP) - see "SMC-R vs TCP" plot. Since our last post the kernel was upgraded from 5.4 to 6.5.0-14 hoping it would alleviate the throughput drops, but it did not, so we bit the bullet and delved into the SMC-R code. The SMC-R source code revealed that __smc_buf_create / smc_compress_bufsize functions are in charge of computing the size of the RMB buffer and allocating either physical or virtual contiguous memory. We suspected that the throughput drops were caused by the size of this buffer being too small. We set out to determine whether there is a correlation between the drops and the size of the RMB buffer, and for that we set the size of the RMB buffer to 128KB, 256KB, 512KB, 1MB, 2MB, 4MB and 8MB and benchmarked the throughput for different message size ranges. The attached plot collates the benchmark results and shows that the period of the drops coincides with the size of the RMB buffer. Whilst increasing the size of the buffer seems to attenuate the throughput drops, we believe that the real root of the drops might lie somewhere else in the SMC-R code. We are suspecting that, for reasons unknown to us, the CDC messages that are sent after the RDMA WRITE operation are delayed in some circumstances. cheers, Costin. PS. for the sake of brevity many details have been omitted on purpose but we'd be happy to provide them if need be, e.g. by default the RMB buffer size is capped to 512KB so we remove the cap and recompile the SMC module; we use alternative tools such as iperf and iperf 3 for benchmarking to dismiss the possibility of the drops to be tool specific; corking has been disabled; etc. [-- Attachment #2: smc_qperf_merged-4.jpg --] [-- Type: image/jpeg, Size: 291219 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: SMC-R throughput drops for specific message sizes 2024-02-01 13:50 SMC-R throughput drops for specific message sizes Iordache Costin (XC-AS/EAE-UK) @ 2024-02-05 3:50 ` Wen Gu 2024-02-19 8:44 ` Wen Gu 0 siblings, 1 reply; 9+ messages in thread From: Wen Gu @ 2024-02-05 3:50 UTC (permalink / raw) To: Iordache Costin (XC-AS/EAE-UK), Gerd Bayer, D . Wythe, Tony Lu, Nils Hoppmann Cc: linux-s390, netdev, Wenjia Zhang, Jan Karcher, Dust Li [-- Attachment #1: Type: text/plain, Size: 2828 bytes --] On 2024/2/1 21:50, Iordache Costin (XC-AS/EAE-UK) wrote: > Hi, > > This is Costin, Alex's colleague. We've got additional updates which we thought would be helpful to share with the community. > > Brief reminder, our hardware/software context is as follows: > - 2 PCs, each equipped with one Mellanox ConnectX-5 HCA (MT27800), dual port > - only one HCA port is active/connected on each side (QSFP28 cable) > - max HCA throughput: 25Gbps ~ 3.12GBs. > - max/active MTU: 4096 > - kernel: 6.5.0-14 > - benchmarking tool: qperf 0.4.11 > > Our goal has been to gauge the SMC-R benefits vs TCP/IP . We are particularly interested in maximizing the throughput whilst reducing CPU utilisation and DRAM memory bandwidth for large data (> 2MB) transfers. > > Our main issue so far has been SMC-R halves the throughput for some specific message sizes (as opposed to TCP/IP) - see "SMC-R vs TCP" plot. > > Since our last post the kernel was upgraded from 5.4 to 6.5.0-14 hoping it would alleviate the throughput drops, but it did not, so we bit the bullet and delved into the SMC-R code. > Hi Costin, FYI, I have also reproduced this in my environment(see attached), with Linux6.8-rc1 - 2 VMs, each with 2 passthrough ConnectX-4. - kernel: Linux6.8-rc1 - benchmarking tool: qperf 0.4.11 But it might take me some time to dive deeper into what exactly happened. > The SMC-R source code revealed that __smc_buf_create / smc_compress_bufsize functions are in charge of computing the size of the RMB buffer and allocating either physical or virtual contiguous memory. We suspected that the throughput drops were caused by the size of this buffer being too small. > > We set out to determine whether there is a correlation between the drops and the size of the RMB buffer, and for that we set the size of the RMB buffer to 128KB, 256KB, 512KB, 1MB, 2MB, 4MB and 8MB and benchmarked the throughput for different message size ranges. > > The attached plot collates the benchmark results and shows that the period of the drops coincides with the size of the RMB buffer. Whilst increasing the size of the buffer seems to attenuate the throughput drops, we believe that the real root of the drops might lie somewhere else in the SMC-R code. We are suspecting that, for reasons unknown to us, the CDC messages that are sent after the RDMA WRITE operation are delayed in some circumstances. > > cheers, > Costin. > > PS. for the sake of brevity many details have been omitted on purpose but we'd be happy to provide them if need be, e.g. by default the RMB buffer size is capped to 512KB so we remove the cap and recompile the SMC module; we use alternative tools such as iperf and iperf 3 for benchmarking to dismiss the possibility of the drops to be tool specific; corking has been disabled; etc. [-- Attachment #2: RMB-64KB.png --] [-- Type: image/png, Size: 56541 bytes --] [-- Attachment #3: RMB-128KB.png --] [-- Type: image/png, Size: 50514 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: SMC-R throughput drops for specific message sizes 2024-02-05 3:50 ` Wen Gu @ 2024-02-19 8:44 ` Wen Gu 2024-02-27 11:28 ` Iordache Costin (XC-AS/EAE-UK) 0 siblings, 1 reply; 9+ messages in thread From: Wen Gu @ 2024-02-19 8:44 UTC (permalink / raw) To: Iordache Costin (XC-AS/EAE-UK), Gerd Bayer, D . Wythe, Tony Lu, Nils Hoppmann Cc: linux-s390, netdev, Wenjia Zhang, Jan Karcher, Dust Li, Alexandra Winter [-- Attachment #1: Type: text/plain, Size: 15782 bytes --] On 2024/2/5 11:50, Wen Gu wrote: > > > On 2024/2/1 21:50, Iordache Costin (XC-AS/EAE-UK) wrote: >> Hi, >> >> This is Costin, Alex's colleague. We've got additional updates which we thought would be helpful to share with the >> community. >> >> Brief reminder, our hardware/software context is as follows: >> - 2 PCs, each equipped with one Mellanox ConnectX-5 HCA (MT27800), dual port >> - only one HCA port is active/connected on each side (QSFP28 cable) >> - max HCA throughput: 25Gbps ~ 3.12GBs. >> - max/active MTU: 4096 >> - kernel: 6.5.0-14 >> - benchmarking tool: qperf 0.4.11 >> >> Our goal has been to gauge the SMC-R benefits vs TCP/IP . We are particularly interested in maximizing the throughput >> whilst reducing CPU utilisation and DRAM memory bandwidth for large data (> 2MB) transfers. >> >> Our main issue so far has been SMC-R halves the throughput for some specific message sizes (as opposed to TCP/IP) - >> see "SMC-R vs TCP" plot. >> >> Since our last post the kernel was upgraded from 5.4 to 6.5.0-14 hoping it would alleviate the throughput drops, but >> it did not, so we bit the bullet and delved into the SMC-R code. >> > > Hi Costin, > > FYI, I have also reproduced this in my environment(see attached), with Linux6.8-rc1 > > - 2 VMs, each with 2 passthrough ConnectX-4. > - kernel: Linux6.8-rc1 > - benchmarking tool: qperf 0.4.11 > > But it might take me some time to dive deeper into what exactly happened. > Hi Costin and community, I would like to share some findings with you. The performance drop might be related to the SMC window size update strategy. Under certain message sizes, the RMB utilization may drop and transmission is inefficient. SMC window size update strategy can be referred from page 63 of RFC7609: https://datatracker.ietf.org/doc/html/rfc7609#page-63 . and the linux implementation is smc_tx_consumer_update() in net/smc/smc_tx.c. " The current window size (from a sender's perspective) is less than half of the receive buffer space, and the consumer cursor update will result in a minimum increase in the window size of 10% of the receive buffer space. " void smc_tx_consumer_update(struct smc_connection *conn, bool force) { <...> if (conn->local_rx_ctrl.prod_flags.cons_curs_upd_req || force || ((to_confirm > conn->rmbe_update_limit) && ((sender_free <= (conn->rmb_desc->len / 2)) || conn->local_rx_ctrl.prod_flags.write_blocked))) { if (conn->killed || conn->local_rx_ctrl.conn_state_flags.peer_conn_abort) return; if ((smc_cdc_get_slot_and_msg_send(conn) < 0) && <- update the window !conn->killed) { queue_delayed_work(conn->lgr->tx_wq, &conn->tx_work, SMC_TX_WORK_DELAY); return; } } <...> } # Test results In my qperf test, one of performance drops can be observed when msgsize changes from (90% * RMB - 1) to (90% * RMB). Take 128KB sndbuf and RMB as an example, #sysctl net.smc.wmem=131072 # both sides #sysctl net.smc.rmem=131072 10% * RMB is 13107 bytes, so compare the qperf results when msgsize is 117964 bytes (128KB - 13107B - 1) and 117965 bytes (128KB - 13107B): - msgsize 117964 # smc_run taskset -c <serv_cpu> qperf # smc_run taskset -c <clnt_cpu> qperf <serv_ip> -m 117964 -t 10 -vu tcp_bw tcp_bw: bw = 4.12 GB/sec msg_size = 118 KB time = 10 sec CPU: %serv_cpu : 1.8 us, 54.6 sy, 0.0 ni, 42.9 id, 0.0 wa, 0.7 hi, 0.0 si, 0.0 st %clnt_cpu : 2.5 us, 40.0 sy, 0.0 ni, 57.1 id, 0.0 wa, 0.4 hi, 0.0 si, 0.0 st - msgsize 117965 # smc_run taskset -c <serv_cpu> qperf # smc_run taskset -c <clnt_cpu> qperf <serv_ip> -m 117965 -t 10 -vu tcp_bw tcp_bw: bw = 2.86 GB/sec msg_size = 118 KB time = 10 sec CPU: %serv_cpu : 1.7 us, 30.0 sy, 0.0 ni, 68.0 id, 0.0 wa, 0.3 hi, 0.0 si, 0.0 st %clnt_cpu : 1.0 us, 23.1 sy, 0.0 ni, 75.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st The bandwidth drop is obvious and the CPU utilization is relatively low. # Analysis I traced the copylen in smc_tx_sendmsg and smc_rx_recvmsg, and found that: tracepoint:smc:smc_tx_sendmsg { @tx_len = hist(args->len); } tracepoint:smc:smc_rx_recvmsg { @rx_len = hist(args->len); } - msgsize 117964 @rx_len: [4, 8) 1 | | [8, 16) 24184 |@@ | [16, 32) 40 | | [32, 64) 0 | | [64, 128) 0 | | [128, 256) 1 | | [256, 512) 0 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 0 | | [8K, 16K) 177963 |@@@@@@@@@@@@@@@@@@ | [16K, 32K) 212122 |@@@@@@@@@@@@@@@@@@@@@@ | [32K, 64K) 490755 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [64K, 128K) 111529 |@@@@@@@@@@@ | @tx_len: [8, 16) 30768 |@@@ | [16, 32) 61 | | [32, 64) 0 | | [64, 128) 0 | | [128, 256) 1 | | [256, 512) 0 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 0 | | [8K, 16K) 138977 |@@@@@@@@@@@@@@@ | [16K, 32K) 265929 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32K, 64K) 469183 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [64K, 128K) 111082 |@@@@@@@@@@@@ | - msgsize 117965 @rx_len: [4, 8) 1 | | [8, 16) 0 | | [16, 32) 2 | | [32, 64) 0 | | [64, 128) 0 | | [128, 256) 1 | | [256, 512) 0 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 0 | | [8K, 16K) 237415 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16K, 32K) 0 | | [32K, 64K) 0 | | [64K, 128K) 237389 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | @tx_len: [16, 32) 2 | | [32, 64) 0 | | [64, 128) 0 | | [128, 256) 1 | | [256, 512) 0 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 0 | | [8K, 16K) 22 | | [16K, 32K) 237357 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32K, 64K) 23 | | [64K, 128K) 237390 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| According to the trace results, when msgsize is 117964 (good), the Tx or Rx data size are distributed between 8K-128K, but when msgsize is 117965 (bad), the Tx or Rx data size are concentrated in the two intervals [16K, 32K) and [64K, 128K). To further see what's going on, I wrote a simple reproduce program (see server.c and client.c in attachment, they transfer 10 specific size message and acts like qperf tcp_bw) and added some debug information in SMC kernel (see 0001-debug-record-cursor-update-process.patch in attachment). Then run my simple reproducer and check the dmesg. # smc_run ./server -p <serv_port> -m <msgsize: 117964 or 117965> # smc_run ./client -i <serv_ip> -p <serv_port> -m <msgsize: 117964 or 117965> According to the dmesg debug information (see dmesg.txt in attachment), we can recover how data is produced and consumed in RMB when msgsize is 117964 or 117965 (see diagram.svg in attachment, it is made based on server-side demsg information). (Please check the above diagram and dmesg information first) When msgsize is 117964 (good), each time the server consumes data in RMB, the window size will be updated to client in time (cfed cursor in linux) because the size of data consumed is always larger than 10% * RMB, and then new data can be produced in RMB. However, when msgsize is 117965 (bad), the window will only be updated when the entire 117965 bytes data has been consumed since the first part of 117965 is always no larger than 10% * RMB, which is 13107 (see diagram.svg for details). This results in low utilization of RMB. To verify this, I made a simple modification to the SMC kernel code, forcing the receiver always updates the window size. (However, this modification is far from a solution for the period drop issue, SMC window can't be always updated for silly window syndrome avoidance.) " diff --git a/net/smc/smc_rx.c b/net/smc/smc_rx.c index 9a2f3638d161..6a4d8041d6b5 100644 --- a/net/smc/smc_rx.c +++ b/net/smc/smc_rx.c @@ -61,7 +61,7 @@ static int smc_rx_update_consumer(struct smc_sock *smc, { struct smc_connection *conn = &smc->conn; struct sock *sk = &smc->sk; - bool force = false; + bool force = true; int diff, rc = 0; smc_curs_add(conn->rmb_desc->len, &cons, len); " Then run qperf test under message size of 117964 and 117965: # smc_run taskset -c <cpu> qperf <ip> -m 117964 -t 10 -vu tcp_bw tcp_bw: bw = 4.13 GB/sec msg_size = 118 KB time = 10 sec # smc_run taskset -c <cpu> qperf <ip> -m 117965 -t 10 -vu tcp_bw tcp_bw: bw = 4.18 GB/sec msg_size = 118 KB time = 10 sec The performance drop disapper, and copylen tracing results under the two msgsize are similar: - msgsize 117964 @rx_len: [4, 8) 1 | | [8, 16) 7521 | | [16, 32) 422 | | [32, 64) 0 | | [64, 128) 0 | | [128, 256) 1 | | [256, 512) 0 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 0 | | [8K, 16K) 225144 |@@@@@@@@@@@@@@@@@@@@ | [16K, 32K) 311010 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32K, 64K) 571816 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [64K, 128K) 62061 |@@@@@ | - msgsize 117965 @rx_len: [2, 4) 54360 |@@@@ | [4, 8) 1025 | | [8, 16) 2 | | [16, 32) 2 | | [32, 64) 0 | | [64, 128) 0 | | [128, 256) 1 | | [256, 512) 0 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 0 | | [8K, 16K) 214217 |@@@@@@@@@@@@@@@@@@ | [16K, 32K) 288575 |@@@@@@@@@@@@@@@@@@@@@@@@@ | [32K, 64K) 593644 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [64K, 128K) 50187 |@@@@ | Hope the above analysis can be of some help for you. Thanks, Wen Gu >> The SMC-R source code revealed that __smc_buf_create / smc_compress_bufsize functions are in charge of computing the >> size of the RMB buffer and allocating either physical or virtual contiguous memory. We suspected that the throughput >> drops were caused by the size of this buffer being too small. >> >> We set out to determine whether there is a correlation between the drops and the size of the RMB buffer, and for that >> we set the size of the RMB buffer to 128KB, 256KB, 512KB, 1MB, 2MB, 4MB and 8MB and benchmarked the throughput for >> different message size ranges. >> >> The attached plot collates the benchmark results and shows that the period of the drops coincides with the size of the >> RMB buffer. Whilst increasing the size of the buffer seems to attenuate the throughput drops, we believe that the real >> root of the drops might lie somewhere else in the SMC-R code. We are suspecting that, for reasons unknown to us, the >> CDC messages that are sent after the RDMA WRITE operation are delayed in some circumstances. >> >> cheers, >> Costin. >> >> PS. for the sake of brevity many details have been omitted on purpose but we'd be happy to provide them if need be, >> e.g. by default the RMB buffer size is capped to 512KB so we remove the cap and recompile the SMC module; we use >> alternative tools such as iperf and iperf 3 for benchmarking to dismiss the possibility of the drops to be tool >> specific; corking has been disabled; etc. [-- Attachment #2: 0001-debug-record-cursor-update-process.patch --] [-- Type: text/plain, Size: 2950 bytes --] From 5a871abe32fb493b06b63f2ef75fd2686881f2f3 Mon Sep 17 00:00:00 2001 From: Wen Gu <guwen@linux.alibaba.com> Date: Sun, 18 Feb 2024 14:51:37 +0800 Subject: [PATCH] debug: record cursor update process Signed-off-by: Wen Gu <guwen@linux.alibaba.com> --- net/smc/smc_cdc.c | 8 ++++++++ net/smc/smc_rx.c | 3 +++ net/smc/smc_tx.c | 6 ++++++ 3 files changed, 17 insertions(+) diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c index e938fe3bcc7c..da08c578546f 100644 --- a/net/smc/smc_cdc.c +++ b/net/smc/smc_cdc.c @@ -369,6 +369,10 @@ static void smc_cdc_msg_recv_action(struct smc_sock *smc, smc_tx_sndbuf_nonfull(smc); } + pr_warn("%s: peer consumed %d, cons_cur {%d:%d}, peer_rmbe_space %d\n", + __func__, diff_cons, conn->local_rx_ctrl.cons.wrap, + conn->local_rx_ctrl.cons.count, + atomic_read(&conn->peer_rmbe_space)); } diff_prod = smc_curs_diff(conn->rmb_desc->len, &prod_old, @@ -381,6 +385,10 @@ static void smc_cdc_msg_recv_action(struct smc_sock *smc, atomic_add(diff_prod, &conn->bytes_to_rcv); /* guarantee 0 <= bytes_to_rcv <= rmb_desc->len */ smp_mb__after_atomic(); + pr_warn("%s: peer produced %d, prod_cur {%d:%d}, bytes_to_recv %d\n", + __func__, diff_prod, conn->local_rx_ctrl.prod.wrap, + conn->local_rx_ctrl.prod.count, + atomic_read(&conn->bytes_to_rcv)); smc->sk.sk_data_ready(&smc->sk); } else { if (conn->local_rx_ctrl.prod_flags.write_blocked) diff --git a/net/smc/smc_rx.c b/net/smc/smc_rx.c index 9a2f3638d161..cbb42a203370 100644 --- a/net/smc/smc_rx.c +++ b/net/smc/smc_rx.c @@ -89,6 +89,9 @@ static int smc_rx_update_consumer(struct smc_sock *smc, } smc_curs_copy(&conn->local_tx_ctrl.cons, &cons, conn); + pr_warn("%s: local consumed %ld B, cons_cur {%d:%d}, bytes_to_recv %d\n", + __func__, len, cons.wrap, cons.count, + atomic_read(&conn->bytes_to_rcv)); /* send consumer cursor update if required */ /* similar to advertising new TCP rcv_wnd if required */ diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c index 214ac3cbcf9a..f290777d4a43 100644 --- a/net/smc/smc_tx.c +++ b/net/smc/smc_tx.c @@ -538,6 +538,10 @@ static int smc_tx_rdma_writes(struct smc_connection *conn, /* dst: peer RMBE */ smc_curs_copy(&conn->tx_curs_sent, &sent, conn);/* src: local sndbuf */ + pr_warn("%s: local produced %ld, prod {%d:%d}, peer_rmbe_space %d\n", + __func__, len, conn->local_tx_ctrl.prod.wrap, + conn->local_tx_ctrl.prod.count, + atomic_read(&conn->peer_rmbe_space)); return 0; } @@ -713,6 +717,8 @@ void smc_tx_consumer_update(struct smc_connection *conn, bool force) if (conn->killed || conn->local_rx_ctrl.conn_state_flags.peer_conn_abort) return; + pr_warn("%s: notify RMB wnd increase %d, cons_cur {%d:%d}\n", + __func__, to_confirm, cons.wrap, cons.count); if ((smc_cdc_get_slot_and_msg_send(conn) < 0) && !conn->killed) { queue_delayed_work(conn->lgr->tx_wq, &conn->tx_work, -- 2.32.0.3.g01195cf9f [-- Attachment #3: server.c --] [-- Type: text/plain, Size: 2379 bytes --] #include <stdio.h> #include <string.h> #include <stdlib.h> #include <unistd.h> #include <arpa/inet.h> #include <sys/socket.h> #include <netinet/in.h> #include <errno.h> #include <stdbool.h> #include <netinet/tcp.h> #include <pthread.h> #ifndef AF_SMC #define AF_SMC 43 #endif #define NET_PROTOCAL AF_INET #define SERV_IP INADDR_ANY #define SERV_PORT 10012 #define BUF_SIZE (5 * 128 * 1024) int stream_recv(int fd, char *buf, int msgsize) { int n = msgsize; while (n) { int i = read(fd, buf, n); if (i < 0) return i; buf += i; n -= i; if (i == 0) break; printf("Successfully recv %d B message\n", i); } return msgsize-n; } int net_serv(int port, int msgsize) { int count, recv; if (!port) port = SERV_PORT; int l_sock = socket(NET_PROTOCAL, SOCK_STREAM, 0); struct sockaddr_in s_addr; memset(&s_addr, 0, sizeof(struct sockaddr_in)); s_addr.sin_family = NET_PROTOCAL; s_addr.sin_addr.s_addr = SERV_IP; s_addr.sin_port = htons(port); // bind listen socket if (bind(l_sock, (struct sockaddr*)&s_addr, sizeof(s_addr))) { printf("bind listen socket error %d\n", errno); return 0; } // listen if (listen(l_sock, 20)) { printf("listen error\n"); return 0; } struct sockaddr_in c_addr; socklen_t c_addr_len = sizeof(c_addr); int s_sock = accept(l_sock, (struct sockaddr*)&c_addr, &c_addr_len); if (s_sock < 0) { printf("accept fail\n"); return 0; } else { char ip[16] = { 0 }; inet_ntop(NET_PROTOCAL, &(c_addr.sin_addr), ip, INET_ADDRSTRLEN); printf("accept connection: ip %s port %d\n", ip, c_addr.sin_port); } char *buf = (char *)malloc(sizeof(char) * BUF_SIZE); while (1) { if (msgsize > BUF_SIZE) break; printf("Recv msgsize: %d\n", msgsize); recv = stream_recv(s_sock, buf, msgsize); if (recv <= 0) { if (recv) printf("Error recv %d\n", recv); break; } } close(s_sock); close(l_sock); return 0; } int main(int argc, char **argv) { bool wrong_param = false; int msgsize = BUF_SIZE; int port = 0; int c; while(!wrong_param && (-1 != (c = getopt(argc, argv, "p:m:")))) { switch (c) { case 'p': port = atoi(optarg); break; case 'm': msgsize = atoi(optarg); break; case '?': printf("usage: ./server -p <port> -m <msgsize>\n"); wrong_param = true; break; } } if (!wrong_param) net_serv(port, msgsize); return 0; } [-- Attachment #4: client.c --] [-- Type: text/plain, Size: 2048 bytes --] #include <stdio.h> #include <string.h> #include <stdlib.h> #include <unistd.h> #include <arpa/inet.h> #include <sys/socket.h> #include <netinet/in.h> #include <stdbool.h> #include <errno.h> #include <netinet/tcp.h> #ifndef AF_SMC #define AF_SMC 43 #endif #define NET_PROTOCAL AF_INET #define SERV_IP "11.213.5.33" #define SERV_PORT 10012 #define BUF_SIZE (5 * 128 * 1024) int stream_send(int fd, char *buf, int msgsize) { int n = msgsize; while (n) { int i = write(fd, buf, n); if (i < 0) return i; buf += i; n -= i; if (i == 0) break; printf("Successfully send %d B message\n", i); } return msgsize-n; } int net_clnt(char *ip, int port, int msgsize, int msgcnt) { int count = 0, send; char c; if (!ip) ip = SERV_IP; if (!port) port = SERV_PORT; int sock = socket(NET_PROTOCAL, SOCK_STREAM, 0); struct sockaddr_in s_addr; memset(&s_addr, 0, sizeof(s_addr)); s_addr.sin_family = NET_PROTOCAL; s_addr.sin_addr.s_addr = inet_addr(ip); s_addr.sin_port = htons(port); if (connect(sock, (struct sockaddr*)&s_addr, sizeof(s_addr))){ printf("connect fail\n"); return 0; } char *buf = (char *)malloc(sizeof(char) * BUF_SIZE); while (--msgcnt) { if (msgsize > BUF_SIZE) break; printf("Send msgsize: %d\n", msgsize); send = stream_send(sock, buf, msgsize); if (send <= 0) { printf("Error send %d\n", send); break; } } close(sock); } int main(int argc, char **argv){ int msgsize = BUF_SIZE, msgcnt = 10; char *ip = NULL; bool wrong_param = false; int port = 0; int c; while(!wrong_param && (-1 != (c = getopt(argc, argv, "i:p:m:c:")))) { switch (c) { case 'i': ip = optarg; break; case 'p': port = atoi(optarg); break; case 'm': msgsize = atoi(optarg); break; case 'c': msgcnt = atoi(optarg); break; case '?': printf("usage: ./client -i <ip> -p <port> -m <msgsize> -c <cnt>\n"); wrong_param = true; break; } } if (!wrong_param) net_clnt(ip, port, msgsize, msgcnt); return 0; } [-- Attachment #5: diagram.svg --] [-- Type: image/svg+xml, Size: 534635 bytes --] [-- Attachment #6: dmesg.txt --] [-- Type: text/plain, Size: 13445 bytes --] ====== msgsize 117964 ======= server: [ +0.001754] smc_cdc_msg_recv_action: peer produced 117964, prod_cur {0:117964}, bytes_to_recv 117964 [ +0.000850] smc_cdc_msg_recv_action: peer produced 13108, prod_cur {1:0}, bytes_to_recv 131072 [ +0.001020] smc_rx_update_consumer: local consumed 117964 B, cons_cur {0:117964}, bytes_to_recv 13108 [ +0.000791] smc_tx_consumer_update: notify RMB wnd increase 117964, cons_cur {0:117964} [ +0.000713] smc_rx_update_consumer: local consumed 13108 B, cons_cur {1:0}, bytes_to_recv 0 [ +0.000760] smc_tx_consumer_update: notify RMB wnd increase 13108, cons_cur {1:0} [ +0.000827] smc_cdc_msg_recv_action: peer produced 117964, prod_cur {1:117964}, bytes_to_recv 117964 [ +0.000847] smc_rx_update_consumer: local consumed 104856 B, cons_cur {1:104856}, bytes_to_recv 13108 [ +0.000630] smc_cdc_msg_recv_action: peer produced 13108, prod_cur {2:0}, bytes_to_recv 26216 [ +0.000203] smc_tx_consumer_update: notify RMB wnd increase 104856, cons_cur {1:104856} [ +0.001445] smc_rx_update_consumer: local consumed 26216 B, cons_cur {2:0}, bytes_to_recv 0 [ +0.000733] smc_tx_consumer_update: notify RMB wnd increase 26216, cons_cur {2:0} [ +0.000862] smc_cdc_msg_recv_action: peer produced 104856, prod_cur {2:104856}, bytes_to_recv 104856 [ +0.000845] smc_rx_update_consumer: local consumed 91748 B, cons_cur {2:91748}, bytes_to_recv 13108 [ +0.000718] smc_cdc_msg_recv_action: peer produced 26216, prod_cur {3:0}, bytes_to_recv 39324 [ +0.000190] smc_tx_consumer_update: notify RMB wnd increase 91748, cons_cur {2:91748} [ +0.001414] smc_rx_update_consumer: local consumed 39324 B, cons_cur {3:0}, bytes_to_recv 0 [ +0.000763] smc_tx_consumer_update: notify RMB wnd increase 39324, cons_cur {3:0} [ +0.000806] smc_cdc_msg_recv_action: peer produced 91748, prod_cur {3:91748}, bytes_to_recv 91748 [ +0.000832] smc_rx_update_consumer: local consumed 78640 B, cons_cur {3:78640}, bytes_to_recv 13108 [ +0.000662] smc_cdc_msg_recv_action: peer produced 39324, prod_cur {4:0}, bytes_to_recv 52432 [ +0.000170] smc_tx_consumer_update: notify RMB wnd increase 78640, cons_cur {3:78640} [ +0.001397] smc_rx_update_consumer: local consumed 52432 B, cons_cur {4:0}, bytes_to_recv 0 [ +0.000753] smc_tx_consumer_update: notify RMB wnd increase 52432, cons_cur {4:0} [ +0.000818] smc_cdc_msg_recv_action: peer produced 78640, prod_cur {4:78640}, bytes_to_recv 78640 [ +0.000821] smc_rx_update_consumer: local consumed 65532 B, cons_cur {4:65532}, bytes_to_recv 13108 [ +0.000662] smc_cdc_msg_recv_action: peer produced 52432, prod_cur {5:0}, bytes_to_recv 65540 [ +0.000155] smc_tx_consumer_update: notify RMB wnd increase 65532, cons_cur {4:65532} [ +0.001402] smc_rx_update_consumer: local consumed 65540 B, cons_cur {5:0}, bytes_to_recv 0 [ +0.000754] smc_tx_consumer_update: notify RMB wnd increase 65540, cons_cur {5:0} [ +0.000802] smc_cdc_msg_recv_action: peer produced 65532, prod_cur {5:65532}, bytes_to_recv 65532 [ +0.000855] smc_rx_update_consumer: local consumed 52424 B, cons_cur {5:52424}, bytes_to_recv 13108 [ +0.000651] smc_cdc_msg_recv_action: peer produced 65540, prod_cur {6:0}, bytes_to_recv 78648 [ +0.000179] smc_tx_consumer_update: notify RMB wnd increase 52424, cons_cur {5:52424} [ +0.001512] smc_rx_update_consumer: local consumed 78648 B, cons_cur {6:0}, bytes_to_recv 0 [ +0.000051] smc_cdc_msg_recv_action: peer produced 52424, prod_cur {6:52424}, bytes_to_recv 52424 [ +0.000296] smc_tx_consumer_update: notify RMB wnd increase 78648, cons_cur {6:0} [ +0.002754] smc_cdc_msg_recv_action: peer produced 78648, prod_cur {7:0}, bytes_to_recv 131072 [ +0.001050] smc_rx_update_consumer: local consumed 39316 B, cons_cur {6:39316}, bytes_to_recv 91756 [ +0.001071] smc_tx_consumer_update: notify RMB wnd increase 39316, cons_cur {6:39316} [ +0.000994] smc_rx_update_consumer: local consumed 91756 B, cons_cur {7:0}, bytes_to_recv 0 [ +0.001046] smc_tx_consumer_update: notify RMB wnd increase 91756, cons_cur {7:0} [ +0.000507] smc_cdc_msg_recv_action: peer produced 39316, prod_cur {7:39316}, bytes_to_recv 39316 [ +0.001617] smc_rx_update_consumer: local consumed 26208 B, cons_cur {7:26208}, bytes_to_recv 13108 [ +0.000314] smc_cdc_msg_recv_action: peer produced 91756, prod_cur {8:0}, bytes_to_recv 104864 [ +0.000144] smc_tx_consumer_update: notify RMB wnd increase 26208, cons_cur {7:26208} [ +0.004014] smc_rx_update_consumer: local consumed 104864 B, cons_cur {8:0}, bytes_to_recv 0 [ +0.001536] smc_cdc_msg_recv_action: peer produced 13100, prod_cur {8:13100}, bytes_to_recv 13100 [ +0.000158] smc_tx_consumer_update: notify RMB wnd increase 104864, cons_cur {8:0} [ +0.003356] smc_rx_update_consumer: local consumed 13100 B, cons_cur {8:13100}, bytes_to_recv 0 client: [ +0.000996] smc_tx_rdma_writes: local produced 117964, prod {0:117964}, peer_rmbe_space 13108 [ +0.000835] smc_tx_rdma_writes: local produced 13108, prod {1:0}, peer_rmbe_space 0 [ +0.003357] smc_cdc_msg_recv_action: peer consumed 117964, cons_cur {0:117964}, peer_rmbe_space 117964 [ +0.000857] smc_tx_rdma_writes: local produced 117964, prod {1:117964}, peer_rmbe_space 0 [ +0.000734] smc_cdc_msg_recv_action: peer consumed 13108, cons_cur {1:0}, peer_rmbe_space 13108 [ +0.000789] smc_tx_rdma_writes: local produced 13108, prod {2:0}, peer_rmbe_space 0 [ +0.002335] smc_cdc_msg_recv_action: peer consumed 104856, cons_cur {1:104856}, peer_rmbe_space 104856 [ +0.000841] smc_tx_rdma_writes: local produced 104856, prod {2:104856}, peer_rmbe_space 0 [ +0.000754] smc_cdc_msg_recv_action: peer consumed 26216, cons_cur {2:0}, peer_rmbe_space 26216 [ +0.000792] smc_tx_rdma_writes: local produced 26216, prod {3:0}, peer_rmbe_space 0 [ +0.002378] smc_cdc_msg_recv_action: peer consumed 91748, cons_cur {2:91748}, peer_rmbe_space 91748 [ +0.000840] smc_tx_rdma_writes: local produced 91748, prod {3:91748}, peer_rmbe_space 0 [ +0.000726] smc_cdc_msg_recv_action: peer consumed 39324, cons_cur {3:0}, peer_rmbe_space 39324 [ +0.000777] smc_tx_rdma_writes: local produced 39324, prod {4:0}, peer_rmbe_space 0 [ +0.002283] smc_cdc_msg_recv_action: peer consumed 78640, cons_cur {3:78640}, peer_rmbe_space 78640 [ +0.000825] smc_tx_rdma_writes: local produced 78640, prod {4:78640}, peer_rmbe_space 0 [ +0.000747] smc_cdc_msg_recv_action: peer consumed 52432, cons_cur {4:0}, peer_rmbe_space 52432 [ +0.000793] smc_tx_rdma_writes: local produced 52432, prod {5:0}, peer_rmbe_space 0 [ +0.002246] smc_cdc_msg_recv_action: peer consumed 65532, cons_cur {4:65532}, peer_rmbe_space 65532 [ +0.000814] smc_tx_rdma_writes: local produced 65532, prod {5:65532}, peer_rmbe_space 0 [ +0.000743] smc_cdc_msg_recv_action: peer consumed 65540, cons_cur {5:0}, peer_rmbe_space 65540 [ +0.000800] smc_tx_rdma_writes: local produced 65540, prod {6:0}, peer_rmbe_space 0 [ +0.000907] smc_cdc_msg_recv_action: peer consumed 52424, cons_cur {5:52424}, peer_rmbe_space 52424 [ +0.000817] smc_tx_rdma_writes: local produced 52424, prod {6:52424}, peer_rmbe_space 0 [ +0.001039] smc_cdc_msg_recv_action: peer consumed 78648, cons_cur {6:0}, peer_rmbe_space 78648 [ +0.000776] smc_tx_rdma_writes: local produced 78648, prod {7:0}, peer_rmbe_space 0 [ +0.005074] smc_cdc_msg_recv_action: peer consumed 39316, cons_cur {6:39316}, peer_rmbe_space 39316 [ +0.000825] smc_tx_rdma_writes: local produced 39316, prod {7:39316}, peer_rmbe_space 0 [ +0.001213] smc_cdc_msg_recv_action: peer consumed 91756, cons_cur {7:0}, peer_rmbe_space 91756 [ +0.000770] smc_tx_rdma_writes: local produced 91756, prod {8:0}, peer_rmbe_space 0 [ +0.004822] smc_cdc_msg_recv_action: peer consumed 26208, cons_cur {7:26208}, peer_rmbe_space 26208 [ +0.000803] smc_tx_rdma_writes: local produced 13100, prod {8:13100}, peer_rmbe_space 13108 [ +0.003763] smc_cdc_msg_recv_action: peer consumed 104864, cons_cur {8:0}, peer_rmbe_space 117972 [ +0.001634] smc_cdc_msg_recv_action: peer consumed 13100, cons_cur {8:13100}, peer_rmbe_space 131072 ====== msgsize 117965 ======= server: [ +0.002255] smc_cdc_msg_recv_action: peer produced 117965, prod_cur {0:117965}, bytes_to_recv 117965 [ +0.001229] smc_rx_update_consumer: local consumed 117965 B, cons_cur {0:117965}, bytes_to_recv 0 [ +0.000175] smc_cdc_msg_recv_action: peer produced 13107, prod_cur {1:0}, bytes_to_recv 13107 [ +0.000155] smc_tx_consumer_update: notify RMB wnd increase 117965, cons_cur {0:117965} [ +0.002855] smc_rx_update_consumer: local consumed 13107 B, cons_cur {1:0}, bytes_to_recv 0 [ +0.002232] smc_cdc_msg_recv_action: peer produced 117965, prod_cur {1:117965}, bytes_to_recv 117965 [ +0.001161] smc_rx_update_consumer: local consumed 104858 B, cons_cur {1:104858}, bytes_to_recv 13107 [ +0.001112] smc_tx_consumer_update: notify RMB wnd increase 117965, cons_cur {1:104858} [ +0.001039] smc_rx_update_consumer: local consumed 13107 B, cons_cur {1:117965}, bytes_to_recv 0 [ +0.002270] smc_cdc_msg_recv_action: peer produced 117965, prod_cur {2:104858}, bytes_to_recv 117965 [ +0.001156] smc_rx_update_consumer: local consumed 104858 B, cons_cur {2:91751}, bytes_to_recv 13107 [ +0.001088] smc_tx_consumer_update: notify RMB wnd increase 117965, cons_cur {2:91751} [ +0.001365] smc_rx_update_consumer: local consumed 13107 B, cons_cur {2:104858}, bytes_to_recv 0 [ +0.001906] smc_cdc_msg_recv_action: peer produced 117965, prod_cur {3:91751}, bytes_to_recv 117965 [ +0.001136] smc_rx_update_consumer: local consumed 104858 B, cons_cur {3:78644}, bytes_to_recv 13107 [ +0.001109] smc_tx_consumer_update: notify RMB wnd increase 117965, cons_cur {3:78644} [ +0.001388] smc_rx_update_consumer: local consumed 13107 B, cons_cur {3:91751}, bytes_to_recv 0 [ +0.001952] smc_cdc_msg_recv_action: peer produced 117965, prod_cur {4:78644}, bytes_to_recv 117965 [ +0.001127] smc_rx_update_consumer: local consumed 104858 B, cons_cur {4:65537}, bytes_to_recv 13107 [ +0.001097] smc_tx_consumer_update: notify RMB wnd increase 117965, cons_cur {4:65537} [ +0.001366] smc_rx_update_consumer: local consumed 13107 B, cons_cur {4:78644}, bytes_to_recv 0 [ +0.001876] smc_cdc_msg_recv_action: peer produced 117965, prod_cur {5:65537}, bytes_to_recv 117965 [ +0.001135] smc_rx_update_consumer: local consumed 104858 B, cons_cur {5:52430}, bytes_to_recv 13107 [ +0.001133] smc_tx_consumer_update: notify RMB wnd increase 117965, cons_cur {5:52430} [ +0.001376] smc_rx_update_consumer: local consumed 13107 B, cons_cur {5:65537}, bytes_to_recv 0 [ +0.001944] smc_cdc_msg_recv_action: peer produced 117965, prod_cur {6:52430}, bytes_to_recv 117965 [ +0.001149] smc_rx_update_consumer: local consumed 104858 B, cons_cur {6:39323}, bytes_to_recv 13107 [ +0.001131] smc_tx_consumer_update: notify RMB wnd increase 117965, cons_cur {6:39323} [ +0.001315] smc_rx_update_consumer: local consumed 13107 B, cons_cur {6:52430}, bytes_to_recv 0 [ +0.001906] smc_cdc_msg_recv_action: peer produced 117965, prod_cur {7:39323}, bytes_to_recv 117965 [ +0.001167] smc_rx_update_consumer: local consumed 104858 B, cons_cur {7:26216}, bytes_to_recv 13107 [ +0.001108] smc_tx_consumer_update: notify RMB wnd increase 117965, cons_cur {7:26216} [ +0.001015] smc_rx_update_consumer: local consumed 13107 B, cons_cur {7:39323}, bytes_to_recv 0 [ +0.002255] smc_cdc_msg_recv_action: peer produced 104858, prod_cur {8:13109}, bytes_to_recv 104858 [ +0.001140] smc_rx_update_consumer: local consumed 104858 B, cons_cur {8:13109}, bytes_to_recv 0 [ +0.001079] smc_tx_consumer_update: notify RMB wnd increase 117965, cons_cur {8:13109} client: [ +0.001136] smc_tx_rdma_writes: local produced 117965, prod {0:117965}, peer_rmbe_space 13107 [ +0.001481] smc_tx_rdma_writes: local produced 13107, prod {1:0}, peer_rmbe_space 0 [ +0.004065] smc_cdc_msg_recv_action: peer consumed 117965, cons_cur {0:117965}, peer_rmbe_space 117965 [ +0.001183] smc_tx_rdma_writes: local produced 117965, prod {1:117965}, peer_rmbe_space 0 [ +0.004355] smc_cdc_msg_recv_action: peer consumed 117965, cons_cur {1:104858}, peer_rmbe_space 117965 [ +0.001205] smc_tx_rdma_writes: local produced 117965, prod {2:104858}, peer_rmbe_space 0 [ +0.004334] smc_cdc_msg_recv_action: peer consumed 117965, cons_cur {2:91751}, peer_rmbe_space 117965 [ +0.001227] smc_tx_rdma_writes: local produced 117965, prod {3:91751}, peer_rmbe_space 0 [ +0.004303] smc_cdc_msg_recv_action: peer consumed 117965, cons_cur {3:78644}, peer_rmbe_space 117965 [ +0.001175] smc_tx_rdma_writes: local produced 117965, prod {4:78644}, peer_rmbe_space 0 [ +0.004368] smc_cdc_msg_recv_action: peer consumed 117965, cons_cur {4:65537}, peer_rmbe_space 117965 [ +0.001196] smc_tx_rdma_writes: local produced 117965, prod {5:65537}, peer_rmbe_space 0 [ +0.004322] smc_cdc_msg_recv_action: peer consumed 117965, cons_cur {5:52430}, peer_rmbe_space 117965 [ +0.001165] smc_tx_rdma_writes: local produced 117965, prod {6:52430}, peer_rmbe_space 0 [ +0.004455] smc_cdc_msg_recv_action: peer consumed 117965, cons_cur {6:39323}, peer_rmbe_space 117965 [ +0.001159] smc_tx_rdma_writes: local produced 117965, prod {7:39323}, peer_rmbe_space 0 [ +0.004317] smc_cdc_msg_recv_action: peer consumed 117965, cons_cur {7:26216}, peer_rmbe_space 117965 [ +0.001195] smc_tx_rdma_writes: local produced 104858, prod {8:13109}, peer_rmbe_space 13107 [ +0.004305] smc_cdc_msg_recv_action: peer consumed 117965, cons_cur {8:13109}, peer_rmbe_space 131072 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: SMC-R throughput drops for specific message sizes 2024-02-19 8:44 ` Wen Gu @ 2024-02-27 11:28 ` Iordache Costin (XC-AS/EAE-UK) 0 siblings, 0 replies; 9+ messages in thread From: Iordache Costin (XC-AS/EAE-UK) @ 2024-02-27 11:28 UTC (permalink / raw) To: Wen Gu, Gerd Bayer, D . Wythe, Tony Lu, Nils Hoppmann Cc: linux-s390, netdev, Wenjia Zhang, Jan Karcher, Dust Li, Alexandra Winter [-- Attachment #1: Type: text/plain, Size: 17745 bytes --] Hi Wen Gu, Cheers for looking into this. We conducted additional experiments to try to get to the bottom of it. For the sake of brevity, I'll refer to the threshold detailed in section 4.5.1 of https://datatracker.ietf.org/doc/html/rfc7609#section-4.5.1 as "window threshold", or for short WndTh. Q1: Does force=1 have the same effect as setting the window threshold to 0? Yes, it does. Figs. 1 and 7 prove that this is indeed the case. Furthermore, Fig. 7 suggests that corking has no effect. Q2: Is the width of the throughput drops proportional with RMB * WndTh% ? Yes, it is. Figs. 1 through 6 show the evolution of the drops width as the window threshold increases, with 50% and 75% cases revealing a contiguous drop with respect to the max throughput (3.2GB/s. This behaviour demonstrates that the effect of the window threshold is as expected for all cases but 0% where drops are still observed. Q3: Assuming the answer to Q2 is affirmative then setting the window threshold to 0 should make the drops disappear. Fig. 1 demonstrates that this is not the case, which makes us think that there might actually be two bugs (possibly unrelated?) causing the drops. Q4: Could corking be responsible for the drops? No. Figs. 7 and 8 still display drops in spite of corking being off. cheers, Costin PS. apologies for the chunkiness of the plots, we used a coarse sampling step to speed up the experiment. On 2024/2/5 11:50, Wen Gu wrote: > > > On 2024/2/1 21:50, Iordache Costin (XC-AS/EAE-UK) wrote: >> Hi, >> >> This is Costin, Alex's colleague. We've got additional updates which we thought would be helpful to share with the >> community. >> >> Brief reminder, our hardware/software context is as follows: >> - 2 PCs, each equipped with one Mellanox ConnectX-5 HCA (MT27800), dual port >> - only one HCA port is active/connected on each side (QSFP28 cable) >> - max HCA throughput: 25Gbps ~ 3.12GBs. >> - max/active MTU: 4096 >> - kernel: 6.5.0-14 >> - benchmarking tool: qperf 0.4.11 >> >> Our goal has been to gauge the SMC-R benefits vs TCP/IP . We are particularly interested in maximizing the throughput >> whilst reducing CPU utilisation and DRAM memory bandwidth for large data (> 2MB) transfers. >> >> Our main issue so far has been SMC-R halves the throughput for some specific message sizes (as opposed to TCP/IP) - >> see "SMC-R vs TCP" plot. >> >> Since our last post the kernel was upgraded from 5.4 to 6.5.0-14 hoping it would alleviate the throughput drops, but >> it did not, so we bit the bullet and delved into the SMC-R code. >> > > Hi Costin, > > FYI, I have also reproduced this in my environment(see attached), with Linux6.8-rc1 > > - 2 VMs, each with 2 passthrough ConnectX-4. > - kernel: Linux6.8-rc1 > - benchmarking tool: qperf 0.4.11 > > But it might take me some time to dive deeper into what exactly happened. > Hi Costin and community, I would like to share some findings with you. The performance drop might be related to the SMC window size update strategy. Under certain message sizes, the RMB utilization may drop and transmission is inefficient. SMC window size update strategy can be referred from page 63 of RFC7609: https://datatracker.ietf.org/doc/html/rfc7609#page-63 . and the linux implementation is smc_tx_consumer_update() in net/smc/smc_tx.c. " The current window size (from a sender's perspective) is less than half of the receive buffer space, and the consumer cursor update will result in a minimum increase in the window size of 10% of the receive buffer space. " void smc_tx_consumer_update(struct smc_connection *conn, bool force) { <...> if (conn->local_rx_ctrl.prod_flags.cons_curs_upd_req || force || ((to_confirm > conn->rmbe_update_limit) && ((sender_free <= (conn->rmb_desc->len / 2)) || conn->local_rx_ctrl.prod_flags.write_blocked))) { if (conn->killed || conn->local_rx_ctrl.conn_state_flags.peer_conn_abort) return; if ((smc_cdc_get_slot_and_msg_send(conn) < 0) && <- update the window !conn->killed) { queue_delayed_work(conn->lgr->tx_wq, &conn->tx_work, SMC_TX_WORK_DELAY); return; } } <...> } # Test results In my qperf test, one of performance drops can be observed when msgsize changes from (90% * RMB - 1) to (90% * RMB). Take 128KB sndbuf and RMB as an example, #sysctl net.smc.wmem=131072 # both sides #sysctl net.smc.rmem=131072 10% * RMB is 13107 bytes, so compare the qperf results when msgsize is 117964 bytes (128KB - 13107B - 1) and 117965 bytes (128KB - 13107B): - msgsize 117964 # smc_run taskset -c <serv_cpu> qperf # smc_run taskset -c <clnt_cpu> qperf <serv_ip> -m 117964 -t 10 -vu tcp_bw tcp_bw: bw = 4.12 GB/sec msg_size = 118 KB time = 10 sec CPU: %serv_cpu : 1.8 us, 54.6 sy, 0.0 ni, 42.9 id, 0.0 wa, 0.7 hi, 0.0 si, 0.0 st %clnt_cpu : 2.5 us, 40.0 sy, 0.0 ni, 57.1 id, 0.0 wa, 0.4 hi, 0.0 si, 0.0 st - msgsize 117965 # smc_run taskset -c <serv_cpu> qperf # smc_run taskset -c <clnt_cpu> qperf <serv_ip> -m 117965 -t 10 -vu tcp_bw tcp_bw: bw = 2.86 GB/sec msg_size = 118 KB time = 10 sec CPU: %serv_cpu : 1.7 us, 30.0 sy, 0.0 ni, 68.0 id, 0.0 wa, 0.3 hi, 0.0 si, 0.0 st %clnt_cpu : 1.0 us, 23.1 sy, 0.0 ni, 75.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st The bandwidth drop is obvious and the CPU utilization is relatively low. # Analysis I traced the copylen in smc_tx_sendmsg and smc_rx_recvmsg, and found that: tracepoint:smc:smc_tx_sendmsg { @tx_len = hist(args->len); } tracepoint:smc:smc_rx_recvmsg { @rx_len = hist(args->len); } - msgsize 117964 @rx_len: [4, 8) 1 | | [8, 16) 24184 |@@ | [16, 32) 40 | | [32, 64) 0 | | [64, 128) 0 | | [128, 256) 1 | | [256, 512) 0 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 0 | | [8K, 16K) 177963 |@@@@@@@@@@@@@@@@@@ | [16K, 32K) 212122 |@@@@@@@@@@@@@@@@@@@@@@ | [32K, 64K) 490755 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [64K, 128K) 111529 |@@@@@@@@@@@ | @tx_len: [8, 16) 30768 |@@@ | [16, 32) 61 | | [32, 64) 0 | | [64, 128) 0 | | [128, 256) 1 | | [256, 512) 0 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 0 | | [8K, 16K) 138977 |@@@@@@@@@@@@@@@ | [16K, 32K) 265929 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32K, 64K) 469183 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [64K, 128K) 111082 |@@@@@@@@@@@@ | - msgsize 117965 @rx_len: [4, 8) 1 | | [8, 16) 0 | | [16, 32) 2 | | [32, 64) 0 | | [64, 128) 0 | | [128, 256) 1 | | [256, 512) 0 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 0 | | [8K, 16K) 237415 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16K, 32K) 0 | | [32K, 64K) 0 | | [64K, 128K) 237389 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | @tx_len: [16, 32) 2 | | [32, 64) 0 | | [64, 128) 0 | | [128, 256) 1 | | [256, 512) 0 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 0 | | [8K, 16K) 22 | | [16K, 32K) 237357 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32K, 64K) 23 | | [64K, 128K) 237390 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| According to the trace results, when msgsize is 117964 (good), the Tx or Rx data size are distributed between 8K-128K, but when msgsize is 117965 (bad), the Tx or Rx data size are concentrated in the two intervals [16K, 32K) and [64K, 128K). To further see what's going on, I wrote a simple reproduce program (see server.c and client.c in attachment, they transfer 10 specific size message and acts like qperf tcp_bw) and added some debug information in SMC kernel (see 0001-debug-record-cursor-update-process.patch in attachment). Then run my simple reproducer and check the dmesg. # smc_run ./server -p <serv_port> -m <msgsize: 117964 or 117965> # smc_run ./client -i <serv_ip> -p <serv_port> -m <msgsize: 117964 or 117965> According to the dmesg debug information (see dmesg.txt in attachment), we can recover how data is produced and consumed in RMB when msgsize is 117964 or 117965 (see diagram.svg in attachment, it is made based on server-side demsg information). (Please check the above diagram and dmesg information first) When msgsize is 117964 (good), each time the server consumes data in RMB, the window size will be updated to client in time (cfed cursor in linux) because the size of data consumed is always larger than 10% * RMB, and then new data can be produced in RMB. However, when msgsize is 117965 (bad), the window will only be updated when the entire 117965 bytes data has been consumed since the first part of 117965 is always no larger than 10% * RMB, which is 13107 (see diagram.svg for details). This results in low utilization of RMB. To verify this, I made a simple modification to the SMC kernel code, forcing the receiver always updates the window size. (However, this modification is far from a solution for the period drop issue, SMC window can't be always updated for silly window syndrome avoidance.) " diff --git a/net/smc/smc_rx.c b/net/smc/smc_rx.c index 9a2f3638d161..6a4d8041d6b5 100644 --- a/net/smc/smc_rx.c +++ b/net/smc/smc_rx.c @@ -61,7 +61,7 @@ static int smc_rx_update_consumer(struct smc_sock *smc, { struct smc_connection *conn = &smc->conn; struct sock *sk = &smc->sk; - bool force = false; + bool force = true; int diff, rc = 0; smc_curs_add(conn->rmb_desc->len, &cons, len); " Then run qperf test under message size of 117964 and 117965: # smc_run taskset -c <cpu> qperf <ip> -m 117964 -t 10 -vu tcp_bw tcp_bw: bw = 4.13 GB/sec msg_size = 118 KB time = 10 sec # smc_run taskset -c <cpu> qperf <ip> -m 117965 -t 10 -vu tcp_bw tcp_bw: bw = 4.18 GB/sec msg_size = 118 KB time = 10 sec The performance drop disapper, and copylen tracing results under the two msgsize are similar: - msgsize 117964 @rx_len: [4, 8) 1 | | [8, 16) 7521 | | [16, 32) 422 | | [32, 64) 0 | | [64, 128) 0 | | [128, 256) 1 | | [256, 512) 0 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 0 | | [8K, 16K) 225144 |@@@@@@@@@@@@@@@@@@@@ | [16K, 32K) 311010 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32K, 64K) 571816 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [64K, 128K) 62061 |@@@@@ | - msgsize 117965 @rx_len: [2, 4) 54360 |@@@@ | [4, 8) 1025 | | [8, 16) 2 | | [16, 32) 2 | | [32, 64) 0 | | [64, 128) 0 | | [128, 256) 1 | | [256, 512) 0 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 0 | | [8K, 16K) 214217 |@@@@@@@@@@@@@@@@@@ | [16K, 32K) 288575 |@@@@@@@@@@@@@@@@@@@@@@@@@ | [32K, 64K) 593644 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [64K, 128K) 50187 |@@@@ | Hope the above analysis can be of some help for you. Thanks, Wen Gu >> The SMC-R source code revealed that __smc_buf_create / smc_compress_bufsize functions are in charge of computing the >> size of the RMB buffer and allocating either physical or virtual contiguous memory. We suspected that the throughput >> drops were caused by the size of this buffer being too small. >> >> We set out to determine whether there is a correlation between the drops and the size of the RMB buffer, and for that >> we set the size of the RMB buffer to 128KB, 256KB, 512KB, 1MB, 2MB, 4MB and 8MB and benchmarked the throughput for >> different message size ranges. >> >> The attached plot collates the benchmark results and shows that the period of the drops coincides with the size of the >> RMB buffer. Whilst increasing the size of the buffer seems to attenuate the throughput drops, we believe that the real >> root of the drops might lie somewhere else in the SMC-R code. We are suspecting that, for reasons unknown to us, the >> CDC messages that are sent after the RDMA WRITE operation are delayed in some circumstances. >> >> cheers, >> Costin. >> >> PS. for the sake of brevity many details have been omitted on purpose but we'd be happy to provide them if need be, >> e.g. by default the RMB buffer size is capped to 512KB so we remove the cap and recompile the SMC module; we use >> alternative tools such as iperf and iperf 3 for benchmarking to dismiss the possibility of the drops to be tool >> specific; corking has been disabled; etc. [-- Attachment #2: threshold_cork.png --] [-- Type: image/png, Size: 205783 bytes --] ^ permalink raw reply related [flat|nested] 9+ messages in thread
[parent not found: <GV2PR10MB8037B30A9D2CE67F267D5E61BB3B2@GV2PR10MB8037.EURPRD10.PROD.OUTLOOK.COM>]
[parent not found: <GV2PR10MB80376BEB9EE8E03F98CC86A1BB3B2@GV2PR10MB8037.EURPRD10.PROD.OUTLOOK.COM>]
* Re: SMC-R throughput drops for specific message sizes [not found] ` <GV2PR10MB80376BEB9EE8E03F98CC86A1BB3B2@GV2PR10MB8037.EURPRD10.PROD.OUTLOOK.COM> @ 2024-03-28 12:18 ` Goerlitz Andreas (SO/PAF1-Mb) 0 siblings, 0 replies; 9+ messages in thread From: Goerlitz Andreas (SO/PAF1-Mb) @ 2024-03-28 12:18 UTC (permalink / raw) To: Wen Gu; +Cc: linux-s390, netdev [-- Attachment #1: Type: text/plain, Size: 4967 bytes --] Hello Wen Gu and community, our group performed more experiments with SMC-R. The results discussed subsequently were performed on two Mellanox-powered (mlx5, ConnectX-5) PCs, with the following configuration: Kernel 6.5.0-25-generic MTU 9000 net.smc.wmem = $((256*1024)) net.smc.rmem = $((256*1024)) net.smc.autocorking_size = 65536 net.smc.smcr_buf_type = 1 Bandwidth ~ 3.2GB/s (25.0 Gbit/s) We modified your server.c (consumer) and client.c (producer) to estimate the throughput and observed that the "msgsize" of the consumer seems to be mainly responsible for the throughput drops, as shown below. Good cases (server/consumer msgsize <= RMBE/2): ----------------------------------------------- server: smc_run ./server -p 12345 -m $((128*1024)) client: smc_run ./client -i 192.168.0.2 -p 12345 -m $((128*1024)) -c 1000 Sent 261881856 bytes in 82224.819000 us [3.184939 GB/s] server: smc_run ./server -p 12345 -m $((128*1024)) client: smc_run ./client -i 192.168.0.2 -p 12345 -m $((256*1024)) -c 1000 Sent 261881856 bytes in 82097.127000 us [3.189892 GB/s] Bad cases (server/consumer msgsize > RMBE/2): ----------------------------------------------- server: smc_run ./server -p 12345 -m $((256*1024)) client: smc_run ./client -i 192.168.0.2 -p 12345 -m $((128*1024)) -c 1000 Sent 261881856 bytes in 130970.306000 us [1.999545 GB/s] server: smc_run ./server -p 12345 -m $((256*1024)) client: smc_run ./client -i 192.168.0.2 -p 12345 -m $((256*1024)) -c 1000 Sent 130940928 bytes in 88172.887000 us [1.485037 GB/s] Our explanation is that in the "bad cases" producer and consumer act synchronously in the following sense: The producer is sending messages (e.g., msgsize = RMBE on producer side), and at some point, it must wait until the consumer processes some of its RMBE, and answers with a CDC message. During this time, the producer is blocked (since RMBE of consumer is full). In case the consumer processes the entire RMBE (i.e., msgsize=RMBE on consumer side), it is then also blocked as there is nothing left to be processed anymore - i.e. it must wait for the producer. We believe/suspect that this (unintended) synchronization leads to the throughput drops. To enforce the consumer to process smaller messages, reply faster to the producer (CDC) and still be able to process some remaining data (i.e., to avoid being blocked), we cap the value of len to RMBE/2 in smc_rx_recvmsg: --- a/net/smc/smc_rx.c 2024-03-25 12:31:32.264614422 +0100 +++ b/net/smc/smc_rx.c 2024-03-25 12:22:31.989913322 +0100 @@ -344,7 +344,7 @@ int smc_rx_recvmsg(struct smc_sock *smc, struct msghdr *msg, struct pipe_inode_info *pipe, size_t len, int flags) { - size_t copylen, read_done = 0, read_remaining = len; + size_t copylen, read_remaining, read_done = 0; size_t chunk_len, chunk_off, chunk_len_sum; struct smc_connection *conn = &smc->conn; int (*func)(struct smc_connection *conn); @@ -363,6 +363,10 @@ sk = &smc->sk; if (sk->sk_state == SMC_LISTEN) return -ENOTCONN; + + len = min_t(size_t, len, conn->rmb_desc->len / 2); + read_remaining = len; + if (flags & MSG_OOB) return smc_rx_recv_urg(smc, msg, len, flags); timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT); We ran qperf experiments (as before) on the standard SMC-R module [std] (kernel 6.5.0-25-generic), Wen Gu’s proposal [wengu] (i.e. setting force = true), and our proposal [our] (i.e. capping len to RMBE/2). The measured throughput is shown in subplots (a) in the appended figures. Additionally, we traced tracepoint:smc:smc_tx_sendmsg{ @tx_ret = lhist(args->len,0,262144,16384); } tracepoint:smc:smc_rx_recvmsg{ @rx_ret = lhist(args->len,0,262144,16384); } and calculated the percentage of rx_ret and tx_ret being greater than RMBE/2 - shown in subplots (b) and (c) respectively. As can be observed, there seems to be a correlation between a drop in throughput and rx_ret being greater than RMBE/2. This is avoided in our proposal, and full throughput is achieved. We hope that our analysis and interpretation can help to solve the issue with the throughput drops in SMC-R. p.s., I would like to acknowledge all individuals who contributed to the analysis of SMC-R from our team (sorted by last name): Soumyadeep Debnath Andreas Görlitz Costin Iordache Alexandros Nikolaou Maik Riestock Ievgen Tatolov Mit freundlichen Grüßen / Best regards Andreas Goerlitz (SO/PAF1-Mb) Bosch Service Solutions Magdeburg GmbH | Otto-von-Guericke-Str. 13 | 39104 Magdeburg | GERMANY | [www.boschservicesolutions.com]www.boschservicesolutions.com Andreas.Goerlitz@de.bosch.com Sitz: Magdeburg, Registergericht: Amtsgericht Stendal, HRB 24039 Geschäftsführung: Robert Mulatz, Georg Wessels [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: client.c --] [-- Type: text/x-csrc; name="client.c", Size: 2612 bytes --] #include <stdio.h> #include <string.h> #include <stdlib.h> #include <unistd.h> #include <arpa/inet.h> #include <sys/socket.h> #include <netinet/in.h> #include <stdbool.h> #include <errno.h> #include <netinet/tcp.h> #include <time.h> #ifndef AF_SMC #define AF_SMC 43 #endif #define NET_PROTOCAL AF_INET #define SERV_IP "11.213.5.33" #define SERV_PORT 10012 #define BUF_SIZE (5 * 128 * 1024) int stream_send(int fd, char *buf, int msgsize) { int n = msgsize; while (n) { int i = write(fd, buf, n); if (i < 0) return i; buf += i; n -= i; if (i == 0) break; } return msgsize-n; } int net_clnt(char *ip, int port, int msgsize, int msgcnt) { struct timespec start, end; double elapsed_us = 0.0; double elapsed_us_sum = 0.0; int sent; long num_bytes = 0; double gb_s; if (!ip) ip = SERV_IP; if (!port) port = SERV_PORT; int sock = socket(NET_PROTOCAL, SOCK_STREAM, 0); struct sockaddr_in s_addr; memset(&s_addr, 0, sizeof(s_addr)); s_addr.sin_family = NET_PROTOCAL; s_addr.sin_addr.s_addr = inet_addr(ip); s_addr.sin_port = htons(port); if (connect(sock, (struct sockaddr*)&s_addr, sizeof(s_addr))){ printf("connect fail\n"); return 0; } char *buf = (char *)malloc(sizeof(char) * BUF_SIZE); while (--msgcnt) { if (msgsize > BUF_SIZE) break; printf("Send msgsize: %d\n", msgsize); clock_gettime(CLOCK_MONOTONIC, &start); sent = stream_send(sock, buf, msgsize); clock_gettime(CLOCK_MONOTONIC, &end); if (send <= 0) { printf("Error send %d\n", sent); break; } elapsed_us = (end.tv_sec - start.tv_sec)*1000000.0; elapsed_us += (end.tv_nsec - start.tv_nsec) / 1000.0; elapsed_us_sum += elapsed_us; num_bytes += sent; } close(sock); gb_s = (num_bytes/1000) / elapsed_us_sum; printf("Sent %ld bytes in %f us [%f GB/s]\n", num_bytes, elapsed_us_sum, gb_s); return 0; } int main(int argc, char **argv){ int msgsize = BUF_SIZE, msgcnt = 10; char *ip = NULL; bool wrong_param = false; int port = 0; int c; while(!wrong_param && (-1 != (c = getopt(argc, argv, "i:p:m:c:")))) { switch (c) { case 'i': ip = optarg; break; case 'p': port = atoi(optarg); break; case 'm': msgsize = atoi(optarg); break; case 'c': msgcnt = atoi(optarg); break; case '?': printf("usage: ./client -i <ip> -p <port> -m <msgsize> -c <cnt>\n"); wrong_param = true; break; } } if (!wrong_param) net_clnt(ip, port, msgsize, msgcnt); return 0; } [-- Attachment #3: results_our.png --] [-- Type: image/png, Size: 147628 bytes --] [-- Attachment #4: results_std.png --] [-- Type: image/png, Size: 165763 bytes --] [-- Attachment #5: results_wengu.png --] [-- Type: image/png, Size: 171372 bytes --] [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #6: server.c --] [-- Type: text/x-csrc; name="server.c", Size: 2403 bytes --] #include <stdio.h> #include <string.h> #include <stdlib.h> #include <unistd.h> #include <arpa/inet.h> #include <sys/socket.h> #include <netinet/in.h> #include <errno.h> #include <stdbool.h> #include <netinet/tcp.h> #include <pthread.h> #ifndef AF_SMC #define AF_SMC 43 #endif #define NET_PROTOCAL AF_INET #define SERV_IP INADDR_ANY #define SERV_PORT 10012 #define BUF_SIZE (5 * 128 * 1024) int stream_recv(int fd, char *buf, int msgsize) { int n = msgsize; while (n) { int i = read(fd, buf, n); if (i < 0) return i; buf += i; n -= i; if (i == 0) break; //printf("Successfully recv %d B message\n", i); } return msgsize-n; } int net_serv(int port, int msgsize) { int recv; if (!port) port = SERV_PORT; int l_sock = socket(NET_PROTOCAL, SOCK_STREAM, 0); struct sockaddr_in s_addr; memset(&s_addr, 0, sizeof(struct sockaddr_in)); s_addr.sin_family = NET_PROTOCAL; s_addr.sin_addr.s_addr = SERV_IP; s_addr.sin_port = htons(port); // bind listen socket if (bind(l_sock, (struct sockaddr*)&s_addr, sizeof(s_addr))) { printf("bind listen socket error %d\n", errno); return 0; } // listen if (listen(l_sock, 20)) { printf("listen error\n"); return 0; } struct sockaddr_in c_addr; socklen_t c_addr_len = sizeof(c_addr); int s_sock = accept(l_sock, (struct sockaddr*)&c_addr, &c_addr_len); if (s_sock < 0) { printf("accept fail\n"); return 0; } else { char ip[16] = { 0 }; inet_ntop(NET_PROTOCAL, &(c_addr.sin_addr), ip, INET_ADDRSTRLEN); printf("accept connection: ip %s port %d\n", ip, c_addr.sin_port); } char *buf = (char *)malloc(sizeof(char) * BUF_SIZE); while (1) { if (msgsize > BUF_SIZE) break; //printf("Recv msgsize: %d\n", msgsize); recv = stream_recv(s_sock, buf, msgsize); if (recv <= 0) { if (recv) printf("Error recv %d\n", recv); break; } } printf("done\n"); close(s_sock); close(l_sock); return 0; } int main(int argc, char **argv) { bool wrong_param = false; int msgsize = BUF_SIZE; int port = 0; int c; while(!wrong_param && (-1 != (c = getopt(argc, argv, "p:m:")))) { switch (c) { case 'p': port = atoi(optarg); break; case 'm': msgsize = atoi(optarg); break; case '?': printf("usage: ./server -p <port> -m <msgsize>\n"); wrong_param = true; break; } } if (!wrong_param) net_serv(port, msgsize); return 0; } ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: SMC-R throughput drops for specific message sizes @ 2024-02-01 10:30 Iordache Costin (XC-AS/EAE-UK) 0 siblings, 0 replies; 9+ messages in thread From: Iordache Costin (XC-AS/EAE-UK) @ 2024-02-01 10:30 UTC (permalink / raw) To: netdev [-- Attachment #1.1: Type: text/plain, Size: 2456 bytes --] Hi, This is Costin, Alex's colleague. We've got additional updates which we thought would be helpful to share with the community. Brief reminder, our hardware/software context is as follows: * 2 PCs, each equipped with one Mellanox ConnectX-5 HCA (MT27800), dual port * only one HCA port is active/connected on each side (QSFP28 cable) * max HCA throughput: 25Gbps ~ 3.12GBs. * max/active MTU: 4096 * kernel: 6.5.0-14 * benchmarking tool: qperf 0.4.11 Our goal has been to gauge the SMC-R benefits vs TCP/IP . We are particularly interested in maximizing the throughput whilst reducing CPU utilisation and DRAM memory bandwidth for large data (> 2MB) transfers. Our main issue so far has been SMC-R halves the throughput for some specific message sizes (as opposed to TCP/IP) - see "SMC-R vs TCP" plot. Since our last post the kernel was upgraded from 5.4 to 6.5.0-14 hoping it would alleviate the throughput drops, but it did not, so we bit the bullet and delved into the SMC-R code. The SMC-R source code revealed that __smc_buf_create / smc_compress_bufsize functions are in charge of computing the size of the RMB buffer and allocating either physical or virtual contiguous memory. We suspected that the throughput drops were caused by the size of this buffer being too small. We set out to determine whether there is a correlation between the drops and the size of the RMB buffer, and for that we set the size of the RMB buffer to 128KB, 256KB, 512KB, 1MB, 2MB, 4MB and 8MB and benchmarked the throughput for different message size ranges. The attached plot collates the benchmark results and shows that the period of the drops coincides with the size of the RMB buffer. Whilst increasing the size of the buffer seems to attenuate the throughput drops, we believe that the real root of the drops might lie somewhere else in the SMC-R code. We are suspecting that, for reasons unknown to us, the CDC messages that are sent after the RDMA WRITE operation are delayed in some circumstances. cheers, Costin. PS. for the sake of brevity many details have been omitted on purpose but we'd be happy to provide them if need be, e.g. by default the RMB buffer size is capped to 512KB so we remove the cap and recompile the SMC module; we use alternative tools such as iperf and iperf 3 for benchmarking to dismiss the possibility of the drops to be tool specific; corking has been disabled; etc. [-- Attachment #1.2: Type: text/html, Size: 9916 bytes --] [-- Attachment #2: smc_qperf_merged-4.jpg --] [-- Type: image/jpeg, Size: 291219 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: SMC-R throughput drops for specific message sizes @ 2023-12-13 15:52 Nikolaou Alexandros (SO/PAF1-Mb) 0 siblings, 0 replies; 9+ messages in thread From: Nikolaou Alexandros (SO/PAF1-Mb) @ 2023-12-13 15:52 UTC (permalink / raw) To: Wen Gu, Gerd Bayer, D . Wythe, Tony Lu, Nils Hoppmann Cc: linux-s390, netdev, Wenjia Zhang, Jan Karcher, Dust Li Dear Gerd, Thank you for directing this matter to the most relevant group and individuals. Your support is greatly appreciated. We're actively delving deeper into the issue to gain further insights as well. I'll ensure to keep this thread updated with any new findings as we progress. Should you need any additional information, please feel free to reach out. Mit freundlichen Grüßen / Best regards Alexandros Nikolaou Bosch Service Solutions Magdeburg GmbH | Otto-von-Guericke-Str. 13 | 39104 Magdeburg | GERMANY | [www.boschservicesolutions.com]www.boschservicesolutions.com Alexandros.Nikolaou@de.bosch.com Sitz: Magdeburg, Registergericht: Amtsgericht Stendal, HRB 24039 Geschäftsführung: Robert Mulatz, Daniel Meyer From: Wen Gu <guwen@linux.alibaba.com> Sent: Wednesday, December 13, 2023 14:38 To: Gerd Bayer <gbayer@linux.ibm.com>; Nikolaou Alexandros (SO/PAF1-Mb) <Alexandros.Nikolaou@de.bosch.com>; D . Wythe <alibuda@linux.alibaba.com>; Tony Lu <tonylu@linux.alibaba.com>; Nils Hoppmann <niho@linux.ibm.com> Cc: linux-s390@vger.kernel.org <linux-s390@vger.kernel.org>; netdev <netdev@vger.kernel.org>; Wenjia Zhang <wenjia@linux.ibm.com>; Jan Karcher <jaka@linux.ibm.com>; Dust Li <dust.li@linux.alibaba.com> Subject: Re: SMC-R throughput drops for specific message sizes On 2023/12/13 20:17, Gerd Bayer wrote: > Hi Nikolaou, > > thank you for providing more details about your setup. > > On Wed, 2023-12-06 at 15:28 +0000, Nikolaou Alexandros (SO/PAF1-Mb) > wrote: >> Dear Wenjia, > > while Wenjia is out, I'm writing primarily to getting some more folks' > attention to this topic. Furthermore, I'm moving the discussion to the > netdev mailing list where SMC discussions usually take place. > >> Thanks for getting back to me. Some further details on the >> experiments are: >> >> - The tests had been conducted on a one-to-one connection between two >> Mellanox-powered (mlx5, ConnectX-5) PCs. >> - Attached you may find the client log of the qperf output. You may >> notice that for the majority of message size values, the bandwidth is >> around 3.2GB/s which matches the maximum throughput of the >> mellanox NICs. >> According to a periodic regular pattern though, with the first >> occurring at a message size of 473616 – 522192 (with a step of >> 12144kB), the 3.2GB/s throughput drops substantially. The >> corresponding commands for these drops are >> server: smc_run qperf >> client: smc_run qperf -v -uu -H worker1 -m 473616 tcp_bw >> - Our smc version (3E92E1460DA96BE2B2DDC2F, smc-tools-1.2.2) does not >> provide us with the smcr info, smc_rnics -a and smcr -d >> stats commands. As an alternative, you may also find attached the >> output of ibv_devinfo -v. >> - Buffer size: >> sudo sysctl -w net.ipv4.tcp_rmem="4096 1048576 6291456" >> sudo sysctl -w net.ipv4.tcp_wmem="4096 1048576 6291456" >> - MTU size: 9000 >> >> Should you require further information, please let me know. > > Wenjia and I belong to a group of Linux on Z developers that maintains > the SMC protocol on s390 mainframe systems. Nils Hoppmann is our expert > for performance and might be able to shed some light on his experiences > with throughput drops for particular SMC message sizes. Our experience > is heavily biased towards IBM Z systems, though - with their distinct > cache and PCI root-complex hardware designs. > > Over the last few years there's a group around D. Wythe, Wen Gu and > Tony Lu who adopted and extended the SMC protocol for use-cases on x86 > architectures. I address them here explicitly, soliciting feedback on > their experiences. Certainly. Our team will take a closer look into this matter as well. We intend to review the thread thoroughly and conduct an analysis within our environment. Updates and feedback will be provided in this thread. > > All in all there are several moving parts involved here, that could > play a role: > - firmware level of your Mellanox/NVidia NICs, > - platform specific hardware designs re. cache and root-complexes, > interrupt distribution, ... > - exact code level of the device drivers and the SMC protocol > > This is just a heads-up, that there may be requests to try things with > newer code levels ;) > > Thank you, > Gerd > > -- > Gerd Bayer > Linux on IBM Z Development - IBM Germany R&D ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <PAWPR10MB72701758A24DD8DF8063BEE6C081A@PAWPR10MB7270.EURPRD10.PROD.OUTLOOK.COM>]
[parent not found: <ccc03f00-02ee-4af6-8e57-b6de3bc019be@linux.ibm.com>]
[parent not found: <PAWPR10MB7270731C91544AEF25E0A33CC084A@PAWPR10MB7270.EURPRD10.PROD.OUTLOOK.COM>]
* Re: SMC-R throughput drops for specific message sizes [not found] ` <PAWPR10MB7270731C91544AEF25E0A33CC084A@PAWPR10MB7270.EURPRD10.PROD.OUTLOOK.COM> @ 2023-12-13 12:17 ` Gerd Bayer 2023-12-13 13:38 ` Wen Gu 0 siblings, 1 reply; 9+ messages in thread From: Gerd Bayer @ 2023-12-13 12:17 UTC (permalink / raw) To: Nikolaou Alexandros (SO/PAF1-Mb), D . Wythe, Wen Gu, Tony Lu, Nils Hoppmann Cc: linux-s390, netdev, Wenjia Zhang, Jan Karcher Hi Nikolaou, thank you for providing more details about your setup. On Wed, 2023-12-06 at 15:28 +0000, Nikolaou Alexandros (SO/PAF1-Mb) wrote: > Dear Wenjia, while Wenjia is out, I'm writing primarily to getting some more folks' attention to this topic. Furthermore, I'm moving the discussion to the netdev mailing list where SMC discussions usually take place. > Thanks for getting back to me. Some further details on the > experiments are: > > - The tests had been conducted on a one-to-one connection between two > Mellanox-powered (mlx5, ConnectX-5) PCs. > - Attached you may find the client log of the qperf output. You may > notice that for the majority of message size values, the bandwidth is > around 3.2GB/s which matches the maximum throughput of the > mellanox NICs. > According to a periodic regular pattern though, with the first > occurring at a message size of 473616 – 522192 (with a step of > 12144kB), the 3.2GB/s throughput drops substantially. The > corresponding commands for these drops are > server: smc_run qperf > client: smc_run qperf -v -uu -H worker1 -m 473616 tcp_bw > - Our smc version (3E92E1460DA96BE2B2DDC2F, smc-tools-1.2.2) does not > provide us with the smcr info, smc_rnics -a and smcr -d > stats commands. As an alternative, you may also find attached the > output of ibv_devinfo -v. > - Buffer size: > sudo sysctl -w net.ipv4.tcp_rmem="4096 1048576 6291456" > sudo sysctl -w net.ipv4.tcp_wmem="4096 1048576 6291456" > - MTU size: 9000 > > Should you require further information, please let me know. Wenjia and I belong to a group of Linux on Z developers that maintains the SMC protocol on s390 mainframe systems. Nils Hoppmann is our expert for performance and might be able to shed some light on his experiences with throughput drops for particular SMC message sizes. Our experience is heavily biased towards IBM Z systems, though - with their distinct cache and PCI root-complex hardware designs. Over the last few years there's a group around D. Wythe, Wen Gu and Tony Lu who adopted and extended the SMC protocol for use-cases on x86 architectures. I address them here explicitly, soliciting feedback on their experiences. All in all there are several moving parts involved here, that could play a role: - firmware level of your Mellanox/NVidia NICs, - platform specific hardware designs re. cache and root-complexes, interrupt distribution, ... - exact code level of the device drivers and the SMC protocol This is just a heads-up, that there may be requests to try things with newer code levels ;) Thank you, Gerd -- Gerd Bayer Linux on IBM Z Development - IBM Germany R&D ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: SMC-R throughput drops for specific message sizes 2023-12-13 12:17 ` Gerd Bayer @ 2023-12-13 13:38 ` Wen Gu 0 siblings, 0 replies; 9+ messages in thread From: Wen Gu @ 2023-12-13 13:38 UTC (permalink / raw) To: Gerd Bayer, Nikolaou Alexandros (SO/PAF1-Mb), D . Wythe, Tony Lu, Nils Hoppmann Cc: linux-s390, netdev, Wenjia Zhang, Jan Karcher, Dust Li On 2023/12/13 20:17, Gerd Bayer wrote: > Hi Nikolaou, > > thank you for providing more details about your setup. > > On Wed, 2023-12-06 at 15:28 +0000, Nikolaou Alexandros (SO/PAF1-Mb) > wrote: >> Dear Wenjia, > > while Wenjia is out, I'm writing primarily to getting some more folks' > attention to this topic. Furthermore, I'm moving the discussion to the > netdev mailing list where SMC discussions usually take place. > >> Thanks for getting back to me. Some further details on the >> experiments are: >> >> - The tests had been conducted on a one-to-one connection between two >> Mellanox-powered (mlx5, ConnectX-5) PCs. >> - Attached you may find the client log of the qperf output. You may >> notice that for the majority of message size values, the bandwidth is >> around 3.2GB/s which matches the maximum throughput of the >> mellanox NICs. >> According to a periodic regular pattern though, with the first >> occurring at a message size of 473616 – 522192 (with a step of >> 12144kB), the 3.2GB/s throughput drops substantially. The >> corresponding commands for these drops are >> server: smc_run qperf >> client: smc_run qperf -v -uu -H worker1 -m 473616 tcp_bw >> - Our smc version (3E92E1460DA96BE2B2DDC2F, smc-tools-1.2.2) does not >> provide us with the smcr info, smc_rnics -a and smcr -d >> stats commands. As an alternative, you may also find attached the >> output of ibv_devinfo -v. >> - Buffer size: >> sudo sysctl -w net.ipv4.tcp_rmem="4096 1048576 6291456" >> sudo sysctl -w net.ipv4.tcp_wmem="4096 1048576 6291456" >> - MTU size: 9000 >> >> Should you require further information, please let me know. > > Wenjia and I belong to a group of Linux on Z developers that maintains > the SMC protocol on s390 mainframe systems. Nils Hoppmann is our expert > for performance and might be able to shed some light on his experiences > with throughput drops for particular SMC message sizes. Our experience > is heavily biased towards IBM Z systems, though - with their distinct > cache and PCI root-complex hardware designs. > > Over the last few years there's a group around D. Wythe, Wen Gu and > Tony Lu who adopted and extended the SMC protocol for use-cases on x86 > architectures. I address them here explicitly, soliciting feedback on > their experiences. Certainly. Our team will take a closer look into this matter as well. We intend to review the thread thoroughly and conduct an analysis within our environment. Updates and feedback will be provided in this thread. > > All in all there are several moving parts involved here, that could > play a role: > - firmware level of your Mellanox/NVidia NICs, > - platform specific hardware designs re. cache and root-complexes, > interrupt distribution, ... > - exact code level of the device drivers and the SMC protocol > > This is just a heads-up, that there may be requests to try things with > newer code levels ;) > > Thank you, > Gerd > > -- > Gerd Bayer > Linux on IBM Z Development - IBM Germany R&D ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2024-03-28 12:18 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-02-01 13:50 SMC-R throughput drops for specific message sizes Iordache Costin (XC-AS/EAE-UK) 2024-02-05 3:50 ` Wen Gu 2024-02-19 8:44 ` Wen Gu 2024-02-27 11:28 ` Iordache Costin (XC-AS/EAE-UK) [not found] <GV2PR10MB8037B30A9D2CE67F267D5E61BB3B2@GV2PR10MB8037.EURPRD10.PROD.OUTLOOK.COM> [not found] ` <GV2PR10MB80376BEB9EE8E03F98CC86A1BB3B2@GV2PR10MB8037.EURPRD10.PROD.OUTLOOK.COM> 2024-03-28 12:18 ` Goerlitz Andreas (SO/PAF1-Mb) -- strict thread matches above, loose matches on Subject: below -- 2024-02-01 10:30 Iordache Costin (XC-AS/EAE-UK) 2023-12-13 15:52 Nikolaou Alexandros (SO/PAF1-Mb) [not found] <PAWPR10MB72701758A24DD8DF8063BEE6C081A@PAWPR10MB7270.EURPRD10.PROD.OUTLOOK.COM> [not found] ` <ccc03f00-02ee-4af6-8e57-b6de3bc019be@linux.ibm.com> [not found] ` <PAWPR10MB7270731C91544AEF25E0A33CC084A@PAWPR10MB7270.EURPRD10.PROD.OUTLOOK.COM> 2023-12-13 12:17 ` Gerd Bayer 2023-12-13 13:38 ` Wen Gu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).