All of lore.kernel.org
 help / color / mirror / Atom feed
* SMC-R problem under multithread
@ 2022-05-30  3:16 liuyacan
  2022-05-30  5:10 ` Tony Lu
  0 siblings, 1 reply; 5+ messages in thread
From: liuyacan @ 2022-05-30  3:16 UTC (permalink / raw)
  To: kgraul, davem, edumazet, kuba, pabeni
  Cc: linux-s390, netdev, linux-kernel, ubraun, tonylu

Hi experts,

  I recently used memcached to test the performance of SMC-R relative to TCP, but the results 
  are confusing me. When using multithread on the server side, the performance of SMC-R is not as good as TCP.
    
  Specifically, I tested 4 scenarios with server thread: 1\2\4\8. The client uses 8threads fixedly. 
  
  server: (smc_run) memcached -t 1 -m 16384 -p [SERVER-PORT] -U 0 -F -c 10240 -o modern
  client: (smc-run) memtier_benchmark -s [SERVER-IP] -p [SERVER-PORT] -P memcache_text --random-data --data-size=100 --data-size-pattern=S --key-minimum=30 --key-maximum=100  -n 5000000 -t 8
  
  The result is as follows:
  
  SMC-R:
  
  server-thread    ops/sec  client-cpu server-cpu
      1             242k        220%         97%
      2             362k        241%        128%
      4             378k        242%        160%
      8             395k        242%        210%
      
  TCP:
  server-thread    ops/sec  client-cpu server-cpu
      1             185k       224%         100%
      2             435k       479%         200%
      4             780k       731%         400%
      8             938k       800%         659%                   
   
  It can be seen that as the number of threads increases, the performance increase of SMC-R is much slower than that of TCP.

  Am I doing something wrong? Or is it only when CPU resources are tight that SMC-R has a significant advantage ?  
  
  Any suggestions are welcome.


Thanks & Regards,
Yacan.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: SMC-R problem under multithread
  2022-05-30  3:16 SMC-R problem under multithread liuyacan
@ 2022-05-30  5:10 ` Tony Lu
  2022-05-30  6:40   ` liuyacan
  0 siblings, 1 reply; 5+ messages in thread
From: Tony Lu @ 2022-05-30  5:10 UTC (permalink / raw)
  To: liuyacan
  Cc: kgraul, davem, edumazet, kuba, pabeni, linux-s390, netdev,
	linux-kernel, ubraun

On Mon, May 30, 2022 at 11:16:04AM +0800, liuyacan@corp.netease.com wrote:
> Hi experts,
> 
>   I recently used memcached to test the performance of SMC-R relative to TCP, but the results 
>   are confusing me. When using multithread on the server side, the performance of SMC-R is not as good as TCP.
>     
>   Specifically, I tested 4 scenarios with server thread: 1\2\4\8. The client uses 8threads fixedly. 
>   
>   server: (smc_run) memcached -t 1 -m 16384 -p [SERVER-PORT] -U 0 -F -c 10240 -o modern
>   client: (smc-run) memtier_benchmark -s [SERVER-IP] -p [SERVER-PORT] -P memcache_text --random-data --data-size=100 --data-size-pattern=S --key-minimum=30 --key-maximum=100  -n 5000000 -t 8
>   
>   The result is as follows:
>   
>   SMC-R:
>   
>   server-thread    ops/sec  client-cpu server-cpu
>       1             242k        220%         97%
>       2             362k        241%        128%
>       4             378k        242%        160%
>       8             395k        242%        210%
>       
>   TCP:
>   server-thread    ops/sec  client-cpu server-cpu
>       1             185k       224%         100%
>       2             435k       479%         200%
>       4             780k       731%         400%
>       8             938k       800%         659%                   
>    
>   It can be seen that as the number of threads increases, the performance increase of SMC-R is much slower than that of TCP.
> 
>   Am I doing something wrong? Or is it only when CPU resources are tight that SMC-R has a significant advantage ?  
>   
>   Any suggestions are welcome.

Hi Yacan,

This result matches some of our scenarios to some extent. Let's talk
about this result first.

Based on your benchmark, the biggest factor affecting performance seems
that the CPU resource is limited. As the number of threads increased,
neither CPU usage nor performance metrics improved, and CPU is limited
to about 200-250%. To make it clear, could you please give out more
metrics about per-CPU (usr / sys / hi / si) and memcached process usage.

Secondly, it seems that there is lots of connections in this test.
If it takes too much time to establish a connection, or the number of
final connections does not reach the specified value, the result will be
greatly affected. Could you please give out more details about the
connections numbers during benchmark?

We have noticed SMC has some limitations in multiple threads and many
connections. This benchmark happens to be basically in line with this
scenario. In general, there are some aspects in brief:
1. control path (connection setup and dismiss) is not as fast as TCP;
2. data path (lock contention, CQ spreading, etc.) needs further improvement;

About CPU limitation, SMC use one CQ and one core to handle data
transmission, which cannot spread workload over multiple cores. There is
is an early temporary solution [1], which also need to improve (new CQ
API, WR refactor). With this early solution, it shows several times the
performance improvement.

About the improvement of connection setup, you can see [2] for more
details, which is still a proposal now, and we are working on it now.
This show considerable performance boost.

[1] https://lore.kernel.org/all/20220126130140.66316-1-tonylu@linux.alibaba.com/
[2] https://lore.kernel.org/all/1653375127-130233-1-git-send-email-alibuda@linux.alibaba.com/

Thanks,
Tony LU

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: SMC-R problem under multithread
  2022-05-30  5:10 ` Tony Lu
@ 2022-05-30  6:40   ` liuyacan
  2022-05-30  8:24     ` Tony Lu
  0 siblings, 1 reply; 5+ messages in thread
From: liuyacan @ 2022-05-30  6:40 UTC (permalink / raw)
  To: tonylu
  Cc: davem, edumazet, kgraul, kuba, linux-kernel, linux-s390,
	liuyacan, netdev, pabeni, ubraun

> Hi experts,
> 
> I recently used memcached to test the performance of SMC-R relative to TCP, but the results 
> are confusing me. When using multithread on the server side, the performance of SMC-R is not as good as TCP.
> 
> Specifically, I tested 4 scenarios with server thread: 1\2\4\8. The client uses 8threads fixedly. 
> 
> server: (smc_run) memcached -t 1 -m 16384 -p [SERVER-PORT] -U 0 -F -c 10240 -o modern
> client: (smc-run) memtier_benchmark -s [SERVER-IP] -p [SERVER-PORT] -P memcache_text --random-data --data-size=100 --data-size-pattern=S --key-minimum=30 --key-maximum=100  -n 5000000 -t 8
> 
> The result is as follows:
> 
> SMC-R:
> 
> server-thread    ops/sec  client-cpu server-cpu
> 1             242k        220%         97%
> 2             362k        241%        128%
> 4             378k        242%        160%
> 8             395k        242%        210%
> 
> TCP:
> server-thread    ops/sec  client-cpu server-cpu
> 1             185k       224%         100%
> 2             435k       479%         200%
> 4             780k       731%         400%
> 8             938k       800%         659%                   
> 
> It can be seen that as the number of threads increases, the performance increase of SMC-R is much slower than that of TCP.
> 
> Am I doing something wrong? Or is it only when CPU resources are tight that SMC-R has a significant advantage ?  
> 
> Any suggestions are welcome.

Hi, Tony.

Inline.
 
> Hi Yacan,
> 
> This result matches some of our scenarios to some extent. Let's talk
> about this result first.
> 
> Based on your benchmark, the biggest factor affecting performance seems
> that the CPU resource is limited. As the number of threads increased,
> neither CPU usage nor performance metrics improved, and CPU is limited
> to about 200-250%. To make it clear, could you please give out more
> metrics about per-CPU (usr / sys / hi / si) and memcached process usage.

Now, I use taskset to limit memcached to use cpu21~cpu28. The result is as follows:

TCP    1 thread 
%Cpu21 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu22 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu23 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu24 :  0.0 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu25 : 14.3 us, 76.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  9.3 si,  0.0 st
%Cpu26 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu27 :  1.0 us,  0.0 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu28 :  0.0 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
  
SMC-R  1 thread
%Cpu21 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu22 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu23 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu24 :  0.0 us,  2.8 sy,  0.0 ni, 17.2 id,  0.0 wa,  0.0 hi, 79.9 si,  0.0 st
%Cpu25 : 18.9 us, 74.2 sy,  0.0 ni,  7.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu26 :  2.9 us,  0.3 sy,  0.0 ni, 96.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu27 :  0.3 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu28 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

TCP    2 thread
%Cpu21 : 12.0 us, 81.7 sy,  0.0 ni,  6.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu22 : 11.0 us, 80.0 sy,  0.0 ni,  9.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu23 :  3.0 us, 12.6 sy,  0.0 ni, 84.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu24 :  0.0 us,  0.0 sy,  0.0 ni, 98.3 id,  0.0 wa,  0.0 hi,  1.7 si,  0.0 st
%Cpu25 :  0.0 us,  0.0 sy,  0.0 ni, 96.5 id,  0.0 wa,  0.0 hi,  3.5 si,  0.0 st
%Cpu26 :  0.0 us,  0.3 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  1.7 si,  0.0 st
%Cpu27 :  0.0 us,  0.0 sy,  0.0 ni, 98.3 id,  0.0 wa,  0.0 hi,  1.7 si,  0.0 st
%Cpu28 :  2.0 us,  0.3 sy,  0.0 ni, 93.0 id,  0.0 wa,  0.0 hi,  4.7 si,  0.0 st
  
SMC-R  2 thread
%Cpu21 :  4.3 us, 18.1 sy,  0.0 ni, 77.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu22 :  2.7 us, 20.6 sy,  0.0 ni, 76.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu23 :  4.7 us, 28.7 sy,  0.0 ni, 66.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu24 :  0.7 us,  2.3 sy,  0.0 ni, 17.3 id,  0.0 wa,  0.0 hi, 79.7 si,  0.0 st
%Cpu25 :  7.7 us, 23.6 sy,  0.0 ni, 68.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu26 :  3.7 us,  8.8 sy,  0.0 ni, 87.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu27 :  0.0 us,  0.7 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu28 :  1.3 us,  8.6 sy,  0.0 ni, 90.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

TCP    4  thread
%Cpu21 : 10.0 us, 55.3 sy,  0.0 ni, 34.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu22 :  8.7 us, 50.5 sy,  0.0 ni, 40.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu23 : 11.7 us, 63.7 sy,  0.0 ni, 24.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu24 :  3.1 us, 13.9 sy,  0.0 ni, 75.6 id,  0.0 wa,  0.0 hi,  7.5 si,  0.0 st
%Cpu25 :  9.3 us, 30.9 sy,  0.0 ni, 49.8 id,  0.0 wa,  0.0 hi, 10.0 si,  0.0 st
%Cpu26 :  8.5 us, 28.3 sy,  0.0 ni, 56.3 id,  0.0 wa,  0.0 hi,  6.8 si,  0.0 st
%Cpu27 :  4.3 us, 21.4 sy,  0.0 ni, 64.9 id,  0.0 wa,  0.0 hi,  9.4 si,  0.0 st
%Cpu28 : 12.4 us, 48.3 sy,  0.0 ni, 30.5 id,  0.0 wa,  0.0 hi,  8.7 si,  0.0 st

SMC-R  4  thread
%Cpu21 :  6.1 us, 21.4 sy,  0.0 ni, 72.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu22 :  5.9 us, 21.8 sy,  0.0 ni, 72.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu23 :  6.5 us, 28.1 sy,  0.0 ni, 65.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu24 :  4.1 us,  9.3 sy,  0.0 ni,  5.5 id,  0.0 wa,  0.0 hi, 81.0 si,  0.0 st
%Cpu25 :  3.7 us,  8.4 sy,  0.0 ni, 87.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu26 :  3.3 us, 10.9 sy,  0.0 ni, 85.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu27 :  4.7 us, 11.3 sy,  0.0 ni, 84.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu28 :  1.0 us,  4.3 sy,  0.0 ni, 94.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

TCP    8  thread
%Cpu21 : 14.7 us, 63.2 sy,  0.0 ni, 22.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu22 : 14.6 us, 61.1 sy,  0.0 ni, 24.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu23 : 12.9 us, 66.9 sy,  0.0 ni, 20.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu24 : 15.4 us, 52.1 sy,  0.0 ni, 20.3 id,  0.0 wa,  0.0 hi, 12.2 si,  0.0 st
%Cpu25 : 11.2 us, 52.7 sy,  0.0 ni, 19.7 id,  0.0 wa,  0.0 hi, 16.3 si,  0.0 st
%Cpu26 : 14.3 us, 54.3 sy,  0.0 ni, 20.8 id,  0.0 wa,  0.0 hi, 10.6 si,  0.0 st
%Cpu27 : 12.1 us, 52.8 sy,  0.0 ni, 21.4 id,  0.0 wa,  0.0 hi, 13.8 si,  0.0 st
%Cpu28 : 14.7 us, 49.1 sy,  0.0 ni, 21.2 id,  0.0 wa,  0.0 hi, 15.0 si,  0.0 st

SMC-R  8  thread 
%Cpu21 :  6.3 us, 20.4 sy,  0.0 ni, 73.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu22 :  8.3 us, 18.3 sy,  0.0 ni, 73.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu23 :  5.1 us, 23.3 sy,  0.0 ni, 71.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu24 :  1.3 us,  3.4 sy,  0.0 ni,  1.0 id,  0.0 wa,  0.0 hi, 94.3 si,  0.0 st
%Cpu25 :  6.3 us, 15.6 sy,  0.0 ni, 78.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu26 :  6.5 us, 12.7 sy,  0.0 ni, 80.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu27 :  7.4 us, 13.5 sy,  0.0 ni, 79.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu28 :  5.8 us, 13.3 sy,  0.0 ni, 80.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st


It looks like SMC-R only uses one core to do softirq work, I presume this is the rx/tx tasklet, right?

> Secondly, it seems that there is lots of connections in this test.
> If it takes too much time to establish a connection, or the number of
> final connections does not reach the specified value, the result will be
> greatly affected. Could you please give out more details about the
> connections numbers during benchmark?

In our environment, client always use 50*8=400 connections.

> We have noticed SMC has some limitations in multiple threads and many
> connections. This benchmark happens to be basically in line with this
> scenario. In general, there are some aspects in brief:
> 1. control path (connection setup and dismiss) is not as fast as TCP;
> 2. data path (lock contention, CQ spreading, etc.) needs further improvement;

SMC-R control path setup time slower than TCP is reasonable and tolerable.

> About CPU limitation, SMC use one CQ and one core to handle data
> transmission, which cannot spread workload over multiple cores. There is
> is an early temporary solution [1], which also need to improve (new CQ
> API, WR refactor). With this early solution, it shows several times the
> performance improvement.
> 
> About the improvement of connection setup, you can see [2] for more
> details, which is still a proposal now, and we are working on it now.
> This show considerable performance boost.
> 
> [1] https://lore.kernel.org/all/20220126130140.66316-1-tonylu@linux.alibaba.com/
> [2] https://lore.kernel.org/all/1653375127-130233-1-git-send-email-alibuda@linux.alibaba.com/
> 
> Thanks,
> Tony LU
> 

We just noticed the CQ per device as well. Actually we tried creating more CQs, multiple rx tasklets, 
but nothing seems to work. Maybe we got it wrong somewhere...Now We plan to try [1] first.

Thank you very much for your reply!

[1] https://lore.kernel.org/all/20220126130140.66316-1-tonylu@linux.alibaba.com/

Regards,
Yacan




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: SMC-R problem under multithread
  2022-05-30  6:40   ` liuyacan
@ 2022-05-30  8:24     ` Tony Lu
  2022-05-31  7:02       ` liuyacan
  0 siblings, 1 reply; 5+ messages in thread
From: Tony Lu @ 2022-05-30  8:24 UTC (permalink / raw)
  To: liuyacan
  Cc: davem, edumazet, kgraul, kuba, linux-kernel, linux-s390, netdev,
	pabeni, ubraun

On Mon, May 30, 2022 at 02:40:49PM +0800, liuyacan@corp.netease.com wrote:
> > Hi experts,
> > 
> > I recently used memcached to test the performance of SMC-R relative to TCP, but the results 
> > are confusing me. When using multithread on the server side, the performance of SMC-R is not as good as TCP.
> > 
> > Specifically, I tested 4 scenarios with server thread: 1\2\4\8. The client uses 8threads fixedly. 
> > 
> > server: (smc_run) memcached -t 1 -m 16384 -p [SERVER-PORT] -U 0 -F -c 10240 -o modern
> > client: (smc-run) memtier_benchmark -s [SERVER-IP] -p [SERVER-PORT] -P memcache_text --random-data --data-size=100 --data-size-pattern=S --key-minimum=30 --key-maximum=100  -n 5000000 -t 8
> > 
> > The result is as follows:
> > 
> > SMC-R:
> > 
> > server-thread    ops/sec  client-cpu server-cpu
> > 1             242k        220%         97%
> > 2             362k        241%        128%
> > 4             378k        242%        160%
> > 8             395k        242%        210%
> > 
> > TCP:
> > server-thread    ops/sec  client-cpu server-cpu
> > 1             185k       224%         100%
> > 2             435k       479%         200%
> > 4             780k       731%         400%
> > 8             938k       800%         659%                   
> > 
> > It can be seen that as the number of threads increases, the performance increase of SMC-R is much slower than that of TCP.
> > 
> > Am I doing something wrong? Or is it only when CPU resources are tight that SMC-R has a significant advantage ?  
> > 
> > Any suggestions are welcome.
> 
> Hi, Tony.
> 
> Inline.
>  
> > Hi Yacan,
> > 
> > This result matches some of our scenarios to some extent. Let's talk
> > about this result first.
> > 
> > Based on your benchmark, the biggest factor affecting performance seems
> > that the CPU resource is limited. As the number of threads increased,
> > neither CPU usage nor performance metrics improved, and CPU is limited
> > to about 200-250%. To make it clear, could you please give out more
> > metrics about per-CPU (usr / sys / hi / si) and memcached process usage.
> 
> Now, I use taskset to limit memcached to use cpu21~cpu28. The result is as follows:
> 
> TCP    1 thread 
> %Cpu21 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu22 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu23 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu24 :  0.0 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
> %Cpu25 : 14.3 us, 76.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  9.3 si,  0.0 st
> %Cpu26 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu27 :  1.0 us,  0.0 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
> %Cpu28 :  0.0 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
>   
> SMC-R  1 thread
> %Cpu21 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu22 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu23 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu24 :  0.0 us,  2.8 sy,  0.0 ni, 17.2 id,  0.0 wa,  0.0 hi, 79.9 si,  0.0 st
> %Cpu25 : 18.9 us, 74.2 sy,  0.0 ni,  7.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu26 :  2.9 us,  0.3 sy,  0.0 ni, 96.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu27 :  0.3 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu28 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> 
> TCP    2 thread
> %Cpu21 : 12.0 us, 81.7 sy,  0.0 ni,  6.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu22 : 11.0 us, 80.0 sy,  0.0 ni,  9.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu23 :  3.0 us, 12.6 sy,  0.0 ni, 84.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu24 :  0.0 us,  0.0 sy,  0.0 ni, 98.3 id,  0.0 wa,  0.0 hi,  1.7 si,  0.0 st
> %Cpu25 :  0.0 us,  0.0 sy,  0.0 ni, 96.5 id,  0.0 wa,  0.0 hi,  3.5 si,  0.0 st
> %Cpu26 :  0.0 us,  0.3 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  1.7 si,  0.0 st
> %Cpu27 :  0.0 us,  0.0 sy,  0.0 ni, 98.3 id,  0.0 wa,  0.0 hi,  1.7 si,  0.0 st
> %Cpu28 :  2.0 us,  0.3 sy,  0.0 ni, 93.0 id,  0.0 wa,  0.0 hi,  4.7 si,  0.0 st
>   
> SMC-R  2 thread
> %Cpu21 :  4.3 us, 18.1 sy,  0.0 ni, 77.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu22 :  2.7 us, 20.6 sy,  0.0 ni, 76.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu23 :  4.7 us, 28.7 sy,  0.0 ni, 66.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu24 :  0.7 us,  2.3 sy,  0.0 ni, 17.3 id,  0.0 wa,  0.0 hi, 79.7 si,  0.0 st
> %Cpu25 :  7.7 us, 23.6 sy,  0.0 ni, 68.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu26 :  3.7 us,  8.8 sy,  0.0 ni, 87.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu27 :  0.0 us,  0.7 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu28 :  1.3 us,  8.6 sy,  0.0 ni, 90.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> 
> TCP    4  thread
> %Cpu21 : 10.0 us, 55.3 sy,  0.0 ni, 34.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu22 :  8.7 us, 50.5 sy,  0.0 ni, 40.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu23 : 11.7 us, 63.7 sy,  0.0 ni, 24.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu24 :  3.1 us, 13.9 sy,  0.0 ni, 75.6 id,  0.0 wa,  0.0 hi,  7.5 si,  0.0 st
> %Cpu25 :  9.3 us, 30.9 sy,  0.0 ni, 49.8 id,  0.0 wa,  0.0 hi, 10.0 si,  0.0 st
> %Cpu26 :  8.5 us, 28.3 sy,  0.0 ni, 56.3 id,  0.0 wa,  0.0 hi,  6.8 si,  0.0 st
> %Cpu27 :  4.3 us, 21.4 sy,  0.0 ni, 64.9 id,  0.0 wa,  0.0 hi,  9.4 si,  0.0 st
> %Cpu28 : 12.4 us, 48.3 sy,  0.0 ni, 30.5 id,  0.0 wa,  0.0 hi,  8.7 si,  0.0 st
> 
> SMC-R  4  thread
> %Cpu21 :  6.1 us, 21.4 sy,  0.0 ni, 72.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu22 :  5.9 us, 21.8 sy,  0.0 ni, 72.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu23 :  6.5 us, 28.1 sy,  0.0 ni, 65.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu24 :  4.1 us,  9.3 sy,  0.0 ni,  5.5 id,  0.0 wa,  0.0 hi, 81.0 si,  0.0 st
> %Cpu25 :  3.7 us,  8.4 sy,  0.0 ni, 87.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu26 :  3.3 us, 10.9 sy,  0.0 ni, 85.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu27 :  4.7 us, 11.3 sy,  0.0 ni, 84.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu28 :  1.0 us,  4.3 sy,  0.0 ni, 94.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> 
> TCP    8  thread
> %Cpu21 : 14.7 us, 63.2 sy,  0.0 ni, 22.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu22 : 14.6 us, 61.1 sy,  0.0 ni, 24.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu23 : 12.9 us, 66.9 sy,  0.0 ni, 20.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu24 : 15.4 us, 52.1 sy,  0.0 ni, 20.3 id,  0.0 wa,  0.0 hi, 12.2 si,  0.0 st
> %Cpu25 : 11.2 us, 52.7 sy,  0.0 ni, 19.7 id,  0.0 wa,  0.0 hi, 16.3 si,  0.0 st
> %Cpu26 : 14.3 us, 54.3 sy,  0.0 ni, 20.8 id,  0.0 wa,  0.0 hi, 10.6 si,  0.0 st
> %Cpu27 : 12.1 us, 52.8 sy,  0.0 ni, 21.4 id,  0.0 wa,  0.0 hi, 13.8 si,  0.0 st
> %Cpu28 : 14.7 us, 49.1 sy,  0.0 ni, 21.2 id,  0.0 wa,  0.0 hi, 15.0 si,  0.0 st
> 
> SMC-R  8  thread 
> %Cpu21 :  6.3 us, 20.4 sy,  0.0 ni, 73.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu22 :  8.3 us, 18.3 sy,  0.0 ni, 73.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu23 :  5.1 us, 23.3 sy,  0.0 ni, 71.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu24 :  1.3 us,  3.4 sy,  0.0 ni,  1.0 id,  0.0 wa,  0.0 hi, 94.3 si,  0.0 st
> %Cpu25 :  6.3 us, 15.6 sy,  0.0 ni, 78.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu26 :  6.5 us, 12.7 sy,  0.0 ni, 80.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu27 :  7.4 us, 13.5 sy,  0.0 ni, 79.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu28 :  5.8 us, 13.3 sy,  0.0 ni, 80.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> 
> 
> It looks like SMC-R only uses one core to do softirq work, I presume this is the rx/tx tasklet, right?

Yep, it only used one CQ (one CPU core) to handle data (tasklet), which
is solved in [1].
 
> > Secondly, it seems that there is lots of connections in this test.
> > If it takes too much time to establish a connection, or the number of
> > final connections does not reach the specified value, the result will be
> > greatly affected. Could you please give out more details about the
> > connections numbers during benchmark?
> 
> In our environment, client always use 50*8=400 connections.

400 connections is not too much. We found some regressions when the
number of connections reaches the scale of thousands.

> 
> > We have noticed SMC has some limitations in multiple threads and many
> > connections. This benchmark happens to be basically in line with this
> > scenario. In general, there are some aspects in brief:
> > 1. control path (connection setup and dismiss) is not as fast as TCP;
> > 2. data path (lock contention, CQ spreading, etc.) needs further improvement;
> 
> SMC-R control path setup time slower than TCP is reasonable and tolerable.

Connection setup is the one of hardest part to solve. If this is okay, I
think SMC should suitable for your scenario.

> 
> > About CPU limitation, SMC use one CQ and one core to handle data
> > transmission, which cannot spread workload over multiple cores. There is
> > is an early temporary solution [1], which also need to improve (new CQ
> > API, WR refactor). With this early solution, it shows several times the
> > performance improvement.
> > 
> > About the improvement of connection setup, you can see [2] for more
> > details, which is still a proposal now, and we are working on it now.
> > This show considerable performance boost.
> > 
> > [1] https://lore.kernel.org/all/20220126130140.66316-1-tonylu@linux.alibaba.com/
> > [2] https://lore.kernel.org/all/1653375127-130233-1-git-send-email-alibuda@linux.alibaba.com/
> > 
> > Thanks,
> > Tony LU
> > 
> 
> We just noticed the CQ per device as well. Actually we tried creating more CQs, multiple rx tasklets, 
> but nothing seems to work. Maybe we got it wrong somewhere...Now We plan to try [1] first.

The key point of this patch [1] is to spread CQ vector to different
cores. It can solve single core issue of tasklet (si high in some CPU
core).

Looking forward for your feedback, thanks.
 
> Thank you very much for your reply!
> 
> [1] https://lore.kernel.org/all/20220126130140.66316-1-tonylu@linux.alibaba.com/
> 
> Regards,
> Yacan
> 
> 

[1] https://lore.kernel.org/all/20220126130140.66316-1-tonylu@linux.alibaba.com/

Cheers,
Tony Lu

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: SMC-R problem under multithread
  2022-05-30  8:24     ` Tony Lu
@ 2022-05-31  7:02       ` liuyacan
  0 siblings, 0 replies; 5+ messages in thread
From: liuyacan @ 2022-05-31  7:02 UTC (permalink / raw)
  To: tonylu
  Cc: davem, edumazet, kgraul, kuba, linux-kernel, linux-s390,
	liuyacan, netdev, pabeni, ubraun

> > > Hi experts,
> > > 
> > > I recently used memcached to test the performance of SMC-R relative to TCP, but the results 
> > > are confusing me. When using multithread on the server side, the performance of SMC-R is not as good as TCP.
> > > 
> > > Specifically, I tested 4 scenarios with server thread: 1\2\4\8. The client uses 8threads fixedly. 
> > > 
> > > server: (smc_run) memcached -t 1 -m 16384 -p [SERVER-PORT] -U 0 -F -c 10240 -o modern
> > > client: (smc-run) memtier_benchmark -s [SERVER-IP] -p [SERVER-PORT] -P memcache_text --random-data --data-size=100 --data-size-pattern=S --key-minimum=30 --key-maximum=100  -n 5000000 -t 8
> > > 
> > > The result is as follows:
> > > 
> > > SMC-R:
> > > 
> > > server-thread    ops/sec  client-cpu server-cpu
> > > 1             242k        220%         97%
> > > 2             362k        241%        128%
> > > 4             378k        242%        160%
> > > 8             395k        242%        210%
> > > 
> > > TCP:
> > > server-thread    ops/sec  client-cpu server-cpu
> > > 1             185k       224%         100%
> > > 2             435k       479%         200%
> > > 4             780k       731%         400%
> > > 8             938k       800%         659%                   
> > > 
> > > It can be seen that as the number of threads increases, the performance increase of SMC-R is much slower than that of TCP.
> > > 
> > > Am I doing something wrong? Or is it only when CPU resources are tight that SMC-R has a significant advantage ?  
> > > 
> > > Any suggestions are welcome.
> > 
> > Hi, Tony.
> > 
> > Inline.
> >  
> > > Hi Yacan,
> > > 
> > > This result matches some of our scenarios to some extent. Let's talk
> > > about this result first.
> > > 
> > > Based on your benchmark, the biggest factor affecting performance seems
> > > that the CPU resource is limited. As the number of threads increased,
> > > neither CPU usage nor performance metrics improved, and CPU is limited
> > > to about 200-250%. To make it clear, could you please give out more
> > > metrics about per-CPU (usr / sys / hi / si) and memcached process usage.
> > 
> > Now, I use taskset to limit memcached to use cpu21~cpu28. The result is as follows:
> > 
> > TCP    1 thread 
> > %Cpu21 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu22 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu23 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu24 :  0.0 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
> > %Cpu25 : 14.3 us, 76.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  9.3 si,  0.0 st
> > %Cpu26 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu27 :  1.0 us,  0.0 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
> > %Cpu28 :  0.0 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
> >   
> > SMC-R  1 thread
> > %Cpu21 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu22 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu23 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu24 :  0.0 us,  2.8 sy,  0.0 ni, 17.2 id,  0.0 wa,  0.0 hi, 79.9 si,  0.0 st
> > %Cpu25 : 18.9 us, 74.2 sy,  0.0 ni,  7.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu26 :  2.9 us,  0.3 sy,  0.0 ni, 96.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu27 :  0.3 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu28 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > 
> > TCP    2 thread
> > %Cpu21 : 12.0 us, 81.7 sy,  0.0 ni,  6.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu22 : 11.0 us, 80.0 sy,  0.0 ni,  9.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu23 :  3.0 us, 12.6 sy,  0.0 ni, 84.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu24 :  0.0 us,  0.0 sy,  0.0 ni, 98.3 id,  0.0 wa,  0.0 hi,  1.7 si,  0.0 st
> > %Cpu25 :  0.0 us,  0.0 sy,  0.0 ni, 96.5 id,  0.0 wa,  0.0 hi,  3.5 si,  0.0 st
> > %Cpu26 :  0.0 us,  0.3 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  1.7 si,  0.0 st
> > %Cpu27 :  0.0 us,  0.0 sy,  0.0 ni, 98.3 id,  0.0 wa,  0.0 hi,  1.7 si,  0.0 st
> > %Cpu28 :  2.0 us,  0.3 sy,  0.0 ni, 93.0 id,  0.0 wa,  0.0 hi,  4.7 si,  0.0 st
> >   
> > SMC-R  2 thread
> > %Cpu21 :  4.3 us, 18.1 sy,  0.0 ni, 77.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu22 :  2.7 us, 20.6 sy,  0.0 ni, 76.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu23 :  4.7 us, 28.7 sy,  0.0 ni, 66.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu24 :  0.7 us,  2.3 sy,  0.0 ni, 17.3 id,  0.0 wa,  0.0 hi, 79.7 si,  0.0 st
> > %Cpu25 :  7.7 us, 23.6 sy,  0.0 ni, 68.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu26 :  3.7 us,  8.8 sy,  0.0 ni, 87.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu27 :  0.0 us,  0.7 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu28 :  1.3 us,  8.6 sy,  0.0 ni, 90.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > 
> > TCP    4  thread
> > %Cpu21 : 10.0 us, 55.3 sy,  0.0 ni, 34.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu22 :  8.7 us, 50.5 sy,  0.0 ni, 40.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu23 : 11.7 us, 63.7 sy,  0.0 ni, 24.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu24 :  3.1 us, 13.9 sy,  0.0 ni, 75.6 id,  0.0 wa,  0.0 hi,  7.5 si,  0.0 st
> > %Cpu25 :  9.3 us, 30.9 sy,  0.0 ni, 49.8 id,  0.0 wa,  0.0 hi, 10.0 si,  0.0 st
> > %Cpu26 :  8.5 us, 28.3 sy,  0.0 ni, 56.3 id,  0.0 wa,  0.0 hi,  6.8 si,  0.0 st
> > %Cpu27 :  4.3 us, 21.4 sy,  0.0 ni, 64.9 id,  0.0 wa,  0.0 hi,  9.4 si,  0.0 st
> > %Cpu28 : 12.4 us, 48.3 sy,  0.0 ni, 30.5 id,  0.0 wa,  0.0 hi,  8.7 si,  0.0 st
> > 
> > SMC-R  4  thread
> > %Cpu21 :  6.1 us, 21.4 sy,  0.0 ni, 72.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu22 :  5.9 us, 21.8 sy,  0.0 ni, 72.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu23 :  6.5 us, 28.1 sy,  0.0 ni, 65.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu24 :  4.1 us,  9.3 sy,  0.0 ni,  5.5 id,  0.0 wa,  0.0 hi, 81.0 si,  0.0 st
> > %Cpu25 :  3.7 us,  8.4 sy,  0.0 ni, 87.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu26 :  3.3 us, 10.9 sy,  0.0 ni, 85.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu27 :  4.7 us, 11.3 sy,  0.0 ni, 84.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu28 :  1.0 us,  4.3 sy,  0.0 ni, 94.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > 
> > TCP    8  thread
> > %Cpu21 : 14.7 us, 63.2 sy,  0.0 ni, 22.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu22 : 14.6 us, 61.1 sy,  0.0 ni, 24.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu23 : 12.9 us, 66.9 sy,  0.0 ni, 20.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu24 : 15.4 us, 52.1 sy,  0.0 ni, 20.3 id,  0.0 wa,  0.0 hi, 12.2 si,  0.0 st
> > %Cpu25 : 11.2 us, 52.7 sy,  0.0 ni, 19.7 id,  0.0 wa,  0.0 hi, 16.3 si,  0.0 st
> > %Cpu26 : 14.3 us, 54.3 sy,  0.0 ni, 20.8 id,  0.0 wa,  0.0 hi, 10.6 si,  0.0 st
> > %Cpu27 : 12.1 us, 52.8 sy,  0.0 ni, 21.4 id,  0.0 wa,  0.0 hi, 13.8 si,  0.0 st
> > %Cpu28 : 14.7 us, 49.1 sy,  0.0 ni, 21.2 id,  0.0 wa,  0.0 hi, 15.0 si,  0.0 st
> > 
> > SMC-R  8  thread 
> > %Cpu21 :  6.3 us, 20.4 sy,  0.0 ni, 73.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu22 :  8.3 us, 18.3 sy,  0.0 ni, 73.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu23 :  5.1 us, 23.3 sy,  0.0 ni, 71.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu24 :  1.3 us,  3.4 sy,  0.0 ni,  1.0 id,  0.0 wa,  0.0 hi, 94.3 si,  0.0 st
> > %Cpu25 :  6.3 us, 15.6 sy,  0.0 ni, 78.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu26 :  6.5 us, 12.7 sy,  0.0 ni, 80.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu27 :  7.4 us, 13.5 sy,  0.0 ni, 79.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > %Cpu28 :  5.8 us, 13.3 sy,  0.0 ni, 80.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > 
> > 
> > It looks like SMC-R only uses one core to do softirq work, I presume this is the rx/tx tasklet, right?
> 
> Yep, it only used one CQ (one CPU core) to handle data (tasklet), which
> is solved in [1].
>  
> > > Secondly, it seems that there is lots of connections in this test.
> > > If it takes too much time to establish a connection, or the number of
> > > final connections does not reach the specified value, the result will be
> > > greatly affected. Could you please give out more details about the
> > > connections numbers during benchmark?
> > 
> > In our environment, client always use 50*8=400 connections.
> 
> 400 connections is not too much. We found some regressions when the
> number of connections reaches the scale of thousands.
> 
> > 
> > > We have noticed SMC has some limitations in multiple threads and many
> > > connections. This benchmark happens to be basically in line with this
> > > scenario. In general, there are some aspects in brief:
> > > 1. control path (connection setup and dismiss) is not as fast as TCP;
> > > 2. data path (lock contention, CQ spreading, etc.) needs further improvement;
> > 
> > SMC-R control path setup time slower than TCP is reasonable and tolerable.
> 
> Connection setup is the one of hardest part to solve. If this is okay, I
> think SMC should suitable for your scenario.
> 
> > 
> > > About CPU limitation, SMC use one CQ and one core to handle data
> > > transmission, which cannot spread workload over multiple cores. There is
> > > is an early temporary solution [1], which also need to improve (new CQ
> > > API, WR refactor). With this early solution, it shows several times the
> > > performance improvement.
> > > 
> > > About the improvement of connection setup, you can see [2] for more
> > > details, which is still a proposal now, and we are working on it now.
> > > This show considerable performance boost.
> > > 
> > > [1] https://lore.kernel.org/all/20220126130140.66316-1-tonylu@linux.alibaba.com/
> > > [2] https://lore.kernel.org/all/1653375127-130233-1-git-send-email-alibuda@linux.alibaba.com/
> > > 
> > > Thanks,
> > > Tony LU
> > > 
> > 
> > We just noticed the CQ per device as well. Actually we tried creating more CQs, multiple rx tasklets, 
> > but nothing seems to work. Maybe we got it wrong somewhere...Now We plan to try [1] first.
> 
> The key point of this patch [1] is to spread CQ vector to different
> cores. It can solve single core issue of tasklet (si high in some CPU
> core).
> 
> Looking forward for your feedback, thanks.
>  
> > Thank you very much for your reply!
> > 
> > [1] https://lore.kernel.org/all/20220126130140.66316-1-tonylu@linux.alibaba.com/
> > 
> > Regards,
> > Yacan
> > 
> > 
> 
> [1] https://lore.kernel.org/all/20220126130140.66316-1-tonylu@linux.alibaba.com/
> 
> Cheers,
> Tony Lu

Hi Tony,

   This patch proved absolutely useful in our environment!
   
    ops/sec   1 thread    2 threads    4 threads    8 threads
   w/o patch   236k          365k        382k           394k
   w/  patch   289k          726k        1061k         1243k
   Ratio       +22%          +98%        +177%         +215%

   Thank you very much.

Regards,
Yacan


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-05-31  7:02 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-30  3:16 SMC-R problem under multithread liuyacan
2022-05-30  5:10 ` Tony Lu
2022-05-30  6:40   ` liuyacan
2022-05-30  8:24     ` Tony Lu
2022-05-31  7:02       ` liuyacan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.