linux-s390.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Tony Lu <tonylu@linux.alibaba.com>
To: kgraul@linux.ibm.com
Cc: kuba@kernel.org, davem@davemloft.net, netdev@vger.kernel.org,
	linux-s390@vger.kernel.org
Subject: [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores
Date: Fri, 14 Jan 2022 13:48:46 +0800	[thread overview]
Message-ID: <20220114054852.38058-1-tonylu@linux.alibaba.com> (raw)

Currently, SMC creates one CQ per IB device, and shares this cq among
all the QPs of links. Meanwhile, this CQ is always binded to the first
completion vector, the IRQ affinity of this vector binds to some CPU
core. 

┌────────┐    ┌──────────────┐   ┌──────────────┐
│ SMC IB │    ├────┐         │   │              │
│ DEVICE │ ┌─▶│ QP │ SMC LINK├──▶│SMC Link Group│
│   ┌────┤ │  ├────┘         │   │              │
│   │ CQ ├─┘  └──────────────┘   └──────────────┘
│   │    ├─┐  ┌──────────────┐   ┌──────────────┐
│   └────┤ │  ├────┐         │   │              │
│        │ └─▶│ QP │ SMC LINK├──▶│SMC Link Group│
│        │    ├────┘         │   │              │
└────────┘    └──────────────┘   └──────────────┘

In this model, when connections execeeds SMC_RMBS_PER_LGR_MAX, it will
create multiple link groups and corresponding QPs. All the connections
share limited QPs and one CQ (both recv and send sides). Generally, one
completion vector binds to a fixed CPU core, it will limit the
performance by single core, and large-scale scenes, such as multiple
threads and lots of connections.

Running nginx and wrk test with 8 threads and 800 connections on 8 cores
host, the softirq of CPU 0 is limited the scalability:

04:18:54 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
04:18:55 PM  all    5.81    0.00   19.42    0.00    2.94   10.21    0.00    0.00    0.00   61.63
04:18:55 PM    0    0.00    0.00    0.00    0.00   16.80   82.78    0.00    0.00    0.00    0.41
<snip>

Nowadays, RDMA devices have more than one completion vectors, such as
mlx5 has 8, eRDMA has 4 completion vector by default. This unlocks the
limitation of single vector and single CPU core.

To enhance scalability and take advantage of multi-core resources, we
can spread CQs to different CPU cores, and introduce more flexible
mapping. Here comes up a new model, the main different is that creating
multiple CQs per IB device, which the max number of CQs is limited by
ibdev's ability (num_comp_vectors). In the scenen of multiple linkgroups,
the link group's QP can bind to the least used CQ, and CQs are binded
to different completion vector and CPU cores. So that we can spread
the softirq (tasklet of wr tx/rx) handler to different cores.

                        ┌──────────────┐   ┌──────────────┐
┌────────┐  ┌───────┐   ├────┐         │   │              │
│        ├─▶│ CQ 0  ├──▶│ QP │ SMC LINK├──▶│SMC Link Group│
│        │  └───────┘   ├────┘         │   │              │
│ SMC IB │  ┌───────┐   └──────────────┘   └──────────────┘
│ DEVICE ├─▶│ CQ 1  │─┐                                    
│        │  └───────┘ │ ┌──────────────┐   ┌──────────────┐
│        │  ┌───────┐ │ ├────┐         │   │              │
│        ├─▶│ CQ n  │ └▶│ QP │ SMC LINK├──▶│SMC Link Group│
└────────┘  └───────┘   ├────┘         │   │              │
                        └──────────────┘   └──────────────┘

After sperad one CQ (4 linkgroups) to four CPU cores, the softirq load
spreads to different cores:

04:26:25 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
04:26:26 PM  all   10.70    0.00   35.80    0.00    7.64   26.62    0.00    0.00    0.00   19.24
04:26:26 PM    0    0.00    0.00    0.00    0.00   16.33   50.00    0.00    0.00    0.00   33.67
04:26:26 PM    1    0.00    0.00    0.00    0.00   15.46   69.07    0.00    0.00    0.00   15.46
04:26:26 PM    2    0.00    0.00    0.00    0.00   13.13   39.39    0.00    0.00    0.00   47.47
04:26:26 PM    3    0.00    0.00    0.00    0.00   13.27   55.10    0.00    0.00    0.00   31.63
<snip>

Here is the benchmark with this patch (prototype of new model):

Test environment:
- CPU Intel Xeon Platinum 8 core, mem 32 GiB, nic Mellanox CX4.
- nginx + wrk HTTP benchmark.
- nginx: disable access_log, increase keepalive_timeout and
  keepalive_requests, long-live connection.
- wrk: 8 threads and 100, 200, 400 connections.

Benchmark result:

Conns/QPS         100        200        400
w/o patch   338502.49  359216.66  398167.16
w/  patch   677247.40  694193.70  812502.69
Ratio        +100.07%    +93.25%   +104.06%

This prototype patches show nealy 1x increasement of QPS.

The benchmarkes of 100, 200, 400 connections use 1, 1, 2 link groups.
When link group is one, it spreads send/recv to two cores. Once more
than one link groups, it would spread to more cores.

If the application's connections is no more than link group's limitation
(SMC_RMBS_PER_LGR_MAX, 255), and CPU resources is restricted. This patch
introduces a tunable way to reduce the hard limitation of link group
connections number. It reduces the restriction of less CQs (cores) and
less competition, such as link-level CDC slots. It depends on the scenes
of applications, so this patch provides a userspace knob, users can
choose to share link groups for saving resources, or create more link
groups for less limitation.

Patch 1-4 introduce multiple CQs support.
- Patch 1 spreads CQ to two cores, it works for less connections.
- Patch 2, 3, 4 introduce multiple CQs support, involves a new medium
  to tie link and ibdev, and load balancing between different completion
  vectors and CQs.
- the policy of spreading CQs is still thinking and testing to get
  highest performance, such as splitting recv/send CQs, or joining them
  together, or bind recv/recv (send/send) CQ to same vector and so on.
  Glad to hear your advice.

Patch 5 is a medium for userspace control knob.
- mainly provide two knobs to adjust the buffer size of smc socket. We
  found that too little buffers would let smc wait for buffer for a long
  time, and limit the performance.
- introduce a sysctl framework, just for tuning, netlink also does work.
  Because sysctl is easy to compose as patch and no need userspace example.
  I am glad to wait for your advice about the control panel for
  userspace.

Patch 6 introduces a tunable knob to decrease the per link group
connections' number, which would increase parallel performance as
mentioned previous.

These patches are still improving, I am very glad to hear your advice.

Thank you.

Tony Lu (6):
  net/smc: Spread CQs to differents completion vectors
  net/smc: Prepare for multiple CQs per IB devices
  net/smc: Introduce smc_ib_cq to bind link and cq
  net/smc: Multiple CQs per IB devices
  net/smc: Unbind buffer size from clcsock and make it tunable
  net/smc: Introduce tunable linkgroup max connections

 Documentation/networking/smc-sysctl.rst |  20 +++
 include/net/netns/smc.h                 |   6 +
 net/smc/Makefile                        |   2 +-
 net/smc/af_smc.c                        |  18 ++-
 net/smc/smc_core.c                      |   2 +-
 net/smc/smc_core.h                      |   2 +
 net/smc/smc_ib.c                        | 178 ++++++++++++++++++++----
 net/smc/smc_ib.h                        |  17 ++-
 net/smc/smc_sysctl.c                    |  92 ++++++++++++
 net/smc/smc_sysctl.h                    |  22 +++
 net/smc/smc_wr.c                        |  42 +++---
 11 files changed, 350 insertions(+), 51 deletions(-)
 create mode 100644 Documentation/networking/smc-sysctl.rst
 create mode 100644 net/smc/smc_sysctl.c
 create mode 100644 net/smc/smc_sysctl.h

-- 
2.32.0.3.g01195cf9f


             reply	other threads:[~2022-01-14  5:48 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-14  5:48 Tony Lu [this message]
2022-01-14  5:48 ` [RFC PATCH net-next 1/6] net/smc: Spread CQs to differents completion vectors Tony Lu
2022-01-14  5:48 ` [RFC PATCH net-next 2/6] net/smc: Prepare for multiple CQs per IB devices Tony Lu
2022-01-14  5:48 ` [RFC PATCH net-next 3/6] net/smc: Introduce smc_ib_cq to bind link and cq Tony Lu
2022-01-14  5:48 ` [RFC PATCH net-next 4/6] net/smc: Multiple CQs per IB devices Tony Lu
2022-01-14  5:48 ` [RFC PATCH net-next 5/6] net/smc: Unbind buffer size from clcsock and make it tunable Tony Lu
2022-01-14  5:48 ` [RFC PATCH net-next 6/6] net/smc: Introduce tunable linkgroup max connections Tony Lu
2022-01-16  9:00 ` [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores Leon Romanovsky
2022-01-16 17:47   ` Tony Lu
2022-01-26  7:23   ` Tony Lu
2022-01-26 15:28     ` Jason Gunthorpe
2022-01-27  3:14       ` Tony Lu
2022-01-27  6:21         ` Leon Romanovsky
2022-01-27  7:59           ` Tony Lu
2022-01-27  8:47             ` Leon Romanovsky
2022-01-27  9:14               ` Tony Lu
2022-01-27  9:25                 ` Leon Romanovsky
2022-01-27  9:50                   ` Tony Lu
2022-01-27 14:52                     ` Karsten Graul
2022-01-28  6:55                       ` Tony Lu
2022-02-01 16:50                         ` Karsten Graul
2022-02-09  9:49                           ` Tony Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220114054852.38058-1-tonylu@linux.alibaba.com \
    --to=tonylu@linux.alibaba.com \
    --cc=davem@davemloft.net \
    --cc=kgraul@linux.ibm.com \
    --cc=kuba@kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).