linux-s390.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration
@ 2023-03-27  3:28 Wen Gu
  2023-03-27  3:28 ` [RFC PATCH net-next v4 1/9] net/smc: Decouple ism_dev from SMC-D device dump Wen Gu
                   ` (10 more replies)
  0 siblings, 11 replies; 17+ messages in thread
From: Wen Gu @ 2023-03-27  3:28 UTC (permalink / raw)
  To: kgraul, wenjia, jaka, wintera, davem, edumazet, kuba, pabeni
  Cc: linux-s390, netdev, linux-kernel

Hi, all

# Background

The background and previous discussion can be referred from [1],[6].

We found SMC-D can be used to accelerate OS internal communication, such as
loopback or between two containers within the same OS instance. So this patch
set provides a kind of SMC-D dummy device (we call it the SMC-D loopback device)
to emulate an ISM device, so that SMC-D can also be used on architectures
other than s390. The SMC-D loopback device are designed as a system global
device, visible to all containers.

This version is implemented based on the generalized interface provided by [2].
And there is an open issue, which will be mentioned later.

# Design

This patch set basically follows the design of the previous version.

Patch #1/9 ~ #3/9 attempt to decouple ISM-related structures from the SMC-D
generalized code and extract some helpers to make SMC-D protocol compatible
with devices other than s390 ISM device.

Patch #4/9 introduces a kind of loopback device, which is defined as SMC-D v2
device and designed to provide communication between SMC sockets in the same OS
instance.

 +-------------------------------------------+
 |  +--------------+       +--------------+  |
 |  | SMC socket A |       | SMC socket B |  |
 |  +--------------+       +--------------+  |
 |       ^                         ^         |
 |       |    +----------------+   |         |
 |       |    |   SMC stack    |   |         |
 |       +--->| +------------+ |<--|         |
 |            | |   dummy    | |             |
 |            | |   device   | |             |
 |            +-+------------+-+             |
 |                   OS                      |
 +-------------------------------------------+

Patch #5/9 ~ #8/9 expand SMC-D protocol interface (smcd_ops) for scenarios where
SMC-D is used to communicate within VM (loopback here) or between VMs on the same
host (based on virtio-ism device, see [3]). What these scenarios have in common
is that the local sndbuf and peer RMB can be mapped to same physical memory region,
so the data copy between the local sndbuf and peer RMB can be omitted. Performance
improvement brought by this extension can be found in # Benchmark Test.

 +----------+                     +----------+
 | socket A |                     | socket B |
 +----------+                     +----------+
       |                               ^
       |         +---------+           |
  regard as      |         | ----------|
  local sndbuf   |  B's    |     regard as
       |         |  RMB    |     local RMB
       |-------> |         |
                 +---------+

Patch #9/9 realizes the support of loopback device for the above-mentioned expanded
SMC-D protocol interface.

# Benchmark Test

 * Test environments:
      - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
      - SMC sndbuf/RMB size 1MB.

 * Test object:
      - TCP lo: run on TCP loopback.
      - domain: run on UNIX domain.
      - SMC lo: run on SMC loopback device with patch #1/9 ~ #4/9.
      - SMC lo-nocpy: run on SMC loopback device with patch #1/9 ~ #9/9.

1. ipc-benchmark (see [4])

 - ./<foo> -c 1000000 -s 100

                    TCP-lo              domain              SMC-lo          SMC-lo-nocpy
Message
rate (msg/s)         79025      115736(+46.45%)    146760(+85.71%)       149800(+89.56%)

2. sockperf

 - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
 - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30

                    TCP-lo                  SMC-lo             SMC-lo-nocpy
Bandwidth(MBps)   4822.388        4940.918(+2.56%)         8086.67(+67.69%)
Latency(us)          6.298          3.352(-46.78%)            3.35(-46.81%)

3. iperf3

 - serv: <smc_run> taskset -c <cpu> iperf3 -s
 - clnt: <smc_run> taskset -c <cpu> iperf3 -c 127.0.0.1 -t 15

                    TCP-lo                  SMC-lo             SMC-lo-nocpy
Bitrate(Gb/s)         40.7            40.5(-0.49%)            72.4(+77.89%)

4. nginx/wrk

 - serv: <smc_run> nginx
 - clnt: <smc_run> wrk -t 8 -c 500 -d 30 http://127.0.0.1:80

                    TCP-lo                  SMC-lo             SMC-lo-nocpy
Requests/s       155994.57      214544.79(+37.53%)       215538.55(+38.17%)


# Open issue

The open issue is about how to detect that the source and target of CLC proposal
are within the same OS instance and can communicate through the SMC-D loopback device.
Similar issue also exists when using virtio-ism devices (the background and details
of virtio-ism device can be referred from [3]). In previous discussions, multiple
options were proposed (see [5]). Thanks again for the help of the community. :)

But as we discussed, these solutions have some imperfection. So this version of RFC
continues to use previous workaround, that is, a 64-bit random GID is generated for
SMC-D loopback device. If the GIDs of the devices found by two peers are the same,
then they are considered to be in the same OS instance and can communicate with each
other by the loopback device.

This approach needs that the loopback device GID is globally unique. But theoretically
there is a possibility of a collision. Assume the following situations:

(1) Assume that the SMC-D loopback devices of the two different OS instances happen
    to generate the same 64-bit GID.

    For the convenience of description, we refer to the sockets on these two
    different OS instance as server A and client B.

    A will misjudge that the two are on the same OS instance because the same GID
    in CLC proposal message. Then A creates its RMB and sends 64-bit token-A to B
    in CLC accept message.

    B receives the CLC accept message. And according to patch #7/9, B tries to
    attach its sndbuf to A's RMB by token-A.

(2) And assume that the OS instance where B is located happens to have an unattached
    RMB whose 64-bit token is same as token-A.

    Then B successfully attaches its sndbuf to the wrong RMB, and creates its RMB,
    sends token-B to A in CLC confirm message.

    Similarly, A receives the message and tries to attach its sndbuf to B's RMB by
    token-B.

(3) Similar to (2), assume that the OS instance where A is located happens to have
    an unattached RMB whose 64-bit token is same as token-B.

    Then A successfully attach its sndbuf to the wrong RMB. Both sides mistakenly
    believe that an SMC-D connection based on the loopback device is established
    between them.

If the above 3 coincidences all happen, that is, 64-bit random number conflicts occur
3 times, then an unreachable SMC-D connection will be established, which is nasty.
But if one of above is not satisfied, it will safely fallback to TCP.

Since the chances of these happening are very small, I wonder if this risk of 1/2^(64*3)
probability is acceptable? Can we just use 64-bits random generated number as GID in
loopback device?

Some other ways that may be able to make loopback GID unique are
 1) Using a 128-bit UUID to identify SMC-D loopback device or virtio-ism device, because
    the probability of a 128-bit UUID collision is considered negligible. But it needs
    to extend the CLC message to carry a longer GID.
 2) Using MAC address of netdev in the OS as part of SMC-D loopback device GID, provided
    that the MAC addresses are unique. But the MAC address could theoretically also be
    incorrectly set to be the same.

Hope to hear opinions from the community. Any ideas are welcome.

Thanks!
Wen Gu

v4->v3
 1. Rebase to the latest net-next;
 2. Introduce SEID helper. SMC-D loopback will return SMCD_DEFAULT_V2_SEID. And if it
    coexist with ISM device, the SEID of ISM device will overwrite SMCD_DEFAULT_V2_SEID
    as smc_ism_v2_system_eid.
 3. Won't remove dmb_node from hashtable until no sndbuf attaching to it.

 Something postponed in this version
 1. Hierarchy perference of SMC-D devices when loopback and ISM devices coexist, which
    will be determinated after comparing the performance of loopback and ISM.

v3->v2
 1. Adapt new generalized interface provided by [2];
 2. Select loopback device through SMC-D v2 protocol;
 3. Split the loopback-related implementation and generic implementation into different
    patches more reasonably.

v1->v2
 1. Fix some build WARNINGs complained by kernel test rebot
    Reported-by: kernel test robot <lkp@intel.com>
 2. Add iperf3 test data.

[1] https://lore.kernel.org/netdev/1671506505-104676-1-git-send-email-guwen@linux.alibaba.com/
[2] https://lore.kernel.org/netdev/20230123181752.1068-1-jaka@linux.ibm.com/
[3] https://lists.oasis-open.org/archives/virtio-comment/202302/msg00148.html
[4] https://github.com/goldsborough/ipc-bench
[5] https://lore.kernel.org/netdev/b9867c7d-bb2b-16fc-feda-b79579aa833d@linux.ibm.com/
[6] https://lore.kernel.org/netdev/1676477905-88043-1-git-send-email-guwen@linux.alibaba.com/


Wen Gu (9):
  net/smc: Decouple ism_dev from SMC-D device dump
  net/smc: Decouple ism_dev from SMC-D DMB registration
  net/smc: Extract v2 check helper from SMC-D device registration
  net/smc: Introduce SMC-D loopback device
  net/smc: Introduce an interface for getting DMB attribute
  net/smc: Introudce interfaces for DMB attach and detach
  net/smc: Avoid data copy from sndbuf to peer RMB in SMC-D
  net/smc: Modify cursor update logic when using mappable DMB
  net/smc: Add interface implementation of loopback device

 drivers/s390/net/ism_drv.c |   5 +-
 include/net/smc.h          |  18 +-
 net/smc/Makefile           |   2 +-
 net/smc/af_smc.c           |  26 ++-
 net/smc/smc_cdc.c          |  59 ++++--
 net/smc/smc_cdc.h          |   1 +
 net/smc/smc_core.c         |  70 ++++++-
 net/smc/smc_core.h         |   1 +
 net/smc/smc_ism.c          |  99 ++++++++--
 net/smc/smc_ism.h          |   5 +
 net/smc/smc_loopback.c     | 445 +++++++++++++++++++++++++++++++++++++++++++++
 net/smc/smc_loopback.h     |  56 ++++++
 12 files changed, 750 insertions(+), 37 deletions(-)
 create mode 100644 net/smc/smc_loopback.c
 create mode 100644 net/smc/smc_loopback.h

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH net-next v4 1/9] net/smc: Decouple ism_dev from SMC-D device dump
  2023-03-27  3:28 [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration Wen Gu
@ 2023-03-27  3:28 ` Wen Gu
  2023-03-27  3:28 ` [RFC PATCH net-next v4 2/9] net/smc: Decouple ism_dev from SMC-D DMB registration Wen Gu
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Wen Gu @ 2023-03-27  3:28 UTC (permalink / raw)
  To: kgraul, wenjia, jaka, wintera, davem, edumazet, kuba, pabeni
  Cc: linux-s390, netdev, linux-kernel

This patch helps to decouple SMC-D device and ISM device, allowing
different underlying device forms, such as non-PCI devices.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
---
 include/net/smc.h | 1 +
 net/smc/smc_ism.c | 6 +++---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/net/smc.h b/include/net/smc.h
index a002552..963ce9c 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -77,6 +77,7 @@ struct smcd_ops {
 struct smcd_dev {
 	const struct smcd_ops *ops;
 	void *priv;
+	struct device *parent_pci_dev;
 	struct list_head list;
 	spinlock_t lock;
 	struct smc_connection **conn;
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index fbee249..4249fb9 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -231,11 +231,9 @@ static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
 	struct smc_pci_dev smc_pci_dev;
 	struct nlattr *port_attrs;
 	struct nlattr *attrs;
-	struct ism_dev *ism;
 	int use_cnt = 0;
 	void *nlh;
 
-	ism = smcd->priv;
 	nlh = genlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq,
 			  &smc_gen_nl_family, NLM_F_MULTI,
 			  SMC_NETLINK_GET_DEV_SMCD);
@@ -250,7 +248,8 @@ static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
 	if (nla_put_u8(skb, SMC_NLA_DEV_IS_CRIT, use_cnt > 0))
 		goto errattr;
 	memset(&smc_pci_dev, 0, sizeof(smc_pci_dev));
-	smc_set_pci_values(to_pci_dev(ism->dev.parent), &smc_pci_dev);
+	if (smcd->parent_pci_dev)
+		smc_set_pci_values(to_pci_dev(smcd->parent_pci_dev), &smc_pci_dev);
 	if (nla_put_u32(skb, SMC_NLA_DEV_PCI_FID, smc_pci_dev.pci_fid))
 		goto errattr;
 	if (nla_put_u16(skb, SMC_NLA_DEV_PCI_CHID, smc_pci_dev.pci_pchid))
@@ -420,6 +419,7 @@ static void smcd_register_dev(struct ism_dev *ism)
 	if (!smcd)
 		return;
 	smcd->priv = ism;
+	smcd->parent_pci_dev = ism->dev.parent;
 	ism_set_priv(ism, &smc_ism_client, smcd);
 	if (smc_pnetid_by_dev_port(&ism->pdev->dev, 0, smcd->pnetid))
 		smc_pnetid_by_table_smcd(smcd);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH net-next v4 2/9] net/smc: Decouple ism_dev from SMC-D DMB registration
  2023-03-27  3:28 [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration Wen Gu
  2023-03-27  3:28 ` [RFC PATCH net-next v4 1/9] net/smc: Decouple ism_dev from SMC-D device dump Wen Gu
@ 2023-03-27  3:28 ` Wen Gu
  2023-03-27  3:28 ` [RFC PATCH net-next v4 3/9] net/smc: Extract v2 check helper from SMC-D device registration Wen Gu
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Wen Gu @ 2023-03-27  3:28 UTC (permalink / raw)
  To: kgraul, wenjia, jaka, wintera, davem, edumazet, kuba, pabeni
  Cc: linux-s390, netdev, linux-kernel

This patch tries to decouple ISM device from SMC-D DMB registration,
So that the register_dmb option is not restricted to be used by ISM
device.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
---
 drivers/s390/net/ism_drv.c | 5 +++--
 include/net/smc.h          | 4 ++--
 net/smc/smc_ism.c          | 7 ++-----
 3 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/drivers/s390/net/ism_drv.c b/drivers/s390/net/ism_drv.c
index 8acb9eb..5eeb54d 100644
--- a/drivers/s390/net/ism_drv.c
+++ b/drivers/s390/net/ism_drv.c
@@ -796,9 +796,10 @@ static int smcd_query_rgid(struct smcd_dev *smcd, u64 rgid, u32 vid_valid,
 }
 
 static int smcd_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
-			     struct ism_client *client)
+			     void *client_priv)
 {
-	return ism_register_dmb(smcd->priv, (struct ism_dmb *)dmb, client);
+	return ism_register_dmb(smcd->priv, (struct ism_dmb *)dmb,
+				(struct ism_client *)client_priv);
 }
 
 static int smcd_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
diff --git a/include/net/smc.h b/include/net/smc.h
index 963ce9c..26206d2 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -50,13 +50,12 @@ struct smcd_dmb {
 #define ISM_ERROR	0xFFFF
 
 struct smcd_dev;
-struct ism_client;
 
 struct smcd_ops {
 	int (*query_remote_gid)(struct smcd_dev *dev, u64 rgid, u32 vid_valid,
 				u32 vid);
 	int (*register_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb,
-			    struct ism_client *client);
+			    void *client_priv);
 	int (*unregister_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
 	int (*add_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
 	int (*del_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
@@ -77,6 +76,7 @@ struct smcd_ops {
 struct smcd_dev {
 	const struct smcd_ops *ops;
 	void *priv;
+	void *client_priv;
 	struct device *parent_pci_dev;
 	struct list_head list;
 	spinlock_t lock;
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index 4249fb9..d4709ca 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -200,7 +200,6 @@ int smc_ism_unregister_dmb(struct smcd_dev *smcd, struct smc_buf_desc *dmb_desc)
 int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
 			 struct smc_buf_desc *dmb_desc)
 {
-#if IS_ENABLED(CONFIG_ISM)
 	struct smcd_dmb dmb;
 	int rc;
 
@@ -209,7 +208,7 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
 	dmb.sba_idx = dmb_desc->sba_idx;
 	dmb.vlan_id = lgr->vlan_id;
 	dmb.rgid = lgr->peer_gid;
-	rc = lgr->smcd->ops->register_dmb(lgr->smcd, &dmb, &smc_ism_client);
+	rc = lgr->smcd->ops->register_dmb(lgr->smcd, &dmb, lgr->smcd->client_priv);
 	if (!rc) {
 		dmb_desc->sba_idx = dmb.sba_idx;
 		dmb_desc->token = dmb.dmb_tok;
@@ -218,9 +217,6 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
 		dmb_desc->len = dmb.dmb_len;
 	}
 	return rc;
-#else
-	return 0;
-#endif
 }
 
 static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
@@ -419,6 +415,7 @@ static void smcd_register_dev(struct ism_dev *ism)
 	if (!smcd)
 		return;
 	smcd->priv = ism;
+	smcd->client_priv = &smc_ism_client;
 	smcd->parent_pci_dev = ism->dev.parent;
 	ism_set_priv(ism, &smc_ism_client, smcd);
 	if (smc_pnetid_by_dev_port(&ism->pdev->dev, 0, smcd->pnetid))
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH net-next v4 3/9] net/smc: Extract v2 check helper from SMC-D device registration
  2023-03-27  3:28 [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration Wen Gu
  2023-03-27  3:28 ` [RFC PATCH net-next v4 1/9] net/smc: Decouple ism_dev from SMC-D device dump Wen Gu
  2023-03-27  3:28 ` [RFC PATCH net-next v4 2/9] net/smc: Decouple ism_dev from SMC-D DMB registration Wen Gu
@ 2023-03-27  3:28 ` Wen Gu
  2023-03-27  3:28 ` [RFC PATCH net-next v4 4/9] net/smc: Introduce SMC-D loopback device Wen Gu
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Wen Gu @ 2023-03-27  3:28 UTC (permalink / raw)
  To: kgraul, wenjia, jaka, wintera, davem, edumazet, kuba, pabeni
  Cc: linux-s390, netdev, linux-kernel

This patch extracts v2-capable logic from the process of registering the
ISM device as an SMC-D device, so that the registration process of other
underlying devices can reuse it.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
---
 net/smc/smc_ism.c | 27 +++++++++++++++++----------
 net/smc/smc_ism.h |  1 +
 2 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index d4709ca..8ad4c71 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -69,6 +69,22 @@ bool smc_ism_is_v2_capable(void)
 	return smc_ism_v2_capable;
 }
 
+/* must be called under smcd_dev_list.mutex lock */
+void smc_ism_check_v2_capable(struct smcd_dev *smcd)
+{
+	u8 *system_eid = NULL;
+
+	if (!list_empty(&smcd_dev_list.list))
+		return;
+
+	system_eid = smcd->ops->get_system_eid();
+	if (smcd->ops->supports_v2()) {
+		smc_ism_v2_capable = true;
+		memcpy(smc_ism_v2_system_eid, system_eid,
+		       SMC_MAX_EID_LEN);
+	}
+}
+
 /* Set a connection using this DMBE. */
 void smc_ism_set_conn(struct smc_connection *conn)
 {
@@ -422,16 +438,7 @@ static void smcd_register_dev(struct ism_dev *ism)
 		smc_pnetid_by_table_smcd(smcd);
 
 	mutex_lock(&smcd_dev_list.mutex);
-	if (list_empty(&smcd_dev_list.list)) {
-		u8 *system_eid = NULL;
-
-		system_eid = smcd->ops->get_system_eid();
-		if (smcd->ops->supports_v2()) {
-			smc_ism_v2_capable = true;
-			memcpy(smc_ism_v2_system_eid, system_eid,
-			       SMC_MAX_EID_LEN);
-		}
-	}
+	smc_ism_check_v2_capable(smcd);
 	/* sort list: devices without pnetid before devices with pnetid */
 	if (smcd->pnetid[0])
 		list_add_tail(&smcd->list, &smcd_dev_list.list);
diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h
index 832b2f4..14d2e77 100644
--- a/net/smc/smc_ism.h
+++ b/net/smc/smc_ism.h
@@ -42,6 +42,7 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int buf_size,
 void smc_ism_get_system_eid(u8 **eid);
 u16 smc_ism_get_chid(struct smcd_dev *dev);
 bool smc_ism_is_v2_capable(void);
+void smc_ism_check_v2_capable(struct smcd_dev *dev);
 int smc_ism_init(void);
 void smc_ism_exit(void);
 int smcd_nl_get_device(struct sk_buff *skb, struct netlink_callback *cb);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH net-next v4 4/9] net/smc: Introduce SMC-D loopback device
  2023-03-27  3:28 [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration Wen Gu
                   ` (2 preceding siblings ...)
  2023-03-27  3:28 ` [RFC PATCH net-next v4 3/9] net/smc: Extract v2 check helper from SMC-D device registration Wen Gu
@ 2023-03-27  3:28 ` Wen Gu
  2023-03-27  3:28 ` [RFC PATCH net-next v4 5/9] net/smc: Introduce an interface for getting DMB attribute Wen Gu
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Wen Gu @ 2023-03-27  3:28 UTC (permalink / raw)
  To: kgraul, wenjia, jaka, wintera, davem, edumazet, kuba, pabeni
  Cc: linux-s390, netdev, linux-kernel

This patch introduces a kind of loopback device for SMC-D, thus
enabling the SMC communication between two local sockets within
one OS instance.

The loopback device supports basic capabilities defined by SMC-D
options, and exposed as an SMC-D v2 device.

The GID of loopback device is random generated, CHID is 0xFFFF
and SEID is SMCD_DEFAULT_V2_SEID.

TODO:
- The hierarchy preference of coexistent SMC-D devices.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
---
 include/net/smc.h      |   6 +
 net/smc/Makefile       |   2 +-
 net/smc/af_smc.c       |  12 +-
 net/smc/smc_cdc.c      |   9 +-
 net/smc/smc_cdc.h      |   1 +
 net/smc/smc_ism.c      |  20 +++
 net/smc/smc_ism.h      |   1 +
 net/smc/smc_loopback.c | 360 +++++++++++++++++++++++++++++++++++++++++++++++++
 net/smc/smc_loopback.h |  51 +++++++
 9 files changed, 459 insertions(+), 3 deletions(-)
 create mode 100644 net/smc/smc_loopback.c
 create mode 100644 net/smc/smc_loopback.h

diff --git a/include/net/smc.h b/include/net/smc.h
index 26206d2..021ca42 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -41,6 +41,12 @@ struct smcd_dmb {
 	dma_addr_t dma_addr;
 };
 
+struct smcd_seid {
+	u8 seid_string[24];
+	u8 serial_number[4];
+	u8 type[4];
+};
+
 #define ISM_EVENT_DMB	0
 #define ISM_EVENT_GID	1
 #define ISM_EVENT_SWR	2
diff --git a/net/smc/Makefile b/net/smc/Makefile
index 875efcd..a8c3711 100644
--- a/net/smc/Makefile
+++ b/net/smc/Makefile
@@ -4,5 +4,5 @@ obj-$(CONFIG_SMC)	+= smc.o
 obj-$(CONFIG_SMC_DIAG)	+= smc_diag.o
 smc-y := af_smc.o smc_pnet.o smc_ib.o smc_clc.o smc_core.o smc_wr.o smc_llc.o
 smc-y += smc_cdc.o smc_tx.o smc_rx.o smc_close.o smc_ism.o smc_netlink.o smc_stats.o
-smc-y += smc_tracepoint.o
+smc-y += smc_tracepoint.o smc_loopback.o
 smc-$(CONFIG_SYSCTL) += smc_sysctl.o
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index c6b4a62..c91600a 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -53,6 +53,7 @@
 #include "smc_stats.h"
 #include "smc_tracepoint.h"
 #include "smc_sysctl.h"
+#include "smc_loopback.h"
 
 static DEFINE_MUTEX(smc_server_lgr_pending);	/* serialize link group
 						 * creation on server
@@ -3471,15 +3472,23 @@ static int __init smc_init(void)
 		goto out_sock;
 	}
 
+	rc = smc_loopback_init();
+	if (rc) {
+		pr_err("%s: smc_loopback_init fails with %d\n", __func__, rc);
+		goto out_ib;
+	}
+
 	rc = tcp_register_ulp(&smc_ulp_ops);
 	if (rc) {
 		pr_err("%s: tcp_ulp_register fails with %d\n", __func__, rc);
-		goto out_ib;
+		goto out_lo;
 	}
 
 	static_branch_enable(&tcp_have_smc);
 	return 0;
 
+out_lo:
+	smc_loopback_exit();
 out_ib:
 	smc_ib_unregister_client();
 out_sock:
@@ -3517,6 +3526,7 @@ static void __exit smc_exit(void)
 	tcp_unregister_ulp(&smc_ulp_ops);
 	sock_unregister(PF_SMC);
 	smc_core_exit();
+	smc_loopback_exit();
 	smc_ib_unregister_client();
 	smc_ism_exit();
 	destroy_workqueue(smc_close_wq);
diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index 89105e9..2f79bac 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -410,7 +410,14 @@ static void smc_cdc_msg_recv(struct smc_sock *smc, struct smc_cdc_msg *cdc)
  */
 static void smcd_cdc_rx_tsklet(struct tasklet_struct *t)
 {
-	struct smc_connection *conn = from_tasklet(conn, t, rx_tsklet);
+	struct smc_connection *conn =
+		from_tasklet(conn, t, rx_tsklet);
+
+	smcd_cdc_rx_handler(conn);
+}
+
+void smcd_cdc_rx_handler(struct smc_connection *conn)
+{
 	struct smcd_cdc_msg *data_cdc;
 	struct smcd_cdc_msg cdc;
 	struct smc_sock *smc;
diff --git a/net/smc/smc_cdc.h b/net/smc/smc_cdc.h
index 696cc11..11559d4 100644
--- a/net/smc/smc_cdc.h
+++ b/net/smc/smc_cdc.h
@@ -301,5 +301,6 @@ int smcr_cdc_msg_send_validation(struct smc_connection *conn,
 				 struct smc_wr_buf *wr_buf);
 int smc_cdc_init(void) __init;
 void smcd_cdc_rx_init(struct smc_connection *conn);
+void smcd_cdc_rx_handler(struct smc_connection *conn);
 
 #endif /* SMC_CDC_H */
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index 8ad4c71..f638999 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -24,6 +24,19 @@ struct smcd_dev_list smcd_dev_list = {
 	.mutex = __MUTEX_INITIALIZER(smcd_dev_list.mutex)
 };
 
+/* s390 ISMv2 device creates seid by s390 cpu_id,
+ * but extension SMCv2 device, like loopback, uses
+ * SMCD_DEFAULT_V2_SEID as seid.
+ *
+ * If ISMv2 device and extension SMCv2 device coexist,
+ * ISMv2's seid will overwrite SMCD_DEFAULT_V2_SEID.
+ */
+struct smcd_seid SMCD_DEFAULT_V2_SEID = {
+	.seid_string = "IBM-DEF-ISMSEID000000000",
+	.serial_number = "1000",
+	.type = "1000",
+};
+
 static bool smc_ism_v2_capable;
 static u8 smc_ism_v2_system_eid[SMC_MAX_EID_LEN];
 
@@ -79,6 +92,13 @@ void smc_ism_check_v2_capable(struct smcd_dev *smcd)
 
 	system_eid = smcd->ops->get_system_eid();
 	if (smcd->ops->supports_v2()) {
+		if (smc_ism_v2_capable &&
+		    memcmp(smc_ism_v2_system_eid, &SMCD_DEFAULT_V2_SEID,
+			   SMC_MAX_EID_LEN))
+			/* ISMv2 device already set the seid */
+			return;
+
+		/* set the initial seid or overwrite the default seid */
 		smc_ism_v2_capable = true;
 		memcpy(smc_ism_v2_system_eid, system_eid,
 		       SMC_MAX_EID_LEN);
diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h
index 14d2e77..870ff7b 100644
--- a/net/smc/smc_ism.h
+++ b/net/smc/smc_ism.h
@@ -21,6 +21,7 @@ struct smcd_dev_list {	/* List of SMCD devices */
 };
 
 extern struct smcd_dev_list	smcd_dev_list;	/* list of smcd devices */
+extern struct smcd_seid		SMCD_DEFAULT_V2_SEID; /* default v2 SEID */
 
 struct smc_ism_vlanid {			/* VLAN id set on ISM device */
 	struct list_head list;
diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
new file mode 100644
index 0000000..6ac5727
--- /dev/null
+++ b/net/smc/smc_loopback.c
@@ -0,0 +1,360 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ *  Shared Memory Communications Direct over loopback device.
+ *
+ *  Provide a SMC-D loopback dummy device.
+ *
+ *  Copyright (c) 2022, Alibaba Inc.
+ *
+ *  Author: Wen Gu <guwen@linux.alibaba.com>
+ *          Tony Lu <tonylu@linux.alibaba.com>
+ *
+ */
+
+#include <linux/device.h>
+#include <linux/types.h>
+#include <net/smc.h>
+
+#include "smc_cdc.h"
+#include "smc_ism.h"
+#include "smc_loopback.h"
+
+#define SMC_LO_SUPPORTS_V2 0x1 /* SMC-D loopback supports SMC-Dv2 */
+
+static struct smc_lo_dev *lo_dev;
+static const char smc_lo_dev_name[] = "smcd_loopback_dev";
+
+static inline void smc_lo_gen_id(struct smc_lo_dev *ldev)
+{
+	/* TODO: ensure local_gid is unique.
+	 */
+	get_random_bytes(&ldev->local_gid, sizeof(ldev->local_gid));
+	ldev->chid = SMC_LO_CHID;
+}
+
+static int smc_lo_query_rgid(struct smcd_dev *smcd, u64 rgid,
+			     u32 vid_valid, u32 vid)
+{
+	struct smc_lo_dev *ldev = smcd->priv;
+
+	/* rgid should equal to lgid in loopback */
+	if (!ldev || rgid != ldev->local_gid)
+		return -ENETUNREACH;
+	return 0;
+}
+
+static int smc_lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
+			       void *client_priv)
+{
+	struct smc_lo_dev *ldev = smcd->priv;
+	struct smc_lo_dmb_node *dmb_node;
+	int sba_idx, rc;
+
+	/* check space for new dmb */
+	for_each_clear_bit(sba_idx, ldev->sba_idx_mask, SMC_LODEV_MAX_DMBS) {
+		if (!test_and_set_bit(sba_idx, ldev->sba_idx_mask))
+			break;
+	}
+	if (sba_idx == SMC_LODEV_MAX_DMBS)
+		return -ENOSPC;
+
+	dmb_node = kzalloc(sizeof(*dmb_node), GFP_KERNEL);
+	if (!dmb_node) {
+		rc = -ENOMEM;
+		goto err_bit;
+	}
+
+	dmb_node->sba_idx = sba_idx;
+	dmb_node->cpu_addr = kzalloc(dmb->dmb_len, GFP_KERNEL |
+			     __GFP_NOWARN | __GFP_NORETRY |
+			     __GFP_NOMEMALLOC);
+	if (!dmb_node->cpu_addr) {
+		rc = -ENOMEM;
+		goto err_node;
+	}
+	dmb_node->len = dmb->dmb_len;
+	dmb_node->dma_addr = (dma_addr_t)dmb_node->cpu_addr;
+
+	/* TODO: token is random but not exclusive !
+	 * suppose to find token in dmb hask table, if has this token
+	 * already, then generate another one.
+	 */
+	/* add new dmb into hash table */
+	get_random_bytes(&dmb_node->token, sizeof(dmb_node->token));
+	write_lock(&ldev->dmb_ht_lock);
+	hash_add(ldev->dmb_ht, &dmb_node->list, dmb_node->token);
+	write_unlock(&ldev->dmb_ht_lock);
+
+	dmb->sba_idx = dmb_node->sba_idx;
+	dmb->dmb_tok = dmb_node->token;
+	dmb->cpu_addr = dmb_node->cpu_addr;
+	dmb->dma_addr = dmb_node->dma_addr;
+	dmb->dmb_len = dmb_node->len;
+
+	return 0;
+
+err_node:
+	kfree(dmb_node);
+err_bit:
+	clear_bit(sba_idx, ldev->sba_idx_mask);
+	return rc;
+}
+
+static int smc_lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
+{
+	struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
+	struct smc_lo_dev *ldev = smcd->priv;
+
+	/* remove dmb from hash table */
+	write_lock(&ldev->dmb_ht_lock);
+	hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb->dmb_tok) {
+		if (tmp_node->token == dmb->dmb_tok) {
+			dmb_node = tmp_node;
+			break;
+		}
+	}
+	if (!dmb_node) {
+		write_unlock(&ldev->dmb_ht_lock);
+		return -EINVAL;
+	}
+	hash_del(&dmb_node->list);
+	write_unlock(&ldev->dmb_ht_lock);
+
+	clear_bit(dmb_node->sba_idx, ldev->sba_idx_mask);
+	kfree(dmb_node->cpu_addr);
+	kfree(dmb_node);
+
+	return 0;
+}
+
+static int smc_lo_add_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
+{
+	return -EOPNOTSUPP;
+}
+
+static int smc_lo_del_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
+{
+	return -EOPNOTSUPP;
+}
+
+static int smc_lo_set_vlan_required(struct smcd_dev *smcd)
+{
+	return -EOPNOTSUPP;
+}
+
+static int smc_lo_reset_vlan_required(struct smcd_dev *smcd)
+{
+	return -EOPNOTSUPP;
+}
+
+static int smc_lo_move_data(struct smcd_dev *smcd, u64 dmb_tok, unsigned int idx,
+			    bool sf, unsigned int offset, void *data,
+			    unsigned int size)
+{
+	struct smc_lo_dmb_node *rmb_node = NULL, *tmp_node;
+	struct smc_lo_dev *ldev = smcd->priv;
+
+	read_lock(&ldev->dmb_ht_lock);
+	hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_tok) {
+		if (tmp_node->token == dmb_tok) {
+			rmb_node = tmp_node;
+			break;
+		}
+	}
+	if (!rmb_node) {
+		read_unlock(&ldev->dmb_ht_lock);
+		return -EINVAL;
+	}
+	read_unlock(&ldev->dmb_ht_lock);
+
+	memcpy((char *)rmb_node->cpu_addr + offset, data, size);
+
+	if (sf) {
+		struct smc_connection *conn =
+			smcd->conn[rmb_node->sba_idx];
+
+		if (conn && !conn->killed)
+			smcd_cdc_rx_handler(conn);
+	}
+	return 0;
+}
+
+static int smc_lo_supports_v2(void)
+{
+	return SMC_LO_SUPPORTS_V2;
+}
+
+static u8 *smc_lo_get_system_eid(void)
+{
+	return SMCD_DEFAULT_V2_SEID.seid_string;
+}
+
+static u64 smc_lo_get_local_gid(struct smcd_dev *smcd)
+{
+	return ((struct smc_lo_dev *)smcd->priv)->local_gid;
+}
+
+static u16 smc_lo_get_chid(struct smcd_dev *smcd)
+{
+	return ((struct smc_lo_dev *)smcd->priv)->chid;
+}
+
+static struct device *smc_lo_get_dev(struct smcd_dev *smcd)
+{
+	return &((struct smc_lo_dev *)smcd->priv)->dev;
+}
+
+static const struct smcd_ops lo_ops = {
+	.query_remote_gid = smc_lo_query_rgid,
+	.register_dmb = smc_lo_register_dmb,
+	.unregister_dmb = smc_lo_unregister_dmb,
+	.add_vlan_id = smc_lo_add_vlan_id,
+	.del_vlan_id = smc_lo_del_vlan_id,
+	.set_vlan_required = smc_lo_set_vlan_required,
+	.reset_vlan_required = smc_lo_reset_vlan_required,
+	.signal_event = NULL,
+	.move_data = smc_lo_move_data,
+	.supports_v2 = smc_lo_supports_v2,
+	.get_system_eid = smc_lo_get_system_eid,
+	.get_local_gid = smc_lo_get_local_gid,
+	.get_chid = smc_lo_get_chid,
+	.get_dev = smc_lo_get_dev,
+};
+
+static struct smcd_dev *smcd_lo_alloc_dev(const struct smcd_ops *ops,
+					  int max_dmbs)
+{
+	struct smcd_dev *smcd;
+
+	smcd = kzalloc(sizeof(*smcd), GFP_KERNEL);
+	if (!smcd)
+		return NULL;
+
+	smcd->conn = kcalloc(max_dmbs, sizeof(struct smc_connection *),
+			     GFP_KERNEL);
+	if (!smcd->conn)
+		goto out_smcd;
+
+	smcd->ops = ops;
+
+	spin_lock_init(&smcd->lock);
+	spin_lock_init(&smcd->lgr_lock);
+	INIT_LIST_HEAD(&smcd->vlan);
+	INIT_LIST_HEAD(&smcd->lgr_list);
+	init_waitqueue_head(&smcd->lgrs_deleted);
+	return smcd;
+
+out_smcd:
+	kfree(smcd);
+	return NULL;
+}
+
+static int smcd_lo_register_dev(struct smc_lo_dev *ldev)
+{
+	struct smcd_dev *smcd;
+
+	smcd = smcd_lo_alloc_dev(&lo_ops, SMC_LODEV_MAX_DMBS);
+	if (!smcd)
+		return -ENOMEM;
+
+	ldev->smcd = smcd;
+	smcd->priv = ldev;
+	smcd->parent_pci_dev = NULL;
+	mutex_lock(&smcd_dev_list.mutex);
+	smc_ism_check_v2_capable(smcd);
+	list_add(&smcd->list, &smcd_dev_list.list);
+	mutex_unlock(&smcd_dev_list.mutex);
+	pr_warn_ratelimited("smc: adding smcd device %s with pnetid %.16s%s\n",
+			    smc_lo_dev_name, smcd->pnetid,
+			    smcd->pnetid_by_user ? " (user defined)" : "");
+	return 0;
+}
+
+static void smcd_lo_unregister_dev(struct smc_lo_dev *ldev)
+{
+	struct smcd_dev *smcd = ldev->smcd;
+
+	pr_warn_ratelimited("smc: removing smcd device %s\n",
+			    smc_lo_dev_name);
+	smcd->going_away = 1;
+	smc_smcd_terminate_all(smcd);
+	mutex_lock(&smcd_dev_list.mutex);
+	list_del_init(&smcd->list);
+	mutex_unlock(&smcd_dev_list.mutex);
+}
+
+static void smc_lo_dev_release(struct device *dev)
+{
+	struct smc_lo_dev *ldev =
+		container_of(dev, struct smc_lo_dev, dev);
+	struct smcd_dev *smcd = ldev->smcd;
+
+	kfree(smcd->conn);
+	kfree(smcd);
+	kfree(ldev);
+}
+
+static int smc_lo_dev_init(struct smc_lo_dev *ldev)
+{
+	smc_lo_gen_id(ldev);
+	rwlock_init(&ldev->dmb_ht_lock);
+	hash_init(ldev->dmb_ht);
+
+	return smcd_lo_register_dev(ldev);
+}
+
+static int smc_lo_dev_probe(void)
+{
+	struct smc_lo_dev *ldev;
+	int ret;
+
+	ldev = kzalloc(sizeof(*ldev), GFP_KERNEL);
+	if (!ldev)
+		return -ENOMEM;
+
+	ldev->dev.parent = NULL;
+	ldev->dev.release = smc_lo_dev_release;
+	device_initialize(&ldev->dev);
+	dev_set_name(&ldev->dev, smc_lo_dev_name);
+	ret = device_add(&ldev->dev);
+	if (ret)
+		goto free_dev;
+
+	ret = smc_lo_dev_init(ldev);
+	if (ret)
+		goto put_dev;
+
+	lo_dev = ldev; /* global loopback device */
+	return 0;
+
+put_dev:
+	device_del(&ldev->dev);
+free_dev:
+	kfree(ldev);
+	return ret;
+}
+
+static void smc_lo_dev_exit(struct smc_lo_dev *ldev)
+{
+	smcd_lo_unregister_dev(ldev);
+}
+
+static void smc_lo_dev_remove(void)
+{
+	if (!lo_dev)
+		return;
+
+	smc_lo_dev_exit(lo_dev);
+	device_del(&lo_dev->dev); /* device_add in smc_lo_dev_probe */
+	put_device(&lo_dev->dev); /* device_initialize in smc_lo_dev_probe */
+}
+
+int smc_loopback_init(void)
+{
+	return smc_lo_dev_probe();
+}
+
+void smc_loopback_exit(void)
+{
+	smc_lo_dev_remove();
+}
diff --git a/net/smc/smc_loopback.h b/net/smc/smc_loopback.h
new file mode 100644
index 0000000..9d34aba
--- /dev/null
+++ b/net/smc/smc_loopback.h
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ *  Shared Memory Communications Direct over loopback device.
+ *
+ *  Provide a SMC-D loopback dummy device.
+ *
+ *  Copyright (c) 2022, Alibaba Inc.
+ *
+ *  Author: Wen Gu <guwen@linux.alibaba.com>
+ *          Tony Lu <tonylu@linux.alibaba.com>
+ *
+ */
+
+#ifndef _SMC_LOOPBACK_H
+#define _SMC_LOOPBACK_H
+
+#include <linux/types.h>
+#include <linux/interrupt.h>
+#include <linux/device.h>
+#include <linux/err.h>
+#include <net/smc.h>
+
+#include "smc_core.h"
+
+#define SMC_LO_CHID 0xFFFF
+#define SMC_LODEV_MAX_DMBS 5000
+#define SMC_LODEV_MAX_DMBS_BUCKETS 16
+
+struct smc_lo_dmb_node {
+	struct hlist_node list;
+	u64 token;
+	u32 len;
+	u32 sba_idx;
+	void *cpu_addr;
+	dma_addr_t dma_addr;
+};
+
+struct smc_lo_dev {
+	struct smcd_dev *smcd;
+	struct device dev;
+	u16 chid;
+	u64 local_gid;
+	DECLARE_BITMAP(sba_idx_mask, SMC_LODEV_MAX_DMBS);
+	rwlock_t dmb_ht_lock;
+	DECLARE_HASHTABLE(dmb_ht, SMC_LODEV_MAX_DMBS_BUCKETS);
+};
+
+int smc_loopback_init(void);
+void smc_loopback_exit(void);
+
+#endif /* _SMC_LOOPBACK_H */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH net-next v4 5/9] net/smc: Introduce an interface for getting DMB attribute
  2023-03-27  3:28 [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration Wen Gu
                   ` (3 preceding siblings ...)
  2023-03-27  3:28 ` [RFC PATCH net-next v4 4/9] net/smc: Introduce SMC-D loopback device Wen Gu
@ 2023-03-27  3:28 ` Wen Gu
  2023-03-27  3:28 ` [RFC PATCH net-next v4 6/9] net/smc: Introudce interfaces for DMB attach and detach Wen Gu
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Wen Gu @ 2023-03-27  3:28 UTC (permalink / raw)
  To: kgraul, wenjia, jaka, wintera, davem, edumazet, kuba, pabeni
  Cc: linux-s390, netdev, linux-kernel

On s390, since all OSs run on a kind of machine level hypervisor which
is a partitioning hypervisor without paging, the sndbufs and DMBs in
such case are unable to be mapped to the same physical memory.

However, in other scene, such as communication within the same OS instance
(loopback) or between guests of a paging hypervisor (see [1]), eg. KVM,
the sndbufs and DMBs can be mapped to the same physical memory to avoid
memory copy from sndbufs to DMBs.

So this patch introduces an interface to smcd_ops for users to judge
whether DMB-map is available. And for reuse, the interface is designed
to return DMB attribute, not only mappability.

[1] https://lists.oasis-open.org/archives/virtio-comment/202302/msg00148.html

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
---
 include/net/smc.h | 5 +++++
 net/smc/smc_ism.c | 8 ++++++++
 net/smc/smc_ism.h | 1 +
 3 files changed, 14 insertions(+)

diff --git a/include/net/smc.h b/include/net/smc.h
index 021ca42..e39ac41 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -55,6 +55,10 @@ struct smcd_seid {
 
 #define ISM_ERROR	0xFFFF
 
+enum {
+	ISM_DMB_MAPPABLE = 0,
+};
+
 struct smcd_dev;
 
 struct smcd_ops {
@@ -77,6 +81,7 @@ struct smcd_ops {
 	u64 (*get_local_gid)(struct smcd_dev *dev);
 	u16 (*get_chid)(struct smcd_dev *dev);
 	struct device* (*get_dev)(struct smcd_dev *dev);
+	int (*get_dev_dmb_attr)(struct smcd_dev *dev);
 };
 
 struct smcd_dev {
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index f638999..a21c867 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -233,6 +233,14 @@ int smc_ism_unregister_dmb(struct smcd_dev *smcd, struct smc_buf_desc *dmb_desc)
 	return rc;
 }
 
+bool smc_ism_dmb_mappable(struct smcd_dev *smcd)
+{
+	if (smcd->ops->get_dev_dmb_attr &&
+	    (smcd->ops->get_dev_dmb_attr(smcd) & (1 << ISM_DMB_MAPPABLE)))
+		return true;
+	return false;
+}
+
 int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
 			 struct smc_buf_desc *dmb_desc)
 {
diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h
index 870ff7b..0ddaa45 100644
--- a/net/smc/smc_ism.h
+++ b/net/smc/smc_ism.h
@@ -39,6 +39,7 @@ struct smc_ism_vlanid {			/* VLAN id set on ISM device */
 int smc_ism_register_dmb(struct smc_link_group *lgr, int buf_size,
 			 struct smc_buf_desc *dmb_desc);
 int smc_ism_unregister_dmb(struct smcd_dev *dev, struct smc_buf_desc *dmb_desc);
+bool smc_ism_dmb_mappable(struct smcd_dev *smcd);
 int smc_ism_signal_shutdown(struct smc_link_group *lgr);
 void smc_ism_get_system_eid(u8 **eid);
 u16 smc_ism_get_chid(struct smcd_dev *dev);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH net-next v4 6/9] net/smc: Introudce interfaces for DMB attach and detach
  2023-03-27  3:28 [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration Wen Gu
                   ` (4 preceding siblings ...)
  2023-03-27  3:28 ` [RFC PATCH net-next v4 5/9] net/smc: Introduce an interface for getting DMB attribute Wen Gu
@ 2023-03-27  3:28 ` Wen Gu
  2023-03-27  3:28 ` [RFC PATCH net-next v4 7/9] net/smc: Avoid data copy from sndbuf to peer RMB in SMC-D Wen Gu
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Wen Gu @ 2023-03-27  3:28 UTC (permalink / raw)
  To: kgraul, wenjia, jaka, wintera, davem, edumazet, kuba, pabeni
  Cc: linux-s390, netdev, linux-kernel

This patch extends smcd_ops, adding two more semantic for SMC-D DMB:

- attach_dmb:
  Attach an already registered DMB to a specific buf_desc, so that we
  can refer to the DMB through this buf_desc.

- detach_dmb:
  Reverse operation of attach_dmb. detach the DMB from the buf_desc.

This interface extension is to prepare for the avoidance of memory
copy from sndbuf to RMB with SMC-D device whose DMBs has ISM_DMB_MAPPABLE
attribute.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
---
 include/net/smc.h |  2 ++
 net/smc/smc_ism.c | 31 +++++++++++++++++++++++++++++++
 net/smc/smc_ism.h |  2 ++
 3 files changed, 35 insertions(+)

diff --git a/include/net/smc.h b/include/net/smc.h
index e39ac41..0132522 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -67,6 +67,8 @@ struct smcd_ops {
 	int (*register_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb,
 			    void *client_priv);
 	int (*unregister_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
+	int (*attach_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
+	int (*detach_dmb)(struct smcd_dev *dev, u64 token);
 	int (*add_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
 	int (*del_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
 	int (*set_vlan_required)(struct smcd_dev *dev);
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index a21c867..012df1d 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -263,6 +263,37 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
 	return rc;
 }
 
+int smc_ism_attach_dmb(struct smcd_dev *dev, u64 token,
+		       struct smc_buf_desc *dmb_desc)
+{
+	struct smcd_dmb dmb;
+	int rc = 0;
+
+	memset(&dmb, 0, sizeof(dmb));
+	dmb.dmb_tok = token;
+
+	if (!dev->ops->attach_dmb)
+		return -EINVAL;
+
+	rc = dev->ops->attach_dmb(dev, &dmb);
+	if (!rc) {
+		dmb_desc->sba_idx = dmb.sba_idx;
+		dmb_desc->token = dmb.dmb_tok;
+		dmb_desc->cpu_addr = dmb.cpu_addr;
+		dmb_desc->dma_addr = dmb.dma_addr;
+		dmb_desc->len = dmb.dmb_len;
+	}
+	return rc;
+}
+
+int smc_ism_detach_dmb(struct smcd_dev *dev, u64 token)
+{
+	if (!dev->ops->detach_dmb)
+		return -EINVAL;
+
+	return dev->ops->detach_dmb(dev, token);
+}
+
 static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
 				  struct sk_buff *skb,
 				  struct netlink_callback *cb)
diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h
index 0ddaa45..d0ab025 100644
--- a/net/smc/smc_ism.h
+++ b/net/smc/smc_ism.h
@@ -40,6 +40,8 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int buf_size,
 			 struct smc_buf_desc *dmb_desc);
 int smc_ism_unregister_dmb(struct smcd_dev *dev, struct smc_buf_desc *dmb_desc);
 bool smc_ism_dmb_mappable(struct smcd_dev *smcd);
+int smc_ism_attach_dmb(struct smcd_dev *dev, u64 token, struct smc_buf_desc *dmb_desc);
+int smc_ism_detach_dmb(struct smcd_dev *dev, u64 token);
 int smc_ism_signal_shutdown(struct smc_link_group *lgr);
 void smc_ism_get_system_eid(u8 **eid);
 u16 smc_ism_get_chid(struct smcd_dev *dev);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH net-next v4 7/9] net/smc: Avoid data copy from sndbuf to peer RMB in SMC-D
  2023-03-27  3:28 [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration Wen Gu
                   ` (5 preceding siblings ...)
  2023-03-27  3:28 ` [RFC PATCH net-next v4 6/9] net/smc: Introudce interfaces for DMB attach and detach Wen Gu
@ 2023-03-27  3:28 ` Wen Gu
  2023-03-27  3:28 ` [RFC PATCH net-next v4 8/9] net/smc: Modify cursor update logic when using mappable DMB Wen Gu
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Wen Gu @ 2023-03-27  3:28 UTC (permalink / raw)
  To: kgraul, wenjia, jaka, wintera, davem, edumazet, kuba, pabeni
  Cc: linux-s390, netdev, linux-kernel

This patch aims to avoid data copy from local sndbuf to peer RMB by
attaching local sndbuf to peer RMB when DMBs have ISM_DMB_MAPPABLE
attribute.

After this, local sndbuf and peer RMB share the same physical memory.

 +----------+                     +----------+
 | socket A |                     | socket B |
 +----------+                     +----------+
       |                               ^
       |         +---------+           |
  regard as      |         | ----------|
  local sndbuf   |  B's    |     regard as
       |         |  RMB    |     local RMB
       |-------> |         |
                 +---------+

1. Actions on local RMB.

     a. Create or reuse RMB when connection is created;
     b. Unuse RMB when connection is freed;
     c. Free RMB when link group is freed;

2. Actions on local sndbuf.

     a. Attach local sndbuf to peer RMB by the rtoken exchanged through
        CLC message. Since then, accessing local sndbuf is equivalent to
        accessing peer RMB
     b. sndbuf_desc is exclusive to specific connection and won't be
        added to lgr buffer pool for reuse.
     c. Local sndbuf is detached from peer RMB and freed when connection
        is freed.

Therefore, the data written to local sndbuf will directly reach peer RMB.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
---
 net/smc/af_smc.c   | 14 +++++++++++
 net/smc/smc_core.c | 70 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 net/smc/smc_core.h |  1 +
 3 files changed, 84 insertions(+), 1 deletion(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index c91600a..212d1b1 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -1378,6 +1378,12 @@ static int smc_connect_ism(struct smc_sock *smc,
 	}
 
 	smc_conn_save_peer_info(smc, aclc);
+
+	if (smc_ism_dmb_mappable(smc->conn.lgr->smcd)) {
+		rc = smcd_buf_attach(smc);
+		if (rc)
+			goto connect_abort;
+	}
 	smc_close_init(smc);
 	smc_rx_init(smc);
 	smc_tx_init(smc);
@@ -2436,6 +2442,14 @@ static void smc_listen_work(struct work_struct *work)
 		mutex_unlock(&smc_server_lgr_pending);
 	}
 	smc_conn_save_peer_info(new_smc, cclc);
+
+	if (ini->is_smcd &&
+	    smc_ism_dmb_mappable(new_smc->conn.lgr->smcd)) {
+		rc = smcd_buf_attach(new_smc);
+		if (rc)
+			goto out_decl;
+	}
+
 	smc_listen_out_connected(new_smc);
 	SMC_STAT_SERV_SUCC_INC(sock_net(newclcsock->sk), ini);
 	goto out_free;
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 4543567..0fa26cc 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -1130,6 +1130,20 @@ static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb,
 	}
 }
 
+static void smcd_buf_detach(struct smc_connection *conn)
+{
+	struct smcd_dev *smcd = conn->lgr->smcd;
+	u64 peer_token = conn->peer_token;
+
+	if (!conn->sndbuf_desc)
+		return;
+
+	smc_ism_detach_dmb(smcd, peer_token);
+
+	kfree(conn->sndbuf_desc);
+	conn->sndbuf_desc = NULL;
+}
+
 static void smc_buf_unuse(struct smc_connection *conn,
 			  struct smc_link_group *lgr)
 {
@@ -1174,6 +1188,10 @@ void smc_conn_free(struct smc_connection *conn)
 		if (!list_empty(&lgr->list))
 			smc_ism_unset_conn(conn);
 		tasklet_kill(&conn->rx_tsklet);
+
+		/* detach sndbuf from peer RMB */
+		if (smc_ism_dmb_mappable(lgr->smcd))
+			smcd_buf_detach(conn);
 	} else {
 		smc_cdc_wait_pend_tx_wr(conn);
 		if (current_work() != &conn->abort_work)
@@ -2425,15 +2443,23 @@ void smc_rmb_sync_sg_for_cpu(struct smc_connection *conn)
  */
 int smc_buf_create(struct smc_sock *smc, bool is_smcd)
 {
+	bool sndbuf_created = false;
 	int rc;
 
+	if (is_smcd &&
+	    smc_ism_dmb_mappable(smc->conn.lgr->smcd))
+		goto create_rmb;
+
 	/* create send buffer */
 	rc = __smc_buf_create(smc, is_smcd, false);
 	if (rc)
 		return rc;
+	sndbuf_created = true;
+
+create_rmb:
 	/* create rmb */
 	rc = __smc_buf_create(smc, is_smcd, true);
-	if (rc) {
+	if (rc && sndbuf_created) {
 		down_write(&smc->conn.lgr->sndbufs_lock);
 		list_del(&smc->conn.sndbuf_desc->list);
 		up_write(&smc->conn.lgr->sndbufs_lock);
@@ -2443,6 +2469,48 @@ int smc_buf_create(struct smc_sock *smc, bool is_smcd)
 	return rc;
 }
 
+int smcd_buf_attach(struct smc_sock *smc)
+{
+	struct smc_connection *conn = &smc->conn;
+	struct smcd_dev *smcd = conn->lgr->smcd;
+	u64 peer_token = conn->peer_token;
+	struct smc_buf_desc *buf_desc;
+	int rc;
+
+	buf_desc = kzalloc(sizeof(*buf_desc), GFP_KERNEL);
+	if (!buf_desc)
+		return -ENOMEM;
+
+	/* map local sndbuf desc to peer RMB, so operations on local
+	 * sndbuf are equivalent to operations on peer RMB.
+	 */
+	rc = smc_ism_attach_dmb(smcd, peer_token, buf_desc);
+	if (rc) {
+		rc = SMC_CLC_DECL_MEM;
+		goto free;
+	}
+
+	smc->sk.sk_sndbuf = buf_desc->len;
+	buf_desc->cpu_addr = (u8 *)buf_desc->cpu_addr + sizeof(struct smcd_cdc_msg);
+	buf_desc->len -=  sizeof(struct smcd_cdc_msg);
+	conn->sndbuf_desc = buf_desc;
+	conn->sndbuf_desc->used = 1;
+	atomic_set(&conn->sndbuf_space, conn->sndbuf_desc->len);
+	return 0;
+
+free:
+	if (conn->rmb_desc) {
+		/* free local RMB as well */
+		down_write(&conn->lgr->rmbs_lock);
+		list_del(&conn->rmb_desc->list);
+		up_write(&conn->lgr->rmbs_lock);
+		smc_buf_free(conn->lgr, true, conn->rmb_desc);
+		conn->rmb_desc = NULL;
+	}
+	kfree(buf_desc);
+	return rc;
+}
+
 static inline int smc_rmb_reserve_rtoken_idx(struct smc_link_group *lgr)
 {
 	int i;
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 1645fba..e52cf70 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -524,6 +524,7 @@ void smc_smcd_terminate(struct smcd_dev *dev, u64 peer_gid,
 void smc_smcd_terminate_all(struct smcd_dev *dev);
 void smc_smcr_terminate_all(struct smc_ib_device *smcibdev);
 int smc_buf_create(struct smc_sock *smc, bool is_smcd);
+int smcd_buf_attach(struct smc_sock *smc);
 int smc_uncompress_bufsize(u8 compressed);
 int smc_rmb_rtoken_handling(struct smc_connection *conn, struct smc_link *link,
 			    struct smc_clc_msg_accept_confirm *clc);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH net-next v4 8/9] net/smc: Modify cursor update logic when using mappable DMB
  2023-03-27  3:28 [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration Wen Gu
                   ` (6 preceding siblings ...)
  2023-03-27  3:28 ` [RFC PATCH net-next v4 7/9] net/smc: Avoid data copy from sndbuf to peer RMB in SMC-D Wen Gu
@ 2023-03-27  3:28 ` Wen Gu
  2023-03-27  3:28 ` [RFC PATCH net-next v4 9/9] net/smc: Add interface implementation of loopback device Wen Gu
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Wen Gu @ 2023-03-27  3:28 UTC (permalink / raw)
  To: kgraul, wenjia, jaka, wintera, davem, edumazet, kuba, pabeni
  Cc: linux-s390, netdev, linux-kernel

Since local sndbuf shares the same physical memory region with peer
RMB when using mappable DMBs, the cursor update logic needs to be
adapted.

The main concern is to ensure that the data written by local to this
memory region won't overwrite the data that has not been consumed by
the peer.

So in this scene, the fin_curs and sndbuf_space that were originally
updated when sending out CDC message are not updated until the cons_curs
update from the peer is received.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
---
 net/smc/smc_cdc.c | 50 +++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 39 insertions(+), 11 deletions(-)

diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index 2f79bac..915b8e7 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -18,6 +18,7 @@
 #include "smc_tx.h"
 #include "smc_rx.h"
 #include "smc_close.h"
+#include "smc_ism.h"
 
 /********************************** send *************************************/
 
@@ -256,17 +257,24 @@ int smcd_cdc_msg_send(struct smc_connection *conn)
 		return rc;
 	smc_curs_copy(&conn->rx_curs_confirmed, &curs, conn);
 	conn->local_rx_ctrl.prod_flags.cons_curs_upd_req = 0;
-	/* Calculate transmitted data and increment free send buffer space */
-	diff = smc_curs_diff(conn->sndbuf_desc->len, &conn->tx_curs_fin,
-			     &conn->tx_curs_sent);
-	/* increased by confirmed number of bytes */
-	smp_mb__before_atomic();
-	atomic_add(diff, &conn->sndbuf_space);
-	/* guarantee 0 <= sndbuf_space <= sndbuf_desc->len */
-	smp_mb__after_atomic();
-	smc_curs_copy(&conn->tx_curs_fin, &conn->tx_curs_sent, conn);
+	if (!smc_ism_dmb_mappable(conn->lgr->smcd)) {
+		/* If local sndbuf has been mapped to peer RMB, then
+		 * don't update the tx_curs_fin and sndbuf_space until
+		 * peer has consumed the data in RMB.
+		 */
 
-	smc_tx_sndbuf_nonfull(smc);
+		/* Calculate transmitted data and increment free send buffer space */
+		diff = smc_curs_diff(conn->sndbuf_desc->len, &conn->tx_curs_fin,
+				     &conn->tx_curs_sent);
+		/* increased by confirmed number of bytes */
+		smp_mb__before_atomic();
+		atomic_add(diff, &conn->sndbuf_space);
+		/* guarantee 0 <= sndbuf_space <= sndbuf_desc->len */
+		smp_mb__after_atomic();
+		smc_curs_copy(&conn->tx_curs_fin, &conn->tx_curs_sent, conn);
+
+		smc_tx_sndbuf_nonfull(smc);
+	}
 	return rc;
 }
 
@@ -324,7 +332,7 @@ static void smc_cdc_msg_recv_action(struct smc_sock *smc,
 {
 	union smc_host_cursor cons_old, prod_old;
 	struct smc_connection *conn = &smc->conn;
-	int diff_cons, diff_prod;
+	int diff_cons, diff_prod, diff_tx;
 
 	smc_curs_copy(&prod_old, &conn->local_rx_ctrl.prod, conn);
 	smc_curs_copy(&cons_old, &conn->local_rx_ctrl.cons, conn);
@@ -340,6 +348,26 @@ static void smc_cdc_msg_recv_action(struct smc_sock *smc,
 		atomic_add(diff_cons, &conn->peer_rmbe_space);
 		/* guarantee 0 <= peer_rmbe_space <= peer_rmbe_size */
 		smp_mb__after_atomic();
+
+		if (conn->lgr->is_smcd &&
+		    smc_ism_dmb_mappable(conn->lgr->smcd)) {
+			/* If local sndbuf has been mapped to peer RMB, then
+			 * update tx_curs_fin and sndbuf_space when peer has
+			 * consumed the data in it's RMB.
+			 */
+
+			/* calculate peer rmb consumed data */
+			diff_tx = smc_curs_diff(conn->sndbuf_desc->len, &conn->tx_curs_fin,
+						&conn->local_rx_ctrl.cons);
+			/* increase local sndbuf space and fin_curs */
+			smp_mb__before_atomic();
+			atomic_add(diff_tx, &conn->sndbuf_space);
+			/* guarantee 0 <= sndbuf_space <= sndbuf_desc->len */
+			smp_mb__after_atomic();
+			smc_curs_copy(&conn->tx_curs_fin, &conn->local_rx_ctrl.cons, conn);
+
+			smc_tx_sndbuf_nonfull(smc);
+		}
 	}
 
 	diff_prod = smc_curs_diff(conn->rmb_desc->len, &prod_old,
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH net-next v4 9/9] net/smc: Add interface implementation of loopback device
  2023-03-27  3:28 [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration Wen Gu
                   ` (7 preceding siblings ...)
  2023-03-27  3:28 ` [RFC PATCH net-next v4 8/9] net/smc: Modify cursor update logic when using mappable DMB Wen Gu
@ 2023-03-27  3:28 ` Wen Gu
  2023-04-05 14:48 ` [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration Wenjia Zhang
  2023-04-05 17:04 ` Niklas Schnelle
  10 siblings, 0 replies; 17+ messages in thread
From: Wen Gu @ 2023-03-27  3:28 UTC (permalink / raw)
  To: kgraul, wenjia, jaka, wintera, davem, edumazet, kuba, pabeni
  Cc: linux-s390, netdev, linux-kernel

This patch completes the specific implementation of loopback device
for the newly added SMC-D DMB-related interface.

The loopback device always provides mappable DMB because the device
users are in the same OS instance.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
---
 net/smc/smc_loopback.c | 101 +++++++++++++++++++++++++++++++++++++++++++++----
 net/smc/smc_loopback.h |   5 +++
 2 files changed, 98 insertions(+), 8 deletions(-)

diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index 6ac5727..2e35cb5 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -74,6 +74,7 @@ static int smc_lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
 	}
 	dmb_node->len = dmb->dmb_len;
 	dmb_node->dma_addr = (dma_addr_t)dmb_node->cpu_addr;
+	refcount_set(&dmb_node->refcnt, 1);
 
 	/* TODO: token is random but not exclusive !
 	 * suppose to find token in dmb hask table, if has this token
@@ -84,6 +85,7 @@ static int smc_lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
 	write_lock(&ldev->dmb_ht_lock);
 	hash_add(ldev->dmb_ht, &dmb_node->list, dmb_node->token);
 	write_unlock(&ldev->dmb_ht_lock);
+	atomic_inc(&ldev->dmb_cnt);
 
 	dmb->sba_idx = dmb_node->sba_idx;
 	dmb->dmb_tok = dmb_node->token;
@@ -105,11 +107,12 @@ static int smc_lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
 	struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
 	struct smc_lo_dev *ldev = smcd->priv;
 
-	/* remove dmb from hash table */
+	/* find dmb from hash table */
 	write_lock(&ldev->dmb_ht_lock);
 	hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb->dmb_tok) {
 		if (tmp_node->token == dmb->dmb_tok) {
 			dmb_node = tmp_node;
+			dmb_node->freeing = 1;
 			break;
 		}
 	}
@@ -117,16 +120,85 @@ static int smc_lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
 		write_unlock(&ldev->dmb_ht_lock);
 		return -EINVAL;
 	}
+	write_unlock(&ldev->dmb_ht_lock);
+
+	/* wait for dmb refcnt to be 0 */
+	if (!refcount_dec_and_test(&dmb_node->refcnt))
+		wait_event(ldev->dmbs_release, !refcount_read(&dmb_node->refcnt));
+
+	/* remove dmb from hash table */
+	write_lock(&ldev->dmb_ht_lock);
 	hash_del(&dmb_node->list);
 	write_unlock(&ldev->dmb_ht_lock);
 
 	clear_bit(dmb_node->sba_idx, ldev->sba_idx_mask);
+
 	kfree(dmb_node->cpu_addr);
 	kfree(dmb_node);
 
+	if (atomic_dec_and_test(&ldev->dmb_cnt))
+		wake_up(&ldev->ldev_release);
 	return 0;
 }
 
+static int smc_lo_attach_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
+{
+	struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
+	struct smc_lo_dev *ldev = smcd->priv;
+
+	/* find dmb_node according to dmb->dmb_tok */
+	read_lock(&ldev->dmb_ht_lock);
+	hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb->dmb_tok) {
+		if (tmp_node->token == dmb->dmb_tok && !tmp_node->freeing) {
+			dmb_node = tmp_node;
+			break;
+		}
+	}
+	if (!dmb_node) {
+		read_unlock(&ldev->dmb_ht_lock);
+		return -EINVAL;
+	}
+	refcount_inc(&dmb_node->refcnt);
+	read_unlock(&ldev->dmb_ht_lock);
+
+	/* provide dmb information */
+	dmb->sba_idx = dmb_node->sba_idx;
+	dmb->dmb_tok = dmb_node->token;
+	dmb->cpu_addr = dmb_node->cpu_addr;
+	dmb->dma_addr = dmb_node->dma_addr;
+	dmb->dmb_len = dmb_node->len;
+	return 0;
+}
+
+static int smc_lo_detach_dmb(struct smcd_dev *smcd, u64 token)
+{
+	struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
+	struct smc_lo_dev *ldev = smcd->priv;
+
+	/* find dmb_node according to dmb->dmb_tok */
+	read_lock(&ldev->dmb_ht_lock);
+	hash_for_each_possible(ldev->dmb_ht, tmp_node, list, token) {
+		if (tmp_node->token == token) {
+			dmb_node = tmp_node;
+			break;
+		}
+	}
+	if (!dmb_node) {
+		read_unlock(&ldev->dmb_ht_lock);
+		return -EINVAL;
+	}
+	read_unlock(&ldev->dmb_ht_lock);
+
+	if (refcount_dec_and_test(&dmb_node->refcnt))
+		wake_up_all(&ldev->dmbs_release);
+	return 0;
+}
+
+static int smc_lo_get_dev_dmb_attr(struct smcd_dev *smcd)
+{
+	return (1 << ISM_DMB_MAPPABLE);
+}
+
 static int smc_lo_add_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
 {
 	return -EOPNOTSUPP;
@@ -153,7 +225,15 @@ static int smc_lo_move_data(struct smcd_dev *smcd, u64 dmb_tok, unsigned int idx
 {
 	struct smc_lo_dmb_node *rmb_node = NULL, *tmp_node;
 	struct smc_lo_dev *ldev = smcd->priv;
-
+	struct smc_connection *conn;
+
+	if (!sf) {
+		/* local sndbuf shares the same physical memory with
+		 * peer RMB, so no need to copy data from local sndbuf
+		 * to peer RMB.
+		 */
+		return 0;
+	}
 	read_lock(&ldev->dmb_ht_lock);
 	hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_tok) {
 		if (tmp_node->token == dmb_tok) {
@@ -169,13 +249,10 @@ static int smc_lo_move_data(struct smcd_dev *smcd, u64 dmb_tok, unsigned int idx
 
 	memcpy((char *)rmb_node->cpu_addr + offset, data, size);
 
-	if (sf) {
-		struct smc_connection *conn =
-			smcd->conn[rmb_node->sba_idx];
+	conn = smcd->conn[rmb_node->sba_idx];
+	if (conn && !conn->killed)
+		smcd_cdc_rx_handler(conn);
 
-		if (conn && !conn->killed)
-			smcd_cdc_rx_handler(conn);
-	}
 	return 0;
 }
 
@@ -208,6 +285,8 @@ static struct device *smc_lo_get_dev(struct smcd_dev *smcd)
 	.query_remote_gid = smc_lo_query_rgid,
 	.register_dmb = smc_lo_register_dmb,
 	.unregister_dmb = smc_lo_unregister_dmb,
+	.attach_dmb = smc_lo_attach_dmb,
+	.detach_dmb = smc_lo_detach_dmb,
 	.add_vlan_id = smc_lo_add_vlan_id,
 	.del_vlan_id = smc_lo_del_vlan_id,
 	.set_vlan_required = smc_lo_set_vlan_required,
@@ -219,6 +298,7 @@ static struct device *smc_lo_get_dev(struct smcd_dev *smcd)
 	.get_local_gid = smc_lo_get_local_gid,
 	.get_chid = smc_lo_get_chid,
 	.get_dev = smc_lo_get_dev,
+	.get_dev_dmb_attr = smc_lo_get_dev_dmb_attr,
 };
 
 static struct smcd_dev *smcd_lo_alloc_dev(const struct smcd_ops *ops,
@@ -299,6 +379,9 @@ static int smc_lo_dev_init(struct smc_lo_dev *ldev)
 	smc_lo_gen_id(ldev);
 	rwlock_init(&ldev->dmb_ht_lock);
 	hash_init(ldev->dmb_ht);
+	atomic_set(&ldev->dmb_cnt, 0);
+	init_waitqueue_head(&ldev->dmbs_release);
+	init_waitqueue_head(&ldev->ldev_release);
 
 	return smcd_lo_register_dev(ldev);
 }
@@ -337,6 +420,8 @@ static int smc_lo_dev_probe(void)
 static void smc_lo_dev_exit(struct smc_lo_dev *ldev)
 {
 	smcd_lo_unregister_dev(ldev);
+	if (atomic_read(&ldev->dmb_cnt))
+		wait_event(ldev->ldev_release, !atomic_read(&ldev->dmb_cnt));
 }
 
 static void smc_lo_dev_remove(void)
diff --git a/net/smc/smc_loopback.h b/net/smc/smc_loopback.h
index 9d34aba..e0bf044 100644
--- a/net/smc/smc_loopback.h
+++ b/net/smc/smc_loopback.h
@@ -33,6 +33,8 @@ struct smc_lo_dmb_node {
 	u32 sba_idx;
 	void *cpu_addr;
 	dma_addr_t dma_addr;
+	refcount_t refcnt;
+	u8 freeing : 1;
 };
 
 struct smc_lo_dev {
@@ -43,6 +45,9 @@ struct smc_lo_dev {
 	DECLARE_BITMAP(sba_idx_mask, SMC_LODEV_MAX_DMBS);
 	rwlock_t dmb_ht_lock;
 	DECLARE_HASHTABLE(dmb_ht, SMC_LODEV_MAX_DMBS_BUCKETS);
+	atomic_t dmb_cnt;
+	wait_queue_head_t dmbs_release;
+	wait_queue_head_t ldev_release;
 };
 
 int smc_loopback_init(void);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration
  2023-03-27  3:28 [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration Wen Gu
                   ` (8 preceding siblings ...)
  2023-03-27  3:28 ` [RFC PATCH net-next v4 9/9] net/smc: Add interface implementation of loopback device Wen Gu
@ 2023-04-05 14:48 ` Wenjia Zhang
  2023-04-10 14:31   ` Wen Gu
  2023-04-05 17:04 ` Niklas Schnelle
  10 siblings, 1 reply; 17+ messages in thread
From: Wenjia Zhang @ 2023-04-05 14:48 UTC (permalink / raw)
  To: Wen Gu, kgraul, jaka, wintera, davem, edumazet, kuba, pabeni
  Cc: linux-s390, netdev, linux-kernel



On 27.03.23 05:28, Wen Gu wrote:
> Hi, all
> 
> # Background
> 
> The background and previous discussion can be referred from [1],[6].
> 
> We found SMC-D can be used to accelerate OS internal communication, such as
> loopback or between two containers within the same OS instance. So this patch
> set provides a kind of SMC-D dummy device (we call it the SMC-D loopback device)
> to emulate an ISM device, so that SMC-D can also be used on architectures
> other than s390. The SMC-D loopback device are designed as a system global
> device, visible to all containers.
> 
> This version is implemented based on the generalized interface provided by [2].
> And there is an open issue, which will be mentioned later.
> 
> # Design
> 
> This patch set basically follows the design of the previous version.
> 
> Patch #1/9 ~ #3/9 attempt to decouple ISM-related structures from the SMC-D
> generalized code and extract some helpers to make SMC-D protocol compatible
> with devices other than s390 ISM device.
> 
> Patch #4/9 introduces a kind of loopback device, which is defined as SMC-D v2
> device and designed to provide communication between SMC sockets in the same OS
> instance.
> 
>   +-------------------------------------------+
>   |  +--------------+       +--------------+  |
>   |  | SMC socket A |       | SMC socket B |  |
>   |  +--------------+       +--------------+  |
>   |       ^                         ^         |
>   |       |    +----------------+   |         |
>   |       |    |   SMC stack    |   |         |
>   |       +--->| +------------+ |<--|         |
>   |            | |   dummy    | |             |
>   |            | |   device   | |             |
>   |            +-+------------+-+             |
>   |                   OS                      |
>   +-------------------------------------------+
> 
> Patch #5/9 ~ #8/9 expand SMC-D protocol interface (smcd_ops) for scenarios where
> SMC-D is used to communicate within VM (loopback here) or between VMs on the same
> host (based on virtio-ism device, see [3]). What these scenarios have in common
> is that the local sndbuf and peer RMB can be mapped to same physical memory region,
> so the data copy between the local sndbuf and peer RMB can be omitted. Performance
> improvement brought by this extension can be found in # Benchmark Test.
> 
>   +----------+                     +----------+
>   | socket A |                     | socket B |
>   +----------+                     +----------+
>         |                               ^
>         |         +---------+           |
>    regard as      |         | ----------|
>    local sndbuf   |  B's    |     regard as
>         |         |  RMB    |     local RMB
>         |-------> |         |
>                   +---------+
> 
> Patch #9/9 realizes the support of loopback device for the above-mentioned expanded
> SMC-D protocol interface.
> 
> # Benchmark Test
> 
>   * Test environments:
>        - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>        - SMC sndbuf/RMB size 1MB.
> 
>   * Test object:
>        - TCP lo: run on TCP loopback.
>        - domain: run on UNIX domain.
>        - SMC lo: run on SMC loopback device with patch #1/9 ~ #4/9.
>        - SMC lo-nocpy: run on SMC loopback device with patch #1/9 ~ #9/9.
> 
> 1. ipc-benchmark (see [4])
> 
>   - ./<foo> -c 1000000 -s 100
> 
>                      TCP-lo              domain              SMC-lo          SMC-lo-nocpy
> Message
> rate (msg/s)         79025      115736(+46.45%)    146760(+85.71%)       149800(+89.56%)
> 
> 2. sockperf
> 
>   - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>   - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
> 
>                      TCP-lo                  SMC-lo             SMC-lo-nocpy
> Bandwidth(MBps)   4822.388        4940.918(+2.56%)         8086.67(+67.69%)
> Latency(us)          6.298          3.352(-46.78%)            3.35(-46.81%)
> 
> 3. iperf3
> 
>   - serv: <smc_run> taskset -c <cpu> iperf3 -s
>   - clnt: <smc_run> taskset -c <cpu> iperf3 -c 127.0.0.1 -t 15
> 
>                      TCP-lo                  SMC-lo             SMC-lo-nocpy
> Bitrate(Gb/s)         40.7            40.5(-0.49%)            72.4(+77.89%)
> 
> 4. nginx/wrk
> 
>   - serv: <smc_run> nginx
>   - clnt: <smc_run> wrk -t 8 -c 500 -d 30 http://127.0.0.1:80
> 
>                      TCP-lo                  SMC-lo             SMC-lo-nocpy
> Requests/s       155994.57      214544.79(+37.53%)       215538.55(+38.17%)
> 
> 
> # Open issue
> 
> The open issue is about how to detect that the source and target of CLC proposal
> are within the same OS instance and can communicate through the SMC-D loopback device.
> Similar issue also exists when using virtio-ism devices (the background and details
> of virtio-ism device can be referred from [3]). In previous discussions, multiple
> options were proposed (see [5]). Thanks again for the help of the community. :)
> 
> But as we discussed, these solutions have some imperfection. So this version of RFC
> continues to use previous workaround, that is, a 64-bit random GID is generated for
> SMC-D loopback device. If the GIDs of the devices found by two peers are the same,
> then they are considered to be in the same OS instance and can communicate with each
> other by the loopback device.
> 
> This approach needs that the loopback device GID is globally unique. But theoretically
> there is a possibility of a collision. Assume the following situations:
> 
> (1) Assume that the SMC-D loopback devices of the two different OS instances happen
>      to generate the same 64-bit GID.
> 
>      For the convenience of description, we refer to the sockets on these two
>      different OS instance as server A and client B.
> 
>      A will misjudge that the two are on the same OS instance because the same GID
>      in CLC proposal message. Then A creates its RMB and sends 64-bit token-A to B
>      in CLC accept message.
> 
>      B receives the CLC accept message. And according to patch #7/9, B tries to
>      attach its sndbuf to A's RMB by token-A.
> 
> (2) And assume that the OS instance where B is located happens to have an unattached
>      RMB whose 64-bit token is same as token-A.
> 
>      Then B successfully attaches its sndbuf to the wrong RMB, and creates its RMB,
>      sends token-B to A in CLC confirm message.
> 
>      Similarly, A receives the message and tries to attach its sndbuf to B's RMB by
>      token-B.
> 
> (3) Similar to (2), assume that the OS instance where A is located happens to have
>      an unattached RMB whose 64-bit token is same as token-B.
> 
>      Then A successfully attach its sndbuf to the wrong RMB. Both sides mistakenly
>      believe that an SMC-D connection based on the loopback device is established
>      between them.
> 
> If the above 3 coincidences all happen, that is, 64-bit random number conflicts occur
> 3 times, then an unreachable SMC-D connection will be established, which is nasty.
> But if one of above is not satisfied, it will safely fallback to TCP.
> 
> Since the chances of these happening are very small, I wonder if this risk of 1/2^(64*3)
> probability is acceptable? Can we just use 64-bits random generated number as GID in
> loopback device?
> 
> Some other ways that may be able to make loopback GID unique are
>   1) Using a 128-bit UUID to identify SMC-D loopback device or virtio-ism device, because
>      the probability of a 128-bit UUID collision is considered negligible. But it needs
>      to extend the CLC message to carry a longer GID.
>   2) Using MAC address of netdev in the OS as part of SMC-D loopback device GID, provided
>      that the MAC addresses are unique. But the MAC address could theoretically also be
>      incorrectly set to be the same.
> 
> Hope to hear opinions from the community. Any ideas are welcome.
> 
> Thanks!
> Wen Gu

Hi Wen,

Thank you for the new version. The discussion on the open issue is still 
on-going in our organisation internally. I appreciate your patience!

One thing I need to mention during testing the loopback device on our 
platform is that we get crash, because smc_ism-signal_shutdown() is 
called by smc_1gr_free_work(), which is called indirectly by 
smc_conn_free(). Please make sure that it would go to the path of the 
loopback device cleanly. Any question and consideration is welcome!

Thanks,
Wenjia
> 
> v4->v3
>   1. Rebase to the latest net-next;
>   2. Introduce SEID helper. SMC-D loopback will return SMCD_DEFAULT_V2_SEID. And if it
>      coexist with ISM device, the SEID of ISM device will overwrite SMCD_DEFAULT_V2_SEID
>      as smc_ism_v2_system_eid.
>   3. Won't remove dmb_node from hashtable until no sndbuf attaching to it.
> 
>   Something postponed in this version
>   1. Hierarchy perference of SMC-D devices when loopback and ISM devices coexist, which
>      will be determinated after comparing the performance of loopback and ISM.
> 
> v3->v2
>   1. Adapt new generalized interface provided by [2];
>   2. Select loopback device through SMC-D v2 protocol;
>   3. Split the loopback-related implementation and generic implementation into different
>      patches more reasonably.
> 
> v1->v2
>   1. Fix some build WARNINGs complained by kernel test rebot
>      Reported-by: kernel test robot <lkp@intel.com>
>   2. Add iperf3 test data.
> 
> [1] https://lore.kernel.org/netdev/1671506505-104676-1-git-send-email-guwen@linux.alibaba.com/
> [2] https://lore.kernel.org/netdev/20230123181752.1068-1-jaka@linux.ibm.com/
> [3] https://lists.oasis-open.org/archives/virtio-comment/202302/msg00148.html
> [4] https://github.com/goldsborough/ipc-bench
> [5] https://lore.kernel.org/netdev/b9867c7d-bb2b-16fc-feda-b79579aa833d@linux.ibm.com/
> [6] https://lore.kernel.org/netdev/1676477905-88043-1-git-send-email-guwen@linux.alibaba.com/
> 
> 
> Wen Gu (9):
>    net/smc: Decouple ism_dev from SMC-D device dump
>    net/smc: Decouple ism_dev from SMC-D DMB registration
>    net/smc: Extract v2 check helper from SMC-D device registration
>    net/smc: Introduce SMC-D loopback device
>    net/smc: Introduce an interface for getting DMB attribute
>    net/smc: Introudce interfaces for DMB attach and detach
>    net/smc: Avoid data copy from sndbuf to peer RMB in SMC-D
>    net/smc: Modify cursor update logic when using mappable DMB
>    net/smc: Add interface implementation of loopback device
> 
>   drivers/s390/net/ism_drv.c |   5 +-
>   include/net/smc.h          |  18 +-
>   net/smc/Makefile           |   2 +-
>   net/smc/af_smc.c           |  26 ++-
>   net/smc/smc_cdc.c          |  59 ++++--
>   net/smc/smc_cdc.h          |   1 +
>   net/smc/smc_core.c         |  70 ++++++-
>   net/smc/smc_core.h         |   1 +
>   net/smc/smc_ism.c          |  99 ++++++++--
>   net/smc/smc_ism.h          |   5 +
>   net/smc/smc_loopback.c     | 445 +++++++++++++++++++++++++++++++++++++++++++++
>   net/smc/smc_loopback.h     |  56 ++++++
>   12 files changed, 750 insertions(+), 37 deletions(-)
>   create mode 100644 net/smc/smc_loopback.c
>   create mode 100644 net/smc/smc_loopback.h
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration
  2023-03-27  3:28 [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration Wen Gu
                   ` (9 preceding siblings ...)
  2023-04-05 14:48 ` [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration Wenjia Zhang
@ 2023-04-05 17:04 ` Niklas Schnelle
  2023-04-06 11:14   ` Alexandra Winter
  2023-04-10 14:30   ` Wen Gu
  10 siblings, 2 replies; 17+ messages in thread
From: Niklas Schnelle @ 2023-04-05 17:04 UTC (permalink / raw)
  To: Wen Gu, kgraul, wenjia, jaka, wintera, davem, edumazet, kuba, pabeni
  Cc: linux-s390, netdev, linux-kernel

On Mon, 2023-03-27 at 11:28 +0800, Wen Gu wrote:
> Hi, all
> 
> # Background
> 
> The background and previous discussion can be referred from [1],[6].
> 
> We found SMC-D can be used to accelerate OS internal communication, such as
> loopback or between two containers within the same OS instance. So this patch
> set provides a kind of SMC-D dummy device (we call it the SMC-D loopback device)
> to emulate an ISM device, so that SMC-D can also be used on architectures
> other than s390. The SMC-D loopback device are designed as a system global
> device, visible to all containers.
> 
> This version is implemented based on the generalized interface provided by [2].
> And there is an open issue, which will be mentioned later.
> 
> # Design
> 
> This patch set basically follows the design of the previous version.
> 
> Patch #1/9 ~ #3/9 attempt to decouple ISM-related structures from the SMC-D
> generalized code and extract some helpers to make SMC-D protocol compatible
> with devices other than s390 ISM device.
> 
> Patch #4/9 introduces a kind of loopback device, which is defined as SMC-D v2
> device and designed to provide communication between SMC sockets in the same OS
> instance.
> 
>  +-------------------------------------------+
>  |  +--------------+       +--------------+  |
>  |  | SMC socket A |       | SMC socket B |  |
>  |  +--------------+       +--------------+  |
>  |       ^                         ^         |
>  |       |    +----------------+   |         |
>  |       |    |   SMC stack    |   |         |
>  |       +--->| +------------+ |<--|         |
>  |            | |   dummy    | |             |
>  |            | |   device   | |             |
>  |            +-+------------+-+             |
>  |                   OS                      |
>  +-------------------------------------------+
> 
> Patch #5/9 ~ #8/9 expand SMC-D protocol interface (smcd_ops) for scenarios where
> SMC-D is used to communicate within VM (loopback here) or between VMs on the same
> host (based on virtio-ism device, see [3]). What these scenarios have in common
> is that the local sndbuf and peer RMB can be mapped to same physical memory region,
> so the data copy between the local sndbuf and peer RMB can be omitted. Performance
> improvement brought by this extension can be found in # Benchmark Test.
> 
>  +----------+                     +----------+
>  | socket A |                     | socket B |
>  +----------+                     +----------+
>        |                               ^
>        |         +---------+           |
>   regard as      |         | ----------|
>   local sndbuf   |  B's    |     regard as
>        |         |  RMB    |     local RMB
>        |-------> |         |
>                  +---------+
> 
> Patch #9/9 realizes the support of loopback device for the above-mentioned expanded
> SMC-D protocol interface.
> 
> # Benchmark Test
> 
>  * Test environments:
>       - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>       - SMC sndbuf/RMB size 1MB.
> 
>  * Test object:
>       - TCP lo: run on TCP loopback.
>       - domain: run on UNIX domain.
>       - SMC lo: run on SMC loopback device with patch #1/9 ~ #4/9.
>       - SMC lo-nocpy: run on SMC loopback device with patch #1/9 ~ #9/9.
> 
> 1. ipc-benchmark (see [4])
> 
>  - ./<foo> -c 1000000 -s 100
> 
>                     TCP-lo              domain              SMC-lo          SMC-lo-nocpy
> Message
> rate (msg/s)         79025      115736(+46.45%)    146760(+85.71%)       149800(+89.56%)
> 
> 2. sockperf
> 
>  - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>  - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
> 
>                     TCP-lo                  SMC-lo             SMC-lo-nocpy
> Bandwidth(MBps)   4822.388        4940.918(+2.56%)         8086.67(+67.69%)
> Latency(us)          6.298          3.352(-46.78%)            3.35(-46.81%)
> 
> 3. iperf3
> 
>  - serv: <smc_run> taskset -c <cpu> iperf3 -s
>  - clnt: <smc_run> taskset -c <cpu> iperf3 -c 127.0.0.1 -t 15
> 
>                     TCP-lo                  SMC-lo             SMC-lo-nocpy
> Bitrate(Gb/s)         40.7            40.5(-0.49%)            72.4(+77.89%)
> 
> 4. nginx/wrk
> 
>  - serv: <smc_run> nginx
>  - clnt: <smc_run> wrk -t 8 -c 500 -d 30 http://127.0.0.1:80
> 
>                     TCP-lo                  SMC-lo             SMC-lo-nocpy
> Requests/s       155994.57      214544.79(+37.53%)       215538.55(+38.17%)
> 
> 
> # Open issue
> 
> The open issue is about how to detect that the source and target of CLC proposal
> are within the same OS instance and can communicate through the SMC-D loopback device.
> Similar issue also exists when using virtio-ism devices (the background and details
> of virtio-ism device can be referred from [3]). In previous discussions, multiple
> options were proposed (see [5]). Thanks again for the help of the community. :)
> 
> But as we discussed, these solutions have some imperfection. So this version of RFC
> continues to use previous workaround, that is, a 64-bit random GID is generated for
> SMC-D loopback device. If the GIDs of the devices found by two peers are the same,
> then they are considered to be in the same OS instance and can communicate with each
> other by the loopback device.
> 
> This approach needs that the loopback device GID is globally unique. But theoretically
> there is a possibility of a collision. Assume the following situations:
> 
> (1) Assume that the SMC-D loopback devices of the two different OS instances happen
>     to generate the same 64-bit GID.
> 
>     For the convenience of description, we refer to the sockets on these two
>     different OS instance as server A and client B.
> 
>     A will misjudge that the two are on the same OS instance because the same GID
>     in CLC proposal message. Then A creates its RMB and sends 64-bit token-A to B
>     in CLC accept message.
> 
>     B receives the CLC accept message. And according to patch #7/9, B tries to
>     attach its sndbuf to A's RMB by token-A.
> 
> (2) And assume that the OS instance where B is located happens to have an unattached
>     RMB whose 64-bit token is same as token-A.
> 
>     Then B successfully attaches its sndbuf to the wrong RMB, and creates its RMB,
>     sends token-B to A in CLC confirm message.
> 
>     Similarly, A receives the message and tries to attach its sndbuf to B's RMB by
>     token-B.
> 
> (3) Similar to (2), assume that the OS instance where A is located happens to have
>     an unattached RMB whose 64-bit token is same as token-B.
> 
>     Then A successfully attach its sndbuf to the wrong RMB. Both sides mistakenly
>     believe that an SMC-D connection based on the loopback device is established
>     between them.
> 
> If the above 3 coincidences all happen, that is, 64-bit random number conflicts occur
> 3 times, then an unreachable SMC-D connection will be established, which is nasty.
> But if one of above is not satisfied, it will safely fallback to TCP.
> 
> Since the chances of these happening are very small, I wonder if this risk of 1/2^(64*3)
> probability is acceptable? Can we just use 64-bits random generated number as GID in
> loopback device?

Let me just spell out some details here to make sure we're all on the
same page.

You're assuming that GIDs are generated randomly at cryptographic
quality. In the code I can see that you use get_random_bytes() which as
its comment explains supplies the same quality randomness as
/dev/urandom so on modern kernels that should provide cryptographic
quality randomness and be fine. Might be something to keep in mind for
backports though.

The fixed CHID of 0xFFFF makes sure this system identity confusion can
only occur between SMC-D loopback (and possibly virtio-ism?) never with
ISM based SMC-D or SMC-R as these never use this CHID value. Correct?

Now for the collision scenario above. As I understand it the
probability of the case where fallback does *not* occur is equivalent
to a 128 bit hash collision. Basically the random 64 bit GID_A
concatenated with the 64 bit DMB Token_A needs to just happen to match
the concatenation of the random 64 bit GID_B with DMB Token_B. With
that interpretation we can consult Wikipedia[0] for a nice table of how
many random GID+DMB Token choices are needed for a certain collision
probability. For 128 bits at least 8.2×10^11 tries would be needed just
to reach a 10^-15 collision probability. Considering the collision does
not only need to exist between two systems but these also need to try
to communicate with each other and happen to use the colliding DMBs for
things to get into the broken fallback case I think from a theoretical
point of view this sounds like neglible risk to me.

That said I'm more worried about the fallback to TCP being broken due
to a code bug once the GIDs do match which is already extremely
unlikely and thus not naturally tested in the wild. Do we have a plan
how to keep testing that fallback scenario somehow. Maybe with a
selftest or something? 

If we can solve the testing part then I'm personally in favor of this
approach of going with cryptograhically random GID and DMB token. It's
simple and doesn't depend on external factors and doesn't need a
protocol extension except for possibly reserving CHID 0xFFFF.

One more question though, what about the SEID why does that have to be
fixed and at least partially match what ISM devices use? I think I'm
missing some SMC protocol/design detail here. I'm guessing this would
require a protocol change?

Thanks,
Niklas

[0] https://en.wikipedia.org/wiki/Birthday_attack



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration
  2023-04-05 17:04 ` Niklas Schnelle
@ 2023-04-06 11:14   ` Alexandra Winter
  2023-04-06 14:27     ` Niklas Schnelle
  2023-04-10 14:30   ` Wen Gu
  1 sibling, 1 reply; 17+ messages in thread
From: Alexandra Winter @ 2023-04-06 11:14 UTC (permalink / raw)
  To: Niklas Schnelle, Wen Gu, kgraul, wenjia, jaka, davem, edumazet,
	kuba, pabeni
  Cc: linux-s390, netdev, linux-kernel



On 05.04.23 19:04, Niklas Schnelle wrote:
> One more question though, what about the SEID why does that have to be
> fixed and at least partially match what ISM devices use? I think I'm
> missing some SMC protocol/design detail here. I'm guessing this would
> require a protocol change?
> 
> Thanks,
> Niklas

Niklas,
in the initial SMC CLC handshake the client and server exchange the SEID (one per peer system)
and up to 8 proposals for SMC-D interfaces.
Wen's current proposal assumes that smc-d loopback can be one of these 8 proposed interfaces,
iiuc. So on s390 the proposal can contain ISM devices and a smc-d loopback device at the same time.
If one of the peers is e.g. an older Linux version, it will just ignore the loopback-device
in the list (Don't find a match for CHID 0xFFFF) and use an ISM interface for SMC-D if possible.
Therefor it is important that the SEID is used in the same way as it is today in the handshake.

If we decide for some reason (virtio-ism open issues?) that a protocol change/extension is
required/wanted, then it is a new game and we can come up with new identifiers, but we may
lose compatibility to backlevel systems.

Alexandra

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration
  2023-04-06 11:14   ` Alexandra Winter
@ 2023-04-06 14:27     ` Niklas Schnelle
  2023-04-10 14:30       ` Wen Gu
  0 siblings, 1 reply; 17+ messages in thread
From: Niklas Schnelle @ 2023-04-06 14:27 UTC (permalink / raw)
  To: Alexandra Winter, Wen Gu, kgraul, wenjia, jaka, davem, edumazet,
	kuba, pabeni
  Cc: linux-s390, netdev, linux-kernel

On Thu, 2023-04-06 at 13:14 +0200, Alexandra Winter wrote:
> 
> On 05.04.23 19:04, Niklas Schnelle wrote:
> > One more question though, what about the SEID why does that have to be
> > fixed and at least partially match what ISM devices use? I think I'm
> > missing some SMC protocol/design detail here. I'm guessing this would
> > require a protocol change?
> > 
> > Thanks,
> > Niklas
> 
> Niklas,
> in the initial SMC CLC handshake the client and server exchange the SEID (one per peer system)
> and up to 8 proposals for SMC-D interfaces.
> Wen's current proposal assumes that smc-d loopback can be one of these 8 proposed interfaces,
> iiuc. So on s390 the proposal can contain ISM devices and a smc-d loopback device at the same time.
> If one of the peers is e.g. an older Linux version, it will just ignore the loopback-device
> in the list (Don't find a match for CHID 0xFFFF) and use an ISM interface for SMC-D if possible.
> Therefor it is important that the SEID is used in the same way as it is today in the handshake.
> 
> If we decide for some reason (virtio-ism open issues?) that a protocol change/extension is
> required/wanted, then it is a new game and we can come up with new identifiers, but we may
> lose compatibility to backlevel systems.
> 
> Alexandra

Ok that makes sense to me. I was looking at the code in patch 4 of this
series and there it looks to me like SMC-D loopback as implemented
would always use the newly added SMCD_DEFAULT_V2_SEID might have
misread it though. From your description I think that would be wrong,
if a SEID is defined as on s390 it should use that SEID in the CLC for
all SMC variants. Similarly on other architectures it should use the
same SEID for SMC-D as for SMC-R, right? Also with partially match I
was actually wrong the SMCD_DEFAULT_V2_SEID.seid_string starts with
"IBM-DEF-ISMSEID…" while on s390's existing ISM we use "IBM-SYSZ-
ISMSEID…" so if SMC-D loopback correctly uses the shared SEID on s390
we can already only get GID.DMB collisions only on the same mainframe.

Thanks,
Niklas

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration
  2023-04-05 17:04 ` Niklas Schnelle
  2023-04-06 11:14   ` Alexandra Winter
@ 2023-04-10 14:30   ` Wen Gu
  1 sibling, 0 replies; 17+ messages in thread
From: Wen Gu @ 2023-04-10 14:30 UTC (permalink / raw)
  To: Niklas Schnelle, kgraul, wenjia, jaka, wintera, davem, edumazet,
	kuba, pabeni
  Cc: linux-s390, netdev, linux-kernel

Hi Niklas,

On 2023/4/6 01:04, Niklas Schnelle wrote:

> 
> Let me just spell out some details here to make sure we're all on the
> same page.
> 
> You're assuming that GIDs are generated randomly at cryptographic
> quality. In the code I can see that you use get_random_bytes() which as
> its comment explains supplies the same quality randomness as
> /dev/urandom so on modern kernels that should provide cryptographic
> quality randomness and be fine. Might be something to keep in mind for
> backports though.
> 
> The fixed CHID of 0xFFFF makes sure this system identity confusion can
> only occur between SMC-D loopback (and possibly virtio-ism?) never with
> ISM based SMC-D or SMC-R as these never use this CHID value. Correct?

Yes, CHID of 0xFFFF used for SMC-D loopback ensures the GID collision
won't involve ISM based SMC-D or SMC-R.

> 
> Now for the collision scenario above. As I understand it the
> probability of the case where fallback does *not* occur is equivalent
> to a 128 bit hash collision. Basically the random 64 bit GID_A
> concatenated with the 64 bit DMB Token_A needs to just happen to match
> the concatenation of the random 64 bit GID_B with DMB Token_B.

Yes, almost like this.

A very little correction: Token_A happens to match a DMB token in B's
kernel (not necessary Token_B) and Token_B happens to match a DMB token
in A's kernel (not necessary Token_A).

  With
> that interpretation we can consult Wikipedia[0] for a nice table of how
> many random GID+DMB Token choices are needed for a certain collision
> probability. For 128 bits at least 8.2×10^11 tries would be needed just
> to reach a 10^-15 collision probability. Considering the collision does
> not only need to exist between two systems but these also need to try
> to communicate with each other and happen to use the colliding DMBs for
> things to get into the broken fallback case I think from a theoretical
> point of view this sounds like neglible risk to me.
> 
Thanks for the reference data.

> That said I'm more worried about the fallback to TCP being broken due
> to a code bug once the GIDs do match which is already extremely
> unlikely and thus not naturally tested in the wild. Do we have a plan
> how to keep testing that fallback scenario somehow. Maybe with a
> selftest or something?
> 

IIUC, you are worried about the code implementation of fallback when GID
collides but DMB token check works? If so, I think we can provide a way
to set loopback device's GID manually, so that we can inject GID collision
fault to test the code.

> If we can solve the testing part then I'm personally in favor of this
> approach of going with cryptograhically random GID and DMB token. It's
> simple and doesn't depend on external factors and doesn't need a
> protocol extension except for possibly reserving CHID 0xFFFF.
> 
> One more question though, what about the SEID why does that have to be
> fixed and at least partially match what ISM devices use? I think I'm
> missing some SMC protocol/design detail here. I'm guessing this would
> require a protocol change?

SEID related topic will be replied in the next e-mail.
> 
> Thanks,
> Niklas
> 
> [0] https://en.wikipedia.org/wiki/Birthday_attack
> 

Thanks!
Wen Gu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration
  2023-04-06 14:27     ` Niklas Schnelle
@ 2023-04-10 14:30       ` Wen Gu
  0 siblings, 0 replies; 17+ messages in thread
From: Wen Gu @ 2023-04-10 14:30 UTC (permalink / raw)
  To: Niklas Schnelle, Alexandra Winter, kgraul, wenjia, jaka, davem,
	edumazet, kuba, pabeni
  Cc: linux-s390, netdev, linux-kernel



On 2023/4/6 22:27, Niklas Schnelle wrote:

> On Thu, 2023-04-06 at 13:14 +0200, Alexandra Winter wrote:
>>
>> On 05.04.23 19:04, Niklas Schnelle wrote:
>>> One more question though, what about the SEID why does that have to be
>>> fixed and at least partially match what ISM devices use? I think I'm
>>> missing some SMC protocol/design detail here. I'm guessing this would
>>> require a protocol change?
>>>
>>> Thanks,
>>> Niklas
>>
>> Niklas,
>> in the initial SMC CLC handshake the client and server exchange the SEID (one per peer system)
>> and up to 8 proposals for SMC-D interfaces.
>> Wen's current proposal assumes that smc-d loopback can be one of these 8 proposed interfaces,
>> iiuc. So on s390 the proposal can contain ISM devices and a smc-d loopback device at the same time.
>> If one of the peers is e.g. an older Linux version, it will just ignore the loopback-device
>> in the list (Don't find a match for CHID 0xFFFF) and use an ISM interface for SMC-D if possible.
>> Therefor it is important that the SEID is used in the same way as it is today in the handshake.
>>
>> If we decide for some reason (virtio-ism open issues?) that a protocol change/extension is
>> required/wanted, then it is a new game and we can come up with new identifiers, but we may
>> lose compatibility to backlevel systems.
>>
>> Alexandra
> 
> Ok that makes sense to me. I was looking at the code in patch 4 of this
> series and there it looks to me like SMC-D loopback as implemented
> would always use the newly added SMCD_DEFAULT_V2_SEID might have
> misread it though. From your description I think that would be wrong,
> if a SEID is defined as on s390 it should use that SEID in the CLC for
> all SMC variants. Similarly on other architectures it should use the
> same SEID for SMC-D as for SMC-R, right? Also with partially match I
> was actually wrong the SMCD_DEFAULT_V2_SEID.seid_string starts with
> "IBM-DEF-ISMSEID…" while on s390's existing ISM we use "IBM-SYSZ-
> ISMSEID…" so if SMC-D loopback correctly uses the shared SEID on s390
> we can already only get GID.DMB collisions only on the same mainframe.
> 
> Thanks,
> Niklas

SMC stack uses a global variable smc_ism_v2_system_eid to indicate
the only one SEID of system. Because all ISMv2 devices return the same
SEID, SEID of the first registered ISMv2 device will be assigned to
smc_ism_v2_system_eid.

Now we have extension SMC-D devices, loopback or virtio-ism device,
and this may need a little change.

My original idea was that

- Extension SMC-D devices always return SMCD_DEFAULT_V2_SEID as SEID.
- If there is only extension device in the system, smc_ism_v2_system_eid
   will record SMCD_DEFAULT_V2_SEID returned by SMC-D extension device.
- If extension devices coexist with ISM devices on s390, smc_ism_v2_system_eid
   will record SEID of ISM devices.

But inspired by your comments, I find the original idea has some problems
in situation that one side has only extension devices but the other side
has both extension and ISM devices. Although they can communicate through
the extension devices(virtio-ism), SMC-D connection is unavailable due to
the different SEIDs.

So as you suggested, the extension devices (loopback or virtio-ism) should
use the same way as ISM device to get the shared SEID on s390 arch.

And on arch other than s390, extension SMC-D device can use a fixed SEID like
SMCD_DEFAULT_V2_SEID here if we do not require SMC-D communication between
different architectures.

Thanks,
Wen Gu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration
  2023-04-05 14:48 ` [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration Wenjia Zhang
@ 2023-04-10 14:31   ` Wen Gu
  0 siblings, 0 replies; 17+ messages in thread
From: Wen Gu @ 2023-04-10 14:31 UTC (permalink / raw)
  To: Wenjia Zhang, kgraul, jaka, wintera, davem, edumazet, kuba, pabeni
  Cc: linux-s390, netdev, linux-kernel



On 2023/4/5 22:48, Wenjia Zhang wrote:

> 
> Hi Wen,
> 
> Thank you for the new version. The discussion on the open issue is still on-going in our organisation internally. I 
> appreciate your patience!
> 
> One thing I need to mention during testing the loopback device on our platform is that we get crash, because 
> smc_ism-signal_shutdown() is called by smc_1gr_free_work(), which is called indirectly by smc_conn_free(). Please make 
> sure that it would go to the path of the loopback device cleanly. Any question and consideration is welcome!
> 
> Thanks,
> Wenjia

Thank you! Wenjia. Testing on s390 is really helpful.

Since most of the path in smc_ism_signal_shutdown() is inside the preprocessing
macro '#if IS_ENABLED(CONFIG_ISM) ... #endif', so they are not executed in my
test environment, therefore I didn't realized the interface of ops->signal_event
in loopback device and missed the crash.

I will fix this and check for the other parts wrapped by '#if IS_ENABLED(CONFIG_ISM)
... #endif' which I ignored before. Then I will send out a new version.

Thanks,
Wen Gu

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-04-10 14:31 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-27  3:28 [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration Wen Gu
2023-03-27  3:28 ` [RFC PATCH net-next v4 1/9] net/smc: Decouple ism_dev from SMC-D device dump Wen Gu
2023-03-27  3:28 ` [RFC PATCH net-next v4 2/9] net/smc: Decouple ism_dev from SMC-D DMB registration Wen Gu
2023-03-27  3:28 ` [RFC PATCH net-next v4 3/9] net/smc: Extract v2 check helper from SMC-D device registration Wen Gu
2023-03-27  3:28 ` [RFC PATCH net-next v4 4/9] net/smc: Introduce SMC-D loopback device Wen Gu
2023-03-27  3:28 ` [RFC PATCH net-next v4 5/9] net/smc: Introduce an interface for getting DMB attribute Wen Gu
2023-03-27  3:28 ` [RFC PATCH net-next v4 6/9] net/smc: Introudce interfaces for DMB attach and detach Wen Gu
2023-03-27  3:28 ` [RFC PATCH net-next v4 7/9] net/smc: Avoid data copy from sndbuf to peer RMB in SMC-D Wen Gu
2023-03-27  3:28 ` [RFC PATCH net-next v4 8/9] net/smc: Modify cursor update logic when using mappable DMB Wen Gu
2023-03-27  3:28 ` [RFC PATCH net-next v4 9/9] net/smc: Add interface implementation of loopback device Wen Gu
2023-04-05 14:48 ` [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration Wenjia Zhang
2023-04-10 14:31   ` Wen Gu
2023-04-05 17:04 ` Niklas Schnelle
2023-04-06 11:14   ` Alexandra Winter
2023-04-06 14:27     ` Niklas Schnelle
2023-04-10 14:30       ` Wen Gu
2023-04-10 14:30   ` Wen Gu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).