netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 00/11] Device Memory TCP
@ 2023-08-10  1:57 Mina Almasry
  2023-08-10  1:57 ` [RFC PATCH v2 01/11] net: add netdev netlink api to bind dma-buf to a net device Mina Almasry
                   ` (13 more replies)
  0 siblings, 14 replies; 62+ messages in thread
From: Mina Almasry @ 2023-08-10  1:57 UTC (permalink / raw)
  To: netdev, linux-media, dri-devel
  Cc: Mina Almasry, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Arnd Bergmann, David Ahern, Willem de Bruijn, Sumit Semwal,
	Christian König, Jason Gunthorpe, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf

Changes in RFC v2:
------------------

The sticking point in RFC v1[1] was the dma-buf pages approach we used to
deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
that attempts to resolve this by implementing scatterlist support in the
networking stack, such that we can import the dma-buf scatterlist
directly. This is the approach proposed at a high level here[2].

Detailed changes:
1. Replaced dma-buf pages approach with importing scatterlist into the
   page pool.
2. Replace the dma-buf pages centric API with a netlink API.
3. Removed the TX path implementation - there is no issue with
   implementing the TX path with scatterlist approach, but leaving
   out the TX path makes it easier to review.
4. Functionality is tested with this proposal, but I have not conducted
   perf testing yet. I'm not sure there are regressions, but I removed
   perf claims from the cover letter until they can be re-confirmed.
5. Added Signed-off-by: contributors to the implementation.
6. Fixed some bugs with the RX path since RFC v1.

Any feedback welcome, but specifically the biggest pending questions
needing feedback IMO are:

1. Feedback on the scatterlist-based approach in general.
2. Netlink API (Patch 1 & 2).
3. Approach to handle all the drivers that expect to receive pages from
   the page pool (Patch 6).

[1] https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb0b4@gmail.com/T/
[2] https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLXCc=H7Nw@mail.gmail.com/

----------------------

* TL;DR:

Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
from device memory efficiently, without bouncing the data to a host memory
buffer.

* Problem:

A large amount of data transfers have device memory as the source and/or
destination. Accelerators drastically increased the volume of such transfers.
Some examples include:
- ML accelerators transferring large amounts of training data from storage into
  GPU/TPU memory. In some cases ML training setup time can be as long as 50% of
  TPU compute time, improving data transfer throughput & efficiency can help
  improving GPU/TPU utilization.

- Distributed training, where ML accelerators, such as GPUs on different hosts,
  exchange data among them.

- Distributed raw block storage applications transfer large amounts of data with
  remote SSDs, much of this data does not require host processing.

Today, the majority of the Device-to-Device data transfers the network are
implemented as the following low level operations: Device-to-Host copy,
Host-to-Host network transfer, and Host-to-Device copy.

The implementation is suboptimal, especially for bulk data transfers, and can
put significant strains on system resources, such as host memory bandwidth,
PCIe bandwidth, etc. One important reason behind the current state is the
kernel’s lack of semantics to express device to network transfers.

* Proposal:

In this patch series we attempt to optimize this use case by implementing
socket APIs that enable the user to:

1. send device memory across the network directly, and
2. receive incoming network packets directly into device memory.

Packet _payloads_ go directly from the NIC to device memory for receive and from
device memory to NIC for transmit.
Packet _headers_ go to/from host memory and are processed by the TCP/IP stack
normally. The NIC _must_ support header split to achieve this.

Advantages:

- Alleviate host memory bandwidth pressure, compared to existing
 network-transfer + device-copy semantics.

- Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
  of the PCIe tree, compared to traditional path which sends data through the
  root complex.

* Patch overview:

** Part 1: netlink API

Gives user ability to bind dma-buf to an RX queue.

** Part 2: scatterlist support

Currently the standard for device memory sharing is DMABUF, which doesn't
generate struct pages. On the other hand, networking stack (skbs, drivers, and
page pool) operate on pages. We have 2 options:

1. Generate struct pages for dmabuf device memory, or,
2. Modify the networking stack to process scatterlist.

Approach #1 was attempted in RFC v1. RFC v2 implements approach #2.

** part 3: page pool support

We piggy back on page pool memory providers proposal:
https://github.com/kuba-moo/linux/tree/pp-providers

It allows the page pool to define a memory provider that provides the
page allocation and freeing. It helps abstract most of the device memory
TCP changes from the driver.

** part 4: support for unreadable skb frags

Page pool iovs are not accessible by the host; we implement changes
throughput the networking stack to correctly handle skbs with unreadable
frags.

** Part 5: recvmsg() APIs

We define user APIs for the user to send and receive device memory.

Not included with this RFC is the GVE devmem TCP support, just to
simplify the review. Code available here if desired:
https://github.com/mina/linux/tree/tcpdevmem

This RFC is built on top of net-next with Jakub's pp-providers changes
cherry-picked.

* NIC dependencies:

1. (strict) Devmem TCP require the NIC to support header split, i.e. the
   capability to split incoming packets into a header + payload and to put
   each into a separate buffer. Devmem TCP works by using device memory
   for the packet payload, and host memory for the packet headers.

2. (optional) Devmem TCP works better with flow steering support & RSS support,
   i.e. the NIC's ability to steer flows into certain rx queues. This allows the
   sysadmin to enable devmem TCP on a subset of the rx queues, and steer
   devmem TCP traffic onto these queues and non devmem TCP elsewhere.

The NIC I have access to with these properties is the GVE with DQO support
running in Google Cloud, but any NIC that supports these features would suffice.
I may be able to help reviewers bring up devmem TCP on their NICs.

* Testing:

The series includes a udmabuf kselftest that show a simple use case of
devmem TCP and validates the entire data path end to end without
a dependency on a specific dmabuf provider.

** Test Setup

Kernel: net-next with this RFC and memory provider API cherry-picked
locally.

Hardware: Google Cloud A3 VMs.

NIC: GVE with header split & RSS & flow steering support.

Mina Almasry (11):
  net: add netdev netlink api to bind dma-buf to a net device
  netdev: implement netlink api to bind dma-buf to netdevice
  netdev: implement netdevice devmem allocator
  memory-provider: updates to core provider API for devmem TCP
  memory-provider: implement dmabuf devmem memory provider
  page-pool: add device memory support
  net: support non paged skb frags
  net: add support for skbs with unreadable frags
  tcp: implement recvmsg() RX path for devmem TCP
  net: add SO_DEVMEM_DONTNEED setsockopt to release RX pages
  selftests: add ncdevmem, netcat for devmem TCP

 Documentation/netlink/specs/netdev.yaml |  27 ++
 include/linux/netdevice.h               |  61 +++
 include/linux/skbuff.h                  |  54 ++-
 include/linux/socket.h                  |   1 +
 include/net/page_pool.h                 | 186 ++++++++-
 include/net/sock.h                      |   2 +
 include/net/tcp.h                       |   5 +-
 include/uapi/asm-generic/socket.h       |   6 +
 include/uapi/linux/netdev.h             |  10 +
 include/uapi/linux/uio.h                |  10 +
 net/core/datagram.c                     |   6 +
 net/core/dev.c                          | 214 ++++++++++
 net/core/gro.c                          |   2 +-
 net/core/netdev-genl-gen.c              |  14 +
 net/core/netdev-genl-gen.h              |   1 +
 net/core/netdev-genl.c                  | 103 +++++
 net/core/page_pool.c                    | 171 ++++++--
 net/core/skbuff.c                       |  80 +++-
 net/core/sock.c                         |  36 ++
 net/ipv4/tcp.c                          | 196 ++++++++-
 net/ipv4/tcp_input.c                    |  13 +-
 net/ipv4/tcp_ipv4.c                     |   7 +
 net/ipv4/tcp_output.c                   |   5 +-
 net/packet/af_packet.c                  |   4 +-
 tools/include/uapi/linux/netdev.h       |  10 +
 tools/net/ynl/generated/netdev-user.c   |  41 ++
 tools/net/ynl/generated/netdev-user.h   |  46 ++
 tools/testing/selftests/net/.gitignore  |   1 +
 tools/testing/selftests/net/Makefile    |   5 +
 tools/testing/selftests/net/ncdevmem.c  | 534 ++++++++++++++++++++++++
 30 files changed, 1787 insertions(+), 64 deletions(-)
 create mode 100644 tools/testing/selftests/net/ncdevmem.c

-- 
2.41.0.640.ga95def55d0-goog


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 01/11] net: add netdev netlink api to bind dma-buf to a net device
  2023-08-10  1:57 [RFC PATCH v2 00/11] Device Memory TCP Mina Almasry
@ 2023-08-10  1:57 ` Mina Almasry
  2023-08-10 16:04   ` Samudrala, Sridhar
  2023-08-10  1:57 ` [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice Mina Almasry
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 62+ messages in thread
From: Mina Almasry @ 2023-08-10  1:57 UTC (permalink / raw)
  To: netdev, linux-media, dri-devel
  Cc: Mina Almasry, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Arnd Bergmann, David Ahern, Willem de Bruijn, Sumit Semwal,
	Christian König, Jason Gunthorpe, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf

API takes the dma-buf fd as input, and binds it to the netdevice. The
user can specify the rx queue to bind the dma-buf to. The user should be
able to bind the same dma-buf to multiple queues, but that is left as
a (minor) TODO in this iteration.

Suggested-by: Stanislav Fomichev <sdf@google.com>

Signed-off-by: Mina Almasry <almasrymina@google.com>
---
 Documentation/netlink/specs/netdev.yaml | 27 +++++++++++++++
 include/uapi/linux/netdev.h             | 10 ++++++
 net/core/netdev-genl-gen.c              | 14 ++++++++
 net/core/netdev-genl-gen.h              |  1 +
 net/core/netdev-genl.c                  |  6 ++++
 tools/include/uapi/linux/netdev.h       | 10 ++++++
 tools/net/ynl/generated/netdev-user.c   | 41 ++++++++++++++++++++++
 tools/net/ynl/generated/netdev-user.h   | 46 +++++++++++++++++++++++++
 8 files changed, 155 insertions(+)

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index e41015310a6e..907a45260e95 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -68,6 +68,23 @@ attribute-sets:
         type: u32
         checks:
           min: 1
+  -
+    name: bind-dmabuf
+    attributes:
+      -
+        name: ifindex
+        doc: netdev ifindex to bind the dma-buf to.
+        type: u32
+        checks:
+          min: 1
+      -
+        name: queue-idx
+        doc: receive queue index to bind the dma-buf to.
+        type: u32
+      -
+        name: dmabuf-fd
+        doc: dmabuf file descriptor to bind.
+        type: u32
 
 operations:
   list:
@@ -100,6 +117,16 @@ operations:
       doc: Notification about device configuration being changed.
       notify: dev-get
       mcgrp: mgmt
+    -
+      name: bind-rx
+      doc: Bind dmabuf to netdev
+      attribute-set: bind-dmabuf
+      do:
+        request:
+          attributes:
+            - ifindex
+            - dmabuf-fd
+            - queue-idx
 
 mcast-groups:
   list:
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index bf71698a1e82..242b2b65161c 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -47,11 +47,21 @@ enum {
 	NETDEV_A_DEV_MAX = (__NETDEV_A_DEV_MAX - 1)
 };
 
+enum {
+	NETDEV_A_BIND_DMABUF_IFINDEX = 1,
+	NETDEV_A_BIND_DMABUF_QUEUE_IDX,
+	NETDEV_A_BIND_DMABUF_DMABUF_FD,
+
+	__NETDEV_A_BIND_DMABUF_MAX,
+	NETDEV_A_BIND_DMABUF_MAX = (__NETDEV_A_BIND_DMABUF_MAX - 1)
+};
+
 enum {
 	NETDEV_CMD_DEV_GET = 1,
 	NETDEV_CMD_DEV_ADD_NTF,
 	NETDEV_CMD_DEV_DEL_NTF,
 	NETDEV_CMD_DEV_CHANGE_NTF,
+	NETDEV_CMD_BIND_RX,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index ea9231378aa6..2e34ad5cccfa 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -15,6 +15,13 @@ static const struct nla_policy netdev_dev_get_nl_policy[NETDEV_A_DEV_IFINDEX + 1
 	[NETDEV_A_DEV_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1),
 };
 
+/* NETDEV_CMD_BIND_RX - do */
+static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_BIND_DMABUF_DMABUF_FD + 1] = {
+	[NETDEV_A_BIND_DMABUF_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1),
+	[NETDEV_A_BIND_DMABUF_DMABUF_FD] = { .type = NLA_U32, },
+	[NETDEV_A_BIND_DMABUF_QUEUE_IDX] = { .type = NLA_U32, },
+};
+
 /* Ops table for netdev */
 static const struct genl_split_ops netdev_nl_ops[] = {
 	{
@@ -29,6 +36,13 @@ static const struct genl_split_ops netdev_nl_ops[] = {
 		.dumpit	= netdev_nl_dev_get_dumpit,
 		.flags	= GENL_CMD_CAP_DUMP,
 	},
+	{
+		.cmd		= NETDEV_CMD_BIND_RX,
+		.doit		= netdev_nl_bind_rx_doit,
+		.policy		= netdev_bind_rx_nl_policy,
+		.maxattr	= NETDEV_A_BIND_DMABUF_DMABUF_FD,
+		.flags		= GENL_CMD_CAP_DO,
+	},
 };
 
 static const struct genl_multicast_group netdev_nl_mcgrps[] = {
diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h
index 7b370c073e7d..5aaeb435ec08 100644
--- a/net/core/netdev-genl-gen.h
+++ b/net/core/netdev-genl-gen.h
@@ -13,6 +13,7 @@
 
 int netdev_nl_dev_get_doit(struct sk_buff *skb, struct genl_info *info);
 int netdev_nl_dev_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb);
+int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info);
 
 enum {
 	NETDEV_NLGRP_MGMT,
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 65ef4867fc49..bf7324dd6c36 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -141,6 +141,12 @@ int netdev_nl_dev_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
 	return skb->len;
 }
 
+/* Stub */
+int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
+{
+	return 0;
+}
+
 static int netdev_genl_netdevice_event(struct notifier_block *nb,
 				       unsigned long event, void *ptr)
 {
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index bf71698a1e82..242b2b65161c 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -47,11 +47,21 @@ enum {
 	NETDEV_A_DEV_MAX = (__NETDEV_A_DEV_MAX - 1)
 };
 
+enum {
+	NETDEV_A_BIND_DMABUF_IFINDEX = 1,
+	NETDEV_A_BIND_DMABUF_QUEUE_IDX,
+	NETDEV_A_BIND_DMABUF_DMABUF_FD,
+
+	__NETDEV_A_BIND_DMABUF_MAX,
+	NETDEV_A_BIND_DMABUF_MAX = (__NETDEV_A_BIND_DMABUF_MAX - 1)
+};
+
 enum {
 	NETDEV_CMD_DEV_GET = 1,
 	NETDEV_CMD_DEV_ADD_NTF,
 	NETDEV_CMD_DEV_DEL_NTF,
 	NETDEV_CMD_DEV_CHANGE_NTF,
+	NETDEV_CMD_BIND_RX,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
diff --git a/tools/net/ynl/generated/netdev-user.c b/tools/net/ynl/generated/netdev-user.c
index 4eb8aefef0cd..2716e63820d2 100644
--- a/tools/net/ynl/generated/netdev-user.c
+++ b/tools/net/ynl/generated/netdev-user.c
@@ -18,6 +18,7 @@ static const char * const netdev_op_strmap[] = {
 	[NETDEV_CMD_DEV_ADD_NTF] = "dev-add-ntf",
 	[NETDEV_CMD_DEV_DEL_NTF] = "dev-del-ntf",
 	[NETDEV_CMD_DEV_CHANGE_NTF] = "dev-change-ntf",
+	[NETDEV_CMD_BIND_RX] = "bind-rx",
 };
 
 const char *netdev_op_str(int op)
@@ -57,6 +58,17 @@ struct ynl_policy_nest netdev_dev_nest = {
 	.table = netdev_dev_policy,
 };
 
+struct ynl_policy_attr netdev_bind_dmabuf_policy[NETDEV_A_BIND_DMABUF_MAX + 1] = {
+	[NETDEV_A_BIND_DMABUF_IFINDEX] = { .name = "ifindex", .type = YNL_PT_U32, },
+	[NETDEV_A_BIND_DMABUF_QUEUE_IDX] = { .name = "queue-idx", .type = YNL_PT_U32, },
+	[NETDEV_A_BIND_DMABUF_DMABUF_FD] = { .name = "dmabuf-fd", .type = YNL_PT_U32, },
+};
+
+struct ynl_policy_nest netdev_bind_dmabuf_nest = {
+	.max_attr = NETDEV_A_BIND_DMABUF_MAX,
+	.table = netdev_bind_dmabuf_policy,
+};
+
 /* Common nested types */
 /* ============== NETDEV_CMD_DEV_GET ============== */
 /* NETDEV_CMD_DEV_GET - do */
@@ -172,6 +184,35 @@ void netdev_dev_get_ntf_free(struct netdev_dev_get_ntf *rsp)
 	free(rsp);
 }
 
+/* ============== NETDEV_CMD_BIND_RX ============== */
+/* NETDEV_CMD_BIND_RX - do */
+void netdev_bind_rx_req_free(struct netdev_bind_rx_req *req)
+{
+	free(req);
+}
+
+int netdev_bind_rx(struct ynl_sock *ys, struct netdev_bind_rx_req *req)
+{
+	struct nlmsghdr *nlh;
+	int err;
+
+	nlh = ynl_gemsg_start_req(ys, ys->family_id, NETDEV_CMD_BIND_RX, 1);
+	ys->req_policy = &netdev_bind_dmabuf_nest;
+
+	if (req->_present.ifindex)
+		mnl_attr_put_u32(nlh, NETDEV_A_BIND_DMABUF_IFINDEX, req->ifindex);
+	if (req->_present.dmabuf_fd)
+		mnl_attr_put_u32(nlh, NETDEV_A_BIND_DMABUF_DMABUF_FD, req->dmabuf_fd);
+	if (req->_present.queue_idx)
+		mnl_attr_put_u32(nlh, NETDEV_A_BIND_DMABUF_QUEUE_IDX, req->queue_idx);
+
+	err = ynl_exec(ys, nlh, NULL);
+	if (err < 0)
+		return -1;
+
+	return 0;
+}
+
 static const struct ynl_ntf_info netdev_ntf_info[] =  {
 	[NETDEV_CMD_DEV_ADD_NTF] =  {
 		.alloc_sz	= sizeof(struct netdev_dev_get_ntf),
diff --git a/tools/net/ynl/generated/netdev-user.h b/tools/net/ynl/generated/netdev-user.h
index 5554dc69bb9c..74a43bb53627 100644
--- a/tools/net/ynl/generated/netdev-user.h
+++ b/tools/net/ynl/generated/netdev-user.h
@@ -82,4 +82,50 @@ struct netdev_dev_get_ntf {
 
 void netdev_dev_get_ntf_free(struct netdev_dev_get_ntf *rsp);
 
+/* ============== NETDEV_CMD_BIND_RX ============== */
+/* NETDEV_CMD_BIND_RX - do */
+struct netdev_bind_rx_req {
+	struct {
+		__u32 ifindex:1;
+		__u32 dmabuf_fd:1;
+		__u32 queue_idx:1;
+	} _present;
+
+	__u32 ifindex;
+	__u32 dmabuf_fd;
+	__u32 queue_idx;
+};
+
+static inline struct netdev_bind_rx_req *netdev_bind_rx_req_alloc(void)
+{
+	return calloc(1, sizeof(struct netdev_bind_rx_req));
+}
+void netdev_bind_rx_req_free(struct netdev_bind_rx_req *req);
+
+static inline void
+netdev_bind_rx_req_set_ifindex(struct netdev_bind_rx_req *req, __u32 ifindex)
+{
+	req->_present.ifindex = 1;
+	req->ifindex = ifindex;
+}
+static inline void
+netdev_bind_rx_req_set_dmabuf_fd(struct netdev_bind_rx_req *req,
+				 __u32 dmabuf_fd)
+{
+	req->_present.dmabuf_fd = 1;
+	req->dmabuf_fd = dmabuf_fd;
+}
+static inline void
+netdev_bind_rx_req_set_queue_idx(struct netdev_bind_rx_req *req,
+				 __u32 queue_idx)
+{
+	req->_present.queue_idx = 1;
+	req->queue_idx = queue_idx;
+}
+
+/*
+ * Bind dmabuf to netdev
+ */
+int netdev_bind_rx(struct ynl_sock *ys, struct netdev_bind_rx_req *req);
+
 #endif /* _LINUX_NETDEV_GEN_H */
-- 
2.41.0.640.ga95def55d0-goog


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice
  2023-08-10  1:57 [RFC PATCH v2 00/11] Device Memory TCP Mina Almasry
  2023-08-10  1:57 ` [RFC PATCH v2 01/11] net: add netdev netlink api to bind dma-buf to a net device Mina Almasry
@ 2023-08-10  1:57 ` Mina Almasry
  2023-08-13 11:26   ` Leon Romanovsky
                     ` (3 more replies)
  2023-08-10  1:57 ` [RFC PATCH v2 03/11] netdev: implement netdevice devmem allocator Mina Almasry
                   ` (11 subsequent siblings)
  13 siblings, 4 replies; 62+ messages in thread
From: Mina Almasry @ 2023-08-10  1:57 UTC (permalink / raw)
  To: netdev, linux-media, dri-devel
  Cc: Mina Almasry, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Arnd Bergmann, David Ahern, Willem de Bruijn, Sumit Semwal,
	Christian König, Jason Gunthorpe, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf, Willem de Bruijn,
	Kaiyuan Zhang

Add a netdev_dmabuf_binding struct which represents the
dma-buf-to-netdevice binding. The netlink API will bind the dma-buf to
an rx queue on the netdevice. On the binding, the dma_buf_attach
& dma_buf_map_attachment will occur. The entries in the sg_table from
mapping will be inserted into a genpool to make it ready
for allocation.

The chunks in the genpool are owned by a dmabuf_chunk_owner struct which
holds the dma-buf offset of the base of the chunk and the dma_addr of
the chunk. Both are needed to use allocations that come from this chunk.

We create a new type that represents an allocation from the genpool:
page_pool_iov. We setup the page_pool_iov allocation size in the
genpool to PAGE_SIZE for simplicity: to match the PAGE_SIZE normally
allocated by the page pool and given to the drivers.

The user can unbind the dmabuf from the netdevice by closing the netlink
socket that established the binding. We do this so that the binding is
automatically unbound even if the userspace process crashes.

The binding and unbinding leaves an indicator in struct netdev_rx_queue
that the given queue is bound, but the binding doesn't take effect until
the driver actually reconfigures its queues, and re-initializes its page
pool. This issue/weirdness is highlighted in the memory provider
proposal[1], and I'm hoping that some generic solution for all
memory providers will be discussed; this patch doesn't address that
weirdness again.

The netdev_dmabuf_binding struct is refcounted, and releases its
resources only when all the refs are released.

[1] https://lore.kernel.org/netdev/20230707183935.997267-1-kuba@kernel.org/

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>

Signed-off-by: Mina Almasry <almasrymina@google.com>
---
 include/linux/netdevice.h |  57 ++++++++++++
 include/net/page_pool.h   |  27 ++++++
 net/core/dev.c            | 178 ++++++++++++++++++++++++++++++++++++++
 net/core/netdev-genl.c    | 101 ++++++++++++++++++++-
 4 files changed, 361 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3800d0479698..1b7c5966d2ca 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -53,6 +53,8 @@
 #include <net/net_trackers.h>
 #include <net/net_debug.h>
 #include <net/dropreason-core.h>
+#include <linux/xarray.h>
+#include <linux/refcount.h>

 struct netpoll_info;
 struct device;
@@ -782,6 +784,55 @@ bool rps_may_expire_flow(struct net_device *dev, u16 rxq_index, u32 flow_id,
 #endif
 #endif /* CONFIG_RPS */

+struct netdev_dmabuf_binding {
+	struct dma_buf *dmabuf;
+	struct dma_buf_attachment *attachment;
+	struct sg_table *sgt;
+	struct net_device *dev;
+	struct gen_pool *chunk_pool;
+
+	/* The user holds a ref (via the netlink API) for as long as they want
+	 * the binding to remain alive. Each page pool using this binding holds
+	 * a ref to keep the binding alive. Each allocated page_pool_iov holds a
+	 * ref.
+	 *
+	 * The binding undos itself and unmaps the underlying dmabuf once all
+	 * those refs are dropped and the binding is no longer desired or in
+	 * use.
+	 */
+	refcount_t ref;
+
+	/* The portid of the user that owns this binding. Used for netlink to
+	 * notify us of the user dropping the bind.
+	 */
+	u32 owner_nlportid;
+
+	/* The list of bindings currently active. Used for netlink to notify us
+	 * of the user dropping the bind.
+	 */
+	struct list_head list;
+
+	/* rxq's this binding is active on. */
+	struct xarray bound_rxq_list;
+};
+
+void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding);
+
+static inline void
+netdev_devmem_binding_get(struct netdev_dmabuf_binding *binding)
+{
+	refcount_inc(&binding->ref);
+}
+
+static inline void
+netdev_devmem_binding_put(struct netdev_dmabuf_binding *binding)
+{
+	if (!refcount_dec_and_test(&binding->ref))
+		return;
+
+	__netdev_devmem_binding_free(binding);
+}
+
 /* This structure contains an instance of an RX queue. */
 struct netdev_rx_queue {
 	struct xdp_rxq_info		xdp_rxq;
@@ -796,6 +847,7 @@ struct netdev_rx_queue {
 #ifdef CONFIG_XDP_SOCKETS
 	struct xsk_buff_pool            *pool;
 #endif
+	struct netdev_dmabuf_binding	*binding;
 } ____cacheline_aligned_in_smp;

 /*
@@ -5026,6 +5078,11 @@ void netif_set_tso_max_segs(struct net_device *dev, unsigned int segs);
 void netif_inherit_tso_max(struct net_device *to,
 			   const struct net_device *from);

+void netdev_unbind_dmabuf_to_queue(struct netdev_dmabuf_binding *binding);
+int netdev_bind_dmabuf_to_queue(struct net_device *dev, unsigned int dmabuf_fd,
+				u32 rxq_idx,
+				struct netdev_dmabuf_binding **out);
+
 static inline bool netif_is_macsec(const struct net_device *dev)
 {
 	return dev->priv_flags & IFF_MACSEC;
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 364fe6924258..61b2066d32b5 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -170,6 +170,33 @@ extern const struct pp_memory_provider_ops hugesp_ops;
 extern const struct pp_memory_provider_ops huge_ops;
 extern const struct pp_memory_provider_ops huge_1g_ops;

+/* page_pool_iov support */
+
+/* Owner of the dma-buf chunks inserted into the gen pool. Each scatterlist
+ * entry from the dmabuf is inserted into the genpool as a chunk, and needs
+ * this owner struct to keep track of some metadata necessary to create
+ * allocations from this chunk.
+ */
+struct dmabuf_genpool_chunk_owner {
+	/* Offset into the dma-buf where this chunk starts.  */
+	unsigned long base_virtual;
+
+	/* dma_addr of the start of the chunk.  */
+	dma_addr_t base_dma_addr;
+
+	/* Array of page_pool_iovs for this chunk. */
+	struct page_pool_iov *ppiovs;
+	size_t num_ppiovs;
+
+	struct netdev_dmabuf_binding *binding;
+};
+
+struct page_pool_iov {
+	struct dmabuf_genpool_chunk_owner *owner;
+
+	refcount_t refcount;
+};
+
 struct page_pool {
 	struct page_pool_params p;

diff --git a/net/core/dev.c b/net/core/dev.c
index 8e7d0cb540cd..02a25ccf771a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -151,6 +151,8 @@
 #include <linux/pm_runtime.h>
 #include <linux/prandom.h>
 #include <linux/once_lite.h>
+#include <linux/genalloc.h>
+#include <linux/dma-buf.h>

 #include "dev.h"
 #include "net-sysfs.h"
@@ -2037,6 +2039,182 @@ static int call_netdevice_notifiers_mtu(unsigned long val,
 	return call_netdevice_notifiers_info(val, &info.info);
 }

+/* Device memory support */
+
+static void netdev_devmem_free_chunk_owner(struct gen_pool *genpool,
+					   struct gen_pool_chunk *chunk,
+					   void *not_used)
+{
+	struct dmabuf_genpool_chunk_owner *owner = chunk->owner;
+
+	kvfree(owner->ppiovs);
+	kfree(owner);
+}
+
+void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding)
+{
+	size_t size, avail;
+
+	gen_pool_for_each_chunk(binding->chunk_pool,
+				netdev_devmem_free_chunk_owner, NULL);
+
+	size = gen_pool_size(binding->chunk_pool);
+	avail = gen_pool_avail(binding->chunk_pool);
+
+	if (!WARN(size != avail, "can't destroy genpool. size=%lu, avail=%lu",
+		  size, avail))
+		gen_pool_destroy(binding->chunk_pool);
+
+	dma_buf_unmap_attachment(binding->attachment, binding->sgt,
+				 DMA_BIDIRECTIONAL);
+	dma_buf_detach(binding->dmabuf, binding->attachment);
+	dma_buf_put(binding->dmabuf);
+	kfree(binding);
+}
+
+void netdev_unbind_dmabuf_to_queue(struct netdev_dmabuf_binding *binding)
+{
+	struct netdev_rx_queue *rxq;
+	unsigned long xa_idx;
+
+	list_del_rcu(&binding->list);
+
+	xa_for_each(&binding->bound_rxq_list, xa_idx, rxq)
+		if (rxq->binding == binding)
+			rxq->binding = NULL;
+
+	netdev_devmem_binding_put(binding);
+}
+
+int netdev_bind_dmabuf_to_queue(struct net_device *dev, unsigned int dmabuf_fd,
+				u32 rxq_idx, struct netdev_dmabuf_binding **out)
+{
+	struct netdev_dmabuf_binding *binding;
+	struct netdev_rx_queue *rxq;
+	struct scatterlist *sg;
+	struct dma_buf *dmabuf;
+	unsigned int sg_idx, i;
+	unsigned long virtual;
+	u32 xa_idx;
+	int err;
+
+	rxq = __netif_get_rx_queue(dev, rxq_idx);
+
+	if (rxq->binding)
+		return -EEXIST;
+
+	dmabuf = dma_buf_get(dmabuf_fd);
+	if (IS_ERR_OR_NULL(dmabuf))
+		return -EBADFD;
+
+	binding = kzalloc_node(sizeof(*binding), GFP_KERNEL,
+			       dev_to_node(&dev->dev));
+	if (!binding) {
+		err = -ENOMEM;
+		goto err_put_dmabuf;
+	}
+
+	xa_init_flags(&binding->bound_rxq_list, XA_FLAGS_ALLOC);
+
+	refcount_set(&binding->ref, 1);
+
+	binding->dmabuf = dmabuf;
+
+	binding->attachment = dma_buf_attach(binding->dmabuf, dev->dev.parent);
+	if (IS_ERR(binding->attachment)) {
+		err = PTR_ERR(binding->attachment);
+		goto err_free_binding;
+	}
+
+	binding->sgt = dma_buf_map_attachment(binding->attachment,
+					      DMA_BIDIRECTIONAL);
+	if (IS_ERR(binding->sgt)) {
+		err = PTR_ERR(binding->sgt);
+		goto err_detach;
+	}
+
+	/* For simplicity we expect to make PAGE_SIZE allocations, but the
+	 * binding can be much more flexible than that. We may be able to
+	 * allocate MTU sized chunks here. Leave that for future work...
+	 */
+	binding->chunk_pool = gen_pool_create(PAGE_SHIFT,
+					      dev_to_node(&dev->dev));
+	if (!binding->chunk_pool) {
+		err = -ENOMEM;
+		goto err_unmap;
+	}
+
+	virtual = 0;
+	for_each_sgtable_dma_sg(binding->sgt, sg, sg_idx) {
+		dma_addr_t dma_addr = sg_dma_address(sg);
+		struct dmabuf_genpool_chunk_owner *owner;
+		size_t len = sg_dma_len(sg);
+		struct page_pool_iov *ppiov;
+
+		owner = kzalloc_node(sizeof(*owner), GFP_KERNEL,
+				     dev_to_node(&dev->dev));
+		owner->base_virtual = virtual;
+		owner->base_dma_addr = dma_addr;
+		owner->num_ppiovs = len / PAGE_SIZE;
+		owner->binding = binding;
+
+		err = gen_pool_add_owner(binding->chunk_pool, dma_addr,
+					 dma_addr, len, dev_to_node(&dev->dev),
+					 owner);
+		if (err) {
+			err = -EINVAL;
+			goto err_free_chunks;
+		}
+
+		owner->ppiovs = kvmalloc_array(owner->num_ppiovs,
+					       sizeof(*owner->ppiovs),
+					       GFP_KERNEL);
+		if (!owner->ppiovs) {
+			err = -ENOMEM;
+			goto err_free_chunks;
+		}
+
+		for (i = 0; i < owner->num_ppiovs; i++) {
+			ppiov = &owner->ppiovs[i];
+			ppiov->owner = owner;
+			refcount_set(&ppiov->refcount, 1);
+		}
+
+		dma_addr += len;
+		virtual += len;
+	}
+
+	/* TODO: need to be able to bind to multiple rx queues on the same
+	 * netdevice. The code should already be able to handle that with
+	 * minimal changes, but the netlink API currently allows for 1 rx
+	 * queue.
+	 */
+	err = xa_alloc(&binding->bound_rxq_list, &xa_idx, rxq, xa_limit_32b,
+		       GFP_KERNEL);
+	if (err)
+		goto err_free_chunks;
+
+	rxq->binding = binding;
+	*out = binding;
+
+	return 0;
+
+err_free_chunks:
+	gen_pool_for_each_chunk(binding->chunk_pool,
+				netdev_devmem_free_chunk_owner, NULL);
+	gen_pool_destroy(binding->chunk_pool);
+err_unmap:
+	dma_buf_unmap_attachment(binding->attachment, binding->sgt,
+				 DMA_BIDIRECTIONAL);
+err_detach:
+	dma_buf_detach(dmabuf, binding->attachment);
+err_free_binding:
+	kfree(binding);
+err_put_dmabuf:
+	dma_buf_put(dmabuf);
+	return err;
+}
+
 #ifdef CONFIG_NET_INGRESS
 static DEFINE_STATIC_KEY_FALSE(ingress_needed_key);

diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index bf7324dd6c36..288ed0112995 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -141,10 +141,74 @@ int netdev_nl_dev_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
 	return skb->len;
 }

-/* Stub */
+static LIST_HEAD(netdev_rbinding_list);
+
 int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	return 0;
+	struct netdev_dmabuf_binding *out_binding;
+	u32 ifindex, dmabuf_fd, rxq_idx;
+	struct net_device *netdev;
+	struct sk_buff *rsp;
+	int err = 0;
+	void *hdr;
+
+	if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_DEV_IFINDEX) ||
+	    GENL_REQ_ATTR_CHECK(info, NETDEV_A_BIND_DMABUF_DMABUF_FD) ||
+	    GENL_REQ_ATTR_CHECK(info, NETDEV_A_BIND_DMABUF_QUEUE_IDX))
+		return -EINVAL;
+
+	ifindex = nla_get_u32(info->attrs[NETDEV_A_DEV_IFINDEX]);
+	dmabuf_fd = nla_get_u32(info->attrs[NETDEV_A_BIND_DMABUF_DMABUF_FD]);
+	rxq_idx = nla_get_u32(info->attrs[NETDEV_A_BIND_DMABUF_QUEUE_IDX]);
+
+	rtnl_lock();
+
+	netdev = __dev_get_by_index(genl_info_net(info), ifindex);
+	if (!netdev) {
+		err = -ENODEV;
+		goto err_unlock;
+	}
+
+	if (rxq_idx >= netdev->num_rx_queues) {
+		err = -ERANGE;
+		goto err_unlock;
+	}
+
+	if (netdev_bind_dmabuf_to_queue(netdev, dmabuf_fd, rxq_idx,
+					&out_binding)) {
+		err = -EINVAL;
+		goto err_unlock;
+	}
+
+	out_binding->owner_nlportid = info->snd_portid;
+	list_add_rcu(&out_binding->list, &netdev_rbinding_list);
+
+	rsp = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!rsp) {
+		err = -ENOMEM;
+		goto err_unbind;
+	}
+
+	hdr = genlmsg_put(rsp, info->snd_portid, info->snd_seq,
+			  &netdev_nl_family, 0, info->genlhdr->cmd);
+	if (!hdr) {
+		err = -EMSGSIZE;
+		goto err_genlmsg_free;
+	}
+
+	genlmsg_end(rsp, hdr);
+
+	rtnl_unlock();
+
+	return genlmsg_reply(rsp, info);
+
+err_genlmsg_free:
+	nlmsg_free(rsp);
+err_unbind:
+	netdev_unbind_dmabuf_to_queue(out_binding);
+err_unlock:
+	rtnl_unlock();
+	return err;
 }

 static int netdev_genl_netdevice_event(struct notifier_block *nb,
@@ -167,10 +231,37 @@ static int netdev_genl_netdevice_event(struct notifier_block *nb,
 	return NOTIFY_OK;
 }

+static int netdev_netlink_notify(struct notifier_block *nb, unsigned long state,
+				 void *_notify)
+{
+	struct netlink_notify *notify = _notify;
+	struct netdev_dmabuf_binding *rbinding;
+
+	if (state != NETLINK_URELEASE || notify->protocol != NETLINK_GENERIC)
+		return NOTIFY_DONE;
+
+	rcu_read_lock();
+
+	list_for_each_entry_rcu(rbinding, &netdev_rbinding_list, list) {
+		if (rbinding->owner_nlportid == notify->portid) {
+			netdev_unbind_dmabuf_to_queue(rbinding);
+			break;
+		}
+	}
+
+	rcu_read_unlock();
+
+	return NOTIFY_OK;
+}
+
 static struct notifier_block netdev_genl_nb = {
 	.notifier_call	= netdev_genl_netdevice_event,
 };

+static struct notifier_block netdev_netlink_notifier = {
+	.notifier_call = netdev_netlink_notify,
+};
+
 static int __init netdev_genl_init(void)
 {
 	int err;
@@ -183,8 +274,14 @@ static int __init netdev_genl_init(void)
 	if (err)
 		goto err_unreg_ntf;

+	err = netlink_register_notifier(&netdev_netlink_notifier);
+	if (err)
+		goto err_unreg_family;
+
 	return 0;

+err_unreg_family:
+	genl_unregister_family(&netdev_nl_family);
 err_unreg_ntf:
 	unregister_netdevice_notifier(&netdev_genl_nb);
 	return err;
--
2.41.0.640.ga95def55d0-goog

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 03/11] netdev: implement netdevice devmem allocator
  2023-08-10  1:57 [RFC PATCH v2 00/11] Device Memory TCP Mina Almasry
  2023-08-10  1:57 ` [RFC PATCH v2 01/11] net: add netdev netlink api to bind dma-buf to a net device Mina Almasry
  2023-08-10  1:57 ` [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice Mina Almasry
@ 2023-08-10  1:57 ` Mina Almasry
  2023-08-10  1:57 ` [RFC PATCH v2 04/11] memory-provider: updates to core provider API for devmem TCP Mina Almasry
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 62+ messages in thread
From: Mina Almasry @ 2023-08-10  1:57 UTC (permalink / raw)
  To: netdev, linux-media, dri-devel
  Cc: Mina Almasry, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Arnd Bergmann, David Ahern, Willem de Bruijn, Sumit Semwal,
	Christian König, Jason Gunthorpe, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf, Willem de Bruijn,
	Kaiyuan Zhang

Implement netdev devmem allocator. The allocator takes a given struct
netdev_dmabuf_binding as input and allocates page_pool_iov from that
binding.

The allocation simply delegates to the binding's genpool for the
allocation logic and wraps the returned memory region in a page_pool_iov
struct.

page_pool_iov are refcounted and are freed back to the binding when the
refcount drops to 0.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>

Signed-off-by: Mina Almasry <almasrymina@google.com>
---
 include/linux/netdevice.h |  4 ++++
 include/net/page_pool.h   | 26 ++++++++++++++++++++++++++
 net/core/dev.c            | 36 ++++++++++++++++++++++++++++++++++++
 3 files changed, 66 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1b7c5966d2ca..bb5296e6cb00 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -5078,6 +5078,10 @@ void netif_set_tso_max_segs(struct net_device *dev, unsigned int segs);
 void netif_inherit_tso_max(struct net_device *to,
 			   const struct net_device *from);

+struct page_pool_iov *
+netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
+void netdev_free_devmem(struct page_pool_iov *ppiov);
+
 void netdev_unbind_dmabuf_to_queue(struct netdev_dmabuf_binding *binding);
 int netdev_bind_dmabuf_to_queue(struct net_device *dev, unsigned int dmabuf_fd,
 				u32 rxq_idx,
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 61b2066d32b5..13ae7f668c9e 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -197,6 +197,32 @@ struct page_pool_iov {
 	refcount_t refcount;
 };

+static inline struct dmabuf_genpool_chunk_owner *
+page_pool_iov_owner(const struct page_pool_iov *ppiov)
+{
+	return ppiov->owner;
+}
+
+static inline unsigned int page_pool_iov_idx(const struct page_pool_iov *ppiov)
+{
+	return ppiov - page_pool_iov_owner(ppiov)->ppiovs;
+}
+
+static inline dma_addr_t
+page_pool_iov_dma_addr(const struct page_pool_iov *ppiov)
+{
+	struct dmabuf_genpool_chunk_owner *owner = page_pool_iov_owner(ppiov);
+
+	return owner->base_dma_addr +
+	       ((dma_addr_t)page_pool_iov_idx(ppiov) << PAGE_SHIFT);
+}
+
+static inline struct netdev_dmabuf_binding *
+page_pool_iov_binding(const struct page_pool_iov *ppiov)
+{
+	return page_pool_iov_owner(ppiov)->binding;
+}
+
 struct page_pool {
 	struct page_pool_params p;

diff --git a/net/core/dev.c b/net/core/dev.c
index 02a25ccf771a..0149335a25b7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2072,6 +2072,42 @@ void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding)
 	kfree(binding);
 }

+struct page_pool_iov *netdev_alloc_devmem(struct netdev_dmabuf_binding *binding)
+{
+	struct dmabuf_genpool_chunk_owner *owner;
+	struct page_pool_iov *ppiov;
+	unsigned long dma_addr;
+	ssize_t offset;
+	ssize_t index;
+
+	dma_addr = gen_pool_alloc_owner(binding->chunk_pool, PAGE_SIZE,
+					(void **)&owner);
+	if (!dma_addr)
+		return NULL;
+
+	offset = dma_addr - owner->base_dma_addr;
+	index = offset / PAGE_SIZE;
+	ppiov = &owner->ppiovs[index];
+
+	netdev_devmem_binding_get(binding);
+
+	return ppiov;
+}
+
+void netdev_free_devmem(struct page_pool_iov *ppiov)
+{
+	struct netdev_dmabuf_binding *binding = page_pool_iov_binding(ppiov);
+
+	refcount_set(&ppiov->refcount, 1);
+
+	if (gen_pool_has_addr(binding->chunk_pool,
+			      page_pool_iov_dma_addr(ppiov), PAGE_SIZE))
+		gen_pool_free(binding->chunk_pool,
+			      page_pool_iov_dma_addr(ppiov), PAGE_SIZE);
+
+	netdev_devmem_binding_put(binding);
+}
+
 void netdev_unbind_dmabuf_to_queue(struct netdev_dmabuf_binding *binding)
 {
 	struct netdev_rx_queue *rxq;
--
2.41.0.640.ga95def55d0-goog

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 04/11] memory-provider: updates to core provider API for devmem TCP
  2023-08-10  1:57 [RFC PATCH v2 00/11] Device Memory TCP Mina Almasry
                   ` (2 preceding siblings ...)
  2023-08-10  1:57 ` [RFC PATCH v2 03/11] netdev: implement netdevice devmem allocator Mina Almasry
@ 2023-08-10  1:57 ` Mina Almasry
  2023-08-10  1:57 ` [RFC PATCH v2 05/11] memory-provider: implement dmabuf devmem memory provider Mina Almasry
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 62+ messages in thread
From: Mina Almasry @ 2023-08-10  1:57 UTC (permalink / raw)
  To: netdev, linux-media, dri-devel
  Cc: Mina Almasry, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Arnd Bergmann, David Ahern, Willem de Bruijn, Sumit Semwal,
	Christian König, Jason Gunthorpe, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf

Implement a few updates to Jakub's RFC memory provider[1] API to make it
suitable for device memory TCP:

1. Currently for devmem TCP the driver's netdev_rx_queue holds a
reference to the netdev_dmabuf_binding struct and needs to pass that to
the page_pool's memory provider somehow. For PoC purposes, create a
pp->mp_priv field that is set by the driver. Likely needs a better API
(likely dependent on the general memory provider API).

2. The current memory_provider API gives the memory_provider the option
to override put_page(), but tries page_pool_clear_pp_info() after the
memory provider has released the page. IMO if the page freeing is
delegated to the provider then the page_pool should not modify the
page after release_page() has been called.

[1]: https://lore.kernel.org/netdev/20230707183935.997267-1-kuba@kernel.org/

Signed-off-by: Mina Almasry <almasrymina@google.com>
---
 include/net/page_pool.h | 1 +
 net/core/page_pool.c    | 7 ++++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 13ae7f668c9e..e395f82e182b 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -78,6 +78,7 @@ struct page_pool_params {
 	struct device	*dev; /* device, for DMA pre-mapping purposes */
 	struct napi_struct *napi; /* Sole consumer of pages, otherwise NULL */
 	u8		memory_provider; /* haaacks! should be user-facing */
+	void		*mp_priv; /* argument to pass to the memory provider */
 	enum dma_data_direction dma_dir; /* DMA mapping direction */
 	unsigned int	max_len; /* max DMA sync memory size */
 	unsigned int	offset;  /* DMA addr offset */
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index d50f6728e4f6..df3f431fcff3 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -241,6 +241,7 @@ static int page_pool_init(struct page_pool *pool,
 		goto free_ptr_ring;
 	}

+	pool->mp_priv = pool->p.mp_priv;
 	if (pool->mp_ops) {
 		err = pool->mp_ops->init(pool);
 		if (err) {
@@ -564,16 +565,16 @@ void page_pool_return_page(struct page_pool *pool, struct page *page)
 	else
 		__page_pool_release_page_dma(pool, page);

-	page_pool_clear_pp_info(page);
-
 	/* This may be the last page returned, releasing the pool, so
 	 * it is not safe to reference pool afterwards.
 	 */
 	count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt);
 	trace_page_pool_state_release(pool, page, count);

-	if (put)
+	if (put) {
+		page_pool_clear_pp_info(page);
 		put_page(page);
+	}
 	/* An optimization would be to call __free_pages(page, pool->p.order)
 	 * knowing page is not part of page-cache (thus avoiding a
 	 * __page_cache_release() call).
--
2.41.0.640.ga95def55d0-goog

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 05/11] memory-provider: implement dmabuf devmem memory provider
  2023-08-10  1:57 [RFC PATCH v2 00/11] Device Memory TCP Mina Almasry
                   ` (3 preceding siblings ...)
  2023-08-10  1:57 ` [RFC PATCH v2 04/11] memory-provider: updates to core provider API for devmem TCP Mina Almasry
@ 2023-08-10  1:57 ` Mina Almasry
  2023-08-10  1:57 ` [RFC PATCH v2 06/11] page-pool: add device memory support Mina Almasry
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 62+ messages in thread
From: Mina Almasry @ 2023-08-10  1:57 UTC (permalink / raw)
  To: netdev, linux-media, dri-devel
  Cc: Mina Almasry, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Arnd Bergmann, David Ahern, Willem de Bruijn, Sumit Semwal,
	Christian König, Jason Gunthorpe, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf, Willem de Bruijn,
	Kaiyuan Zhang

Implement a memory provider that allocates dmabuf devmem page_pool_iovs.

Support of PP_FLAG_DMA_MAP, PP_FLAG_DMA_SYNC_DEV, and PP_FLAG_PAGE_FRAG
is omitted. PP_FLAG_DMA_MAP is irrelevant as dma-buf devmem is already
mapped. The other flags are omitted for simplicity.

The provider receives a reference to the struct netdev_dmabuf_binding
via the pool->mp_priv pointer. The driver needs to set this pointer for
the provider in the page_pool_params.

The provider obtains a reference on the netdev_dmabuf_binding which
guarantees the binding and the underlying mapping remains alive until
the provider is destroyed.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>

Signed-off-by: Mina Almasry <almasrymina@google.com>
---
 include/net/page_pool.h | 58 ++++++++++++++++++++++++++++++
 net/core/page_pool.c    | 79 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 137 insertions(+)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index e395f82e182b..537eb36115ed 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -32,6 +32,7 @@
 #include <linux/mm.h> /* Needed by ptr_ring */
 #include <linux/ptr_ring.h>
 #include <linux/dma-direction.h>
+#include <net/net_debug.h>

 #define PP_FLAG_DMA_MAP		BIT(0) /* Should page_pool do the DMA
 					* map/unmap
@@ -157,6 +158,7 @@ enum pp_memory_provider_type {
 	PP_MP_HUGE_SPLIT, /* 2MB, online page alloc */
 	PP_MP_HUGE, /* 2MB, all memory pre-allocated */
 	PP_MP_HUGE_1G, /* 1G pages, MEP, pre-allocated */
+	PP_MP_DMABUF_DEVMEM, /* dmabuf devmem provider */
 };

 struct pp_memory_provider_ops {
@@ -170,9 +172,15 @@ extern const struct pp_memory_provider_ops basic_ops;
 extern const struct pp_memory_provider_ops hugesp_ops;
 extern const struct pp_memory_provider_ops huge_ops;
 extern const struct pp_memory_provider_ops huge_1g_ops;
+extern const struct pp_memory_provider_ops dmabuf_devmem_ops;

 /* page_pool_iov support */

+/*  We overload the LSB of the struct page pointer to indicate whether it's
+ *  a page or page_pool_iov.
+ */
+#define PP_DEVMEM 0x01UL
+
 /* Owner of the dma-buf chunks inserted into the gen pool. Each scatterlist
  * entry from the dmabuf is inserted into the genpool as a chunk, and needs
  * this owner struct to keep track of some metadata necessary to create
@@ -196,6 +204,8 @@ struct page_pool_iov {
 	struct dmabuf_genpool_chunk_owner *owner;

 	refcount_t refcount;
+
+	struct page_pool *pp;
 };

 static inline struct dmabuf_genpool_chunk_owner *
@@ -218,12 +228,60 @@ page_pool_iov_dma_addr(const struct page_pool_iov *ppiov)
 	       ((dma_addr_t)page_pool_iov_idx(ppiov) << PAGE_SHIFT);
 }

+static inline unsigned long
+page_pool_iov_virtual_addr(const struct page_pool_iov *ppiov)
+{
+	struct dmabuf_genpool_chunk_owner *owner = page_pool_iov_owner(ppiov);
+
+	return owner->base_virtual +
+	       ((unsigned long)page_pool_iov_idx(ppiov) << PAGE_SHIFT);
+}
+
 static inline struct netdev_dmabuf_binding *
 page_pool_iov_binding(const struct page_pool_iov *ppiov)
 {
 	return page_pool_iov_owner(ppiov)->binding;
 }

+static inline int page_pool_iov_refcount(const struct page_pool_iov *ppiov)
+{
+	return refcount_read(&ppiov->refcount);
+}
+
+static inline void page_pool_iov_get_many(struct page_pool_iov *ppiov,
+					  unsigned int count)
+{
+	refcount_add(count, &ppiov->refcount);
+}
+
+void __page_pool_iov_free(struct page_pool_iov *ppiov);
+
+static inline void page_pool_iov_put_many(struct page_pool_iov *ppiov,
+					  unsigned int count)
+{
+	if (!refcount_sub_and_test(count, &ppiov->refcount))
+		return;
+
+	__page_pool_iov_free(ppiov);
+}
+
+/* page pool mm helpers */
+
+static inline bool page_is_page_pool_iov(const struct page *page)
+{
+	return (unsigned long)page & PP_DEVMEM;
+}
+
+static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
+{
+	if (page_is_page_pool_iov(page))
+		return (struct page_pool_iov *)((unsigned long)page &
+						~PP_DEVMEM);
+
+	DEBUG_NET_WARN_ON_ONCE(true);
+	return NULL;
+}
+
 struct page_pool {
 	struct page_pool_params p;

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index df3f431fcff3..0a7c08d748b8 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -20,6 +20,7 @@
 #include <linux/poison.h>
 #include <linux/ethtool.h>
 #include <linux/netdevice.h>
+#include <linux/genalloc.h>

 #include <trace/events/page_pool.h>

@@ -236,6 +237,9 @@ static int page_pool_init(struct page_pool *pool,
 	case PP_MP_HUGE_1G:
 		pool->mp_ops = &huge_1g_ops;
 		break;
+	case PP_MP_DMABUF_DEVMEM:
+		pool->mp_ops = &dmabuf_devmem_ops;
+		break;
 	default:
 		err = -EINVAL;
 		goto free_ptr_ring;
@@ -1006,6 +1010,15 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
 }
 EXPORT_SYMBOL(page_pool_return_skb_page);

+void __page_pool_iov_free(struct page_pool_iov *ppiov)
+{
+	if (ppiov->pp->mp_ops != &dmabuf_devmem_ops)
+		return;
+
+	netdev_free_devmem(ppiov);
+}
+EXPORT_SYMBOL_GPL(__page_pool_iov_free);
+
 /***********************
  *  Mem provider hack  *
  ***********************/
@@ -1538,3 +1551,69 @@ const struct pp_memory_provider_ops huge_1g_ops = {
 	.alloc_pages		= mp_huge_1g_alloc_pages,
 	.release_page		= mp_huge_1g_release,
 };
+
+/*** "Dmabuf devmem memory provider" ***/
+
+static int mp_dmabuf_devmem_init(struct page_pool *pool)
+{
+	struct netdev_dmabuf_binding *binding = pool->mp_priv;
+
+	if (!binding)
+		return -EINVAL;
+
+	if (pool->p.flags & PP_FLAG_DMA_MAP ||
+	    pool->p.flags & PP_FLAG_DMA_SYNC_DEV ||
+	    pool->p.flags & PP_FLAG_PAGE_FRAG)
+		return -EOPNOTSUPP;
+
+	netdev_devmem_binding_get(binding);
+	return 0;
+}
+
+static struct page *mp_dmabuf_devmem_alloc_pages(struct page_pool *pool,
+						 gfp_t gfp)
+{
+	struct netdev_dmabuf_binding *binding = pool->mp_priv;
+	struct page_pool_iov *ppiov;
+
+	ppiov = netdev_alloc_devmem(binding);
+	if (!ppiov)
+		return NULL;
+
+	ppiov->pp = pool;
+	pool->pages_state_hold_cnt++;
+	trace_page_pool_state_hold(pool, (struct page *)ppiov,
+				   pool->pages_state_hold_cnt);
+	ppiov = (struct page_pool_iov *)((unsigned long)ppiov | PP_DEVMEM);
+
+	return (struct page *)ppiov;
+}
+
+static void mp_dmabuf_devmem_destroy(struct page_pool *pool)
+{
+	struct netdev_dmabuf_binding *binding = pool->mp_priv;
+
+	netdev_devmem_binding_put(binding);
+}
+
+static bool mp_dmabuf_devmem_release_page(struct page_pool *pool,
+					  struct page *page)
+{
+	struct page_pool_iov *ppiov;
+
+	if (WARN_ON_ONCE(!page_is_page_pool_iov(page)))
+		return false;
+
+	ppiov = page_to_page_pool_iov(page);
+	page_pool_iov_put_many(ppiov, 1);
+	/* We don't want the page pool put_page()ing our page_pool_iovs. */
+	return false;
+}
+
+const struct pp_memory_provider_ops dmabuf_devmem_ops = {
+	.init			= mp_dmabuf_devmem_init,
+	.destroy		= mp_dmabuf_devmem_destroy,
+	.alloc_pages		= mp_dmabuf_devmem_alloc_pages,
+	.release_page		= mp_dmabuf_devmem_release_page,
+};
+EXPORT_SYMBOL(dmabuf_devmem_ops);
--
2.41.0.640.ga95def55d0-goog

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 06/11] page-pool: add device memory support
  2023-08-10  1:57 [RFC PATCH v2 00/11] Device Memory TCP Mina Almasry
                   ` (4 preceding siblings ...)
  2023-08-10  1:57 ` [RFC PATCH v2 05/11] memory-provider: implement dmabuf devmem memory provider Mina Almasry
@ 2023-08-10  1:57 ` Mina Almasry
  2023-08-19  9:51   ` Jesper Dangaard Brouer
  2023-08-10  1:57 ` [RFC PATCH v2 07/11] net: support non paged skb frags Mina Almasry
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 62+ messages in thread
From: Mina Almasry @ 2023-08-10  1:57 UTC (permalink / raw)
  To: netdev, linux-media, dri-devel
  Cc: Mina Almasry, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Arnd Bergmann, David Ahern, Willem de Bruijn, Sumit Semwal,
	Christian König, Jason Gunthorpe, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf

Overload the LSB of struct page* to indicate that it's a page_pool_iov.

Refactor mm calls on struct page * into helpers, and add page_pool_iov
handling on those helpers. Modify callers of these mm APIs with calls to
these helpers instead.

In areas where struct page* is dereferenced, add a check for special
handling of page_pool_iov.

The memory providers producing page_pool_iov can set the LSB on the
struct page* returned to the page pool.

Note that instead of overloading the LSB of page pointers, we can
instead define a new union between struct page & struct page_pool_iov and
compact it in a new type. However, we'd need to implement the code churn
to modify the page_pool & drivers to use this new type. For this POC
that is not implemented (feedback welcome).

I have a sample implementation of adding a new page_pool_token type
in the page_pool to give a general idea here:
https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d

Full branch here:
https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens

(In the branches above, page_pool_iov is called devmem_slice).

Could also add static_branch to speed up the checks in page_pool_iov
memory providers are being used.

Signed-off-by: Mina Almasry <almasrymina@google.com>
---
 include/net/page_pool.h | 74 ++++++++++++++++++++++++++++++++++-
 net/core/page_pool.c    | 85 ++++++++++++++++++++++++++++-------------
 2 files changed, 131 insertions(+), 28 deletions(-)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 537eb36115ed..f08ca230d68e 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -282,6 +282,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
 	return NULL;
 }

+static inline int page_pool_page_ref_count(struct page *page)
+{
+	if (page_is_page_pool_iov(page))
+		return page_pool_iov_refcount(page_to_page_pool_iov(page));
+
+	return page_ref_count(page);
+}
+
+static inline void page_pool_page_get_many(struct page *page,
+					   unsigned int count)
+{
+	if (page_is_page_pool_iov(page))
+		return page_pool_iov_get_many(page_to_page_pool_iov(page),
+					      count);
+
+	return page_ref_add(page, count);
+}
+
+static inline void page_pool_page_put_many(struct page *page,
+					   unsigned int count)
+{
+	if (page_is_page_pool_iov(page))
+		return page_pool_iov_put_many(page_to_page_pool_iov(page),
+					      count);
+
+	if (count > 1)
+		page_ref_sub(page, count - 1);
+
+	put_page(page);
+}
+
+static inline bool page_pool_page_is_pfmemalloc(struct page *page)
+{
+	if (page_is_page_pool_iov(page))
+		return false;
+
+	return page_is_pfmemalloc(page);
+}
+
+static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
+{
+	/* Assume page_pool_iov are on the preferred node without actually
+	 * checking...
+	 *
+	 * This check is only used to check for recycling memory in the page
+	 * pool's fast paths. Currently the only implementation of page_pool_iov
+	 * is dmabuf device memory. It's a deliberate decision by the user to
+	 * bind a certain dmabuf to a certain netdev, and the netdev rx queue
+	 * would not be able to reallocate memory from another dmabuf that
+	 * exists on the preferred node, so, this check doesn't make much sense
+	 * in this case. Assume all page_pool_iovs can be recycled for now.
+	 */
+	if (page_is_page_pool_iov(page))
+		return true;
+
+	return page_to_nid(page) == pref_nid;
+}
+
 struct page_pool {
 	struct page_pool_params p;

@@ -434,6 +492,9 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
 {
 	long ret;

+	if (page_is_page_pool_iov(page))
+		return -EINVAL;
+
 	/* If nr == pp_frag_count then we have cleared all remaining
 	 * references to the page. No need to actually overwrite it, instead
 	 * we can leave this to be overwritten by the calling function.
@@ -494,7 +555,12 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,

 static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
 {
-	dma_addr_t ret = page->dma_addr;
+	dma_addr_t ret;
+
+	if (page_is_page_pool_iov(page))
+		return page_pool_iov_dma_addr(page_to_page_pool_iov(page));
+
+	ret = page->dma_addr;

 	if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
 		ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16;
@@ -504,6 +570,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page)

 static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
 {
+	/* page_pool_iovs are mapped and their dma-addr can't be modified. */
+	if (page_is_page_pool_iov(page)) {
+		DEBUG_NET_WARN_ON_ONCE(true);
+		return;
+	}
+
 	page->dma_addr = addr;
 	if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
 		page->dma_addr_upper = upper_32_bits(addr);
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 0a7c08d748b8..20c1f74fd844 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -318,7 +318,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
 		if (unlikely(!page))
 			break;

-		if (likely(page_to_nid(page) == pref_nid)) {
+		if (likely(page_pool_page_is_pref_nid(page, pref_nid))) {
 			pool->alloc.cache[pool->alloc.count++] = page;
 		} else {
 			/* NUMA mismatch;
@@ -363,7 +363,15 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
 					  struct page *page,
 					  unsigned int dma_sync_size)
 {
-	dma_addr_t dma_addr = page_pool_get_dma_addr(page);
+	dma_addr_t dma_addr;
+
+	/* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */
+	if (page_is_page_pool_iov(page)) {
+		DEBUG_NET_WARN_ON_ONCE(true);
+		return;
+	}
+
+	dma_addr = page_pool_get_dma_addr(page);

 	dma_sync_size = min(dma_sync_size, pool->p.max_len);
 	dma_sync_single_range_for_device(pool->p.dev, dma_addr,
@@ -375,6 +383,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
 {
 	dma_addr_t dma;

+	if (page_is_page_pool_iov(page)) {
+		/* page_pool_iovs are already mapped */
+		DEBUG_NET_WARN_ON_ONCE(true);
+		return true;
+	}
+
 	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
 	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
 	 * into page private data (i.e 32bit cpu with 64bit DMA caps)
@@ -398,14 +412,24 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
 static void page_pool_set_pp_info(struct page_pool *pool,
 				  struct page *page)
 {
-	page->pp = pool;
-	page->pp_magic |= PP_SIGNATURE;
+	if (!page_is_page_pool_iov(page)) {
+		page->pp = pool;
+		page->pp_magic |= PP_SIGNATURE;
+	} else {
+		page_to_page_pool_iov(page)->pp = pool;
+	}
+
 	if (pool->p.init_callback)
 		pool->p.init_callback(page, pool->p.init_arg);
 }

 static void page_pool_clear_pp_info(struct page *page)
 {
+	if (page_is_page_pool_iov(page)) {
+		page_to_page_pool_iov(page)->pp = NULL;
+		return;
+	}
+
 	page->pp_magic = 0;
 	page->pp = NULL;
 }
@@ -615,7 +639,7 @@ static bool page_pool_recycle_in_cache(struct page *page,
 		return false;
 	}

-	/* Caller MUST have verified/know (page_ref_count(page) == 1) */
+	/* Caller MUST have verified/know (page_pool_page_ref_count(page) == 1) */
 	pool->alloc.cache[pool->alloc.count++] = page;
 	recycle_stat_inc(pool, cached);
 	return true;
@@ -638,9 +662,10 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
 	 * refcnt == 1 means page_pool owns page, and can recycle it.
 	 *
 	 * page is NOT reusable when allocated when system is under
-	 * some pressure. (page_is_pfmemalloc)
+	 * some pressure. (page_pool_page_is_pfmemalloc)
 	 */
-	if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
+	if (likely(page_pool_page_ref_count(page) == 1 &&
+		   !page_pool_page_is_pfmemalloc(page))) {
 		/* Read barrier done in page_ref_count / READ_ONCE */

 		if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
@@ -741,7 +766,8 @@ static struct page *page_pool_drain_frag(struct page_pool *pool,
 	if (likely(page_pool_defrag_page(page, drain_count)))
 		return NULL;

-	if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
+	if (page_pool_page_ref_count(page) == 1 &&
+	    !page_pool_page_is_pfmemalloc(page)) {
 		if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
 			page_pool_dma_sync_for_device(pool, page, -1);

@@ -818,9 +844,9 @@ static void page_pool_empty_ring(struct page_pool *pool)
 	/* Empty recycle ring */
 	while ((page = ptr_ring_consume_bh(&pool->ring))) {
 		/* Verify the refcnt invariant of cached pages */
-		if (!(page_ref_count(page) == 1))
+		if (!(page_pool_page_ref_count(page) == 1))
 			pr_crit("%s() page_pool refcnt %d violation\n",
-				__func__, page_ref_count(page));
+				__func__, page_pool_page_ref_count(page));

 		page_pool_return_page(pool, page);
 	}
@@ -977,19 +1003,24 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
 	struct page_pool *pp;
 	bool allow_direct;

-	page = compound_head(page);
+	if (!page_is_page_pool_iov(page)) {
+		page = compound_head(page);

-	/* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
-	 * in order to preserve any existing bits, such as bit 0 for the
-	 * head page of compound page and bit 1 for pfmemalloc page, so
-	 * mask those bits for freeing side when doing below checking,
-	 * and page_is_pfmemalloc() is checked in __page_pool_put_page()
-	 * to avoid recycling the pfmemalloc page.
-	 */
-	if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
-		return false;
+		/* page->pp_magic is OR'ed with PP_SIGNATURE after the
+		 * allocation in order to preserve any existing bits, such as
+		 * bit 0 for the head page of compound page and bit 1 for
+		 * pfmemalloc page, so mask those bits for freeing side when
+		 * doing below checking, and page_pool_page_is_pfmemalloc() is
+		 * checked in __page_pool_put_page() to avoid recycling the
+		 * pfmemalloc page.
+		 */
+		if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
+			return false;

-	pp = page->pp;
+		pp = page->pp;
+	} else {
+		pp = page_to_page_pool_iov(page)->pp;
+	}

 	/* Allow direct recycle if we have reasons to believe that we are
 	 * in the same context as the consumer would run, so there's
@@ -1273,9 +1304,9 @@ static bool mp_huge_busy(struct mp_huge *hu, unsigned int idx)

 	for (j = 0; j < (1 << MP_HUGE_ORDER); j++) {
 		page = hu->page[idx] + j;
-		if (page_ref_count(page) != 1) {
+		if (page_pool_page_ref_count(page) != 1) {
 			pr_warn("Page with ref count %d at %u, %u. Can't safely destory, leaking memory!\n",
-				page_ref_count(page), idx, j);
+				page_pool_page_ref_count(page), idx, j);
 			return true;
 		}
 	}
@@ -1330,7 +1361,7 @@ static struct page *mp_huge_alloc_pages(struct page_pool *pool, gfp_t gfp)
 			continue;

 		if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
-		    page_ref_count(page) != 1) {
+		    page_pool_page_ref_count(page) != 1) {
 			atomic_inc(&mp_huge_ins_b);
 			continue;
 		}
@@ -1458,9 +1489,9 @@ static void mp_huge_1g_destroy(struct page_pool *pool)
 	free = true;
 	for (i = 0; i < MP_HUGE_1G_CNT; i++) {
 		page = hu->page + i;
-		if (page_ref_count(page) != 1) {
+		if (page_pool_page_ref_count(page) != 1) {
 			pr_warn("Page with ref count %d at %u. Can't safely destory, leaking memory!\n",
-				page_ref_count(page), i);
+				page_pool_page_ref_count(page), i);
 			free = false;
 			break;
 		}
@@ -1489,7 +1520,7 @@ static struct page *mp_huge_1g_alloc_pages(struct page_pool *pool, gfp_t gfp)
 		page = hu->page + page_i;

 		if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
-		    page_ref_count(page) != 1) {
+		    page_pool_page_ref_count(page) != 1) {
 			atomic_inc(&mp_huge_ins_b);
 			continue;
 		}
--
2.41.0.640.ga95def55d0-goog

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 07/11] net: support non paged skb frags
  2023-08-10  1:57 [RFC PATCH v2 00/11] Device Memory TCP Mina Almasry
                   ` (5 preceding siblings ...)
  2023-08-10  1:57 ` [RFC PATCH v2 06/11] page-pool: add device memory support Mina Almasry
@ 2023-08-10  1:57 ` Mina Almasry
  2023-08-10  1:57 ` [RFC PATCH v2 08/11] net: add support for skbs with unreadable frags Mina Almasry
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 62+ messages in thread
From: Mina Almasry @ 2023-08-10  1:57 UTC (permalink / raw)
  To: netdev, linux-media, dri-devel
  Cc: Mina Almasry, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Arnd Bergmann, David Ahern, Willem de Bruijn, Sumit Semwal,
	Christian König, Jason Gunthorpe, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf

Make skb_frag_page() fail in the case where the frag is not backed
by a page, and fix its relevent callers to handle this case.

Correctly handle skb_frag refcounting in the page_pool_iovs case.

Signed-off-by: Mina Almasry <almasrymina@google.com>
---
 include/linux/skbuff.h | 40 +++++++++++++++++++++++++++++++++-------
 net/core/gro.c         |  2 +-
 net/core/skbuff.c      |  3 +++
 net/ipv4/tcp.c         | 10 +++++++++-
 4 files changed, 46 insertions(+), 9 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index faaba050f843..5520587050c4 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3389,15 +3389,38 @@ static inline void skb_frag_off_copy(skb_frag_t *fragto,
 	fragto->bv_offset = fragfrom->bv_offset;
 }
 
+/* Returns true if the skb_frag contains a page_pool_iov. */
+static inline bool skb_frag_is_page_pool_iov(const skb_frag_t *frag)
+{
+	return page_is_page_pool_iov(frag->bv_page);
+}
+
 /**
  * skb_frag_page - retrieve the page referred to by a paged fragment
  * @frag: the paged fragment
  *
- * Returns the &struct page associated with @frag.
+ * Returns the &struct page associated with @frag. Returns NULL if this frag
+ * has no associated page.
  */
 static inline struct page *skb_frag_page(const skb_frag_t *frag)
 {
-	return frag->bv_page;
+	if (!page_is_page_pool_iov(frag->bv_page))
+		return frag->bv_page;
+
+	return NULL;
+}
+
+/**
+ * skb_frag_page_pool_iov - retrieve the page_pool_iov referred to by fragment
+ * @frag: the fragment
+ *
+ * Returns the &struct page_pool_iov associated with @frag. Returns NULL if this
+ * frag has no associated page_pool_iov.
+ */
+static inline struct page_pool_iov *
+skb_frag_page_pool_iov(const skb_frag_t *frag)
+{
+	return page_to_page_pool_iov(frag->bv_page);
 }
 
 /**
@@ -3408,7 +3431,7 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
  */
 static inline void __skb_frag_ref(skb_frag_t *frag)
 {
-	get_page(skb_frag_page(frag));
+	page_pool_page_get_many(frag->bv_page, 1);
 }
 
 /**
@@ -3426,13 +3449,13 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f)
 static inline void
 napi_frag_unref(skb_frag_t *frag, bool recycle, bool napi_safe)
 {
-	struct page *page = skb_frag_page(frag);
-
 #ifdef CONFIG_PAGE_POOL
-	if (recycle && page_pool_return_skb_page(page, napi_safe))
+	if (recycle && page_pool_return_skb_page(frag->bv_page, napi_safe))
 		return;
+	page_pool_page_put_many(frag->bv_page, 1);
+#else
+	put_page(skb_frag_page(frag));
 #endif
-	put_page(page);
 }
 
 /**
@@ -3472,6 +3495,9 @@ static inline void skb_frag_unref(struct sk_buff *skb, int f)
  */
 static inline void *skb_frag_address(const skb_frag_t *frag)
 {
+	if (!skb_frag_page(frag))
+		return NULL;
+
 	return page_address(skb_frag_page(frag)) + skb_frag_off(frag);
 }
 
diff --git a/net/core/gro.c b/net/core/gro.c
index 0759277dc14e..42d7f6755f32 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -376,7 +376,7 @@ static inline void skb_gro_reset_offset(struct sk_buff *skb, u32 nhoff)
 	NAPI_GRO_CB(skb)->frag0 = NULL;
 	NAPI_GRO_CB(skb)->frag0_len = 0;
 
-	if (!skb_headlen(skb) && pinfo->nr_frags &&
+	if (!skb_headlen(skb) && pinfo->nr_frags && skb_frag_page(frag0) &&
 	    !PageHighMem(skb_frag_page(frag0)) &&
 	    (!NET_IP_ALIGN || !((skb_frag_off(frag0) + nhoff) & 3))) {
 		NAPI_GRO_CB(skb)->frag0 = skb_frag_address(frag0);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a298992060e6..ac79881a2630 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2939,6 +2939,9 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe,
 	for (seg = 0; seg < skb_shinfo(skb)->nr_frags; seg++) {
 		const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];
 
+		if (WARN_ON_ONCE(!skb_frag_page(f)))
+			return false;
+
 		if (__splice_segment(skb_frag_page(f),
 				     skb_frag_off(f), skb_frag_size(f),
 				     offset, len, spd, false, sk, pipe))
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 88f4ebab12ac..7893df0e22ee 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2160,6 +2160,9 @@ static int tcp_zerocopy_receive(struct sock *sk,
 			break;
 		}
 		page = skb_frag_page(frags);
+		if (WARN_ON_ONCE(!page))
+			break;
+
 		prefetchw(page);
 		pages[pages_to_map++] = page;
 		length += PAGE_SIZE;
@@ -4415,7 +4418,12 @@ int tcp_md5_hash_skb_data(struct tcp_md5sig_pool *hp,
 	for (i = 0; i < shi->nr_frags; ++i) {
 		const skb_frag_t *f = &shi->frags[i];
 		unsigned int offset = skb_frag_off(f);
-		struct page *page = skb_frag_page(f) + (offset >> PAGE_SHIFT);
+		struct page *page = skb_frag_page(f);
+
+		if (WARN_ON_ONCE(!page))
+			return 1;
+
+		page += offset >> PAGE_SHIFT;
 
 		sg_set_page(&sg, page, skb_frag_size(f),
 			    offset_in_page(offset));
-- 
2.41.0.640.ga95def55d0-goog


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 08/11] net: add support for skbs with unreadable frags
  2023-08-10  1:57 [RFC PATCH v2 00/11] Device Memory TCP Mina Almasry
                   ` (6 preceding siblings ...)
  2023-08-10  1:57 ` [RFC PATCH v2 07/11] net: support non paged skb frags Mina Almasry
@ 2023-08-10  1:57 ` Mina Almasry
  2023-08-10  1:57 ` [RFC PATCH v2 09/11] tcp: implement recvmsg() RX path for devmem TCP Mina Almasry
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 62+ messages in thread
From: Mina Almasry @ 2023-08-10  1:57 UTC (permalink / raw)
  To: netdev, linux-media, dri-devel
  Cc: Mina Almasry, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Arnd Bergmann, David Ahern, Willem de Bruijn, Sumit Semwal,
	Christian König, Jason Gunthorpe, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf, Willem de Bruijn,
	Kaiyuan Zhang

For device memory TCP, we expect the skb headers to be available in host
memory for access, and we expect the skb frags to be in device memory
and unaccessible to the host. We expect there to be no mixing and
matching of device memory frags (unaccessible) with host memory frags
(accessible) in the same skb.

Add a skb->devmem flag which indicates whether the frags in this skb
are device memory frags or not.

__skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
and marks the skb as skb->devmem accordingly.

Add checks through the network stack to avoid accessing the frags of
devmem skbs and avoid coalescing devmem skbs with non devmem skbs.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>

Signed-off-by: Mina Almasry <almasrymina@google.com>
---
 include/linux/skbuff.h | 14 +++++++-
 include/net/tcp.h      |  5 +--
 net/core/datagram.c    |  6 ++++
 net/core/skbuff.c      | 77 ++++++++++++++++++++++++++++++++++++------
 net/ipv4/tcp.c         |  6 ++++
 net/ipv4/tcp_input.c   | 13 +++++--
 net/ipv4/tcp_output.c  |  5 ++-
 net/packet/af_packet.c |  4 +--
 8 files changed, 111 insertions(+), 19 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 5520587050c4..88a3fc7f99b7 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -806,6 +806,8 @@ typedef unsigned char *sk_buff_data_t;
  *	@csum_level: indicates the number of consecutive checksums found in
  *		the packet minus one that have been verified as
  *		CHECKSUM_UNNECESSARY (max 3)
+ *	@devmem: indicates that all the fragments in this skb are backed by
+ *		device memory.
  *	@dst_pending_confirm: need to confirm neighbour
  *	@decrypted: Decrypted SKB
  *	@slow_gro: state present at GRO time, slower prepare step required
@@ -992,7 +994,7 @@ struct sk_buff {
 #if IS_ENABLED(CONFIG_IP_SCTP)
 	__u8			csum_not_inet:1;
 #endif
-
+	__u8			devmem:1;
 #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
 	__u16			tc_index;	/* traffic control index */
 #endif
@@ -1767,6 +1769,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
 		__skb_zcopy_downgrade_managed(skb);
 }

+/* Return true if frags in this skb are not readable by the host. */
+static inline bool skb_frags_not_readable(const struct sk_buff *skb)
+{
+	return skb->devmem;
+}
+
 static inline void skb_mark_not_on_list(struct sk_buff *skb)
 {
 	skb->next = NULL;
@@ -2469,6 +2477,10 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
 					struct page *page, int off, int size)
 {
 	__skb_fill_page_desc_noacc(skb_shinfo(skb), i, page, off, size);
+	if (page_is_page_pool_iov(page)) {
+		skb->devmem = true;
+		return;
+	}

 	/* Propagate page pfmemalloc to the skb if we can. The problem is
 	 * that not all callers have unique ownership of the page but rely
diff --git a/include/net/tcp.h b/include/net/tcp.h
index c5fb90079920..1ea2d7274b8c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -979,7 +979,7 @@ static inline int tcp_skb_mss(const struct sk_buff *skb)

 static inline bool tcp_skb_can_collapse_to(const struct sk_buff *skb)
 {
-	return likely(!TCP_SKB_CB(skb)->eor);
+	return likely(!TCP_SKB_CB(skb)->eor && !skb_frags_not_readable(skb));
 }

 static inline bool tcp_skb_can_collapse(const struct sk_buff *to,
@@ -987,7 +987,8 @@ static inline bool tcp_skb_can_collapse(const struct sk_buff *to,
 {
 	return likely(tcp_skb_can_collapse_to(to) &&
 		      mptcp_skb_can_collapse(to, from) &&
-		      skb_pure_zcopy_same(to, from));
+		      skb_pure_zcopy_same(to, from) &&
+		      skb_frags_not_readable(to) == skb_frags_not_readable(from));
 }

 /* Events passed to congestion control interface */
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 176eb5834746..cdd4fb129968 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -425,6 +425,9 @@ static int __skb_datagram_iter(const struct sk_buff *skb, int offset,
 			return 0;
 	}

+	if (skb_frags_not_readable(skb))
+		goto short_copy;
+
 	/* Copy paged appendix. Hmm... why does this look so complicated? */
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		int end;
@@ -616,6 +619,9 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
 {
 	int frag;

+	if (skb_frags_not_readable(skb))
+		return -EFAULT;
+
 	if (msg && msg->msg_ubuf && msg->sg_from_iter)
 		return msg->sg_from_iter(sk, skb, from, length);

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ac79881a2630..1814d413897e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1175,6 +1175,14 @@ void skb_dump(const char *level, const struct sk_buff *skb, bool full_pkt)
 		struct page *p;
 		u8 *vaddr;

+		if (skb_frag_is_page_pool_iov(frag)) {
+			printk("%sskb frag: not readable", level);
+			len -= frag->bv_len;
+			if (!len)
+				break;
+			continue;
+		}
+
 		skb_frag_foreach_page(frag, skb_frag_off(frag),
 				      skb_frag_size(frag), p, p_off, p_len,
 				      copied) {
@@ -1752,6 +1760,9 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 	if (skb_shared(skb) || skb_unclone(skb, gfp_mask))
 		return -EINVAL;

+	if (skb_frags_not_readable(skb))
+		return -EFAULT;
+
 	if (!num_frags)
 		goto release;

@@ -1922,8 +1933,12 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
 {
 	int headerlen = skb_headroom(skb);
 	unsigned int size = skb_end_offset(skb) + skb->data_len;
-	struct sk_buff *n = __alloc_skb(size, gfp_mask,
-					skb_alloc_rx_flag(skb), NUMA_NO_NODE);
+	struct sk_buff *n;
+
+	if (skb_frags_not_readable(skb))
+		return NULL;
+
+	n = __alloc_skb(size, gfp_mask, skb_alloc_rx_flag(skb), NUMA_NO_NODE);

 	if (!n)
 		return NULL;
@@ -2249,14 +2264,16 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb,
 				int newheadroom, int newtailroom,
 				gfp_t gfp_mask)
 {
-	/*
-	 *	Allocate the copy buffer
-	 */
-	struct sk_buff *n = __alloc_skb(newheadroom + skb->len + newtailroom,
-					gfp_mask, skb_alloc_rx_flag(skb),
-					NUMA_NO_NODE);
 	int oldheadroom = skb_headroom(skb);
 	int head_copy_len, head_copy_off;
+	struct sk_buff *n;
+
+	if (skb_frags_not_readable(skb))
+		return NULL;
+
+	/* Allocate the copy buffer */
+	n = __alloc_skb(newheadroom + skb->len + newtailroom, gfp_mask,
+			skb_alloc_rx_flag(skb), NUMA_NO_NODE);

 	if (!n)
 		return NULL;
@@ -2595,6 +2612,9 @@ void *__pskb_pull_tail(struct sk_buff *skb, int delta)
 	 */
 	int i, k, eat = (skb->tail + delta) - skb->end;

+	if (skb_frags_not_readable(skb))
+		return NULL;
+
 	if (eat > 0 || skb_cloned(skb)) {
 		if (pskb_expand_head(skb, 0, eat > 0 ? eat + 128 : 0,
 				     GFP_ATOMIC))
@@ -2748,6 +2768,9 @@ int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len)
 		to     += copy;
 	}

+	if (skb_frags_not_readable(skb))
+		goto fault;
+
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		int end;
 		skb_frag_t *f = &skb_shinfo(skb)->frags[i];
@@ -2936,6 +2959,9 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe,
 	/*
 	 * then map the fragments
 	 */
+	if (skb_frags_not_readable(skb))
+		return false;
+
 	for (seg = 0; seg < skb_shinfo(skb)->nr_frags; seg++) {
 		const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];

@@ -3159,6 +3185,9 @@ int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len)
 		from += copy;
 	}

+	if (skb_frags_not_readable(skb))
+		goto fault;
+
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 		int end;
@@ -3238,6 +3267,9 @@ __wsum __skb_checksum(const struct sk_buff *skb, int offset, int len,
 		pos	= copy;
 	}

+	if (skb_frags_not_readable(skb))
+		return 0;
+
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		int end;
 		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
@@ -3338,6 +3370,9 @@ __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset,
 		pos	= copy;
 	}

+	if (skb_frags_not_readable(skb))
+		return 0;
+
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		int end;

@@ -3795,7 +3830,9 @@ static inline void skb_split_inside_header(struct sk_buff *skb,
 		skb_shinfo(skb1)->frags[i] = skb_shinfo(skb)->frags[i];

 	skb_shinfo(skb1)->nr_frags = skb_shinfo(skb)->nr_frags;
+	skb1->devmem		   = skb->devmem;
 	skb_shinfo(skb)->nr_frags  = 0;
+	skb->devmem		   = 0;
 	skb1->data_len		   = skb->data_len;
 	skb1->len		   += skb1->data_len;
 	skb->data_len		   = 0;
@@ -3809,6 +3846,7 @@ static inline void skb_split_no_header(struct sk_buff *skb,
 {
 	int i, k = 0;
 	const int nfrags = skb_shinfo(skb)->nr_frags;
+	const int devmem = skb->devmem;

 	skb_shinfo(skb)->nr_frags = 0;
 	skb1->len		  = skb1->data_len = skb->len - len;
@@ -3842,6 +3880,16 @@ static inline void skb_split_no_header(struct sk_buff *skb,
 		pos += size;
 	}
 	skb_shinfo(skb1)->nr_frags = k;
+
+	if (skb_shinfo(skb)->nr_frags)
+		skb->devmem = devmem;
+	else
+		skb->devmem = 0;
+
+	if (skb_shinfo(skb1)->nr_frags)
+		skb1->devmem = devmem;
+	else
+		skb1->devmem = 0;
 }

 /**
@@ -4077,6 +4125,9 @@ unsigned int skb_seq_read(unsigned int consumed, const u8 **data,
 		return block_limit - abs_offset;
 	}

+	if (skb_frags_not_readable(st->cur_skb))
+		return 0;
+
 	if (st->frag_idx == 0 && !st->frag_data)
 		st->stepped_offset += skb_headlen(st->cur_skb);

@@ -5678,7 +5729,10 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 	    (from->pp_recycle && skb_cloned(from)))
 		return false;

-	if (len <= skb_tailroom(to)) {
+	if (skb_frags_not_readable(from) != skb_frags_not_readable(to))
+		return false;
+
+	if (len <= skb_tailroom(to) && !skb_frags_not_readable(from)) {
 		if (len)
 			BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len));
 		*delta_truesize = 0;
@@ -5853,6 +5907,9 @@ int skb_ensure_writable(struct sk_buff *skb, unsigned int write_len)
 	if (!pskb_may_pull(skb, write_len))
 		return -ENOMEM;

+	if (skb_frags_not_readable(skb))
+		return -EFAULT;
+
 	if (!skb_cloned(skb) || skb_clone_writable(skb, write_len))
 		return 0;

@@ -6513,7 +6570,7 @@ void skb_condense(struct sk_buff *skb)
 {
 	if (skb->data_len) {
 		if (skb->data_len > skb->end - skb->tail ||
-		    skb_cloned(skb))
+		    skb_cloned(skb) || skb_frags_not_readable(skb))
 			return;

 		/* Nice, we can free page frag(s) right now */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 7893df0e22ee..53ec616b1fb7 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2143,6 +2143,9 @@ static int tcp_zerocopy_receive(struct sock *sk,
 				skb = tcp_recv_skb(sk, seq, &offset);
 			}

+			if (skb_frags_not_readable(skb))
+				break;
+
 			if (TCP_SKB_CB(skb)->has_rxtstamp) {
 				tcp_update_recv_tstamps(skb, tss);
 				zc->msg_flags |= TCP_CMSG_TS;
@@ -4415,6 +4418,9 @@ int tcp_md5_hash_skb_data(struct tcp_md5sig_pool *hp,
 	if (crypto_ahash_update(req))
 		return 1;

+	if (skb_frags_not_readable(skb))
+		return 1;
+
 	for (i = 0; i < shi->nr_frags; ++i) {
 		const skb_frag_t *f = &shi->frags[i];
 		unsigned int offset = skb_frag_off(f);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 670c3dab24f2..f5b12f963cd8 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5204,6 +5204,9 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
 	for (end_of_skbs = true; skb != NULL && skb != tail; skb = n) {
 		n = tcp_skb_next(skb, list);

+		if (skb_frags_not_readable(skb))
+			goto skip_this;
+
 		/* No new bits? It is possible on ofo queue. */
 		if (!before(start, TCP_SKB_CB(skb)->end_seq)) {
 			skb = tcp_collapse_one(sk, skb, list, root);
@@ -5224,17 +5227,20 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
 			break;
 		}

-		if (n && n != tail && mptcp_skb_can_collapse(skb, n) &&
+		if (n && n != tail && !skb_frags_not_readable(n) &&
+		    mptcp_skb_can_collapse(skb, n) &&
 		    TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(n)->seq) {
 			end_of_skbs = false;
 			break;
 		}

+skip_this:
 		/* Decided to skip this, advance start seq. */
 		start = TCP_SKB_CB(skb)->end_seq;
 	}
 	if (end_of_skbs ||
-	    (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)))
+	    (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) ||
+	    skb_frags_not_readable(skb))
 		return;

 	__skb_queue_head_init(&tmp);
@@ -5278,7 +5284,8 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
 				if (!skb ||
 				    skb == tail ||
 				    !mptcp_skb_can_collapse(nskb, skb) ||
-				    (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)))
+				    (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) ||
+				    skb_frags_not_readable(skb))
 					goto end;
 #ifdef CONFIG_TLS_DEVICE
 				if (skb->decrypted != nskb->decrypted)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 2cb39b6dad02..54bc4de6bce4 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2300,7 +2300,8 @@ static bool tcp_can_coalesce_send_queue_head(struct sock *sk, int len)

 		if (unlikely(TCP_SKB_CB(skb)->eor) ||
 		    tcp_has_tx_tstamp(skb) ||
-		    !skb_pure_zcopy_same(skb, next))
+		    !skb_pure_zcopy_same(skb, next) ||
+		    skb_frags_not_readable(skb) != skb_frags_not_readable(next))
 			return false;

 		len -= skb->len;
@@ -3169,6 +3170,8 @@ static bool tcp_can_collapse(const struct sock *sk, const struct sk_buff *skb)
 		return false;
 	if (skb_cloned(skb))
 		return false;
+	if (skb_frags_not_readable(skb))
+		return false;
 	/* Some heuristics for collapsing over SACK'd could be invented */
 	if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)
 		return false;
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 85ff90a03b0c..308151044032 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2152,7 +2152,7 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev,
 		}
 	}

-	snaplen = skb->len;
+	snaplen = skb_frags_not_readable(skb) ? skb_headlen(skb) : skb->len;

 	res = run_filter(skb, sk, snaplen);
 	if (!res)
@@ -2275,7 +2275,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 		}
 	}

-	snaplen = skb->len;
+	snaplen = skb_frags_not_readable(skb) ? skb_headlen(skb) : skb->len;

 	res = run_filter(skb, sk, snaplen);
 	if (!res)
--
2.41.0.640.ga95def55d0-goog

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 09/11] tcp: implement recvmsg() RX path for devmem TCP
  2023-08-10  1:57 [RFC PATCH v2 00/11] Device Memory TCP Mina Almasry
                   ` (7 preceding siblings ...)
  2023-08-10  1:57 ` [RFC PATCH v2 08/11] net: add support for skbs with unreadable frags Mina Almasry
@ 2023-08-10  1:57 ` Mina Almasry
  2023-08-10  1:57 ` [RFC PATCH v2 10/11] net: add SO_DEVMEM_DONTNEED setsockopt to release RX pages Mina Almasry
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 62+ messages in thread
From: Mina Almasry @ 2023-08-10  1:57 UTC (permalink / raw)
  To: netdev, linux-media, dri-devel
  Cc: Mina Almasry, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Arnd Bergmann, David Ahern, Willem de Bruijn, Sumit Semwal,
	Christian König, Jason Gunthorpe, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf, Willem de Bruijn,
	Kaiyuan Zhang

In tcp_recvmsg_locked(), detect if the skb being received by the user
is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM
flag - pass it to tcp_recvmsg_devmem() for custom handling.

tcp_recvmsg_devmem() copies any data in the skb header to the linear
buffer, and returns a cmsg to the user indicating the number of bytes
returned in the linear buffer.

tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags,
and returns to the user a cmsg_devmem indicating the location of the
data in the dmabuf device memory. cmsg_devmem contains this information:

1.  the offset into the dmabuf where the payload starts. 'frag_offset'.
2. the size of the frag. 'frag_size'.
3. an opaque token 'frag_token' to return to the kernel when the buffer
is to be released.

The pages awaiting freeing are stored in the newly added
sk->sk_user_pages, and each page passed to userspace is get_page()'d.
This reference is dropped once the userspace indicates that it is
done reading this page.  All pages are released when the socket is
destroyed.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>

Signed-off-by: Mina Almasry <almasrymina@google.com>
---
 include/linux/socket.h            |   1 +
 include/net/sock.h                |   2 +
 include/uapi/asm-generic/socket.h |   5 +
 include/uapi/linux/uio.h          |   6 +
 net/ipv4/tcp.c                    | 180 +++++++++++++++++++++++++++++-
 net/ipv4/tcp_ipv4.c               |   7 ++
 6 files changed, 196 insertions(+), 5 deletions(-)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 39b74d83c7c4..102733ae888d 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -326,6 +326,7 @@ struct ucred {
 					  * plain text and require encryption
 					  */
 
+#define MSG_SOCK_DEVMEM 0x2000000	/* Receive devmem skbs as cmsg */
 #define MSG_ZEROCOPY	0x4000000	/* Use user data in kernel path */
 #define MSG_SPLICE_PAGES 0x8000000	/* Splice the pages from the iterator in sendmsg() */
 #define MSG_FASTOPEN	0x20000000	/* Send data in TCP SYN */
diff --git a/include/net/sock.h b/include/net/sock.h
index 2eb916d1ff64..5d2a97001152 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -353,6 +353,7 @@ struct sk_filter;
   *	@sk_txtime_unused: unused txtime flags
   *	@ns_tracker: tracker for netns reference
   *	@sk_bind2_node: bind node in the bhash2 table
+  *	@sk_user_pages: xarray of pages the user is holding a reference on.
   */
 struct sock {
 	/*
@@ -545,6 +546,7 @@ struct sock {
 	struct rcu_head		sk_rcu;
 	netns_tracker		ns_tracker;
 	struct hlist_node	sk_bind2_node;
+	struct xarray		sk_user_pages;
 };
 
 enum sk_pacing {
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index 8ce8a39a1e5f..aacb97f16b78 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -135,6 +135,11 @@
 #define SO_PASSPIDFD		76
 #define SO_PEERPIDFD		77
 
+#define SO_DEVMEM_HEADER	98
+#define SCM_DEVMEM_HEADER	SO_DEVMEM_HEADER
+#define SO_DEVMEM_OFFSET	99
+#define SCM_DEVMEM_OFFSET	SO_DEVMEM_OFFSET
+
 #if !defined(__KERNEL__)
 
 #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))
diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h
index 059b1a9147f4..ae94763b1963 100644
--- a/include/uapi/linux/uio.h
+++ b/include/uapi/linux/uio.h
@@ -20,6 +20,12 @@ struct iovec
 	__kernel_size_t iov_len; /* Must be size_t (1003.1g) */
 };
 
+struct cmsg_devmem {
+	__u64 frag_offset;
+	__u32 frag_size;
+	__u32 frag_token;
+};
+
 /*
  *	UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1)
  */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 53ec616b1fb7..7a5279b61a89 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -461,6 +461,7 @@ void tcp_init_sock(struct sock *sk)
 
 	set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
 	sk_sockets_allocated_inc(sk);
+	xa_init_flags(&sk->sk_user_pages, XA_FLAGS_ALLOC1);
 }
 EXPORT_SYMBOL(tcp_init_sock);
 
@@ -2306,6 +2307,144 @@ static int tcp_inq_hint(struct sock *sk)
 	return inq;
 }
 
+/* On error, returns the -errno. On success, returns number of bytes sent to the
+ * user. May not consume all of @remaining_len.
+ */
+static int tcp_recvmsg_devmem(const struct sock *sk, const struct sk_buff *skb,
+			      unsigned int offset, struct msghdr *msg,
+			      int remaining_len)
+{
+	struct cmsg_devmem cmsg_devmem = { 0 };
+	unsigned int start;
+	int i, copy, n;
+	int sent = 0;
+	int err = 0;
+
+	do {
+		start = skb_headlen(skb);
+
+		if (!skb_frags_not_readable(skb)) {
+			err = -ENODEV;
+			goto out;
+		}
+
+		/* Copy header. */
+		copy = start - offset;
+		if (copy > 0) {
+			copy = min(copy, remaining_len);
+
+			n = copy_to_iter(skb->data + offset, copy,
+					 &msg->msg_iter);
+			if (n != copy) {
+				err = -EFAULT;
+				goto out;
+			}
+
+			offset += copy;
+			remaining_len -= copy;
+
+			/* First a cmsg_devmem for # bytes copied to user
+			 * buffer.
+			 */
+			memset(&cmsg_devmem, 0, sizeof(cmsg_devmem));
+			cmsg_devmem.frag_size = copy;
+			err = put_cmsg(msg, SOL_SOCKET, SO_DEVMEM_HEADER,
+				       sizeof(cmsg_devmem), &cmsg_devmem);
+			if (err)
+				goto out;
+
+			sent += copy;
+
+			if (remaining_len == 0)
+				goto out;
+		}
+
+		/* after that, send information of devmem pages through a
+		 * sequence of cmsg
+		 */
+		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+			const skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+			struct page_pool_iov *ppiov;
+			u64 frag_offset;
+			u32 user_token;
+			int end;
+
+			/* skb_frags_not_readable() should indicate that ALL the
+			 * frags in this skb are unreadable page_pool_iovs.
+			 * We're checking for that flag above, but also check
+			 * individual pages here. If the tcp stack is not
+			 * setting skb->devmem correctly, we still don't want to
+			 * crash here when accessing pgmap or priv below.
+			 */
+			if (!skb_frag_page_pool_iov(frag)) {
+				net_err_ratelimited("Found non-devmem skb with page_pool_iov");
+				err = -ENODEV;
+				goto out;
+			}
+
+			ppiov = skb_frag_page_pool_iov(frag);
+			end = start + skb_frag_size(frag);
+			copy = end - offset;
+
+			if (copy > 0) {
+				copy = min(copy, remaining_len);
+
+				frag_offset = page_pool_iov_virtual_addr(ppiov) +
+					      skb_frag_off(frag) + offset -
+					      start;
+				cmsg_devmem.frag_offset = frag_offset;
+				cmsg_devmem.frag_size = copy;
+				err = xa_alloc((struct xarray *)&sk->sk_user_pages,
+					       &user_token, frag->bv_page,
+					       xa_limit_31b, GFP_KERNEL);
+				if (err)
+					goto out;
+
+				cmsg_devmem.frag_token = user_token;
+
+				offset += copy;
+				remaining_len -= copy;
+
+				err = put_cmsg(msg, SOL_SOCKET,
+					       SO_DEVMEM_OFFSET,
+					       sizeof(cmsg_devmem),
+					       &cmsg_devmem);
+				if (err)
+					goto out;
+
+				page_pool_iov_get_many(ppiov, 1);
+
+				sent += copy;
+
+				if (remaining_len == 0)
+					goto out;
+			}
+			start = end;
+		}
+
+		if (!remaining_len)
+			goto out;
+
+		/* if remaining_len is not satisfied yet, we need to go to the
+		 * next frag in the frag_list to satisfy remaining_len.
+		 */
+		skb = skb_shinfo(skb)->frag_list ?: skb->next;
+
+		offset = offset - start;
+	} while (skb);
+
+	if (remaining_len) {
+		err = -EFAULT;
+		goto out;
+	}
+
+out:
+	if (!sent)
+		sent = err;
+
+	return sent;
+}
+
 /*
  *	This routine copies from a sock struct into the user buffer.
  *
@@ -2318,6 +2457,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 			      int flags, struct scm_timestamping_internal *tss,
 			      int *cmsg_flags)
 {
+	bool last_copied_devmem, last_copied_init = false;
 	struct tcp_sock *tp = tcp_sk(sk);
 	int copied = 0;
 	u32 peek_seq;
@@ -2492,15 +2632,45 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 		}
 
 		if (!(flags & MSG_TRUNC)) {
-			err = skb_copy_datagram_msg(skb, offset, msg, used);
-			if (err) {
-				/* Exception. Bailout! */
-				if (!copied)
-					copied = -EFAULT;
+			if (last_copied_init &&
+			    last_copied_devmem != skb->devmem)
 				break;
+
+			if (!skb->devmem) {
+				err = skb_copy_datagram_msg(skb, offset, msg,
+							    used);
+				if (err) {
+					/* Exception. Bailout! */
+					if (!copied)
+						copied = -EFAULT;
+					break;
+				}
+			} else {
+				if (!(flags & MSG_SOCK_DEVMEM)) {
+					/* skb->devmem skbs can only be received
+					 * with the MSG_SOCK_DEVMEM flag.
+					 */
+					if (!copied)
+						copied = -EFAULT;
+
+					break;
+				}
+
+				err = tcp_recvmsg_devmem(sk, skb, offset, msg,
+							 used);
+				if (err <= 0) {
+					if (!copied)
+						copied = -EFAULT;
+
+					break;
+				}
+				used = err;
 			}
 		}
 
+		last_copied_devmem = skb->devmem;
+		last_copied_init = true;
+
 		WRITE_ONCE(*seq, *seq + used);
 		copied += used;
 		len -= used;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index cecd5a135e64..4472b9357569 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2295,6 +2295,13 @@ static int tcp_v4_init_sock(struct sock *sk)
 void tcp_v4_destroy_sock(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct page *page;
+	unsigned long index;
+
+	xa_for_each(&sk->sk_user_pages, index, page)
+		page_pool_page_put_many(page, 1);
+
+	xa_destroy(&sk->sk_user_pages);
 
 	trace_tcp_destroy_sock(sk);
 
-- 
2.41.0.640.ga95def55d0-goog


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 10/11] net: add SO_DEVMEM_DONTNEED setsockopt to release RX pages
  2023-08-10  1:57 [RFC PATCH v2 00/11] Device Memory TCP Mina Almasry
                   ` (8 preceding siblings ...)
  2023-08-10  1:57 ` [RFC PATCH v2 09/11] tcp: implement recvmsg() RX path for devmem TCP Mina Almasry
@ 2023-08-10  1:57 ` Mina Almasry
  2023-08-10  1:57 ` [RFC PATCH v2 11/11] selftests: add ncdevmem, netcat for devmem TCP Mina Almasry
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 62+ messages in thread
From: Mina Almasry @ 2023-08-10  1:57 UTC (permalink / raw)
  To: netdev, linux-media, dri-devel
  Cc: Mina Almasry, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Arnd Bergmann, David Ahern, Willem de Bruijn, Sumit Semwal,
	Christian König, Jason Gunthorpe, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf, Willem de Bruijn,
	Kaiyuan Zhang

Add an interface for the user to notify the kernel that it is done
reading the NET_RX dmabuf pages returned as cmsg. The kernel will
drop the reference on the NET_RX pages to make them available for
re-use.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>

Signed-off-by: Mina Almasry <almasrymina@google.com>
---
 include/uapi/asm-generic/socket.h |  1 +
 include/uapi/linux/uio.h          |  4 ++++
 net/core/sock.c                   | 36 +++++++++++++++++++++++++++++++
 3 files changed, 41 insertions(+)

diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index aacb97f16b78..eb93b43394d4 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -135,6 +135,7 @@
 #define SO_PASSPIDFD		76
 #define SO_PEERPIDFD		77
 
+#define SO_DEVMEM_DONTNEED	97
 #define SO_DEVMEM_HEADER	98
 #define SCM_DEVMEM_HEADER	SO_DEVMEM_HEADER
 #define SO_DEVMEM_OFFSET	99
diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h
index ae94763b1963..71314bf41590 100644
--- a/include/uapi/linux/uio.h
+++ b/include/uapi/linux/uio.h
@@ -26,6 +26,10 @@ struct cmsg_devmem {
 	__u32 frag_token;
 };
 
+struct devmemtoken {
+	__u32 token_start;
+	__u32 token_count;
+};
 /*
  *	UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1)
  */
diff --git a/net/core/sock.c b/net/core/sock.c
index ab1e8d1bd5a1..2736b770a399 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1049,6 +1049,39 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
 	return 0;
 }
 
+static noinline_for_stack int
+sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
+{
+	struct devmemtoken tokens[128];
+	unsigned int num_tokens, i, j;
+	int ret;
+
+	if (sk->sk_type != SOCK_STREAM || sk->sk_protocol != IPPROTO_TCP)
+		return -EBADF;
+
+	if (optlen % sizeof(struct devmemtoken) || optlen > sizeof(tokens))
+		return -EINVAL;
+
+	num_tokens = optlen / sizeof(struct devmemtoken);
+	if (copy_from_sockptr(tokens, optval, optlen))
+		return -EFAULT;
+
+	ret = 0;
+	for (i = 0; i < num_tokens; i++) {
+		for (j = 0; j < tokens[i].token_count; j++) {
+			struct page *page = xa_erase(&sk->sk_user_pages,
+						     tokens[i].token_start + j);
+
+			if (page) {
+				page_pool_page_put_many(page, 1);
+				ret++;
+			}
+		}
+	}
+
+	return ret;
+}
+
 void sockopt_lock_sock(struct sock *sk)
 {
 	/* When current->bpf_ctx is set, the setsockopt is called from
@@ -1528,6 +1561,9 @@ int sk_setsockopt(struct sock *sk, int level, int optname,
 		WRITE_ONCE(sk->sk_txrehash, (u8)val);
 		break;
 
+	case SO_DEVMEM_DONTNEED:
+		ret = sock_devmem_dontneed(sk, optval, optlen);
+		break;
 	default:
 		ret = -ENOPROTOOPT;
 		break;
-- 
2.41.0.640.ga95def55d0-goog


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 11/11] selftests: add ncdevmem, netcat for devmem TCP
  2023-08-10  1:57 [RFC PATCH v2 00/11] Device Memory TCP Mina Almasry
                   ` (9 preceding siblings ...)
  2023-08-10  1:57 ` [RFC PATCH v2 10/11] net: add SO_DEVMEM_DONTNEED setsockopt to release RX pages Mina Almasry
@ 2023-08-10  1:57 ` Mina Almasry
  2023-08-10 10:29 ` [RFC PATCH v2 00/11] Device Memory TCP Christian König
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 62+ messages in thread
From: Mina Almasry @ 2023-08-10  1:57 UTC (permalink / raw)
  To: netdev, linux-media, dri-devel
  Cc: Mina Almasry, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Arnd Bergmann, David Ahern, Willem de Bruijn, Sumit Semwal,
	Christian König, Jason Gunthorpe, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf

ncdevmem is a devmem TCP netcat. It works similarly to netcat, but it
sends and receives data using the devmem TCP APIs. It uses udmabuf as
the dmabuf provider. It is compatible with a regular netcat running on
a peer, or a ncdevmem running on a peer.

In addition to normal netcat support, ncdevmem has a validation mode,
where it sends a specific pattern and validates this pattern on the
receiver side to ensure data integrity.

Suggested-by: Stanislav Fomichev <sdf@google.com>

Signed-off-by: Mina Almasry <almasrymina@google.com>
---
 tools/testing/selftests/net/.gitignore |   1 +
 tools/testing/selftests/net/Makefile   |   5 +
 tools/testing/selftests/net/ncdevmem.c | 534 +++++++++++++++++++++++++
 3 files changed, 540 insertions(+)
 create mode 100644 tools/testing/selftests/net/ncdevmem.c

diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore
index 501854a89cc0..5f2f8f01c800 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -16,6 +16,7 @@ ipsec
 ipv6_flowlabel
 ipv6_flowlabel_mgr
 msg_zerocopy
+ncdevmem
 nettest
 psock_fanout
 psock_snd
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index ae53c26af51b..3181611552d3 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -5,6 +5,10 @@ CFLAGS +=  -Wall -Wl,--no-as-needed -O2 -g
 CFLAGS += -I../../../../usr/include/ $(KHDR_INCLUDES)
 # Additional include paths needed by kselftest.h
 CFLAGS += -I../
+CFLAGS += -I../../../net/ynl/generated/
+CFLAGS += -I../../../net/ynl/lib/
+
+LDLIBS += ../../../net/ynl/lib/ynl.a ../../../net/ynl/generated/protos.a
 
 LDLIBS += -lmnl
 
@@ -90,6 +94,7 @@ TEST_PROGS += test_vxlan_mdb.sh
 TEST_PROGS += test_bridge_neigh_suppress.sh
 TEST_PROGS += test_vxlan_nolocalbypass.sh
 TEST_PROGS += test_bridge_backup_port.sh
+TEST_GEN_FILES += ncdevmem
 
 TEST_FILES := settings
 
diff --git a/tools/testing/selftests/net/ncdevmem.c b/tools/testing/selftests/net/ncdevmem.c
new file mode 100644
index 000000000000..2efcc98f6067
--- /dev/null
+++ b/tools/testing/selftests/net/ncdevmem.c
@@ -0,0 +1,534 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#define __EXPORTED_HEADERS__
+
+#include <linux/uio.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <stdbool.h>
+#include <string.h>
+#include <errno.h>
+#define __iovec_defined
+#include <fcntl.h>
+#include <malloc.h>
+
+#include <arpa/inet.h>
+#include <sys/socket.h>
+#include <sys/mman.h>
+#include <sys/ioctl.h>
+#include <sys/syscall.h>
+
+#include <linux/memfd.h>
+#include <linux/if.h>
+#include <linux/dma-buf.h>
+#include <linux/udmabuf.h>
+#include <libmnl/libmnl.h>
+#include <linux/types.h>
+#include <linux/netlink.h>
+#include <linux/genetlink.h>
+#include <linux/netdev.h>
+#include <time.h>
+
+#include "netdev-user.h"
+#include <ynl.h>
+
+#define PAGE_SHIFT 12
+#define PAGE_SIZE 4096
+#define TEST_PREFIX "ncdevmem"
+#define NUM_PAGES 16000
+
+#ifndef MSG_SOCK_DEVMEM
+#define MSG_SOCK_DEVMEM 0x2000000
+#endif
+
+/*
+ * tcpdevmem netcat. Works similarly to netcat but does device memory TCP
+ * instead of regular TCP. Uses udmabuf to mock a dmabuf provider.
+ *
+ * Usage:
+ *
+ * * Without validation:
+ *
+ *	On server:
+ *	ncdevmem -s <server IP> -c <client IP> -f eth1 -n 0000:06:00.0 -l \
+ *		-p 5201
+ *
+ *	On client:
+ *	ncdevmem -s <server IP> -c <client IP> -f eth1 -n 0000:06:00.0 -p 5201
+ *
+ * * With Validation:
+ *	On server:
+ *	ncdevmem -s <server IP> -c <client IP> -l -f eth1 -n 0000:06:00.0 \
+ *		-p 5202 -v 1
+ *
+ *	On client:
+ *	ncdevmem -s <server IP> -c <client IP> -f eth1 -n 0000:06:00.0 -p 5202 \
+ *		-v 100000
+ *
+ * Note this is compatible with regular netcat. i.e. the sender or receiver can
+ * be replaced with regular netcat to test the RX or TX path in isolation.
+ */
+
+static char *server_ip = "192.168.1.4";
+static char *client_ip = "192.168.1.2";
+static char *port = "5201";
+static size_t do_validation;
+static int queue_num = 15;
+static char *ifname = "eth1";
+static char *nic_pci_addr = "0000:06:00.0";
+static unsigned int iterations;
+
+void print_bytes(void *ptr, size_t size)
+{
+	unsigned char *p = ptr;
+	int i;
+	for (i = 0; i < size; i++) {
+		printf("%02hhX ", p[i]);
+	}
+	printf("\n");
+}
+
+void print_nonzero_bytes(void *ptr, size_t size)
+{
+	unsigned char *p = ptr;
+	unsigned int i;
+	for (i = 0; i < size; i++) {
+		if (p[i])
+			printf("%c", p[i]);
+	}
+	printf("\n");
+}
+
+void initialize_validation(void *line, size_t size)
+{
+	static unsigned char seed = 1;
+	unsigned char *ptr = line;
+	for (size_t i = 0; i < size; i++) {
+		ptr[i] = seed;
+		seed++;
+		if (seed == 254)
+			seed = 0;
+	}
+}
+
+void validate_buffer(void *line, size_t size)
+{
+	static unsigned char seed = 1;
+	int errors = 0;
+
+	unsigned char *ptr = line;
+	for (size_t i = 0; i < size; i++) {
+		if (ptr[i] != seed) {
+			fprintf(stderr,
+				"Failed validation: expected=%u, "
+				"actual=%u, index=%lu\n",
+				seed, ptr[i], i);
+			errors++;
+			if (errors > 20)
+				exit(1);
+		}
+		seed++;
+		if (seed == do_validation)
+			seed = 0;
+	}
+
+	fprintf(stdout, "Validated buffer\n");
+}
+
+/* Triggers a driver reset...
+ *
+ * The proper way to do this is probably 'ethtool --reset', but I don't have
+ * that supported on my current test bed. I resort to changing this
+ * configuration in the driver which also causes a driver reset...
+ */
+static void reset_flow_steering(void)
+{
+	char command[256];
+	memset(command, 0, sizeof(command));
+	snprintf(command, sizeof(command), "sudo ethtool -K %s ntuple off",
+		 "eth1");
+	system(command);
+
+	memset(command, 0, sizeof(command));
+	snprintf(command, sizeof(command), "sudo ethtool -K %s ntuple on",
+		 "eth1");
+	system(command);
+}
+
+static void configure_flow_steering(void)
+{
+	char command[256];
+	memset(command, 0, sizeof(command));
+	snprintf(command, sizeof(command),
+		 "sudo ethtool -N %s flow-type tcp4 src-ip %s dst-ip %s "
+		 "src-port %s dst-port %s queue %d",
+		 ifname, client_ip, server_ip, port, port, queue_num);
+	system(command);
+}
+
+/* Triggers a device reset, which causes the dmabuf pages binding to take
+ * effect. A better and more generic way to do this may be ethtool --reset.
+ */
+static void trigger_device_reset(void)
+{
+	char command[256];
+	memset(command, 0, sizeof(command));
+	snprintf(command, sizeof(command),
+		 "sudo ethtool --set-priv-flags %s enable-header-split off",
+		 ifname);
+	system(command);
+
+	memset(command, 0, sizeof(command));
+	snprintf(command, sizeof(command),
+		 "sudo ethtool --set-priv-flags %s enable-header-split on",
+		 ifname);
+	system(command);
+}
+
+static void bind_rx_queue(unsigned int ifindex, unsigned int dmabuf_fd,
+			  unsigned int queue_num, struct ynl_sock **ys)
+{
+	struct ynl_error yerr;
+
+	struct netdev_bind_rx_req *req;
+	int ret = 0;
+
+	*ys = ynl_sock_create(&ynl_netdev_family, &yerr);
+	if (!*ys) {
+		fprintf(stderr, "YNL: %s\n", yerr.msg);
+		return;
+	}
+
+	if (ynl_subscribe(*ys, "mgmt"))
+		goto err_close;
+
+	req = netdev_bind_rx_req_alloc();
+	netdev_bind_rx_req_set_ifindex(req, ifindex);
+	netdev_bind_rx_req_set_dmabuf_fd(req, dmabuf_fd);
+	netdev_bind_rx_req_set_queue_idx(req, queue_num);
+
+	ret = netdev_bind_rx(*ys, req);
+	netdev_bind_rx_req_free(req);
+	if (!ret) {
+		perror("netdev_bind_rx");
+		goto err_close;
+	}
+
+	return;
+
+err_close:
+	fprintf(stderr, "YNL failed: %s\n", (*ys)->err.msg);
+	ynl_sock_destroy(*ys);
+	exit(1);
+	return;
+}
+
+static void create_udmabuf(int *devfd, int *memfd, int *buf, size_t dmabuf_size)
+{
+	struct udmabuf_create create;
+	int ret;
+
+	*devfd = open("/dev/udmabuf", O_RDWR);
+	if (*devfd < 0) {
+		fprintf(stderr,
+			"%s: [skip,no-udmabuf: Unable to access DMA "
+			"buffer device file]\n",
+			TEST_PREFIX);
+		exit(70);
+	}
+
+	*memfd = memfd_create("udmabuf-test", MFD_ALLOW_SEALING);
+	if (*memfd < 0) {
+		printf("%s: [skip,no-memfd]\n", TEST_PREFIX);
+		exit(72);
+	}
+
+	ret = fcntl(*memfd, F_ADD_SEALS, F_SEAL_SHRINK);
+	if (ret < 0) {
+		printf("%s: [skip,fcntl-add-seals]\n", TEST_PREFIX);
+		exit(73);
+	}
+
+	ret = ftruncate(*memfd, dmabuf_size);
+	if (ret == -1) {
+		printf("%s: [FAIL,memfd-truncate]\n", TEST_PREFIX);
+		exit(74);
+	}
+
+	memset(&create, 0, sizeof(create));
+
+	create.memfd = *memfd;
+	create.offset = 0;
+	create.size = dmabuf_size;
+	*buf = ioctl(*devfd, UDMABUF_CREATE, &create);
+	if (*buf < 0) {
+		printf("%s: [FAIL, create udmabuf]\n", TEST_PREFIX);
+		exit(75);
+	}
+}
+
+int do_server(void)
+{
+	int devfd, memfd, buf, ret;
+	size_t dmabuf_size;
+	struct ynl_sock *ys;
+
+	dmabuf_size = getpagesize() * NUM_PAGES;
+
+	create_udmabuf(&devfd, &memfd, &buf, dmabuf_size);
+
+	bind_rx_queue(3 /* index for eth1 */, buf, queue_num, &ys);
+
+	char *buf_mem = NULL;
+	buf_mem = mmap(NULL, dmabuf_size, PROT_READ | PROT_WRITE, MAP_SHARED,
+		       buf, 0);
+	if (buf_mem == MAP_FAILED) {
+		perror("mmap()");
+		exit(1);
+	}
+
+	/* Need to trigger the NIC to reallocate its RX pages, otherwise the
+	 * bind doesn't take effect.
+	 */
+	trigger_device_reset();
+
+	sleep(1);
+
+	reset_flow_steering();
+	configure_flow_steering();
+
+	struct sockaddr_in server_sin;
+	server_sin.sin_family = AF_INET;
+	server_sin.sin_port = htons(atoi(port));
+
+	ret = inet_pton(server_sin.sin_family, server_ip, &server_sin.sin_addr);
+	if (socket < 0) {
+		printf("%s: [FAIL, create socket]\n", TEST_PREFIX);
+		exit(79);
+	}
+
+	int socket_fd = socket(server_sin.sin_family, SOCK_STREAM, 0);
+	if (socket < 0) {
+		printf("%s: [FAIL, create socket]\n", TEST_PREFIX);
+		exit(76);
+	}
+
+	int opt = 1;
+	ret = setsockopt(socket_fd, SOL_SOCKET,
+			 SO_REUSEADDR | SO_REUSEPORT | SO_ZEROCOPY, &opt,
+			 sizeof(opt));
+	if (ret) {
+		printf("%s: [FAIL, set sock opt]: %s\n", TEST_PREFIX,
+		       strerror(errno));
+		exit(76);
+	}
+
+	printf("binding to address %s:%d\n", server_ip,
+	       ntohs(server_sin.sin_port));
+
+	ret = bind(socket_fd, &server_sin, sizeof(server_sin));
+	if (ret) {
+		printf("%s: [FAIL, bind]: %s\n", TEST_PREFIX, strerror(errno));
+		exit(76);
+	}
+
+	ret = listen(socket_fd, 1);
+	if (ret) {
+		printf("%s: [FAIL, listen]: %s\n", TEST_PREFIX,
+		       strerror(errno));
+		exit(76);
+	}
+
+	struct sockaddr_in client_addr;
+	socklen_t client_addr_len = sizeof(client_addr);
+
+	char buffer[256];
+
+	inet_ntop(server_sin.sin_family, &server_sin.sin_addr, buffer,
+		  sizeof(buffer));
+	printf("Waiting or connection on %s:%d\n", buffer,
+	       ntohs(server_sin.sin_port));
+	int client_fd = accept(socket_fd, &client_addr, &client_addr_len);
+
+	inet_ntop(client_addr.sin_family, &client_addr.sin_addr, buffer,
+		  sizeof(buffer));
+	printf("Got connection from %s:%d\n", buffer,
+	       ntohs(client_addr.sin_port));
+
+	char iobuf[819200];
+	char ctrl_data[sizeof(int) * 20000];
+
+	size_t total_received = 0;
+	size_t i = 0;
+	size_t page_aligned_frags = 0;
+	size_t non_page_aligned_frags = 0;
+	while (1) {
+		bool is_devmem = false;
+		printf("\n\n");
+
+		struct msghdr msg = { 0 };
+		struct iovec iov = { .iov_base = iobuf,
+				     .iov_len = sizeof(iobuf) };
+		msg.msg_iov = &iov;
+		msg.msg_iovlen = 1;
+		msg.msg_control = ctrl_data;
+		msg.msg_controllen = sizeof(ctrl_data);
+		ssize_t ret = recvmsg(client_fd, &msg, MSG_SOCK_DEVMEM);
+		printf("recvmsg ret=%ld\n", ret);
+		if (ret < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
+			continue;
+		}
+		if (ret < 0) {
+			perror("recvmsg");
+			continue;
+		}
+		if (ret == 0) {
+			printf("client exited\n");
+			goto cleanup;
+		}
+
+		i++;
+		struct cmsghdr *cm = NULL;
+		struct cmsg_devmem *cmsg_devmem = NULL;
+		for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
+			if (cm->cmsg_level != SOL_SOCKET ||
+			    (cm->cmsg_type != SCM_DEVMEM_OFFSET &&
+			     cm->cmsg_type != SCM_DEVMEM_HEADER)) {
+				fprintf(stdout, "skipping non-devmem cmsg\n");
+				continue;
+			}
+
+			cmsg_devmem = (struct cmsg_devmem *)CMSG_DATA(cm);
+			is_devmem = true;
+
+			if (cm->cmsg_type == SCM_DEVMEM_HEADER) {
+				/* TODO: process data copied from skb's linear
+				 * buffer.
+				 */
+				fprintf(stdout,
+					"SCM_DEVMEM_HEADER. "
+					"cmsg_devmem->frag_size=%u\n",
+					cmsg_devmem->frag_size);
+
+				continue;
+			}
+
+			struct devmemtoken token = { cmsg_devmem->frag_token,
+						     1 };
+
+			total_received += cmsg_devmem->frag_size;
+			printf("received frag_page=%llu, in_page_offset=%llu,"
+			       " frag_offset=%llu, frag_size=%u, token=%u"
+			       " total_received=%lu\n",
+			       cmsg_devmem->frag_offset >> PAGE_SHIFT,
+			       cmsg_devmem->frag_offset % PAGE_SIZE,
+			       cmsg_devmem->frag_offset, cmsg_devmem->frag_size,
+			       cmsg_devmem->frag_token, total_received);
+
+			if (cmsg_devmem->frag_size % PAGE_SIZE)
+				non_page_aligned_frags++;
+			else
+				page_aligned_frags++;
+
+			struct dma_buf_sync sync = { 0 };
+			sync.flags = DMA_BUF_SYNC_READ | DMA_BUF_SYNC_START;
+			ioctl(buf, DMA_BUF_IOCTL_SYNC, &sync);
+
+			if (do_validation)
+				validate_buffer(
+					((unsigned char *)buf_mem) +
+						cmsg_devmem->frag_offset,
+					cmsg_devmem->frag_size);
+			else
+				print_nonzero_bytes(
+					((unsigned char *)buf_mem) +
+						cmsg_devmem->frag_offset,
+					cmsg_devmem->frag_size);
+
+			sync.flags = DMA_BUF_SYNC_READ | DMA_BUF_SYNC_END;
+			ioctl(buf, DMA_BUF_IOCTL_SYNC, &sync);
+
+			ret = setsockopt(client_fd, SOL_SOCKET,
+					 SO_DEVMEM_DONTNEED, &token,
+					 sizeof(token));
+			if (ret != 1) {
+				perror("SO_DEVMEM_DONTNEED not enough tokens");
+				exit(1);
+			}
+		}
+		if (!is_devmem)
+			printf("flow steering error\n");
+
+		printf("total_received=%lu\n", total_received);
+	}
+
+cleanup:
+	fprintf(stdout, "%s: ok\n", TEST_PREFIX);
+
+	fprintf(stdout, "page_aligned_frags=%lu, non_page_aligned_frags=%lu\n",
+		page_aligned_frags, non_page_aligned_frags);
+
+	fprintf(stdout, "page_aligned_frags=%lu, non_page_aligned_frags=%lu\n",
+		page_aligned_frags, non_page_aligned_frags);
+
+	munmap(buf_mem, dmabuf_size);
+	close(client_fd);
+	close(socket_fd);
+	close(buf);
+	close(memfd);
+	close(devfd);
+	ynl_sock_destroy(ys);
+	trigger_device_reset();
+
+	return 0;
+}
+
+int main(int argc, char *argv[])
+{
+	int is_server = 0, opt;
+
+	while ((opt = getopt(argc, argv, "ls:c:p:v:q:f:n:i:")) != -1) {
+		switch (opt) {
+		case 'l':
+			is_server = 1;
+			break;
+		case 's':
+			server_ip = optarg;
+			break;
+		case 'c':
+			client_ip = optarg;
+			break;
+		case 'p':
+			port = optarg;
+			break;
+		case 'v':
+			do_validation = atoll(optarg);
+			break;
+		case 'q':
+			queue_num = atoi(optarg);
+			break;
+		case 'f':
+			ifname = optarg;
+			break;
+		case 'n':
+			nic_pci_addr = optarg;
+			break;
+		case 'i':
+			iterations = atoll(optarg);
+			break;
+		case '?':
+			printf("unknown option: %c\n", optopt);
+			break;
+		}
+	}
+
+	for (; optind < argc; optind++) {
+		printf("extra arguments: %s\n", argv[optind]);
+	}
+
+	if (is_server)
+		return do_server();
+
+	return 0;
+}
-- 
2.41.0.640.ga95def55d0-goog


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 00/11] Device Memory TCP
  2023-08-10  1:57 [RFC PATCH v2 00/11] Device Memory TCP Mina Almasry
                   ` (10 preceding siblings ...)
  2023-08-10  1:57 ` [RFC PATCH v2 11/11] selftests: add ncdevmem, netcat for devmem TCP Mina Almasry
@ 2023-08-10 10:29 ` Christian König
  2023-08-10 16:06   ` Jason Gunthorpe
  2023-08-10 18:44   ` Mina Almasry
  2023-08-14  1:12 ` David Ahern
  2023-08-15 13:38 ` David Laight
  13 siblings, 2 replies; 62+ messages in thread
From: Christian König @ 2023-08-10 10:29 UTC (permalink / raw)
  To: Mina Almasry, netdev, linux-media, dri-devel
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	David Ahern, Willem de Bruijn, Sumit Semwal, Jason Gunthorpe,
	Hari Ramakrishnan, Dan Williams, Andy Lutomirski, stephen, sdf

Am 10.08.23 um 03:57 schrieb Mina Almasry:
> Changes in RFC v2:
> ------------------
>
> The sticking point in RFC v1[1] was the dma-buf pages approach we used to
> deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
> that attempts to resolve this by implementing scatterlist support in the
> networking stack, such that we can import the dma-buf scatterlist
> directly.

Impressive work, I didn't thought that this would be possible that "easily".

Please note that we have considered replacing scatterlists with simple 
arrays of DMA-addresses in the DMA-buf framework to avoid people trying 
to access the struct page inside the scatterlist.

It might be a good idea to push for that first before this here is 
finally implemented.

GPU drivers already convert the scatterlist used to arrays of 
DMA-addresses as soon as they get them. This leaves RDMA and V4L as the 
other two main users which would need to be converted.

>   This is the approach proposed at a high level here[2].
>
> Detailed changes:
> 1. Replaced dma-buf pages approach with importing scatterlist into the
>     page pool.
> 2. Replace the dma-buf pages centric API with a netlink API.
> 3. Removed the TX path implementation - there is no issue with
>     implementing the TX path with scatterlist approach, but leaving
>     out the TX path makes it easier to review.
> 4. Functionality is tested with this proposal, but I have not conducted
>     perf testing yet. I'm not sure there are regressions, but I removed
>     perf claims from the cover letter until they can be re-confirmed.
> 5. Added Signed-off-by: contributors to the implementation.
> 6. Fixed some bugs with the RX path since RFC v1.
>
> Any feedback welcome, but specifically the biggest pending questions
> needing feedback IMO are:
>
> 1. Feedback on the scatterlist-based approach in general.

As far as I can see this sounds like the right thing to do in general.

Question is rather if we should stick with scatterlist, use array of 
DMA-addresses or maybe even come up with a completely new structure.

> 2. Netlink API (Patch 1 & 2).

How does netlink manage the lifetime of objects?

> 3. Approach to handle all the drivers that expect to receive pages from
>     the page pool (Patch 6).

Can't say anything about that. I know TCP/IP inside out, but I'm a GPU 
and not a network driver author.

Regards,
Christian.

>
> [1] https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb0b4@gmail.com/T/
> [2] https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLXCc=H7Nw@mail.gmail.com/
>
> ----------------------
>
> * TL;DR:
>
> Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
> from device memory efficiently, without bouncing the data to a host memory
> buffer.
>
> * Problem:
>
> A large amount of data transfers have device memory as the source and/or
> destination. Accelerators drastically increased the volume of such transfers.
> Some examples include:
> - ML accelerators transferring large amounts of training data from storage into
>    GPU/TPU memory. In some cases ML training setup time can be as long as 50% of
>    TPU compute time, improving data transfer throughput & efficiency can help
>    improving GPU/TPU utilization.
>
> - Distributed training, where ML accelerators, such as GPUs on different hosts,
>    exchange data among them.
>
> - Distributed raw block storage applications transfer large amounts of data with
>    remote SSDs, much of this data does not require host processing.
>
> Today, the majority of the Device-to-Device data transfers the network are
> implemented as the following low level operations: Device-to-Host copy,
> Host-to-Host network transfer, and Host-to-Device copy.
>
> The implementation is suboptimal, especially for bulk data transfers, and can
> put significant strains on system resources, such as host memory bandwidth,
> PCIe bandwidth, etc. One important reason behind the current state is the
> kernel’s lack of semantics to express device to network transfers.
>
> * Proposal:
>
> In this patch series we attempt to optimize this use case by implementing
> socket APIs that enable the user to:
>
> 1. send device memory across the network directly, and
> 2. receive incoming network packets directly into device memory.
>
> Packet _payloads_ go directly from the NIC to device memory for receive and from
> device memory to NIC for transmit.
> Packet _headers_ go to/from host memory and are processed by the TCP/IP stack
> normally. The NIC _must_ support header split to achieve this.
>
> Advantages:
>
> - Alleviate host memory bandwidth pressure, compared to existing
>   network-transfer + device-copy semantics.
>
> - Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
>    of the PCIe tree, compared to traditional path which sends data through the
>    root complex.
>
> * Patch overview:
>
> ** Part 1: netlink API
>
> Gives user ability to bind dma-buf to an RX queue.
>
> ** Part 2: scatterlist support
>
> Currently the standard for device memory sharing is DMABUF, which doesn't
> generate struct pages. On the other hand, networking stack (skbs, drivers, and
> page pool) operate on pages. We have 2 options:
>
> 1. Generate struct pages for dmabuf device memory, or,
> 2. Modify the networking stack to process scatterlist.
>
> Approach #1 was attempted in RFC v1. RFC v2 implements approach #2.
>
> ** part 3: page pool support
>
> We piggy back on page pool memory providers proposal:
> https://github.com/kuba-moo/linux/tree/pp-providers
>
> It allows the page pool to define a memory provider that provides the
> page allocation and freeing. It helps abstract most of the device memory
> TCP changes from the driver.
>
> ** part 4: support for unreadable skb frags
>
> Page pool iovs are not accessible by the host; we implement changes
> throughput the networking stack to correctly handle skbs with unreadable
> frags.
>
> ** Part 5: recvmsg() APIs
>
> We define user APIs for the user to send and receive device memory.
>
> Not included with this RFC is the GVE devmem TCP support, just to
> simplify the review. Code available here if desired:
> https://github.com/mina/linux/tree/tcpdevmem
>
> This RFC is built on top of net-next with Jakub's pp-providers changes
> cherry-picked.
>
> * NIC dependencies:
>
> 1. (strict) Devmem TCP require the NIC to support header split, i.e. the
>     capability to split incoming packets into a header + payload and to put
>     each into a separate buffer. Devmem TCP works by using device memory
>     for the packet payload, and host memory for the packet headers.
>
> 2. (optional) Devmem TCP works better with flow steering support & RSS support,
>     i.e. the NIC's ability to steer flows into certain rx queues. This allows the
>     sysadmin to enable devmem TCP on a subset of the rx queues, and steer
>     devmem TCP traffic onto these queues and non devmem TCP elsewhere.
>
> The NIC I have access to with these properties is the GVE with DQO support
> running in Google Cloud, but any NIC that supports these features would suffice.
> I may be able to help reviewers bring up devmem TCP on their NICs.
>
> * Testing:
>
> The series includes a udmabuf kselftest that show a simple use case of
> devmem TCP and validates the entire data path end to end without
> a dependency on a specific dmabuf provider.
>
> ** Test Setup
>
> Kernel: net-next with this RFC and memory provider API cherry-picked
> locally.
>
> Hardware: Google Cloud A3 VMs.
>
> NIC: GVE with header split & RSS & flow steering support.
>
> Mina Almasry (11):
>    net: add netdev netlink api to bind dma-buf to a net device
>    netdev: implement netlink api to bind dma-buf to netdevice
>    netdev: implement netdevice devmem allocator
>    memory-provider: updates to core provider API for devmem TCP
>    memory-provider: implement dmabuf devmem memory provider
>    page-pool: add device memory support
>    net: support non paged skb frags
>    net: add support for skbs with unreadable frags
>    tcp: implement recvmsg() RX path for devmem TCP
>    net: add SO_DEVMEM_DONTNEED setsockopt to release RX pages
>    selftests: add ncdevmem, netcat for devmem TCP
>
>   Documentation/netlink/specs/netdev.yaml |  27 ++
>   include/linux/netdevice.h               |  61 +++
>   include/linux/skbuff.h                  |  54 ++-
>   include/linux/socket.h                  |   1 +
>   include/net/page_pool.h                 | 186 ++++++++-
>   include/net/sock.h                      |   2 +
>   include/net/tcp.h                       |   5 +-
>   include/uapi/asm-generic/socket.h       |   6 +
>   include/uapi/linux/netdev.h             |  10 +
>   include/uapi/linux/uio.h                |  10 +
>   net/core/datagram.c                     |   6 +
>   net/core/dev.c                          | 214 ++++++++++
>   net/core/gro.c                          |   2 +-
>   net/core/netdev-genl-gen.c              |  14 +
>   net/core/netdev-genl-gen.h              |   1 +
>   net/core/netdev-genl.c                  | 103 +++++
>   net/core/page_pool.c                    | 171 ++++++--
>   net/core/skbuff.c                       |  80 +++-
>   net/core/sock.c                         |  36 ++
>   net/ipv4/tcp.c                          | 196 ++++++++-
>   net/ipv4/tcp_input.c                    |  13 +-
>   net/ipv4/tcp_ipv4.c                     |   7 +
>   net/ipv4/tcp_output.c                   |   5 +-
>   net/packet/af_packet.c                  |   4 +-
>   tools/include/uapi/linux/netdev.h       |  10 +
>   tools/net/ynl/generated/netdev-user.c   |  41 ++
>   tools/net/ynl/generated/netdev-user.h   |  46 ++
>   tools/testing/selftests/net/.gitignore  |   1 +
>   tools/testing/selftests/net/Makefile    |   5 +
>   tools/testing/selftests/net/ncdevmem.c  | 534 ++++++++++++++++++++++++
>   30 files changed, 1787 insertions(+), 64 deletions(-)
>   create mode 100644 tools/testing/selftests/net/ncdevmem.c
>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 01/11] net: add netdev netlink api to bind dma-buf to a net device
  2023-08-10  1:57 ` [RFC PATCH v2 01/11] net: add netdev netlink api to bind dma-buf to a net device Mina Almasry
@ 2023-08-10 16:04   ` Samudrala, Sridhar
  2023-08-11  2:19     ` Mina Almasry
  0 siblings, 1 reply; 62+ messages in thread
From: Samudrala, Sridhar @ 2023-08-10 16:04 UTC (permalink / raw)
  To: Mina Almasry, netdev, linux-media, dri-devel
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	David Ahern, Willem de Bruijn, Sumit Semwal,
	Christian König, Jason Gunthorpe, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf, Amritha Nambiar



On 8/9/2023 6:57 PM, Mina Almasry wrote:
> API takes the dma-buf fd as input, and binds it to the netdevice. The
> user can specify the rx queue to bind the dma-buf to. The user should be
> able to bind the same dma-buf to multiple queues, but that is left as
> a (minor) TODO in this iteration.

To support binding dma-buf fd to multiple queues, can we extend/change 
this interface to bind dma-buf fd to a napi_id? Amritha is currently 
working on a patchset that exposes napi_id's and their association with 
the queues.

https://lore.kernel.org/netdev/169059098829.3736.381753570945338022.stgit@anambiarhost.jf.intel.com/


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 00/11] Device Memory TCP
  2023-08-10 10:29 ` [RFC PATCH v2 00/11] Device Memory TCP Christian König
@ 2023-08-10 16:06   ` Jason Gunthorpe
  2023-08-10 18:44   ` Mina Almasry
  1 sibling, 0 replies; 62+ messages in thread
From: Jason Gunthorpe @ 2023-08-10 16:06 UTC (permalink / raw)
  To: Christian König
  Cc: Mina Almasry, netdev, linux-media, dri-devel, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	David Ahern, Willem de Bruijn, Sumit Semwal, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf

On Thu, Aug 10, 2023 at 12:29:08PM +0200, Christian König wrote:
> Am 10.08.23 um 03:57 schrieb Mina Almasry:
> > Changes in RFC v2:
> > ------------------
> > 
> > The sticking point in RFC v1[1] was the dma-buf pages approach we used to
> > deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
> > that attempts to resolve this by implementing scatterlist support in the
> > networking stack, such that we can import the dma-buf scatterlist
> > directly.
> 
> Impressive work, I didn't thought that this would be possible that "easily".
> 
> Please note that we have considered replacing scatterlists with simple
> arrays of DMA-addresses in the DMA-buf framework to avoid people trying to
> access the struct page inside the scatterlist.
> 
> It might be a good idea to push for that first before this here is finally
> implemented.
> 
> GPU drivers already convert the scatterlist used to arrays of DMA-addresses
> as soon as they get them. This leaves RDMA and V4L as the other two main
> users which would need to be converted.

Oh that would be a nightmare for RDMA.

We need a standard based way to have scalable lists of DMA addresses :(

> > 2. Netlink API (Patch 1 & 2).
> 
> How does netlink manage the lifetime of objects?

And access control..

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 00/11] Device Memory TCP
  2023-08-10 10:29 ` [RFC PATCH v2 00/11] Device Memory TCP Christian König
  2023-08-10 16:06   ` Jason Gunthorpe
@ 2023-08-10 18:44   ` Mina Almasry
  2023-08-10 18:58     ` Jason Gunthorpe
  2023-08-11 11:02     ` Christian König
  1 sibling, 2 replies; 62+ messages in thread
From: Mina Almasry @ 2023-08-10 18:44 UTC (permalink / raw)
  To: Christian König
  Cc: netdev, linux-media, dri-devel, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Arnd Bergmann, David Ahern, Willem de Bruijn,
	Sumit Semwal, Jason Gunthorpe, Hari Ramakrishnan, Dan Williams,
	Andy Lutomirski, stephen, sdf

On Thu, Aug 10, 2023 at 3:29 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 10.08.23 um 03:57 schrieb Mina Almasry:
> > Changes in RFC v2:
> > ------------------
> >
> > The sticking point in RFC v1[1] was the dma-buf pages approach we used to
> > deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
> > that attempts to resolve this by implementing scatterlist support in the
> > networking stack, such that we can import the dma-buf scatterlist
> > directly.
>
> Impressive work, I didn't thought that this would be possible that "easily".
>
> Please note that we have considered replacing scatterlists with simple
> arrays of DMA-addresses in the DMA-buf framework to avoid people trying
> to access the struct page inside the scatterlist.
>

FWIW, I'm not doing anything with the struct pages inside the
scatterlist. All I need from the scatterlist are the
sg_dma_address(sg) and the sg_dma_len(sg), and I'm guessing the array
you're describing will provide exactly those, but let me know if I
misunderstood.

> It might be a good idea to push for that first before this here is
> finally implemented.
>
> GPU drivers already convert the scatterlist used to arrays of
> DMA-addresses as soon as they get them. This leaves RDMA and V4L as the
> other two main users which would need to be converted.
>
> >   This is the approach proposed at a high level here[2].
> >
> > Detailed changes:
> > 1. Replaced dma-buf pages approach with importing scatterlist into the
> >     page pool.
> > 2. Replace the dma-buf pages centric API with a netlink API.
> > 3. Removed the TX path implementation - there is no issue with
> >     implementing the TX path with scatterlist approach, but leaving
> >     out the TX path makes it easier to review.
> > 4. Functionality is tested with this proposal, but I have not conducted
> >     perf testing yet. I'm not sure there are regressions, but I removed
> >     perf claims from the cover letter until they can be re-confirmed.
> > 5. Added Signed-off-by: contributors to the implementation.
> > 6. Fixed some bugs with the RX path since RFC v1.
> >
> > Any feedback welcome, but specifically the biggest pending questions
> > needing feedback IMO are:
> >
> > 1. Feedback on the scatterlist-based approach in general.
>
> As far as I can see this sounds like the right thing to do in general.
>
> Question is rather if we should stick with scatterlist, use array of
> DMA-addresses or maybe even come up with a completely new structure.
>

As far as I can tell, it should be trivial to switch this device
memory TCP implementation to anything that provides:

1. DMA-addresses (sg_dma_address() equivalent)
2. lengths (sg_dma_len() equivalent)

if you go that route. Specifically, I think it will be pretty much a
localized change to netdev_bind_dmabuf_to_queue() implemented in this
patch:
https://lore.kernel.org/netdev/ZNULIDzuVVyfyMq2@ziepe.ca/T/#m2d344b08f54562cc9155c3f5b018cbfaed96036f

> > 2. Netlink API (Patch 1 & 2).
>
> How does netlink manage the lifetime of objects?
>

Netlink itself doesn't handle the lifetime of the binding. However,
the API I implemented unbinds the dma-buf when the netlink socket is
destroyed. I do this so that even if the user process crashes or
forgets to unbind, the dma-buf will still be unbound once the netlink
socket is closed on the process exit. Details in this patch:
https://lore.kernel.org/netdev/ZNULIDzuVVyfyMq2@ziepe.ca/T/#m2d344b08f54562cc9155c3f5b018cbfaed96036f

On Thu, Aug 10, 2023 at 9:07 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Aug 10, 2023 at 12:29:08PM +0200, Christian König wrote:
> > Am 10.08.23 um 03:57 schrieb Mina Almasry:
> > > Changes in RFC v2:
> > > ------------------
> > >
> > > The sticking point in RFC v1[1] was the dma-buf pages approach we used to
> > > deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
> > > that attempts to resolve this by implementing scatterlist support in the
> > > networking stack, such that we can import the dma-buf scatterlist
> > > directly.
> >
> > Impressive work, I didn't thought that this would be possible that "easily".
> >
> > Please note that we have considered replacing scatterlists with simple
> > arrays of DMA-addresses in the DMA-buf framework to avoid people trying to
> > access the struct page inside the scatterlist.
> >
> > It might be a good idea to push for that first before this here is finally
> > implemented.
> >
> > GPU drivers already convert the scatterlist used to arrays of DMA-addresses
> > as soon as they get them. This leaves RDMA and V4L as the other two main
> > users which would need to be converted.
>
> Oh that would be a nightmare for RDMA.
>
> We need a standard based way to have scalable lists of DMA addresses :(
>
> > > 2. Netlink API (Patch 1 & 2).
> >
> > How does netlink manage the lifetime of objects?
>
> And access control..
>

Someone will correct me if I'm wrong but I'm not sure netlink itself
will do (sufficient) access control. However I meant for the netlink
API to bind dma-bufs to be a CAP_NET_ADMIN API, and I forgot to add
this check in this proof-of-concept, sorry. I'll add a CAP_NET_ADMIN
check in netdev_bind_dmabuf_to_queue() in the next iteration.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 00/11] Device Memory TCP
  2023-08-10 18:44   ` Mina Almasry
@ 2023-08-10 18:58     ` Jason Gunthorpe
  2023-08-11  1:56       ` Mina Almasry
  2023-08-11 11:02     ` Christian König
  1 sibling, 1 reply; 62+ messages in thread
From: Jason Gunthorpe @ 2023-08-10 18:58 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Christian König, netdev, linux-media, dri-devel,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	David Ahern, Willem de Bruijn, Sumit Semwal, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf

On Thu, Aug 10, 2023 at 11:44:53AM -0700, Mina Almasry wrote:

> Someone will correct me if I'm wrong but I'm not sure netlink itself
> will do (sufficient) access control. However I meant for the netlink
> API to bind dma-bufs to be a CAP_NET_ADMIN API, and I forgot to add
> this check in this proof-of-concept, sorry. I'll add a CAP_NET_ADMIN
> check in netdev_bind_dmabuf_to_queue() in the next iteration.

Can some other process that does not have the netlink fd manage to
recv packets that were stored into the dmabuf?

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 00/11] Device Memory TCP
  2023-08-10 18:58     ` Jason Gunthorpe
@ 2023-08-11  1:56       ` Mina Almasry
  0 siblings, 0 replies; 62+ messages in thread
From: Mina Almasry @ 2023-08-11  1:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christian König, netdev, linux-media, dri-devel,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	David Ahern, Willem de Bruijn, Sumit Semwal, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf

On Thu, Aug 10, 2023 at 11:58 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Aug 10, 2023 at 11:44:53AM -0700, Mina Almasry wrote:
>
> > Someone will correct me if I'm wrong but I'm not sure netlink itself
> > will do (sufficient) access control. However I meant for the netlink
> > API to bind dma-bufs to be a CAP_NET_ADMIN API, and I forgot to add
> > this check in this proof-of-concept, sorry. I'll add a CAP_NET_ADMIN
> > check in netdev_bind_dmabuf_to_queue() in the next iteration.
>
> Can some other process that does not have the netlink fd manage to
> recv packets that were stored into the dmabuf?
>

The process needs to have the dma-buf fd to receive packets, and not
necessarily the netlink fd. It should be possible for:

1. a CAP_NET_ADMIN process to create a dma-buf, bind it using a
netlink fd, then share the dma-buf with another normal process that
receives packets on it.
2. a normal process to create a dma-buf, share it with a privileged
CAP_NET_ADMIN process that creates the binding via the netlink fd,
then the owner of the dma-buf can receive data on the dma-buf fd.
3. a CAP_NET_ADMIN creates the dma-buf and creates the binding itself
and receives data.

We in particular plan to use devmem TCP in the first mode, but this
detail is specific to us so I've largely neglected from describing it
in the cover letter. If our setup is interesting:
the CAP_NET_ADMIN process I describe in #1 is a 'tcpdevmem daemon'
which allocates the GPU memory, creates a dma-buf, creates an RX queue
binding, and shares the dma-buf with the ML application(s) running on
our instance. The ML application receives data on the dma-buf via
recvmsg().

The 'tcpdevmem daemon' takes care of the binding but also configures
RSS & flow steering. The dma-buf fd sharing happens over a unix domain
socket.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 01/11] net: add netdev netlink api to bind dma-buf to a net device
  2023-08-10 16:04   ` Samudrala, Sridhar
@ 2023-08-11  2:19     ` Mina Almasry
  0 siblings, 0 replies; 62+ messages in thread
From: Mina Almasry @ 2023-08-11  2:19 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: netdev, linux-media, dri-devel, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Arnd Bergmann, David Ahern, Willem de Bruijn,
	Sumit Semwal, Christian König, Jason Gunthorpe,
	Hari Ramakrishnan, Dan Williams, Andy Lutomirski, stephen, sdf,
	Amritha Nambiar

On Thu, Aug 10, 2023 at 9:09 AM Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
>
>
> On 8/9/2023 6:57 PM, Mina Almasry wrote:
> > API takes the dma-buf fd as input, and binds it to the netdevice. The
> > user can specify the rx queue to bind the dma-buf to. The user should be
> > able to bind the same dma-buf to multiple queues, but that is left as
> > a (minor) TODO in this iteration.
>
> To support binding dma-buf fd to multiple queues, can we extend/change
> this interface to bind dma-buf fd to a napi_id? Amritha is currently
> working on a patchset that exposes napi_id's and their association with
> the queues.
>
> https://lore.kernel.org/netdev/169059098829.3736.381753570945338022.stgit@anambiarhost.jf.intel.com/
>

Thank you Sridhar,

I think honestly implementing multiple rx queue binding is trivial,
even without the napi_id association. The user should be able to call
the bind-rx API multiple times with the same dma-buf to bind to
multiple queues, or I can convert the queue-idx to a multi-attr
netlink attribute to let the user specify multiple rx queues in 1
call.

Without doing some homework it's not immediately obvious to me that
coupling the dma-buf binding with the napi_id is necessary or
advantageous. Is there a reason coupling those is better?

It seems like napi_id can also refer to TX queues, and binding a
dma-buf with a TX queue doesn't make much sense to me. For TX we need
to couple the dma-buf with the netdev that's sending the dma-buf data,
but not a specific TX queue on the netdev, I think.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 00/11] Device Memory TCP
  2023-08-10 18:44   ` Mina Almasry
  2023-08-10 18:58     ` Jason Gunthorpe
@ 2023-08-11 11:02     ` Christian König
  1 sibling, 0 replies; 62+ messages in thread
From: Christian König @ 2023-08-11 11:02 UTC (permalink / raw)
  To: Mina Almasry
  Cc: netdev, linux-media, dri-devel, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Arnd Bergmann, David Ahern, Willem de Bruijn,
	Sumit Semwal, Jason Gunthorpe, Hari Ramakrishnan, Dan Williams,
	Andy Lutomirski, stephen, sdf

Am 10.08.23 um 20:44 schrieb Mina Almasry:
> On Thu, Aug 10, 2023 at 3:29 AM Christian König
> <christian.koenig@amd.com> wrote:
>> Am 10.08.23 um 03:57 schrieb Mina Almasry:
>>> Changes in RFC v2:
>>> ------------------
>>>
>>> The sticking point in RFC v1[1] was the dma-buf pages approach we used to
>>> deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
>>> that attempts to resolve this by implementing scatterlist support in the
>>> networking stack, such that we can import the dma-buf scatterlist
>>> directly.
>> Impressive work, I didn't thought that this would be possible that "easily".
>>
>> Please note that we have considered replacing scatterlists with simple
>> arrays of DMA-addresses in the DMA-buf framework to avoid people trying
>> to access the struct page inside the scatterlist.
>>
> FWIW, I'm not doing anything with the struct pages inside the
> scatterlist. All I need from the scatterlist are the
> sg_dma_address(sg) and the sg_dma_len(sg), and I'm guessing the array
> you're describing will provide exactly those, but let me know if I
> misunderstood.

Your understanding is perfectly correct.

>
>> It might be a good idea to push for that first before this here is
>> finally implemented.
>>
>> GPU drivers already convert the scatterlist used to arrays of
>> DMA-addresses as soon as they get them. This leaves RDMA and V4L as the
>> other two main users which would need to be converted.
>>
>>>    This is the approach proposed at a high level here[2].
>>>
>>> Detailed changes:
>>> 1. Replaced dma-buf pages approach with importing scatterlist into the
>>>      page pool.
>>> 2. Replace the dma-buf pages centric API with a netlink API.
>>> 3. Removed the TX path implementation - there is no issue with
>>>      implementing the TX path with scatterlist approach, but leaving
>>>      out the TX path makes it easier to review.
>>> 4. Functionality is tested with this proposal, but I have not conducted
>>>      perf testing yet. I'm not sure there are regressions, but I removed
>>>      perf claims from the cover letter until they can be re-confirmed.
>>> 5. Added Signed-off-by: contributors to the implementation.
>>> 6. Fixed some bugs with the RX path since RFC v1.
>>>
>>> Any feedback welcome, but specifically the biggest pending questions
>>> needing feedback IMO are:
>>>
>>> 1. Feedback on the scatterlist-based approach in general.
>> As far as I can see this sounds like the right thing to do in general.
>>
>> Question is rather if we should stick with scatterlist, use array of
>> DMA-addresses or maybe even come up with a completely new structure.
>>
> As far as I can tell, it should be trivial to switch this device
> memory TCP implementation to anything that provides:
>
> 1. DMA-addresses (sg_dma_address() equivalent)
> 2. lengths (sg_dma_len() equivalent)
>
> if you go that route. Specifically, I think it will be pretty much a
> localized change to netdev_bind_dmabuf_to_queue() implemented in this
> patch:
> https://lore.kernel.org/netdev/ZNULIDzuVVyfyMq2@ziepe.ca/T/#m2d344b08f54562cc9155c3f5b018cbfaed96036f

Thanks, that's exactly what I wanted to hear.

>
>>> 2. Netlink API (Patch 1 & 2).
>> How does netlink manage the lifetime of objects?
>>
> Netlink itself doesn't handle the lifetime of the binding. However,
> the API I implemented unbinds the dma-buf when the netlink socket is
> destroyed. I do this so that even if the user process crashes or
> forgets to unbind, the dma-buf will still be unbound once the netlink
> socket is closed on the process exit. Details in this patch:
> https://lore.kernel.org/netdev/ZNULIDzuVVyfyMq2@ziepe.ca/T/#m2d344b08f54562cc9155c3f5b018cbfaed96036f

I need to double check the details, but at least of hand that sounds 
sufficient for the lifetime requirements of DMA-buf.

Thanks,
Christian.

>
> On Thu, Aug 10, 2023 at 9:07 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>> On Thu, Aug 10, 2023 at 12:29:08PM +0200, Christian König wrote:
>>> Am 10.08.23 um 03:57 schrieb Mina Almasry:
>>>> Changes in RFC v2:
>>>> ------------------
>>>>
>>>> The sticking point in RFC v1[1] was the dma-buf pages approach we used to
>>>> deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
>>>> that attempts to resolve this by implementing scatterlist support in the
>>>> networking stack, such that we can import the dma-buf scatterlist
>>>> directly.
>>> Impressive work, I didn't thought that this would be possible that "easily".
>>>
>>> Please note that we have considered replacing scatterlists with simple
>>> arrays of DMA-addresses in the DMA-buf framework to avoid people trying to
>>> access the struct page inside the scatterlist.
>>>
>>> It might be a good idea to push for that first before this here is finally
>>> implemented.
>>>
>>> GPU drivers already convert the scatterlist used to arrays of DMA-addresses
>>> as soon as they get them. This leaves RDMA and V4L as the other two main
>>> users which would need to be converted.
>> Oh that would be a nightmare for RDMA.
>>
>> We need a standard based way to have scalable lists of DMA addresses :(
>>
>>>> 2. Netlink API (Patch 1 & 2).
>>> How does netlink manage the lifetime of objects?
>> And access control..
>>
> Someone will correct me if I'm wrong but I'm not sure netlink itself
> will do (sufficient) access control. However I meant for the netlink
> API to bind dma-bufs to be a CAP_NET_ADMIN API, and I forgot to add
> this check in this proof-of-concept, sorry. I'll add a CAP_NET_ADMIN
> check in netdev_bind_dmabuf_to_queue() in the next iteration.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice
  2023-08-10  1:57 ` [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice Mina Almasry
@ 2023-08-13 11:26   ` Leon Romanovsky
  2023-08-14  1:10   ` David Ahern
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 62+ messages in thread
From: Leon Romanovsky @ 2023-08-13 11:26 UTC (permalink / raw)
  To: Mina Almasry
  Cc: netdev, linux-media, dri-devel, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Arnd Bergmann, David Ahern, Willem de Bruijn,
	Sumit Semwal, Christian König, Jason Gunthorpe,
	Hari Ramakrishnan, Dan Williams, Andy Lutomirski, stephen, sdf,
	Willem de Bruijn, Kaiyuan Zhang

On Wed, Aug 09, 2023 at 06:57:38PM -0700, Mina Almasry wrote:
> Add a netdev_dmabuf_binding struct which represents the
> dma-buf-to-netdevice binding. The netlink API will bind the dma-buf to
> an rx queue on the netdevice. On the binding, the dma_buf_attach
> & dma_buf_map_attachment will occur. The entries in the sg_table from
> mapping will be inserted into a genpool to make it ready
> for allocation.
> 
> The chunks in the genpool are owned by a dmabuf_chunk_owner struct which
> holds the dma-buf offset of the base of the chunk and the dma_addr of
> the chunk. Both are needed to use allocations that come from this chunk.
> 
> We create a new type that represents an allocation from the genpool:
> page_pool_iov. We setup the page_pool_iov allocation size in the
> genpool to PAGE_SIZE for simplicity: to match the PAGE_SIZE normally
> allocated by the page pool and given to the drivers.
> 
> The user can unbind the dmabuf from the netdevice by closing the netlink
> socket that established the binding. We do this so that the binding is
> automatically unbound even if the userspace process crashes.
> 
> The binding and unbinding leaves an indicator in struct netdev_rx_queue
> that the given queue is bound, but the binding doesn't take effect until
> the driver actually reconfigures its queues, and re-initializes its page
> pool. This issue/weirdness is highlighted in the memory provider
> proposal[1], and I'm hoping that some generic solution for all
> memory providers will be discussed; this patch doesn't address that
> weirdness again.
> 
> The netdev_dmabuf_binding struct is refcounted, and releases its
> resources only when all the refs are released.
> 
> [1] https://lore.kernel.org/netdev/20230707183935.997267-1-kuba@kernel.org/
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> 
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> ---
>  include/linux/netdevice.h |  57 ++++++++++++
>  include/net/page_pool.h   |  27 ++++++
>  net/core/dev.c            | 178 ++++++++++++++++++++++++++++++++++++++
>  net/core/netdev-genl.c    | 101 ++++++++++++++++++++-
>  4 files changed, 361 insertions(+), 2 deletions(-)

<...>

> +void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding);
> +
> +static inline void
> +netdev_devmem_binding_get(struct netdev_dmabuf_binding *binding)
> +{
> +	refcount_inc(&binding->ref);
> +}
> +
> +static inline void
> +netdev_devmem_binding_put(struct netdev_dmabuf_binding *binding)
> +{
> +	if (!refcount_dec_and_test(&binding->ref))
> +		return;
> +
> +	__netdev_devmem_binding_free(binding);
> +}

Not a big deal, but it looks like reimplemented version of kref_get/kref_put to me.

Thanks

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice
  2023-08-10  1:57 ` [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice Mina Almasry
  2023-08-13 11:26   ` Leon Romanovsky
@ 2023-08-14  1:10   ` David Ahern
  2023-08-14  3:15     ` Mina Almasry
  2023-08-16  0:16     ` Jakub Kicinski
  2023-08-30 12:38   ` Yunsheng Lin
  2023-09-08  0:47   ` David Wei
  3 siblings, 2 replies; 62+ messages in thread
From: David Ahern @ 2023-08-14  1:10 UTC (permalink / raw)
  To: Mina Almasry, netdev, linux-media, dri-devel
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	Willem de Bruijn, Sumit Semwal, Christian König,
	Jason Gunthorpe, Hari Ramakrishnan, Dan Williams,
	Andy Lutomirski, stephen, sdf, Willem de Bruijn, Kaiyuan Zhang

On 8/9/23 7:57 PM, Mina Almasry wrote:
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 8e7d0cb540cd..02a25ccf771a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -151,6 +151,8 @@
>  #include <linux/pm_runtime.h>
>  #include <linux/prandom.h>
>  #include <linux/once_lite.h>
> +#include <linux/genalloc.h>
> +#include <linux/dma-buf.h>
> 
>  #include "dev.h"
>  #include "net-sysfs.h"
> @@ -2037,6 +2039,182 @@ static int call_netdevice_notifiers_mtu(unsigned long val,
>  	return call_netdevice_notifiers_info(val, &info.info);
>  }
> 
> +/* Device memory support */
> +
> +static void netdev_devmem_free_chunk_owner(struct gen_pool *genpool,
> +					   struct gen_pool_chunk *chunk,
> +					   void *not_used)
> +{
> +	struct dmabuf_genpool_chunk_owner *owner = chunk->owner;
> +
> +	kvfree(owner->ppiovs);
> +	kfree(owner);
> +}
> +
> +void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding)
> +{
> +	size_t size, avail;
> +
> +	gen_pool_for_each_chunk(binding->chunk_pool,
> +				netdev_devmem_free_chunk_owner, NULL);
> +
> +	size = gen_pool_size(binding->chunk_pool);
> +	avail = gen_pool_avail(binding->chunk_pool);
> +
> +	if (!WARN(size != avail, "can't destroy genpool. size=%lu, avail=%lu",
> +		  size, avail))
> +		gen_pool_destroy(binding->chunk_pool);
> +
> +	dma_buf_unmap_attachment(binding->attachment, binding->sgt,
> +				 DMA_BIDIRECTIONAL);
> +	dma_buf_detach(binding->dmabuf, binding->attachment);
> +	dma_buf_put(binding->dmabuf);
> +	kfree(binding);
> +}
> +
> +void netdev_unbind_dmabuf_to_queue(struct netdev_dmabuf_binding *binding)
> +{
> +	struct netdev_rx_queue *rxq;
> +	unsigned long xa_idx;
> +
> +	list_del_rcu(&binding->list);
> +
> +	xa_for_each(&binding->bound_rxq_list, xa_idx, rxq)
> +		if (rxq->binding == binding)
> +			rxq->binding = NULL;
> +
> +	netdev_devmem_binding_put(binding);

This does a put on the binding but it does not notify the driver that
that the dmabuf references need to be flushed from the rx queue.

Also, what about the device getting deleted - e.g., the driver is removed?

> +}
> +
> +int netdev_bind_dmabuf_to_queue(struct net_device *dev, unsigned int dmabuf_fd,
> +				u32 rxq_idx, struct netdev_dmabuf_binding **out)
> +{
> +	struct netdev_dmabuf_binding *binding;
> +	struct netdev_rx_queue *rxq;
> +	struct scatterlist *sg;
> +	struct dma_buf *dmabuf;
> +	unsigned int sg_idx, i;
> +	unsigned long virtual;
> +	u32 xa_idx;
> +	int err;
> +
> +	rxq = __netif_get_rx_queue(dev, rxq_idx);
> +
> +	if (rxq->binding)
> +		return -EEXIST;

So this proposal is limiting a binding to a single dmabuf at a time? Is
this just for the RFC?

Also, this suggests that the Rx queue is unique to the flow. I do not
recall a netdev API to create H/W queues on the fly (only a passing
comment from Kuba), so how is the H/W queue (or queue set since a
completion queue is needed as well) created for the flow? And in turn if
it is unique to the flow, what deletes the queue if an app does not do a
proper cleanup? If the queue sticks around, the dmabuf references stick
around.

Also, if this is an app specific h/w queue, flow steering is not
mentioned in this RFC.

> +
> +	dmabuf = dma_buf_get(dmabuf_fd);
> +	if (IS_ERR_OR_NULL(dmabuf))
> +		return -EBADFD;
> +
> +	binding = kzalloc_node(sizeof(*binding), GFP_KERNEL,
> +			       dev_to_node(&dev->dev));
> +	if (!binding) {
> +		err = -ENOMEM;
> +		goto err_put_dmabuf;
> +	}
> +
> +	xa_init_flags(&binding->bound_rxq_list, XA_FLAGS_ALLOC);
> +
> +	refcount_set(&binding->ref, 1);
> +
> +	binding->dmabuf = dmabuf;
> +
> +	binding->attachment = dma_buf_attach(binding->dmabuf, dev->dev.parent);
> +	if (IS_ERR(binding->attachment)) {
> +		err = PTR_ERR(binding->attachment);
> +		goto err_free_binding;
> +	}
> +
> +	binding->sgt = dma_buf_map_attachment(binding->attachment,
> +					      DMA_BIDIRECTIONAL);
> +	if (IS_ERR(binding->sgt)) {
> +		err = PTR_ERR(binding->sgt);
> +		goto err_detach;
> +	}
> +
> +	/* For simplicity we expect to make PAGE_SIZE allocations, but the
> +	 * binding can be much more flexible than that. We may be able to
> +	 * allocate MTU sized chunks here. Leave that for future work...
> +	 */
> +	binding->chunk_pool = gen_pool_create(PAGE_SHIFT,
> +					      dev_to_node(&dev->dev));
> +	if (!binding->chunk_pool) {
> +		err = -ENOMEM;
> +		goto err_unmap;
> +	}
> +
> +	virtual = 0;
> +	for_each_sgtable_dma_sg(binding->sgt, sg, sg_idx) {
> +		dma_addr_t dma_addr = sg_dma_address(sg);
> +		struct dmabuf_genpool_chunk_owner *owner;
> +		size_t len = sg_dma_len(sg);
> +		struct page_pool_iov *ppiov;
> +
> +		owner = kzalloc_node(sizeof(*owner), GFP_KERNEL,
> +				     dev_to_node(&dev->dev));
> +		owner->base_virtual = virtual;
> +		owner->base_dma_addr = dma_addr;
> +		owner->num_ppiovs = len / PAGE_SIZE;
> +		owner->binding = binding;
> +
> +		err = gen_pool_add_owner(binding->chunk_pool, dma_addr,
> +					 dma_addr, len, dev_to_node(&dev->dev),
> +					 owner);
> +		if (err) {
> +			err = -EINVAL;
> +			goto err_free_chunks;
> +		}
> +
> +		owner->ppiovs = kvmalloc_array(owner->num_ppiovs,
> +					       sizeof(*owner->ppiovs),
> +					       GFP_KERNEL);
> +		if (!owner->ppiovs) {
> +			err = -ENOMEM;
> +			goto err_free_chunks;
> +		}
> +
> +		for (i = 0; i < owner->num_ppiovs; i++) {
> +			ppiov = &owner->ppiovs[i];
> +			ppiov->owner = owner;
> +			refcount_set(&ppiov->refcount, 1);
> +		}
> +
> +		dma_addr += len;
> +		virtual += len;
> +	}
> +
> +	/* TODO: need to be able to bind to multiple rx queues on the same
> +	 * netdevice. The code should already be able to handle that with
> +	 * minimal changes, but the netlink API currently allows for 1 rx
> +	 * queue.
> +	 */
> +	err = xa_alloc(&binding->bound_rxq_list, &xa_idx, rxq, xa_limit_32b,
> +		       GFP_KERNEL);
> +	if (err)
> +		goto err_free_chunks;
> +
> +	rxq->binding = binding;
> +	*out = binding;
> +
> +	return 0;
> +
> +err_free_chunks:
> +	gen_pool_for_each_chunk(binding->chunk_pool,
> +				netdev_devmem_free_chunk_owner, NULL);
> +	gen_pool_destroy(binding->chunk_pool);
> +err_unmap:
> +	dma_buf_unmap_attachment(binding->attachment, binding->sgt,
> +				 DMA_BIDIRECTIONAL);
> +err_detach:
> +	dma_buf_detach(dmabuf, binding->attachment);
> +err_free_binding:
> +	kfree(binding);
> +err_put_dmabuf:
> +	dma_buf_put(dmabuf);
> +	return err;
> +}
> +
>  #ifdef CONFIG_NET_INGRESS
>  static DEFINE_STATIC_KEY_FALSE(ingress_needed_key);
> 
> diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> index bf7324dd6c36..288ed0112995 100644
> --- a/net/core/netdev-genl.c
> +++ b/net/core/netdev-genl.c> @@ -167,10 +231,37 @@ static int netdev_genl_netdevice_event(struct
notifier_block *nb,
>  	return NOTIFY_OK;
>  }
> 
> +static int netdev_netlink_notify(struct notifier_block *nb, unsigned long state,
> +				 void *_notify)
> +{
> +	struct netlink_notify *notify = _notify;
> +	struct netdev_dmabuf_binding *rbinding;
> +
> +	if (state != NETLINK_URELEASE || notify->protocol != NETLINK_GENERIC)
> +		return NOTIFY_DONE;
> +
> +	rcu_read_lock();
> +
> +	list_for_each_entry_rcu(rbinding, &netdev_rbinding_list, list) {
> +		if (rbinding->owner_nlportid == notify->portid) {
> +			netdev_unbind_dmabuf_to_queue(rbinding);

This ties the removal of a dmabuf to the close of the netlink socket as
suggested in the previous round of comments. What happens if the process
closes the dmabuf fd? Is the outstanding dev binding sufficient to keep
the allocation / use in place?


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 00/11] Device Memory TCP
  2023-08-10  1:57 [RFC PATCH v2 00/11] Device Memory TCP Mina Almasry
                   ` (11 preceding siblings ...)
  2023-08-10 10:29 ` [RFC PATCH v2 00/11] Device Memory TCP Christian König
@ 2023-08-14  1:12 ` David Ahern
  2023-08-14  2:11   ` Mina Almasry
  2023-08-17 18:00   ` Pavel Begunkov
  2023-08-15 13:38 ` David Laight
  13 siblings, 2 replies; 62+ messages in thread
From: David Ahern @ 2023-08-14  1:12 UTC (permalink / raw)
  To: Mina Almasry, netdev, linux-media, dri-devel
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	Willem de Bruijn, Sumit Semwal, Christian König,
	Jason Gunthorpe, Hari Ramakrishnan, Dan Williams,
	Andy Lutomirski, stephen, sdf

On 8/9/23 7:57 PM, Mina Almasry wrote:
> Changes in RFC v2:
> ------------------
> 
> The sticking point in RFC v1[1] was the dma-buf pages approach we used to
> deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
> that attempts to resolve this by implementing scatterlist support in the
> networking stack, such that we can import the dma-buf scatterlist
> directly. This is the approach proposed at a high level here[2].
> 
> Detailed changes:
> 1. Replaced dma-buf pages approach with importing scatterlist into the
>    page pool.
> 2. Replace the dma-buf pages centric API with a netlink API.
> 3. Removed the TX path implementation - there is no issue with
>    implementing the TX path with scatterlist approach, but leaving
>    out the TX path makes it easier to review.
> 4. Functionality is tested with this proposal, but I have not conducted
>    perf testing yet. I'm not sure there are regressions, but I removed
>    perf claims from the cover letter until they can be re-confirmed.
> 5. Added Signed-off-by: contributors to the implementation.
> 6. Fixed some bugs with the RX path since RFC v1.
> 
> Any feedback welcome, but specifically the biggest pending questions
> needing feedback IMO are:
> 
> 1. Feedback on the scatterlist-based approach in general.
> 2. Netlink API (Patch 1 & 2).
> 3. Approach to handle all the drivers that expect to receive pages from
>    the page pool (Patch 6).
> 
> [1] https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb0b4@gmail.com/T/
> [2] https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLXCc=H7Nw@mail.gmail.com/
> 
> ----------------------
> 
> * TL;DR:
> 
> Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
> from device memory efficiently, without bouncing the data to a host memory
> buffer.
> 
> * Problem:
> 
> A large amount of data transfers have device memory as the source and/or
> destination. Accelerators drastically increased the volume of such transfers.
> Some examples include:
> - ML accelerators transferring large amounts of training data from storage into
>   GPU/TPU memory. In some cases ML training setup time can be as long as 50% of
>   TPU compute time, improving data transfer throughput & efficiency can help
>   improving GPU/TPU utilization.
> 
> - Distributed training, where ML accelerators, such as GPUs on different hosts,
>   exchange data among them.
> 
> - Distributed raw block storage applications transfer large amounts of data with
>   remote SSDs, much of this data does not require host processing.
> 
> Today, the majority of the Device-to-Device data transfers the network are
> implemented as the following low level operations: Device-to-Host copy,
> Host-to-Host network transfer, and Host-to-Device copy.
> 
> The implementation is suboptimal, especially for bulk data transfers, and can
> put significant strains on system resources, such as host memory bandwidth,
> PCIe bandwidth, etc. One important reason behind the current state is the
> kernel’s lack of semantics to express device to network transfers.
> 
> * Proposal:
> 
> In this patch series we attempt to optimize this use case by implementing
> socket APIs that enable the user to:
> 
> 1. send device memory across the network directly, and
> 2. receive incoming network packets directly into device memory.
> 
> Packet _payloads_ go directly from the NIC to device memory for receive and from
> device memory to NIC for transmit.
> Packet _headers_ go to/from host memory and are processed by the TCP/IP stack
> normally. The NIC _must_ support header split to achieve this.
> 
> Advantages:
> 
> - Alleviate host memory bandwidth pressure, compared to existing
>  network-transfer + device-copy semantics.
> 
> - Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
>   of the PCIe tree, compared to traditional path which sends data through the
>   root complex.
> 
> * Patch overview:
> 
> ** Part 1: netlink API
> 
> Gives user ability to bind dma-buf to an RX queue.
> 
> ** Part 2: scatterlist support
> 
> Currently the standard for device memory sharing is DMABUF, which doesn't
> generate struct pages. On the other hand, networking stack (skbs, drivers, and
> page pool) operate on pages. We have 2 options:
> 
> 1. Generate struct pages for dmabuf device memory, or,
> 2. Modify the networking stack to process scatterlist.
> 
> Approach #1 was attempted in RFC v1. RFC v2 implements approach #2.
> 
> ** part 3: page pool support
> 
> We piggy back on page pool memory providers proposal:
> https://github.com/kuba-moo/linux/tree/pp-providers
> 
> It allows the page pool to define a memory provider that provides the
> page allocation and freeing. It helps abstract most of the device memory
> TCP changes from the driver.
> 
> ** part 4: support for unreadable skb frags
> 
> Page pool iovs are not accessible by the host; we implement changes
> throughput the networking stack to correctly handle skbs with unreadable
> frags.
> 
> ** Part 5: recvmsg() APIs
> 
> We define user APIs for the user to send and receive device memory.
> 
> Not included with this RFC is the GVE devmem TCP support, just to
> simplify the review. Code available here if desired:
> https://github.com/mina/linux/tree/tcpdevmem
> 
> This RFC is built on top of net-next with Jakub's pp-providers changes
> cherry-picked.
> 
> * NIC dependencies:
> 
> 1. (strict) Devmem TCP require the NIC to support header split, i.e. the
>    capability to split incoming packets into a header + payload and to put
>    each into a separate buffer. Devmem TCP works by using device memory
>    for the packet payload, and host memory for the packet headers.
> 
> 2. (optional) Devmem TCP works better with flow steering support & RSS support,
>    i.e. the NIC's ability to steer flows into certain rx queues. This allows the
>    sysadmin to enable devmem TCP on a subset of the rx queues, and steer
>    devmem TCP traffic onto these queues and non devmem TCP elsewhere.
> 
> The NIC I have access to with these properties is the GVE with DQO support
> running in Google Cloud, but any NIC that supports these features would suffice.
> I may be able to help reviewers bring up devmem TCP on their NICs.
> 
> * Testing:
> 
> The series includes a udmabuf kselftest that show a simple use case of
> devmem TCP and validates the entire data path end to end without
> a dependency on a specific dmabuf provider.
> 
> ** Test Setup
> 
> Kernel: net-next with this RFC and memory provider API cherry-picked
> locally.
> 
> Hardware: Google Cloud A3 VMs.
> 
> NIC: GVE with header split & RSS & flow steering support.

This set seems to depend on Jakub's memory provider patches and a netdev
driver change which is not included. For the testing mentioned here, you
must have a tree + branch with all of the patches. Is it publicly available?

It would be interesting to see how well (easy) this integrates with
io_uring. Besides avoiding all of the syscalls for receiving the iov and
releasing the buffers back to the pool, io_uring also brings in the
ability to seed a page_pool with registered buffers which provides a
means to get simpler Rx ZC for host memory.

Overall I like the intent and possibilities for extensions, but a lot of
details are missing - perhaps some are answered by seeing an end-to-end
implementation.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 00/11] Device Memory TCP
  2023-08-14  1:12 ` David Ahern
@ 2023-08-14  2:11   ` Mina Almasry
  2023-08-17 18:00   ` Pavel Begunkov
  1 sibling, 0 replies; 62+ messages in thread
From: Mina Almasry @ 2023-08-14  2:11 UTC (permalink / raw)
  To: David Ahern
  Cc: netdev, linux-media, dri-devel, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Arnd Bergmann, Willem de Bruijn, Sumit Semwal,
	Christian König, Jason Gunthorpe, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf

On Sun, Aug 13, 2023 at 6:12 PM David Ahern <dsahern@kernel.org> wrote:
>
> On 8/9/23 7:57 PM, Mina Almasry wrote:
> > Changes in RFC v2:
> > ------------------
> >
> > The sticking point in RFC v1[1] was the dma-buf pages approach we used to
> > deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
> > that attempts to resolve this by implementing scatterlist support in the
> > networking stack, such that we can import the dma-buf scatterlist
> > directly. This is the approach proposed at a high level here[2].
> >
> > Detailed changes:
> > 1. Replaced dma-buf pages approach with importing scatterlist into the
> >    page pool.
> > 2. Replace the dma-buf pages centric API with a netlink API.
> > 3. Removed the TX path implementation - there is no issue with
> >    implementing the TX path with scatterlist approach, but leaving
> >    out the TX path makes it easier to review.
> > 4. Functionality is tested with this proposal, but I have not conducted
> >    perf testing yet. I'm not sure there are regressions, but I removed
> >    perf claims from the cover letter until they can be re-confirmed.
> > 5. Added Signed-off-by: contributors to the implementation.
> > 6. Fixed some bugs with the RX path since RFC v1.
> >
> > Any feedback welcome, but specifically the biggest pending questions
> > needing feedback IMO are:
> >
> > 1. Feedback on the scatterlist-based approach in general.
> > 2. Netlink API (Patch 1 & 2).
> > 3. Approach to handle all the drivers that expect to receive pages from
> >    the page pool (Patch 6).
> >
> > [1] https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb0b4@gmail.com/T/
> > [2] https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLXCc=H7Nw@mail.gmail.com/
> >
> > ----------------------
> >
> > * TL;DR:
> >
> > Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
> > from device memory efficiently, without bouncing the data to a host memory
> > buffer.
> >
> > * Problem:
> >
> > A large amount of data transfers have device memory as the source and/or
> > destination. Accelerators drastically increased the volume of such transfers.
> > Some examples include:
> > - ML accelerators transferring large amounts of training data from storage into
> >   GPU/TPU memory. In some cases ML training setup time can be as long as 50% of
> >   TPU compute time, improving data transfer throughput & efficiency can help
> >   improving GPU/TPU utilization.
> >
> > - Distributed training, where ML accelerators, such as GPUs on different hosts,
> >   exchange data among them.
> >
> > - Distributed raw block storage applications transfer large amounts of data with
> >   remote SSDs, much of this data does not require host processing.
> >
> > Today, the majority of the Device-to-Device data transfers the network are
> > implemented as the following low level operations: Device-to-Host copy,
> > Host-to-Host network transfer, and Host-to-Device copy.
> >
> > The implementation is suboptimal, especially for bulk data transfers, and can
> > put significant strains on system resources, such as host memory bandwidth,
> > PCIe bandwidth, etc. One important reason behind the current state is the
> > kernel’s lack of semantics to express device to network transfers.
> >
> > * Proposal:
> >
> > In this patch series we attempt to optimize this use case by implementing
> > socket APIs that enable the user to:
> >
> > 1. send device memory across the network directly, and
> > 2. receive incoming network packets directly into device memory.
> >
> > Packet _payloads_ go directly from the NIC to device memory for receive and from
> > device memory to NIC for transmit.
> > Packet _headers_ go to/from host memory and are processed by the TCP/IP stack
> > normally. The NIC _must_ support header split to achieve this.
> >
> > Advantages:
> >
> > - Alleviate host memory bandwidth pressure, compared to existing
> >  network-transfer + device-copy semantics.
> >
> > - Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
> >   of the PCIe tree, compared to traditional path which sends data through the
> >   root complex.
> >
> > * Patch overview:
> >
> > ** Part 1: netlink API
> >
> > Gives user ability to bind dma-buf to an RX queue.
> >
> > ** Part 2: scatterlist support
> >
> > Currently the standard for device memory sharing is DMABUF, which doesn't
> > generate struct pages. On the other hand, networking stack (skbs, drivers, and
> > page pool) operate on pages. We have 2 options:
> >
> > 1. Generate struct pages for dmabuf device memory, or,
> > 2. Modify the networking stack to process scatterlist.
> >
> > Approach #1 was attempted in RFC v1. RFC v2 implements approach #2.
> >
> > ** part 3: page pool support
> >
> > We piggy back on page pool memory providers proposal:
> > https://github.com/kuba-moo/linux/tree/pp-providers
> >
> > It allows the page pool to define a memory provider that provides the
> > page allocation and freeing. It helps abstract most of the device memory
> > TCP changes from the driver.
> >
> > ** part 4: support for unreadable skb frags
> >
> > Page pool iovs are not accessible by the host; we implement changes
> > throughput the networking stack to correctly handle skbs with unreadable
> > frags.
> >
> > ** Part 5: recvmsg() APIs
> >
> > We define user APIs for the user to send and receive device memory.
> >
> > Not included with this RFC is the GVE devmem TCP support, just to
> > simplify the review. Code available here if desired:
> > https://github.com/mina/linux/tree/tcpdevmem
> >
> > This RFC is built on top of net-next with Jakub's pp-providers changes
> > cherry-picked.
> >
> > * NIC dependencies:
> >
> > 1. (strict) Devmem TCP require the NIC to support header split, i.e. the
> >    capability to split incoming packets into a header + payload and to put
> >    each into a separate buffer. Devmem TCP works by using device memory
> >    for the packet payload, and host memory for the packet headers.
> >
> > 2. (optional) Devmem TCP works better with flow steering support & RSS support,
> >    i.e. the NIC's ability to steer flows into certain rx queues. This allows the
> >    sysadmin to enable devmem TCP on a subset of the rx queues, and steer
> >    devmem TCP traffic onto these queues and non devmem TCP elsewhere.
> >
> > The NIC I have access to with these properties is the GVE with DQO support
> > running in Google Cloud, but any NIC that supports these features would suffice.
> > I may be able to help reviewers bring up devmem TCP on their NICs.
> >
> > * Testing:
> >
> > The series includes a udmabuf kselftest that show a simple use case of
> > devmem TCP and validates the entire data path end to end without
> > a dependency on a specific dmabuf provider.
> >
> > ** Test Setup
> >
> > Kernel: net-next with this RFC and memory provider API cherry-picked
> > locally.
> >
> > Hardware: Google Cloud A3 VMs.
> >
> > NIC: GVE with header split & RSS & flow steering support.
>
> This set seems to depend on Jakub's memory provider patches and a netdev
> driver change which is not included. For the testing mentioned here, you
> must have a tree + branch with all of the patches. Is it publicly available?
>

Yes, the net-next based branch is right here:
https://github.com/mina/linux/tree/tcpdevmem

Here is the git log of that branch:
https://github.com/mina/linux/commits/tcpdevmem

FWIW, it's already linked from the (long) cover letter, at the end of
the '* Patch overview:' section.

The branch includes all you mentioned above. The netdev driver I'm
using in the GVE. It also includes patches to implement header split &
flow steering for GVE (being upstreamed separately), and some debug
changes.

> It would be interesting to see how well (easy) this integrates with
> io_uring. Besides avoiding all of the syscalls for receiving the iov and
> releasing the buffers back to the pool, io_uring also brings in the
> ability to seed a page_pool with registered buffers which provides a
> means to get simpler Rx ZC for host memory.
>
> Overall I like the intent and possibilities for extensions, but a lot of
> details are missing - perhaps some are answered by seeing an end-to-end
> implementation.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice
  2023-08-14  1:10   ` David Ahern
@ 2023-08-14  3:15     ` Mina Almasry
  2023-08-16  0:16     ` Jakub Kicinski
  1 sibling, 0 replies; 62+ messages in thread
From: Mina Almasry @ 2023-08-14  3:15 UTC (permalink / raw)
  To: David Ahern
  Cc: netdev, linux-media, dri-devel, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Arnd Bergmann, Willem de Bruijn, Sumit Semwal,
	Christian König, Jason Gunthorpe, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf, Willem de Bruijn,
	Kaiyuan Zhang

On Sun, Aug 13, 2023 at 6:10 PM David Ahern <dsahern@kernel.org> wrote:
>
> On 8/9/23 7:57 PM, Mina Almasry wrote:
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 8e7d0cb540cd..02a25ccf771a 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -151,6 +151,8 @@
> >  #include <linux/pm_runtime.h>
> >  #include <linux/prandom.h>
> >  #include <linux/once_lite.h>
> > +#include <linux/genalloc.h>
> > +#include <linux/dma-buf.h>
> >
> >  #include "dev.h"
> >  #include "net-sysfs.h"
> > @@ -2037,6 +2039,182 @@ static int call_netdevice_notifiers_mtu(unsigned long val,
> >       return call_netdevice_notifiers_info(val, &info.info);
> >  }
> >
> > +/* Device memory support */
> > +
> > +static void netdev_devmem_free_chunk_owner(struct gen_pool *genpool,
> > +                                        struct gen_pool_chunk *chunk,
> > +                                        void *not_used)
> > +{
> > +     struct dmabuf_genpool_chunk_owner *owner = chunk->owner;
> > +
> > +     kvfree(owner->ppiovs);
> > +     kfree(owner);
> > +}
> > +
> > +void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding)
> > +{
> > +     size_t size, avail;
> > +
> > +     gen_pool_for_each_chunk(binding->chunk_pool,
> > +                             netdev_devmem_free_chunk_owner, NULL);
> > +
> > +     size = gen_pool_size(binding->chunk_pool);
> > +     avail = gen_pool_avail(binding->chunk_pool);
> > +
> > +     if (!WARN(size != avail, "can't destroy genpool. size=%lu, avail=%lu",
> > +               size, avail))
> > +             gen_pool_destroy(binding->chunk_pool);
> > +
> > +     dma_buf_unmap_attachment(binding->attachment, binding->sgt,
> > +                              DMA_BIDIRECTIONAL);
> > +     dma_buf_detach(binding->dmabuf, binding->attachment);
> > +     dma_buf_put(binding->dmabuf);
> > +     kfree(binding);
> > +}
> > +
> > +void netdev_unbind_dmabuf_to_queue(struct netdev_dmabuf_binding *binding)
> > +{
> > +     struct netdev_rx_queue *rxq;
> > +     unsigned long xa_idx;
> > +
> > +     list_del_rcu(&binding->list);
> > +
> > +     xa_for_each(&binding->bound_rxq_list, xa_idx, rxq)
> > +             if (rxq->binding == binding)
> > +                     rxq->binding = NULL;
> > +
> > +     netdev_devmem_binding_put(binding);
>
> This does a put on the binding but it does not notify the driver that
> that the dmabuf references need to be flushed from the rx queue.
>

Correct, FWIW this is called out in the commit message of this patch,
and is a general issue with all memory providers and not really
specific to the memory provider added for devmem TCP. Jakub described
the issue in the cover letter of the memory provider proposal:
https://lore.kernel.org/netdev/20230707183935.997267-1-kuba@kernel.org/

For now the selftest triggers a driver reset after bind & unbind for
the configuration to take effect. I think the right thing to do is a
generic solution should be applied to the general memory provider
proposal and devmem TCP will follow that.

One way to resolve this could be to trigger ethtool_ops->reset() call
on any memory provider configuration, which would recreate the queues
as part of the reset. Or adding a new API (ethtool op or otherwise)
which will only recreate the queues (or a specific queue).

> Also, what about the device getting deleted - e.g., the driver is removed?
>

Good point, I don't think I'm handling that correctly. I'm not sure
what the solution is at the moment. It probably is not right for the
bind to do a netdev_hold(), because it doesn't make much sense for the
dma-buf binding to keep the netdev alive, I think.

So probably the netdev freeing needs to unbind from the dma-buf, and
the netlink unbind needs to not duplicate the unbind. Should be simple
to implement I think. Thanks for catching.

> > +}
> > +
> > +int netdev_bind_dmabuf_to_queue(struct net_device *dev, unsigned int dmabuf_fd,
> > +                             u32 rxq_idx, struct netdev_dmabuf_binding **out)
> > +{
> > +     struct netdev_dmabuf_binding *binding;
> > +     struct netdev_rx_queue *rxq;
> > +     struct scatterlist *sg;
> > +     struct dma_buf *dmabuf;
> > +     unsigned int sg_idx, i;
> > +     unsigned long virtual;
> > +     u32 xa_idx;
> > +     int err;
> > +
> > +     rxq = __netif_get_rx_queue(dev, rxq_idx);
> > +
> > +     if (rxq->binding)
> > +             return -EEXIST;
>
> So this proposal is limiting a binding to a single dmabuf at a time? Is
> this just for the RFC?
>

I'm only allowing 1 rx queue to be bound to 1 dma-buf, and that is a
permanent restriction, I think. It would be amazing if we could bind
multiple dma-bufs to the same rx queue and the driver could somehow
know which dma-buf this packet is destined for. Unfortunately I don't
think drivers can do this without fancy parsing of incoming traffic,
and devmem TCP is possible without such driver support - as long as we
restrict 1 dma-buf per queue.

> Also, this suggests that the Rx queue is unique to the flow.  I do not
> recall a netdev API to create H/W queues on the fly (only a passing
> comment from Kuba), so how is the H/W queue (or queue set since a
> completion queue is needed as well) created for the flow? And in turn if
> it is unique to the flow, what deletes the queue if an app does not do a
> proper cleanup? If the queue sticks around, the dmabuf references stick
> around.
>

An RX queue is unique to an application & its dma-buf, not a single
flow. It is possible for the application to bind its dma-buf to an rx
queue, then steer multiple flows to that rx queue, and receive
incoming packets from these multiple flows onto its dma-buf.

Not implemented in this POC RFC, but will be implemented in the next
version: it should also be possible for the application to bind its
dma-buf to multiple rx queues, and steer its flows to one of these rx
queues, and receive incoming packets on the dma-buf.

I'm currently not thinking along the lines of creating a new H/W queue
for each new devmem flow. Instead, an existing queue gets re-purposed
for device memory TCP by binding it to a dma-buf and configuring flow
steering & RSS to steer this dma-buf owner's flows onto this rx queue.

We could go in the direction of creating new H/W queues for each
dma-buf binding if you think there is some upside. Off the top of my
head, I think the current model fits in better with the general
memory-provider plans which configure existing queues rather than
create new ones.

> Also, if this is an app specific h/w queue, flow steering is not
> mentioned in this RFC.
>

Technically it's not an app-specific h/w queue. In theory it's also
possible for multiple applications running under the same user to
share a single dma-buf which is bound to any number of rx queues, and
for all these applications to receive incoming packets on the shared
dma-buf simultaneously.

Flow steering is mentioned as a dependency in the cover letter, but
I've largely neglected to elaborate on how the use case works
end-to-end with userspace flow steering & RSS configuration, largely
because the APIs are flexible to handle many different use cases.
Sorry about that, I'll add a section regarding that in the next
iteration.

> > +
> > +     dmabuf = dma_buf_get(dmabuf_fd);
> > +     if (IS_ERR_OR_NULL(dmabuf))
> > +             return -EBADFD;
> > +
> > +     binding = kzalloc_node(sizeof(*binding), GFP_KERNEL,
> > +                            dev_to_node(&dev->dev));
> > +     if (!binding) {
> > +             err = -ENOMEM;
> > +             goto err_put_dmabuf;
> > +     }
> > +
> > +     xa_init_flags(&binding->bound_rxq_list, XA_FLAGS_ALLOC);
> > +
> > +     refcount_set(&binding->ref, 1);
> > +
> > +     binding->dmabuf = dmabuf;
> > +
> > +     binding->attachment = dma_buf_attach(binding->dmabuf, dev->dev.parent);
> > +     if (IS_ERR(binding->attachment)) {
> > +             err = PTR_ERR(binding->attachment);
> > +             goto err_free_binding;
> > +     }
> > +
> > +     binding->sgt = dma_buf_map_attachment(binding->attachment,
> > +                                           DMA_BIDIRECTIONAL);
> > +     if (IS_ERR(binding->sgt)) {
> > +             err = PTR_ERR(binding->sgt);
> > +             goto err_detach;
> > +     }
> > +
> > +     /* For simplicity we expect to make PAGE_SIZE allocations, but the
> > +      * binding can be much more flexible than that. We may be able to
> > +      * allocate MTU sized chunks here. Leave that for future work...
> > +      */
> > +     binding->chunk_pool = gen_pool_create(PAGE_SHIFT,
> > +                                           dev_to_node(&dev->dev));
> > +     if (!binding->chunk_pool) {
> > +             err = -ENOMEM;
> > +             goto err_unmap;
> > +     }
> > +
> > +     virtual = 0;
> > +     for_each_sgtable_dma_sg(binding->sgt, sg, sg_idx) {
> > +             dma_addr_t dma_addr = sg_dma_address(sg);
> > +             struct dmabuf_genpool_chunk_owner *owner;
> > +             size_t len = sg_dma_len(sg);
> > +             struct page_pool_iov *ppiov;
> > +
> > +             owner = kzalloc_node(sizeof(*owner), GFP_KERNEL,
> > +                                  dev_to_node(&dev->dev));
> > +             owner->base_virtual = virtual;
> > +             owner->base_dma_addr = dma_addr;
> > +             owner->num_ppiovs = len / PAGE_SIZE;
> > +             owner->binding = binding;
> > +
> > +             err = gen_pool_add_owner(binding->chunk_pool, dma_addr,
> > +                                      dma_addr, len, dev_to_node(&dev->dev),
> > +                                      owner);
> > +             if (err) {
> > +                     err = -EINVAL;
> > +                     goto err_free_chunks;
> > +             }
> > +
> > +             owner->ppiovs = kvmalloc_array(owner->num_ppiovs,
> > +                                            sizeof(*owner->ppiovs),
> > +                                            GFP_KERNEL);
> > +             if (!owner->ppiovs) {
> > +                     err = -ENOMEM;
> > +                     goto err_free_chunks;
> > +             }
> > +
> > +             for (i = 0; i < owner->num_ppiovs; i++) {
> > +                     ppiov = &owner->ppiovs[i];
> > +                     ppiov->owner = owner;
> > +                     refcount_set(&ppiov->refcount, 1);
> > +             }
> > +
> > +             dma_addr += len;
> > +             virtual += len;
> > +     }
> > +
> > +     /* TODO: need to be able to bind to multiple rx queues on the same
> > +      * netdevice. The code should already be able to handle that with
> > +      * minimal changes, but the netlink API currently allows for 1 rx
> > +      * queue.
> > +      */
> > +     err = xa_alloc(&binding->bound_rxq_list, &xa_idx, rxq, xa_limit_32b,
> > +                    GFP_KERNEL);
> > +     if (err)
> > +             goto err_free_chunks;
> > +
> > +     rxq->binding = binding;
> > +     *out = binding;
> > +
> > +     return 0;
> > +
> > +err_free_chunks:
> > +     gen_pool_for_each_chunk(binding->chunk_pool,
> > +                             netdev_devmem_free_chunk_owner, NULL);
> > +     gen_pool_destroy(binding->chunk_pool);
> > +err_unmap:
> > +     dma_buf_unmap_attachment(binding->attachment, binding->sgt,
> > +                              DMA_BIDIRECTIONAL);
> > +err_detach:
> > +     dma_buf_detach(dmabuf, binding->attachment);
> > +err_free_binding:
> > +     kfree(binding);
> > +err_put_dmabuf:
> > +     dma_buf_put(dmabuf);
> > +     return err;
> > +}
> > +
> >  #ifdef CONFIG_NET_INGRESS
> >  static DEFINE_STATIC_KEY_FALSE(ingress_needed_key);
> >
> > diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> > index bf7324dd6c36..288ed0112995 100644
> > --- a/net/core/netdev-genl.c
> > +++ b/net/core/netdev-genl.c> @@ -167,10 +231,37 @@ static int netdev_genl_netdevice_event(struct
> notifier_block *nb,
> >       return NOTIFY_OK;
> >  }
> >
> > +static int netdev_netlink_notify(struct notifier_block *nb, unsigned long state,
> > +                              void *_notify)
> > +{
> > +     struct netlink_notify *notify = _notify;
> > +     struct netdev_dmabuf_binding *rbinding;
> > +
> > +     if (state != NETLINK_URELEASE || notify->protocol != NETLINK_GENERIC)
> > +             return NOTIFY_DONE;
> > +
> > +     rcu_read_lock();
> > +
> > +     list_for_each_entry_rcu(rbinding, &netdev_rbinding_list, list) {
> > +             if (rbinding->owner_nlportid == notify->portid) {
> > +                     netdev_unbind_dmabuf_to_queue(rbinding);
>
> This ties the removal of a dmabuf to the close of the netlink socket as
> suggested in the previous round of comments. What happens if the process
> closes the dmabuf fd? Is the outstanding dev binding sufficient to keep
> the allocation / use in place?
>

Correct, the outstanding dev binding keeps the dma-buf & its
attachment in place until the driver no longer needs the binding and
drops the references.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [RFC PATCH v2 00/11] Device Memory TCP
  2023-08-10  1:57 [RFC PATCH v2 00/11] Device Memory TCP Mina Almasry
                   ` (12 preceding siblings ...)
  2023-08-14  1:12 ` David Ahern
@ 2023-08-15 13:38 ` David Laight
  2023-08-15 14:41   ` Willem de Bruijn
  13 siblings, 1 reply; 62+ messages in thread
From: David Laight @ 2023-08-15 13:38 UTC (permalink / raw)
  To: 'Mina Almasry', netdev, linux-media, dri-devel
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	David Ahern, Willem de Bruijn, Sumit Semwal,
	Christian König, Jason Gunthorpe, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf

From: Mina Almasry
> Sent: 10 August 2023 02:58
...
> * TL;DR:
> 
> Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
> from device memory efficiently, without bouncing the data to a host memory
> buffer.

Doesn't that really require peer-to-peer PCIe transfers?
IIRC these aren't supported by many root hubs and have
fundamental flow control and/or TLP credit problems.

I'd guess they are also pretty incompatible with IOMMU?

I can see how you might manage to transmit frames from
some external memory (eg after encryption) but surely
processing receive data that way needs the packets
be filtered by both IP addresses and port numbers before
being redirected to the (presumably limited) external
memory.

OTOH isn't the kernel going to need to run code before
the packet is actually sent and just after it is received?
So all you might gain is a bit of latency?
And a bit less utilisation of host memory??
But if your system is really limited by cpu-memory bandwidth
you need more cache :-)

So how much benefit is there over efficient use of host
memory bounce buffers??

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 00/11] Device Memory TCP
  2023-08-15 13:38 ` David Laight
@ 2023-08-15 14:41   ` Willem de Bruijn
  0 siblings, 0 replies; 62+ messages in thread
From: Willem de Bruijn @ 2023-08-15 14:41 UTC (permalink / raw)
  To: David Laight
  Cc: Mina Almasry, netdev, linux-media, dri-devel, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	David Ahern, Sumit Semwal, Christian König, Jason Gunthorpe,
	Hari Ramakrishnan, Dan Williams, Andy Lutomirski, stephen, sdf

On Tue, Aug 15, 2023 at 9:38 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Mina Almasry
> > Sent: 10 August 2023 02:58
> ...
> > * TL;DR:
> >
> > Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
> > from device memory efficiently, without bouncing the data to a host memory
> > buffer.
>
> Doesn't that really require peer-to-peer PCIe transfers?
> IIRC these aren't supported by many root hubs and have
> fundamental flow control and/or TLP credit problems.
>
> I'd guess they are also pretty incompatible with IOMMU?

Yes, this is a form of PCI_P2PDMA and all the limitations of that apply.

> I can see how you might manage to transmit frames from
> some external memory (eg after encryption) but surely
> processing receive data that way needs the packets
> be filtered by both IP addresses and port numbers before
> being redirected to the (presumably limited) external
> memory.

This feature depends on NIC receive header split. The TCP/IP headers
are stored to host memory, the payload to device memory.

Optionally, on devices that do not support explicit header-split, but
do support scatter-gather I/O, if the header size is constant and
known, that can be used as a weak substitute. This has additional
caveats wrt unexpected traffic for which payload must be host visible
(e.g., ICMP).

> OTOH isn't the kernel going to need to run code before
> the packet is actually sent and just after it is received?
> So all you might gain is a bit of latency?
> And a bit less utilisation of host memory??
> But if your system is really limited by cpu-memory bandwidth
> you need more cache :-)
>
>
> So how much benefit is there over efficient use of host
> memory bounce buffers??

Among other things, on a PCIe tree this makes it possible to load up
machines with many NICs + GPUs.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice
  2023-08-14  1:10   ` David Ahern
  2023-08-14  3:15     ` Mina Almasry
@ 2023-08-16  0:16     ` Jakub Kicinski
  2023-08-16 16:12       ` Willem de Bruijn
  1 sibling, 1 reply; 62+ messages in thread
From: Jakub Kicinski @ 2023-08-16  0:16 UTC (permalink / raw)
  To: David Ahern
  Cc: Mina Almasry, netdev, Eric Dumazet, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Magnus Karlsson,
	Willem de Bruijn, sdf, Willem de Bruijn, Kaiyuan Zhang

On Sun, 13 Aug 2023 19:10:35 -0600 David Ahern wrote:
> Also, this suggests that the Rx queue is unique to the flow. I do not
> recall a netdev API to create H/W queues on the fly (only a passing
> comment from Kuba), so how is the H/W queue (or queue set since a
> completion queue is needed as well) created for the flow?
> And in turn if it is unique to the flow, what deletes the queue if
> an app does not do a proper cleanup? If the queue sticks around,
> the dmabuf references stick around.

Let's start sketching out the design for queue config.
Without sliding into scope creep, hopefully.

Step one - I think we can decompose the problem into:
 A) flow steering
 B) object lifetime and permissions
 C) queue configuration (incl. potentially creating / destroying queues)

These come together into use scenarios like:
 #1 - partitioning for containers - when high perf containers share
      a machine each should get an RSS context on the physical NIC
      to have predictable traffic<>CPU placement, they may also have
      different preferences on how the queues are configured, maybe
      XDP, too?
 #2 - fancy page pools within the host (e.g. huge pages)
 #3 - very fancy page pools not within the host (Mina's work)
 #4 - XDP redirect target (allowing XDP_REDIRECT without installing XDP
      on the target)
 #5 - busy polling - admittedly a bit theoretical, I don't know of
      anyone busy polling in real life, but one of the problems today
      is that setting it up requires scraping random bits of info from
      sysfs and a lot of hoping.

Flow steering (A) is there today, to a sufficient extent, I think,
so we can defer on that. Sooner or later we should probably figure
out if we want to continue down the unruly path of TC offloads or
just give up and beef up ethtool.

I don't have a good sense of what a good model for cleanup and
permissions is (B). All I know is that if we need to tie things to
processes netlink can do it, and we shouldn't have to create our
own FS and special file descriptors...

And then there's (C) which is the main part to talk about.
The first step IMHO is to straighten out the configuration process.
Currently we do:

 user -> thin ethtool API --------------------> driver
                              netdev core <---'

By "straighten" I mean more of a:

 user -> thin ethtool API ---> netdev core ---> driver

flow. This means core maintains the full expected configuration,
queue count and their parameters and driver creates those queues
as instructed.

I'd imagine we'd need 4 basic ops:
 - queue_mem_alloc(dev, cfg) -> queue_mem
 - queue_mem_free(dev, cfg, queue_mem)
 - queue_start(dev, queue info, cfg, queue_mem) -> errno
 - queue_stop(dev, queue info, cfg)

The mem_alloc/mem_free takes care of the commonly missed requirement to
not take the datapath down until resources are allocated for new config.

Core then sets all the queues up after ndo_open, and tears down before
ndo_stop. In case of an ethtool -L / -G call or enabling / disabling XDP
core can handle the entire reconfiguration dance.

The cfg object needs to contain all queue configuration, including 
the page pool parameters.

If we have an abstract model of the configuration in the core we can
modify it much more easily, I hope. I mean - the configuration will be
somewhat detached from what's instantiated in the drivers.

I'd prefer to go as far as we can without introducing a driver callback
to "check if it can support a config change", and try to rely on
(static) capabilities instead. This allows more of the validation to
happen in the core and also lends itself naturally to exporting the
capabilities to the user.

Checking the use cases:

 #1 - partitioning for containers - storing the cfg in the core gives
      us a neat ability to allow users to set the configuration on RSS
      context
 #2, #3 - page pools - we can make page_pool_create take cfg and read whatever
      params we want from there, memory provider, descriptor count, recycling
      ring size etc. Also for header-data-split we may want different settings
      per queue so again cfg comes in handy
 #4 - XDP redirect target - we should spawn XDP TX queues independently from
      the XDP configuration

That's all I have thought up in terms of direction.
Does that make sense? What are the main gaps? Other proposals?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice
  2023-08-16  0:16     ` Jakub Kicinski
@ 2023-08-16 16:12       ` Willem de Bruijn
  2023-08-18  1:33         ` David Ahern
  0 siblings, 1 reply; 62+ messages in thread
From: Willem de Bruijn @ 2023-08-16 16:12 UTC (permalink / raw)
  To: Jakub Kicinski, David Ahern
  Cc: Mina Almasry, netdev, Eric Dumazet, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Magnus Karlsson,
	Willem de Bruijn, sdf, Willem de Bruijn, Kaiyuan Zhang

Jakub Kicinski wrote:
> On Sun, 13 Aug 2023 19:10:35 -0600 David Ahern wrote:
> > Also, this suggests that the Rx queue is unique to the flow. I do not
> > recall a netdev API to create H/W queues on the fly (only a passing
> > comment from Kuba), so how is the H/W queue (or queue set since a
> > completion queue is needed as well) created for the flow?
> > And in turn if it is unique to the flow, what deletes the queue if
> > an app does not do a proper cleanup? If the queue sticks around,
> > the dmabuf references stick around.
> 
> Let's start sketching out the design for queue config.
> Without sliding into scope creep, hopefully.
> 
> Step one - I think we can decompose the problem into:
>  A) flow steering
>  B) object lifetime and permissions
>  C) queue configuration (incl. potentially creating / destroying queues)
> 
> These come together into use scenarios like:
>  #1 - partitioning for containers - when high perf containers share
>       a machine each should get an RSS context on the physical NIC
>       to have predictable traffic<>CPU placement, they may also have
>       different preferences on how the queues are configured, maybe
>       XDP, too?
>  #2 - fancy page pools within the host (e.g. huge pages)
>  #3 - very fancy page pools not within the host (Mina's work)
>  #4 - XDP redirect target (allowing XDP_REDIRECT without installing XDP
>       on the target)
>  #5 - busy polling - admittedly a bit theoretical, I don't know of
>       anyone busy polling in real life, but one of the problems today
>       is that setting it up requires scraping random bits of info from
>       sysfs and a lot of hoping.
> 
> Flow steering (A) is there today, to a sufficient extent, I think,
> so we can defer on that. Sooner or later we should probably figure
> out if we want to continue down the unruly path of TC offloads or
> just give up and beef up ethtool.
> 
> I don't have a good sense of what a good model for cleanup and
> permissions is (B). All I know is that if we need to tie things to
> processes netlink can do it, and we shouldn't have to create our
> own FS and special file descriptors...
> 
> And then there's (C) which is the main part to talk about.
> The first step IMHO is to straighten out the configuration process.
> Currently we do:
> 
>  user -> thin ethtool API --------------------> driver
>                               netdev core <---'
> 
> By "straighten" I mean more of a:
> 
>  user -> thin ethtool API ---> netdev core ---> driver
> 
> flow. This means core maintains the full expected configuration,
> queue count and their parameters and driver creates those queues
> as instructed.
> 
> I'd imagine we'd need 4 basic ops:
>  - queue_mem_alloc(dev, cfg) -> queue_mem
>  - queue_mem_free(dev, cfg, queue_mem)
>  - queue_start(dev, queue info, cfg, queue_mem) -> errno
>  - queue_stop(dev, queue info, cfg)
> 
> The mem_alloc/mem_free takes care of the commonly missed requirement to
> not take the datapath down until resources are allocated for new config.
> 
> Core then sets all the queues up after ndo_open, and tears down before
> ndo_stop. In case of an ethtool -L / -G call or enabling / disabling XDP
> core can handle the entire reconfiguration dance.
> 
> The cfg object needs to contain all queue configuration, including 
> the page pool parameters.
> 
> If we have an abstract model of the configuration in the core we can
> modify it much more easily, I hope. I mean - the configuration will be
> somewhat detached from what's instantiated in the drivers.
> 
> I'd prefer to go as far as we can without introducing a driver callback
> to "check if it can support a config change", and try to rely on
> (static) capabilities instead. This allows more of the validation to
> happen in the core and also lends itself naturally to exporting the
> capabilities to the user.
> 
> Checking the use cases:
> 
>  #1 - partitioning for containers - storing the cfg in the core gives
>       us a neat ability to allow users to set the configuration on RSS
>       context
>  #2, #3 - page pools - we can make page_pool_create take cfg and read whatever
>       params we want from there, memory provider, descriptor count, recycling
>       ring size etc. Also for header-data-split we may want different settings
>       per queue so again cfg comes in handy
>  #4 - XDP redirect target - we should spawn XDP TX queues independently from
>       the XDP configuration
> 
> That's all I have thought up in terms of direction.
> Does that make sense? What are the main gaps? Other proposals?

More on (A) and (B):

I expect most use cases match the containerization that you mention.
Where a privileged process handles configuration.

For that, the existing interfaces of ethtool -G/-L-/N/-K/-X suffice.

A more far-out approach could infer the ntuple 5-tuple connection or
3-tuple listener rule from a socket itself, no ethtool required. But
let's ignore that for now.

Currently we need to use ethtool -X to restrict the RSS indirection
table to a subset of queues. It is not strictly necessary to
reconfigure the device on each new container, if pre-allocation a
sufficient set of non-RSS queues.

Then only ethtool -N is needed to drive data towards one of the
non-RSS queues. Or one of the non context 0 RSS contexts if that is
used.

The main part that is missing is memory allocation. Memory is stranded
on unused queues, and there is no explicit support for special
allocators.

A poor man's solution might be to load a ring with minimal sized
buffers (assuming devices accept that, say a zero length buffer),
attach a memory provider before inserting an ntuple rule, and refill
from the memory provider. This requires accepting that a whole ring of
packets is lost before refilled slots get filled..

(I'm messing with that with AF_XDP right now: a process that xsk_binds
 before filling the FILL queue..)

Ideally, we would have a way to reconfigure a single queue, without
having to down/up the entire device..

I don't know if the kernel needs an explicit abstract model, or can
leave that to the userspace privileged daemon that presses the ethtool
buttons.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 00/11] Device Memory TCP
  2023-08-14  1:12 ` David Ahern
  2023-08-14  2:11   ` Mina Almasry
@ 2023-08-17 18:00   ` Pavel Begunkov
  2023-08-17 22:18     ` Mina Almasry
  1 sibling, 1 reply; 62+ messages in thread
From: Pavel Begunkov @ 2023-08-17 18:00 UTC (permalink / raw)
  To: David Ahern, Mina Almasry, netdev, linux-media, dri-devel
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	Willem de Bruijn, Sumit Semwal, Christian König,
	Jason Gunthorpe, Hari Ramakrishnan, Dan Williams,
	Andy Lutomirski, stephen, sdf, David Wei

On 8/14/23 02:12, David Ahern wrote:
> On 8/9/23 7:57 PM, Mina Almasry wrote:
>> Changes in RFC v2:
>> ------------------
...
>> ** Test Setup
>>
>> Kernel: net-next with this RFC and memory provider API cherry-picked
>> locally.
>>
>> Hardware: Google Cloud A3 VMs.
>>
>> NIC: GVE with header split & RSS & flow steering support.
> 
> This set seems to depend on Jakub's memory provider patches and a netdev
> driver change which is not included. For the testing mentioned here, you
> must have a tree + branch with all of the patches. Is it publicly available?
> 
> It would be interesting to see how well (easy) this integrates with
> io_uring. Besides avoiding all of the syscalls for receiving the iov and
> releasing the buffers back to the pool, io_uring also brings in the
> ability to seed a page_pool with registered buffers which provides a
> means to get simpler Rx ZC for host memory.

The patchset sounds pretty interesting. I've been working with David Wei
(CC'ing) on io_uring zc rx (prototype polishing stage) all that is old
similar approaches based on allocating an rx queue. It targets host
memory and device memory as an extra feature, uapi is different, lifetimes
are managed/bound to io_uring. Completions/buffers are returned to user via
a separate queue instead of cmsg, and pushed back granularly to the kernel
via another queue. I'll leave it to David to elaborate

It sounds like we have space for collaboration here, if not merging then
reusing internals as much as we can, but we'd need to look into the
details deeper.

> Overall I like the intent and possibilities for extensions, but a lot of
> details are missing - perhaps some are answered by seeing an end-to-end
> implementation.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 00/11] Device Memory TCP
  2023-08-17 18:00   ` Pavel Begunkov
@ 2023-08-17 22:18     ` Mina Almasry
  2023-08-23 22:52       ` David Wei
  0 siblings, 1 reply; 62+ messages in thread
From: Mina Almasry @ 2023-08-17 22:18 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: David Ahern, netdev, linux-media, dri-devel, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	Willem de Bruijn, Sumit Semwal, Christian König,
	Jason Gunthorpe, Hari Ramakrishnan, Dan Williams,
	Andy Lutomirski, stephen, sdf, David Wei

On Thu, Aug 17, 2023 at 11:04 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 8/14/23 02:12, David Ahern wrote:
> > On 8/9/23 7:57 PM, Mina Almasry wrote:
> >> Changes in RFC v2:
> >> ------------------
> ...
> >> ** Test Setup
> >>
> >> Kernel: net-next with this RFC and memory provider API cherry-picked
> >> locally.
> >>
> >> Hardware: Google Cloud A3 VMs.
> >>
> >> NIC: GVE with header split & RSS & flow steering support.
> >
> > This set seems to depend on Jakub's memory provider patches and a netdev
> > driver change which is not included. For the testing mentioned here, you
> > must have a tree + branch with all of the patches. Is it publicly available?
> >
> > It would be interesting to see how well (easy) this integrates with
> > io_uring. Besides avoiding all of the syscalls for receiving the iov and
> > releasing the buffers back to the pool, io_uring also brings in the
> > ability to seed a page_pool with registered buffers which provides a
> > means to get simpler Rx ZC for host memory.
>
> The patchset sounds pretty interesting. I've been working with David Wei
> (CC'ing) on io_uring zc rx (prototype polishing stage) all that is old
> similar approaches based on allocating an rx queue. It targets host
> memory and device memory as an extra feature, uapi is different, lifetimes
> are managed/bound to io_uring. Completions/buffers are returned to user via
> a separate queue instead of cmsg, and pushed back granularly to the kernel
> via another queue. I'll leave it to David to elaborate
>
> It sounds like we have space for collaboration here, if not merging then
> reusing internals as much as we can, but we'd need to look into the
> details deeper.
>

I'm happy to look at your implementation and collaborate on something
that works for both use cases. Feel free to share unpolished prototype
so I can start having a general idea if possible.

> > Overall I like the intent and possibilities for extensions, but a lot of
> > details are missing - perhaps some are answered by seeing an end-to-end
> > implementation.
>
> --
> Pavel Begunkov



-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice
  2023-08-16 16:12       ` Willem de Bruijn
@ 2023-08-18  1:33         ` David Ahern
  2023-08-18  2:09           ` Jakub Kicinski
  0 siblings, 1 reply; 62+ messages in thread
From: David Ahern @ 2023-08-18  1:33 UTC (permalink / raw)
  To: Willem de Bruijn, Jakub Kicinski
  Cc: Mina Almasry, netdev, Eric Dumazet, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Magnus Karlsson, sdf,
	Willem de Bruijn, Kaiyuan Zhang

[ sorry for the delayed response; very busy 2 days ]

On 8/16/23 10:12 AM, Willem de Bruijn wrote:
> Jakub Kicinski wrote:
>> On Sun, 13 Aug 2023 19:10:35 -0600 David Ahern wrote:
>>> Also, this suggests that the Rx queue is unique to the flow. I do not
>>> recall a netdev API to create H/W queues on the fly (only a passing
>>> comment from Kuba), so how is the H/W queue (or queue set since a
>>> completion queue is needed as well) created for the flow?
>>> And in turn if it is unique to the flow, what deletes the queue if
>>> an app does not do a proper cleanup? If the queue sticks around,
>>> the dmabuf references stick around.
>>
>> Let's start sketching out the design for queue config.
>> Without sliding into scope creep, hopefully.
>>
>> Step one - I think we can decompose the problem into:
>>  A) flow steering
>>  B) object lifetime and permissions
>>  C) queue configuration (incl. potentially creating / destroying queues)
>>
>> These come together into use scenarios like:
>>  #1 - partitioning for containers - when high perf containers share
>>       a machine each should get an RSS context on the physical NIC
>>       to have predictable traffic<>CPU placement, they may also have
>>       different preferences on how the queues are configured, maybe
>>       XDP, too?

subfunctions are a more effective and simpler solution for containers, no?

>>  #2 - fancy page pools within the host (e.g. huge pages)
>>  #3 - very fancy page pools not within the host (Mina's work)
>>  #4 - XDP redirect target (allowing XDP_REDIRECT without installing XDP
>>       on the target)
>>  #5 - busy polling - admittedly a bit theoretical, I don't know of
>>       anyone busy polling in real life, but one of the problems today
>>       is that setting it up requires scraping random bits of info from
>>       sysfs and a lot of hoping.
>>
>> Flow steering (A) is there today, to a sufficient extent, I think,
>> so we can defer on that. Sooner or later we should probably figure
>> out if we want to continue down the unruly path of TC offloads or
>> just give up and beef up ethtool.

Flow steering to TC offloads -- more details on what you were thinking here?

>>
>> I don't have a good sense of what a good model for cleanup and
>> permissions is (B). All I know is that if we need to tie things to
>> processes netlink can do it, and we shouldn't have to create our
>> own FS and special file descriptors...

From my perspective the main sticking point that has not been handled is
flushing buffers from the RxQ, but there is 100% tied to queue
management and a process' ability to effect a flush or queue tear down -
and that is the focus of your list below:

>>
>> And then there's (C) which is the main part to talk about.
>> The first step IMHO is to straighten out the configuration process.
>> Currently we do:
>>
>>  user -> thin ethtool API --------------------> driver
>>                               netdev core <---'
>>
>> By "straighten" I mean more of a:
>>
>>  user -> thin ethtool API ---> netdev core ---> driver
>>
>> flow. This means core maintains the full expected configuration,
>> queue count and their parameters and driver creates those queues
>> as instructed.
>>
>> I'd imagine we'd need 4 basic ops:
>>  - queue_mem_alloc(dev, cfg) -> queue_mem
>>  - queue_mem_free(dev, cfg, queue_mem)
>>  - queue_start(dev, queue info, cfg, queue_mem) -> errno
>>  - queue_stop(dev, queue info, cfg)
>>
>> The mem_alloc/mem_free takes care of the commonly missed requirement to
>> not take the datapath down until resources are allocated for new config.

sounds reasonable.

>>
>> Core then sets all the queues up after ndo_open, and tears down before
>> ndo_stop. In case of an ethtool -L / -G call or enabling / disabling XDP
>> core can handle the entire reconfiguration dance.

`ethtool -L/-G` and `ip link set {up/down}` pertain to the "general OS"
queues managed by a driver for generic workloads and networking
management (e.g., neigh discovery, icmp, etc). The discussions here
pertains to processes wanting to use their own memory or GPU memory in a
queue. Processes will come and go and the queue management needs to
align with that need without affecting all of the other queues managed
by the driver.


>>
>> The cfg object needs to contain all queue configuration, including 
>> the page pool parameters.
>>
>> If we have an abstract model of the configuration in the core we can
>> modify it much more easily, I hope. I mean - the configuration will be
>> somewhat detached from what's instantiated in the drivers.
>>
>> I'd prefer to go as far as we can without introducing a driver callback
>> to "check if it can support a config change", and try to rely on
>> (static) capabilities instead. This allows more of the validation to
>> happen in the core and also lends itself naturally to exporting the
>> capabilities to the user.
>>
>> Checking the use cases:
>>
>>  #1 - partitioning for containers - storing the cfg in the core gives
>>       us a neat ability to allow users to set the configuration on RSS
>>       context
>>  #2, #3 - page pools - we can make page_pool_create take cfg and read whatever
>>       params we want from there, memory provider, descriptor count, recycling
>>       ring size etc. Also for header-data-split we may want different settings
>>       per queue so again cfg comes in handy
>>  #4 - XDP redirect target - we should spawn XDP TX queues independently from
>>       the XDP configuration
>>
>> That's all I have thought up in terms of direction.
>> Does that make sense? What are the main gaps? Other proposals?
> 
> More on (A) and (B):
> 
> I expect most use cases match the containerization that you mention.
> Where a privileged process handles configuration.
> 
> For that, the existing interfaces of ethtool -G/-L-/N/-K/-X suffice.
> 
> A more far-out approach could infer the ntuple 5-tuple connection or
> 3-tuple listener rule from a socket itself, no ethtool required. But
> let's ignore that for now.
> 
> Currently we need to use ethtool -X to restrict the RSS indirection
> table to a subset of queues. It is not strictly necessary to
> reconfigure the device on each new container, if pre-allocation a
> sufficient set of non-RSS queues.

This is an interesting approach: This scheme here is along the lines of
you have N cpus in the server, so N queue sets (or channels). The
indirection table means M queue sets are used for RSS leaving N-M queues
for flows with "fancy memory providers". Such a model can work but it is
quite passive, needs careful orchestration and has a lot of moving,
disjointed pieces - with some race conditions around setup vs first data
packet arriving.

I was thinking about a more generic design where H/W queues are created
and destroyed - e.g., queues unique to a process which makes the cleanup
so much easier.

> 
> Then only ethtool -N is needed to drive data towards one of the
> non-RSS queues. Or one of the non context 0 RSS contexts if that is
> used.
> 
> The main part that is missing is memory allocation. Memory is stranded
> on unused queues, and there is no explicit support for special
> allocators.
> 
> A poor man's solution might be to load a ring with minimal sized
> buffers (assuming devices accept that, say a zero length buffer),
> attach a memory provider before inserting an ntuple rule, and refill
> from the memory provider. This requires accepting that a whole ring of
> packets is lost before refilled slots get filled..
> 
> (I'm messing with that with AF_XDP right now: a process that xsk_binds
>  before filling the FILL queue..)
> 
> Ideally, we would have a way to reconfigure a single queue, without
> having to down/up the entire device..
> 
> I don't know if the kernel needs an explicit abstract model, or can
> leave that to the userspace privileged daemon that presses the ethtool
> buttons.

The kernel has that in the IB verbs S/W APIs. Yes, I realize that
comment amounts to profanity on this mailing list, but it should not be.
There are existing APIs for creating, managing and destroying queues -
open source, GPL'ed, *software* APIs that are open for all to use.

That said, I have no religion here. If the netdev stack wants new APIs
to manage queues - including supplying buffers - drivers will have APIs
that can be adapted to some new ndo to create, configure, and destroy
queues. The ethtool API can be updated to manage that. Ultimately I
believe anything short of dynamic queue management will be a band-aid
approach that will have a lot of problems.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice
  2023-08-18  1:33         ` David Ahern
@ 2023-08-18  2:09           ` Jakub Kicinski
  2023-08-18  2:21             ` David Ahern
  2023-08-18 21:52             ` Mina Almasry
  0 siblings, 2 replies; 62+ messages in thread
From: Jakub Kicinski @ 2023-08-18  2:09 UTC (permalink / raw)
  To: David Ahern
  Cc: Willem de Bruijn, Mina Almasry, netdev, Eric Dumazet,
	Paolo Abeni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Magnus Karlsson, sdf, Willem de Bruijn, Kaiyuan Zhang

On Thu, 17 Aug 2023 19:33:47 -0600 David Ahern wrote:
> [ sorry for the delayed response; very busy 2 days ]

Tell me about it :)

> On 8/16/23 10:12 AM, Willem de Bruijn wrote:
> > Jakub Kicinski wrote:  
> >> Let's start sketching out the design for queue config.
> >> Without sliding into scope creep, hopefully.
> >>
> >> Step one - I think we can decompose the problem into:
> >>  A) flow steering
> >>  B) object lifetime and permissions
> >>  C) queue configuration (incl. potentially creating / destroying queues)
> >>
> >> These come together into use scenarios like:
> >>  #1 - partitioning for containers - when high perf containers share
> >>       a machine each should get an RSS context on the physical NIC
> >>       to have predictable traffic<>CPU placement, they may also have
> >>       different preferences on how the queues are configured, maybe
> >>       XDP, too?  
> 
> subfunctions are a more effective and simpler solution for containers, no?

Maybe, subfunctions offload a lot, let's not go too far into the weeds
on production and flexibility considerations but they wouldn't be my
first choice.

> >>  #2 - fancy page pools within the host (e.g. huge pages)
> >>  #3 - very fancy page pools not within the host (Mina's work)
> >>  #4 - XDP redirect target (allowing XDP_REDIRECT without installing XDP
> >>       on the target)
> >>  #5 - busy polling - admittedly a bit theoretical, I don't know of
> >>       anyone busy polling in real life, but one of the problems today
> >>       is that setting it up requires scraping random bits of info from
> >>       sysfs and a lot of hoping.
> >>
> >> Flow steering (A) is there today, to a sufficient extent, I think,
> >> so we can defer on that. Sooner or later we should probably figure
> >> out if we want to continue down the unruly path of TC offloads or
> >> just give up and beef up ethtool.  
> 
> Flow steering to TC offloads -- more details on what you were thinking here?

I think TC flower can do almost everything ethtool -N can.
So do we continue to developer for both APIs or pick one?

> >> I don't have a good sense of what a good model for cleanup and
> >> permissions is (B). All I know is that if we need to tie things to
> >> processes netlink can do it, and we shouldn't have to create our
> >> own FS and special file descriptors...  
> 
> From my perspective the main sticking point that has not been handled is
> flushing buffers from the RxQ, but there is 100% tied to queue
> management and a process' ability to effect a flush or queue tear down -
> and that is the focus of your list below:

If you're thinking about it from the perspective of "application died
give me back all the buffers" - the RxQ is just one piece, right?
As we discovered with page pool - packets may get stuck in stack for
ever.

> >> And then there's (C) which is the main part to talk about.
> >> The first step IMHO is to straighten out the configuration process.
> >> Currently we do:
> >>
> >>  user -> thin ethtool API --------------------> driver
> >>                               netdev core <---'
> >>
> >> By "straighten" I mean more of a:
> >>
> >>  user -> thin ethtool API ---> netdev core ---> driver
> >>
> >> flow. This means core maintains the full expected configuration,
> >> queue count and their parameters and driver creates those queues
> >> as instructed.
> >>
> >> I'd imagine we'd need 4 basic ops:
> >>  - queue_mem_alloc(dev, cfg) -> queue_mem
> >>  - queue_mem_free(dev, cfg, queue_mem)
> >>  - queue_start(dev, queue info, cfg, queue_mem) -> errno
> >>  - queue_stop(dev, queue info, cfg)
> >>
> >> The mem_alloc/mem_free takes care of the commonly missed requirement to
> >> not take the datapath down until resources are allocated for new config.  
> 
> sounds reasonable.
> 
> >>
> >> Core then sets all the queues up after ndo_open, and tears down before
> >> ndo_stop. In case of an ethtool -L / -G call or enabling / disabling XDP
> >> core can handle the entire reconfiguration dance.  
> 
> `ethtool -L/-G` and `ip link set {up/down}` pertain to the "general OS"
> queues managed by a driver for generic workloads and networking
> management (e.g., neigh discovery, icmp, etc). The discussions here
> pertains to processes wanting to use their own memory or GPU memory in a
> queue. Processes will come and go and the queue management needs to
> align with that need without affecting all of the other queues managed
> by the driver.

For sure, I'm just saying that both the old uAPI can be translated to
the new driver API, and so should the new uAPIs. I focused on the
driver facing APIs because I think that it's the hard part. We have
many drivers, the uAPI is more easily dreamed up, no?

> >> The cfg object needs to contain all queue configuration, including 
> >> the page pool parameters.
> >>
> >> If we have an abstract model of the configuration in the core we can
> >> modify it much more easily, I hope. I mean - the configuration will be
> >> somewhat detached from what's instantiated in the drivers.
> >>
> >> I'd prefer to go as far as we can without introducing a driver callback
> >> to "check if it can support a config change", and try to rely on
> >> (static) capabilities instead. This allows more of the validation to
> >> happen in the core and also lends itself naturally to exporting the
> >> capabilities to the user.
> >>
> >> Checking the use cases:
> >>
> >>  #1 - partitioning for containers - storing the cfg in the core gives
> >>       us a neat ability to allow users to set the configuration on RSS
> >>       context
> >>  #2, #3 - page pools - we can make page_pool_create take cfg and read whatever
> >>       params we want from there, memory provider, descriptor count, recycling
> >>       ring size etc. Also for header-data-split we may want different settings
> >>       per queue so again cfg comes in handy
> >>  #4 - XDP redirect target - we should spawn XDP TX queues independently from
> >>       the XDP configuration
> >>
> >> That's all I have thought up in terms of direction.
> >> Does that make sense? What are the main gaps? Other proposals?  
> > 
> > More on (A) and (B):
> > 
> > I expect most use cases match the containerization that you mention.
> > Where a privileged process handles configuration.
> > 
> > For that, the existing interfaces of ethtool -G/-L-/N/-K/-X suffice.
> > 
> > A more far-out approach could infer the ntuple 5-tuple connection or
> > 3-tuple listener rule from a socket itself, no ethtool required. But
> > let's ignore that for now.
> > 
> > Currently we need to use ethtool -X to restrict the RSS indirection
> > table to a subset of queues. It is not strictly necessary to
> > reconfigure the device on each new container, if pre-allocation a
> > sufficient set of non-RSS queues.  
> 
> This is an interesting approach: This scheme here is along the lines of
> you have N cpus in the server, so N queue sets (or channels). The
> indirection table means M queue sets are used for RSS leaving N-M queues
> for flows with "fancy memory providers". Such a model can work but it is
> quite passive, needs careful orchestration and has a lot of moving,
> disjointed pieces - with some race conditions around setup vs first data
> packet arriving.
> 
> I was thinking about a more generic design where H/W queues are created
> and destroyed - e.g., queues unique to a process which makes the cleanup
> so much easier.

FWIW what Willem describes is what we were telling people to do for
AF_XDP for however many years it existed.

> > Then only ethtool -N is needed to drive data towards one of the
> > non-RSS queues. Or one of the non context 0 RSS contexts if that is
> > used.
> > 
> > The main part that is missing is memory allocation. Memory is stranded
> > on unused queues, and there is no explicit support for special
> > allocators.
> > 
> > A poor man's solution might be to load a ring with minimal sized
> > buffers (assuming devices accept that, say a zero length buffer),
> > attach a memory provider before inserting an ntuple rule, and refill
> > from the memory provider. This requires accepting that a whole ring of
> > packets is lost before refilled slots get filled..
> > 
> > (I'm messing with that with AF_XDP right now: a process that xsk_binds
> >  before filling the FILL queue..)
> > 
> > Ideally, we would have a way to reconfigure a single queue, without
> > having to down/up the entire device..
> > 
> > I don't know if the kernel needs an explicit abstract model, or can
> > leave that to the userspace privileged daemon that presses the ethtool
> > buttons.  
> 
> The kernel has that in the IB verbs S/W APIs. Yes, I realize that
> comment amounts to profanity on this mailing list, but it should not be.
> There are existing APIs for creating, managing and destroying queues -
> open source, GPL'ed, *software* APIs that are open for all to use.
> 
> That said, I have no religion here. If the netdev stack wants new APIs
> to manage queues - including supplying buffers - drivers will have APIs
> that can be adapted to some new ndo to create, configure, and destroy
> queues. The ethtool API can be updated to manage that. Ultimately I
> believe anything short of dynamic queue management will be a band-aid
> approach that will have a lot of problems.

No religion here either, but the APIs we talk about are not
particularly complex. Having started hacking things together 
with page pools, huge pages, RSS etc - IMHO the reuse and convergence
would be very superficial.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice
  2023-08-18  2:09           ` Jakub Kicinski
@ 2023-08-18  2:21             ` David Ahern
  2023-08-18 21:52             ` Mina Almasry
  1 sibling, 0 replies; 62+ messages in thread
From: David Ahern @ 2023-08-18  2:21 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Willem de Bruijn, Mina Almasry, netdev, Eric Dumazet,
	Paolo Abeni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Magnus Karlsson, sdf, Willem de Bruijn, Kaiyuan Zhang

On 8/17/23 8:09 PM, Jakub Kicinski wrote:
>>
>> Flow steering to TC offloads -- more details on what you were thinking here?
> 
> I think TC flower can do almost everything ethtool -N can.
> So do we continue to developer for both APIs or pick one?

ok, tc flower; that did not come to mind. Don't use it often.

> 
>>>> I don't have a good sense of what a good model for cleanup and
>>>> permissions is (B). All I know is that if we need to tie things to
>>>> processes netlink can do it, and we shouldn't have to create our
>>>> own FS and special file descriptors...  
>>
>> From my perspective the main sticking point that has not been handled is
>> flushing buffers from the RxQ, but there is 100% tied to queue
>> management and a process' ability to effect a flush or queue tear down -
>> and that is the focus of your list below:
> 
> If you're thinking about it from the perspective of "application died
> give me back all the buffers" - the RxQ is just one piece, right?
> As we discovered with page pool - packets may get stuck in stack for
> ever.

Yes, flushing the retransmit queue for TCP is one of those places where
buffer references can get stuck for some amount of time.

>>
>> `ethtool -L/-G` and `ip link set {up/down}` pertain to the "general OS"
>> queues managed by a driver for generic workloads and networking
>> management (e.g., neigh discovery, icmp, etc). The discussions here
>> pertains to processes wanting to use their own memory or GPU memory in a
>> queue. Processes will come and go and the queue management needs to
>> align with that need without affecting all of the other queues managed
>> by the driver.
> 
> For sure, I'm just saying that both the old uAPI can be translated to
> the new driver API, and so should the new uAPIs. I focused on the
> driver facing APIs because I think that it's the hard part. We have
> many drivers, the uAPI is more easily dreamed up, no?

sure.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice
  2023-08-18  2:09           ` Jakub Kicinski
  2023-08-18  2:21             ` David Ahern
@ 2023-08-18 21:52             ` Mina Almasry
  2023-08-19  1:34               ` David Ahern
  1 sibling, 1 reply; 62+ messages in thread
From: Mina Almasry @ 2023-08-18 21:52 UTC (permalink / raw)
  To: Jakub Kicinski, Praveen Kaligineedi
  Cc: David Ahern, Willem de Bruijn, netdev, Eric Dumazet, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Magnus Karlsson, sdf,
	Willem de Bruijn, Kaiyuan Zhang

On Thu, Aug 17, 2023 at 7:10 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 17 Aug 2023 19:33:47 -0600 David Ahern wrote:
> > [ sorry for the delayed response; very busy 2 days ]
>
> Tell me about it :)
>
> > On 8/16/23 10:12 AM, Willem de Bruijn wrote:
> > > Jakub Kicinski wrote:
> > >> Let's start sketching out the design for queue config.
> > >> Without sliding into scope creep, hopefully.
> > >>
> > >> Step one - I think we can decompose the problem into:
> > >>  A) flow steering
> > >>  B) object lifetime and permissions
> > >>  C) queue configuration (incl. potentially creating / destroying queues)
> > >>
> > >> These come together into use scenarios like:
> > >>  #1 - partitioning for containers - when high perf containers share
> > >>       a machine each should get an RSS context on the physical NIC
> > >>       to have predictable traffic<>CPU placement, they may also have
> > >>       different preferences on how the queues are configured, maybe
> > >>       XDP, too?
> >
> > subfunctions are a more effective and simpler solution for containers, no?
>
> Maybe, subfunctions offload a lot, let's not go too far into the weeds
> on production and flexibility considerations but they wouldn't be my
> first choice.
>
> > >>  #2 - fancy page pools within the host (e.g. huge pages)
> > >>  #3 - very fancy page pools not within the host (Mina's work)
> > >>  #4 - XDP redirect target (allowing XDP_REDIRECT without installing XDP
> > >>       on the target)
> > >>  #5 - busy polling - admittedly a bit theoretical, I don't know of
> > >>       anyone busy polling in real life, but one of the problems today
> > >>       is that setting it up requires scraping random bits of info from
> > >>       sysfs and a lot of hoping.
> > >>
> > >> Flow steering (A) is there today, to a sufficient extent, I think,
> > >> so we can defer on that. Sooner or later we should probably figure
> > >> out if we want to continue down the unruly path of TC offloads or
> > >> just give up and beef up ethtool.
> >
> > Flow steering to TC offloads -- more details on what you were thinking here?
>
> I think TC flower can do almost everything ethtool -N can.
> So do we continue to developer for both APIs or pick one?
>
> > >> I don't have a good sense of what a good model for cleanup and
> > >> permissions is (B). All I know is that if we need to tie things to
> > >> processes netlink can do it, and we shouldn't have to create our
> > >> own FS and special file descriptors...
> >
> > From my perspective the main sticking point that has not been handled is
> > flushing buffers from the RxQ, but there is 100% tied to queue
> > management and a process' ability to effect a flush or queue tear down -
> > and that is the focus of your list below:
>
> If you're thinking about it from the perspective of "application died
> give me back all the buffers" - the RxQ is just one piece, right?
> As we discovered with page pool - packets may get stuck in stack for
> ever.
>
> > >> And then there's (C) which is the main part to talk about.
> > >> The first step IMHO is to straighten out the configuration process.
> > >> Currently we do:
> > >>
> > >>  user -> thin ethtool API --------------------> driver
> > >>                               netdev core <---'
> > >>
> > >> By "straighten" I mean more of a:
> > >>
> > >>  user -> thin ethtool API ---> netdev core ---> driver
> > >>
> > >> flow. This means core maintains the full expected configuration,
> > >> queue count and their parameters and driver creates those queues
> > >> as instructed.
> > >>
> > >> I'd imagine we'd need 4 basic ops:
> > >>  - queue_mem_alloc(dev, cfg) -> queue_mem
> > >>  - queue_mem_free(dev, cfg, queue_mem)
> > >>  - queue_start(dev, queue info, cfg, queue_mem) -> errno
> > >>  - queue_stop(dev, queue info, cfg)
> > >>
> > >> The mem_alloc/mem_free takes care of the commonly missed requirement to
> > >> not take the datapath down until resources are allocated for new config.
> >
> > sounds reasonable.
> >

Thanks for taking the time to review & provide suggestions. I do need
to understand concrete changes to apply to the next revision. Here is
my understanding so far, please correct if wrong, and sorry if I
didn't capture everything you want:

The sticking points are:
1. From David: this proposal doesn't give an application the ability
to flush an rx queue, which means that we have to rely on a driver
reset that affects all queues to refill the rx queue buffers.
2. From Jakub: the uAPI and implementation here needs to be in line
with his general direction & extensible to apply to existing use cases
`ethtool -L/-G`, etc.

AFAIU this is what I need to do in the next version:

1. The uAPI will be changed such that it will either re-configure an
existing queue to bind it to the dma-buf, or allocate a new queue
bound to the dma-buf (not sure which is better at the moment). Either
way, the configuration will take place immediately, and not rely on an
entire driver reset to actuate the change.

2. The uAPI will be changed such that if the netlink socket is closed,
or the process dies, the rx queue will be unbound from the dma-buf or
the rx queue will be freed entirely (again, not sure which is better
at the moment). The configuration will take place immediately without
relying on a driver reset.

3. I will add 4 new net_device_ops that Jakub specified:
queue_mem_alloc/free(), and queue_start/stop().

4. The uAPI mentioned in #1 will use the new net_device_ops to
allocate or reconfigure a queue attached to the provided dma-buf.

Does this sound roughly reasonable here?

AFAICT the only technical difficulty is that I'm not sure it's
feasible for a driver to start or stop 1 rx-queue without triggering a
full driver reset. The (2) drivers I looked at both do a full reset to
change any queue configuration. I'll investigate.

> > >>
> > >> Core then sets all the queues up after ndo_open, and tears down before
> > >> ndo_stop. In case of an ethtool -L / -G call or enabling / disabling XDP
> > >> core can handle the entire reconfiguration dance.
> >
> > `ethtool -L/-G` and `ip link set {up/down}` pertain to the "general OS"
> > queues managed by a driver for generic workloads and networking
> > management (e.g., neigh discovery, icmp, etc). The discussions here
> > pertains to processes wanting to use their own memory or GPU memory in a
> > queue. Processes will come and go and the queue management needs to
> > align with that need without affecting all of the other queues managed
> > by the driver.
>
> For sure, I'm just saying that both the old uAPI can be translated to
> the new driver API, and so should the new uAPIs. I focused on the
> driver facing APIs because I think that it's the hard part. We have
> many drivers, the uAPI is more easily dreamed up, no?
>
> > >> The cfg object needs to contain all queue configuration, including
> > >> the page pool parameters.
> > >>
> > >> If we have an abstract model of the configuration in the core we can
> > >> modify it much more easily, I hope. I mean - the configuration will be
> > >> somewhat detached from what's instantiated in the drivers.
> > >>
> > >> I'd prefer to go as far as we can without introducing a driver callback
> > >> to "check if it can support a config change", and try to rely on
> > >> (static) capabilities instead. This allows more of the validation to
> > >> happen in the core and also lends itself naturally to exporting the
> > >> capabilities to the user.
> > >>
> > >> Checking the use cases:
> > >>
> > >>  #1 - partitioning for containers - storing the cfg in the core gives
> > >>       us a neat ability to allow users to set the configuration on RSS
> > >>       context
> > >>  #2, #3 - page pools - we can make page_pool_create take cfg and read whatever
> > >>       params we want from there, memory provider, descriptor count, recycling
> > >>       ring size etc. Also for header-data-split we may want different settings
> > >>       per queue so again cfg comes in handy
> > >>  #4 - XDP redirect target - we should spawn XDP TX queues independently from
> > >>       the XDP configuration
> > >>
> > >> That's all I have thought up in terms of direction.
> > >> Does that make sense? What are the main gaps? Other proposals?
> > >
> > > More on (A) and (B):
> > >
> > > I expect most use cases match the containerization that you mention.
> > > Where a privileged process handles configuration.
> > >
> > > For that, the existing interfaces of ethtool -G/-L-/N/-K/-X suffice.
> > >
> > > A more far-out approach could infer the ntuple 5-tuple connection or
> > > 3-tuple listener rule from a socket itself, no ethtool required. But
> > > let's ignore that for now.
> > >
> > > Currently we need to use ethtool -X to restrict the RSS indirection
> > > table to a subset of queues. It is not strictly necessary to
> > > reconfigure the device on each new container, if pre-allocation a
> > > sufficient set of non-RSS queues.
> >
> > This is an interesting approach: This scheme here is along the lines of
> > you have N cpus in the server, so N queue sets (or channels). The
> > indirection table means M queue sets are used for RSS leaving N-M queues
> > for flows with "fancy memory providers". Such a model can work but it is
> > quite passive, needs careful orchestration and has a lot of moving,
> > disjointed pieces - with some race conditions around setup vs first data
> > packet arriving.
> >
> > I was thinking about a more generic design where H/W queues are created
> > and destroyed - e.g., queues unique to a process which makes the cleanup
> > so much easier.
>
> FWIW what Willem describes is what we were telling people to do for
> AF_XDP for however many years it existed.
>
> > > Then only ethtool -N is needed to drive data towards one of the
> > > non-RSS queues. Or one of the non context 0 RSS contexts if that is
> > > used.
> > >
> > > The main part that is missing is memory allocation. Memory is stranded
> > > on unused queues, and there is no explicit support for special
> > > allocators.
> > >
> > > A poor man's solution might be to load a ring with minimal sized
> > > buffers (assuming devices accept that, say a zero length buffer),
> > > attach a memory provider before inserting an ntuple rule, and refill
> > > from the memory provider. This requires accepting that a whole ring of
> > > packets is lost before refilled slots get filled..
> > >
> > > (I'm messing with that with AF_XDP right now: a process that xsk_binds
> > >  before filling the FILL queue..)
> > >
> > > Ideally, we would have a way to reconfigure a single queue, without
> > > having to down/up the entire device..
> > >
> > > I don't know if the kernel needs an explicit abstract model, or can
> > > leave that to the userspace privileged daemon that presses the ethtool
> > > buttons.
> >
> > The kernel has that in the IB verbs S/W APIs. Yes, I realize that
> > comment amounts to profanity on this mailing list, but it should not be.
> > There are existing APIs for creating, managing and destroying queues -
> > open source, GPL'ed, *software* APIs that are open for all to use.
> >
> > That said, I have no religion here. If the netdev stack wants new APIs
> > to manage queues - including supplying buffers - drivers will have APIs
> > that can be adapted to some new ndo to create, configure, and destroy
> > queues. The ethtool API can be updated to manage that. Ultimately I
> > believe anything short of dynamic queue management will be a band-aid
> > approach that will have a lot of problems.
>
> No religion here either, but the APIs we talk about are not
> particularly complex. Having started hacking things together
> with page pools, huge pages, RSS etc - IMHO the reuse and convergence
> would be very superficial.



-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice
  2023-08-18 21:52             ` Mina Almasry
@ 2023-08-19  1:34               ` David Ahern
  2023-08-19  2:06                 ` Jakub Kicinski
  0 siblings, 1 reply; 62+ messages in thread
From: David Ahern @ 2023-08-19  1:34 UTC (permalink / raw)
  To: Mina Almasry, Jakub Kicinski, Praveen Kaligineedi
  Cc: Willem de Bruijn, netdev, Eric Dumazet, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Magnus Karlsson, sdf,
	Willem de Bruijn, Kaiyuan Zhang

On 8/18/23 3:52 PM, Mina Almasry wrote:
> The sticking points are:
> 1. From David: this proposal doesn't give an application the ability
> to flush an rx queue, which means that we have to rely on a driver
> reset that affects all queues to refill the rx queue buffers.

Generically, the design needs to be able to flush (or invalidate) all
references to the dma-buf once the process no longer "owns" it.

> 2. From Jakub: the uAPI and implementation here needs to be in line
> with his general direction & extensible to apply to existing use cases
> `ethtool -L/-G`, etc.

I think this is a bit more open ended given the openness of the netdev
netlink API. i.e., managing a H/W queue (create, delete, stop / flush,
associate a page_pool) could be done through this API.

> 
> AFAIU this is what I need to do in the next version:
> 
> 1. The uAPI will be changed such that it will either re-configure an
> existing queue to bind it to the dma-buf, or allocate a new queue
> bound to the dma-buf (not sure which is better at the moment). Either

1. API to manage a page-pool (create, delete, update).

2. API to add and remove a dma-buf (or host memory buffer) with a
page-pool. Remove may take time to flush references pushed to hardware
so this would be asynchronous.

3. Create a queue or use an existing queue id and associate a page-pool
with it.

> way, the configuration will take place immediately, and not rely on an
> entire driver reset to actuate the change.

yes

> 
> 2. The uAPI will be changed such that if the netlink socket is closed,
> or the process dies, the rx queue will be unbound from the dma-buf or
> the rx queue will be freed entirely (again, not sure which is better

I think those are separate actions. But, if the queue was created by and
referenced by a process, then closing an fd means it should be freed.

> at the moment). The configuration will take place immediately without
> relying on a driver reset.

yes on the reset.

> 
> 3. I will add 4 new net_device_ops that Jakub specified:
> queue_mem_alloc/free(), and queue_start/stop().
> 
> 4. The uAPI mentioned in #1 will use the new net_device_ops to
> allocate or reconfigure a queue attached to the provided dma-buf.
> 
> Does this sound roughly reasonable here?
> 
> AFAICT the only technical difficulty is that I'm not sure it's
> feasible for a driver to start or stop 1 rx-queue without triggering a
> full driver reset. The (2) drivers I looked at both do a full reset to
> change any queue configuration. I'll investigate.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice
  2023-08-19  1:34               ` David Ahern
@ 2023-08-19  2:06                 ` Jakub Kicinski
  2023-08-19  3:30                   ` David Ahern
  0 siblings, 1 reply; 62+ messages in thread
From: Jakub Kicinski @ 2023-08-19  2:06 UTC (permalink / raw)
  To: David Ahern
  Cc: Mina Almasry, Praveen Kaligineedi, Willem de Bruijn, netdev,
	Eric Dumazet, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Magnus Karlsson, sdf, Willem de Bruijn,
	Kaiyuan Zhang

On Fri, 18 Aug 2023 19:34:32 -0600 David Ahern wrote:
> On 8/18/23 3:52 PM, Mina Almasry wrote:
> > The sticking points are:
> > 1. From David: this proposal doesn't give an application the ability
> > to flush an rx queue, which means that we have to rely on a driver
> > reset that affects all queues to refill the rx queue buffers.  
> 
> Generically, the design needs to be able to flush (or invalidate) all
> references to the dma-buf once the process no longer "owns" it.

Are we talking about the ability for the app to flush the queue
when it wants to (do no idea what)? Or auto-flush when app crashes?

> > 2. From Jakub: the uAPI and implementation here needs to be in line
> > with his general direction & extensible to apply to existing use cases
> > `ethtool -L/-G`, etc.  
> 
> I think this is a bit more open ended given the openness of the netdev
> netlink API. i.e., managing a H/W queue (create, delete, stop / flush,
> associate a page_pool) could be done through this API.
> 
> > 
> > AFAIU this is what I need to do in the next version:
> > 
> > 1. The uAPI will be changed such that it will either re-configure an
> > existing queue to bind it to the dma-buf, or allocate a new queue
> > bound to the dma-buf (not sure which is better at the moment). Either  
> 
> 1. API to manage a page-pool (create, delete, update).

I wasn't anticipating a "create page pool" API.

I was thinking of a scheme where user space sets page pool parameters,
but the driver still creates the pool.

But I guess it is doable. More work, tho. Are there ibverbs which
can do it? lol.

> 2. API to add and remove a dma-buf (or host memory buffer) with a
> page-pool. Remove may take time to flush references pushed to hardware
> so this would be asynchronous.
> 
> 3. Create a queue or use an existing queue id and associate a page-pool
> with it.
> 
> > way, the configuration will take place immediately, and not rely on an
> > entire driver reset to actuate the change.  
> 
> yes
> 
> > 
> > 2. The uAPI will be changed such that if the netlink socket is closed,
> > or the process dies, the rx queue will be unbound from the dma-buf or
> > the rx queue will be freed entirely (again, not sure which is better  
> 
> I think those are separate actions. But, if the queue was created by and
> referenced by a process, then closing an fd means it should be freed.
> 
> > at the moment). The configuration will take place immediately without
> > relying on a driver reset.  
> 
> yes on the reset.
> 
> > 
> > 3. I will add 4 new net_device_ops that Jakub specified:
> > queue_mem_alloc/free(), and queue_start/stop().
> > 
> > 4. The uAPI mentioned in #1 will use the new net_device_ops to
> > allocate or reconfigure a queue attached to the provided dma-buf.

I'd leave 2, 3, 4 alone for now. Focus on binding a page pool to 
an existing queue.

> > Does this sound roughly reasonable here?
> > 
> > AFAICT the only technical difficulty is that I'm not sure it's
> > feasible for a driver to start or stop 1 rx-queue without triggering a
> > full driver reset. The (2) drivers I looked at both do a full reset to
> > change any queue configuration. I'll investigate.  
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice
  2023-08-19  2:06                 ` Jakub Kicinski
@ 2023-08-19  3:30                   ` David Ahern
  2023-08-19 14:18                     ` Willem de Bruijn
  0 siblings, 1 reply; 62+ messages in thread
From: David Ahern @ 2023-08-19  3:30 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Mina Almasry, Praveen Kaligineedi, Willem de Bruijn, netdev,
	Eric Dumazet, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Magnus Karlsson, sdf, Willem de Bruijn,
	Kaiyuan Zhang

On 8/18/23 8:06 PM, Jakub Kicinski wrote:
> On Fri, 18 Aug 2023 19:34:32 -0600 David Ahern wrote:
>> On 8/18/23 3:52 PM, Mina Almasry wrote:
>>> The sticking points are:
>>> 1. From David: this proposal doesn't give an application the ability
>>> to flush an rx queue, which means that we have to rely on a driver
>>> reset that affects all queues to refill the rx queue buffers.  
>>
>> Generically, the design needs to be able to flush (or invalidate) all
>> references to the dma-buf once the process no longer "owns" it.
> 
> Are we talking about the ability for the app to flush the queue
> when it wants to (do no idea what)? Or auto-flush when app crashes?

If a buffer reference can be invalidated such that a posted buffer is
ignored by H/W, then no flush is needed per se. Either way the key point
is that posted buffers can no longer be filled by H/W once a process no
longer owns the dma-buf reference. I believe the actual mechanism here
will vary by H/W.

> 
>>> 2. From Jakub: the uAPI and implementation here needs to be in line
>>> with his general direction & extensible to apply to existing use cases
>>> `ethtool -L/-G`, etc.  
>>
>> I think this is a bit more open ended given the openness of the netdev
>> netlink API. i.e., managing a H/W queue (create, delete, stop / flush,
>> associate a page_pool) could be done through this API.
>>
>>>
>>> AFAIU this is what I need to do in the next version:
>>>
>>> 1. The uAPI will be changed such that it will either re-configure an
>>> existing queue to bind it to the dma-buf, or allocate a new queue
>>> bound to the dma-buf (not sure which is better at the moment). Either  
>>
>> 1. API to manage a page-pool (create, delete, update).
> 
> I wasn't anticipating a "create page pool" API.
> 
> I was thinking of a scheme where user space sets page pool parameters,
> but the driver still creates the pool.

There needs to be a process (or process group depending on design)
unique page pool due to the lifetime of what is backing it. The driver
or core netdev code can create it; if it is tied to an rx queue and
created by the driver there are design implications. As separate objects
page pools and rx queues can have their own lifetimes (e.g., multiple rx
queues for a single pp); generically a H/W queue and the basis of
buffers supplied to land packets are independent.

> 
> But I guess it is doable. More work, tho. Are there ibverbs which
> can do it? lol.

Well, generically yes - think intent not necessarily a 1:1 mapping.

> 
>> 2. API to add and remove a dma-buf (or host memory buffer) with a
>> page-pool. Remove may take time to flush references pushed to hardware
>> so this would be asynchronous.
>>
>> 3. Create a queue or use an existing queue id and associate a page-pool
>> with it.
>>
>>> way, the configuration will take place immediately, and not rely on an
>>> entire driver reset to actuate the change.  
>>
>> yes
>>
>>>
>>> 2. The uAPI will be changed such that if the netlink socket is closed,
>>> or the process dies, the rx queue will be unbound from the dma-buf or
>>> the rx queue will be freed entirely (again, not sure which is better  
>>
>> I think those are separate actions. But, if the queue was created by and
>> referenced by a process, then closing an fd means it should be freed.
>>
>>> at the moment). The configuration will take place immediately without
>>> relying on a driver reset.  
>>
>> yes on the reset.
>>
>>>
>>> 3. I will add 4 new net_device_ops that Jakub specified:
>>> queue_mem_alloc/free(), and queue_start/stop().
>>>
>>> 4. The uAPI mentioned in #1 will use the new net_device_ops to
>>> allocate or reconfigure a queue attached to the provided dma-buf.
> 
> I'd leave 2, 3, 4 alone for now. Focus on binding a page pool to 
> an existing queue.
> 
>>> Does this sound roughly reasonable here?
>>>
>>> AFAICT the only technical difficulty is that I'm not sure it's
>>> feasible for a driver to start or stop 1 rx-queue without triggering a
>>> full driver reset. The (2) drivers I looked at both do a full reset to
>>> change any queue configuration. I'll investigate.  
>>
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 06/11] page-pool: add device memory support
  2023-08-10  1:57 ` [RFC PATCH v2 06/11] page-pool: add device memory support Mina Almasry
@ 2023-08-19  9:51   ` Jesper Dangaard Brouer
  2023-08-19 14:08     ` Willem de Bruijn
  2023-08-22  6:05     ` Mina Almasry
  0 siblings, 2 replies; 62+ messages in thread
From: Jesper Dangaard Brouer @ 2023-08-19  9:51 UTC (permalink / raw)
  To: Mina Almasry, netdev, linux-media, dri-devel
  Cc: brouer, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Arnd Bergmann, David Ahern, Willem de Bruijn, Sumit Semwal,
	Christian König, Jason Gunthorpe, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf



On 10/08/2023 03.57, Mina Almasry wrote:
> Overload the LSB of struct page* to indicate that it's a page_pool_iov.
> 
> Refactor mm calls on struct page * into helpers, and add page_pool_iov
> handling on those helpers. Modify callers of these mm APIs with calls to
> these helpers instead.
> 

I don't like of this approach.
This is adding code to the PP (page_pool) fast-path in multiple places.

I've not had time to run my usual benchmarks, which are here:
 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c

But I'm sure it will affect performance.

Regardless of performance, this approach is using ptr-LSB-bits, to hide
that page-pointer are not really struct-pages, feels like force feeding
a solution just to use the page_pool APIs.


> In areas where struct page* is dereferenced, add a check for special
> handling of page_pool_iov.
> 
> The memory providers producing page_pool_iov can set the LSB on the
> struct page* returned to the page pool.
> 
> Note that instead of overloading the LSB of page pointers, we can
> instead define a new union between struct page & struct page_pool_iov and
> compact it in a new type. However, we'd need to implement the code churn
> to modify the page_pool & drivers to use this new type. For this POC
> that is not implemented (feedback welcome).
> 

I've said before, that I prefer multiplexing on page->pp_magic.
For your page_pool_iov the layout would have to match the offset of
pp_magic, to do this. (And if insisting on using PP infra the refcnt
would also need to align).

On the allocation side, all drivers already use a driver helper
page_pool_dev_alloc_pages() or we could add another (better named)
helper to multiplex between other types of allocators, e.g. a devmem
allocator.

On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle
could multiplex on pp_magic and call another API.  The API could be an
extension to PP helpers, but it could also be a devmap allocator helper.

IMHO forcing/piggy-bagging everything into page_pool is not the right 
solution.  I really think netstack need to support different allocator 
types. The page pool have been leading the way, yes, but perhaps it is 
time to add an API layer that e.g. could be named netmem, that gives us 
the multiplexing between allocators.  In that process some of page_pool 
APIs would be lifted out as common blocks and others remain.

--Jesper

> I have a sample implementation of adding a new page_pool_token type
> in the page_pool to give a general idea here:
> https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d
> 
> Full branch here:
> https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens
> 
> (In the branches above, page_pool_iov is called devmem_slice).
> 
> Could also add static_branch to speed up the checks in page_pool_iov
> memory providers are being used.
> 
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> ---
>   include/net/page_pool.h | 74 ++++++++++++++++++++++++++++++++++-
>   net/core/page_pool.c    | 85 ++++++++++++++++++++++++++++-------------
>   2 files changed, 131 insertions(+), 28 deletions(-)
> 
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index 537eb36115ed..f08ca230d68e 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -282,6 +282,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
>   	return NULL;
>   }
> 
> +static inline int page_pool_page_ref_count(struct page *page)
> +{
> +	if (page_is_page_pool_iov(page))
> +		return page_pool_iov_refcount(page_to_page_pool_iov(page));
> +
> +	return page_ref_count(page);
> +}
> +
> +static inline void page_pool_page_get_many(struct page *page,
> +					   unsigned int count)
> +{
> +	if (page_is_page_pool_iov(page))
> +		return page_pool_iov_get_many(page_to_page_pool_iov(page),
> +					      count);
> +
> +	return page_ref_add(page, count);
> +}
> +
> +static inline void page_pool_page_put_many(struct page *page,
> +					   unsigned int count)
> +{
> +	if (page_is_page_pool_iov(page))
> +		return page_pool_iov_put_many(page_to_page_pool_iov(page),
> +					      count);
> +
> +	if (count > 1)
> +		page_ref_sub(page, count - 1);
> +
> +	put_page(page);
> +}
> +
> +static inline bool page_pool_page_is_pfmemalloc(struct page *page)
> +{
> +	if (page_is_page_pool_iov(page))
> +		return false;
> +
> +	return page_is_pfmemalloc(page);
> +}
> +
> +static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
> +{
> +	/* Assume page_pool_iov are on the preferred node without actually
> +	 * checking...
> +	 *
> +	 * This check is only used to check for recycling memory in the page
> +	 * pool's fast paths. Currently the only implementation of page_pool_iov
> +	 * is dmabuf device memory. It's a deliberate decision by the user to
> +	 * bind a certain dmabuf to a certain netdev, and the netdev rx queue
> +	 * would not be able to reallocate memory from another dmabuf that
> +	 * exists on the preferred node, so, this check doesn't make much sense
> +	 * in this case. Assume all page_pool_iovs can be recycled for now.
> +	 */
> +	if (page_is_page_pool_iov(page))
> +		return true;
> +
> +	return page_to_nid(page) == pref_nid;
> +}
> +
>   struct page_pool {
>   	struct page_pool_params p;
> 
> @@ -434,6 +492,9 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
>   {
>   	long ret;
> 
> +	if (page_is_page_pool_iov(page))
> +		return -EINVAL;
> +
>   	/* If nr == pp_frag_count then we have cleared all remaining
>   	 * references to the page. No need to actually overwrite it, instead
>   	 * we can leave this to be overwritten by the calling function.
> @@ -494,7 +555,12 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
> 
>   static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>   {
> -	dma_addr_t ret = page->dma_addr;
> +	dma_addr_t ret;
> +
> +	if (page_is_page_pool_iov(page))
> +		return page_pool_iov_dma_addr(page_to_page_pool_iov(page));
> +
> +	ret = page->dma_addr;
> 
>   	if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
>   		ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16;
> @@ -504,6 +570,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
> 
>   static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
>   {
> +	/* page_pool_iovs are mapped and their dma-addr can't be modified. */
> +	if (page_is_page_pool_iov(page)) {
> +		DEBUG_NET_WARN_ON_ONCE(true);
> +		return;
> +	}
> +
>   	page->dma_addr = addr;
>   	if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
>   		page->dma_addr_upper = upper_32_bits(addr);
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 0a7c08d748b8..20c1f74fd844 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -318,7 +318,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
>   		if (unlikely(!page))
>   			break;
> 
> -		if (likely(page_to_nid(page) == pref_nid)) {
> +		if (likely(page_pool_page_is_pref_nid(page, pref_nid))) {
>   			pool->alloc.cache[pool->alloc.count++] = page;
>   		} else {
>   			/* NUMA mismatch;
> @@ -363,7 +363,15 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
>   					  struct page *page,
>   					  unsigned int dma_sync_size)
>   {
> -	dma_addr_t dma_addr = page_pool_get_dma_addr(page);
> +	dma_addr_t dma_addr;
> +
> +	/* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */
> +	if (page_is_page_pool_iov(page)) {
> +		DEBUG_NET_WARN_ON_ONCE(true);
> +		return;
> +	}
> +
> +	dma_addr = page_pool_get_dma_addr(page);
> 
>   	dma_sync_size = min(dma_sync_size, pool->p.max_len);
>   	dma_sync_single_range_for_device(pool->p.dev, dma_addr,
> @@ -375,6 +383,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>   {
>   	dma_addr_t dma;
> 
> +	if (page_is_page_pool_iov(page)) {
> +		/* page_pool_iovs are already mapped */
> +		DEBUG_NET_WARN_ON_ONCE(true);
> +		return true;
> +	}
> +
>   	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
>   	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
>   	 * into page private data (i.e 32bit cpu with 64bit DMA caps)
> @@ -398,14 +412,24 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>   static void page_pool_set_pp_info(struct page_pool *pool,
>   				  struct page *page)
>   {
> -	page->pp = pool;
> -	page->pp_magic |= PP_SIGNATURE;
> +	if (!page_is_page_pool_iov(page)) {
> +		page->pp = pool;
> +		page->pp_magic |= PP_SIGNATURE;
> +	} else {
> +		page_to_page_pool_iov(page)->pp = pool;
> +	}
> +
>   	if (pool->p.init_callback)
>   		pool->p.init_callback(page, pool->p.init_arg);
>   }
> 
>   static void page_pool_clear_pp_info(struct page *page)
>   {
> +	if (page_is_page_pool_iov(page)) {
> +		page_to_page_pool_iov(page)->pp = NULL;
> +		return;
> +	}
> +
>   	page->pp_magic = 0;
>   	page->pp = NULL;
>   }
> @@ -615,7 +639,7 @@ static bool page_pool_recycle_in_cache(struct page *page,
>   		return false;
>   	}
> 
> -	/* Caller MUST have verified/know (page_ref_count(page) == 1) */
> +	/* Caller MUST have verified/know (page_pool_page_ref_count(page) == 1) */
>   	pool->alloc.cache[pool->alloc.count++] = page;
>   	recycle_stat_inc(pool, cached);
>   	return true;
> @@ -638,9 +662,10 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
>   	 * refcnt == 1 means page_pool owns page, and can recycle it.
>   	 *
>   	 * page is NOT reusable when allocated when system is under
> -	 * some pressure. (page_is_pfmemalloc)
> +	 * some pressure. (page_pool_page_is_pfmemalloc)
>   	 */
> -	if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
> +	if (likely(page_pool_page_ref_count(page) == 1 &&
> +		   !page_pool_page_is_pfmemalloc(page))) {
>   		/* Read barrier done in page_ref_count / READ_ONCE */
> 
>   		if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> @@ -741,7 +766,8 @@ static struct page *page_pool_drain_frag(struct page_pool *pool,
>   	if (likely(page_pool_defrag_page(page, drain_count)))
>   		return NULL;
> 
> -	if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
> +	if (page_pool_page_ref_count(page) == 1 &&
> +	    !page_pool_page_is_pfmemalloc(page)) {
>   		if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
>   			page_pool_dma_sync_for_device(pool, page, -1);
> 
> @@ -818,9 +844,9 @@ static void page_pool_empty_ring(struct page_pool *pool)
>   	/* Empty recycle ring */
>   	while ((page = ptr_ring_consume_bh(&pool->ring))) {
>   		/* Verify the refcnt invariant of cached pages */
> -		if (!(page_ref_count(page) == 1))
> +		if (!(page_pool_page_ref_count(page) == 1))
>   			pr_crit("%s() page_pool refcnt %d violation\n",
> -				__func__, page_ref_count(page));
> +				__func__, page_pool_page_ref_count(page));
> 
>   		page_pool_return_page(pool, page);
>   	}
> @@ -977,19 +1003,24 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
>   	struct page_pool *pp;
>   	bool allow_direct;
> 
> -	page = compound_head(page);
> +	if (!page_is_page_pool_iov(page)) {
> +		page = compound_head(page);
> 
> -	/* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
> -	 * in order to preserve any existing bits, such as bit 0 for the
> -	 * head page of compound page and bit 1 for pfmemalloc page, so
> -	 * mask those bits for freeing side when doing below checking,
> -	 * and page_is_pfmemalloc() is checked in __page_pool_put_page()
> -	 * to avoid recycling the pfmemalloc page.
> -	 */
> -	if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> -		return false;
> +		/* page->pp_magic is OR'ed with PP_SIGNATURE after the
> +		 * allocation in order to preserve any existing bits, such as
> +		 * bit 0 for the head page of compound page and bit 1 for
> +		 * pfmemalloc page, so mask those bits for freeing side when
> +		 * doing below checking, and page_pool_page_is_pfmemalloc() is
> +		 * checked in __page_pool_put_page() to avoid recycling the
> +		 * pfmemalloc page.
> +		 */
> +		if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> +			return false;
> 
> -	pp = page->pp;
> +		pp = page->pp;
> +	} else {
> +		pp = page_to_page_pool_iov(page)->pp;
> +	}
> 
>   	/* Allow direct recycle if we have reasons to believe that we are
>   	 * in the same context as the consumer would run, so there's
> @@ -1273,9 +1304,9 @@ static bool mp_huge_busy(struct mp_huge *hu, unsigned int idx)
> 
>   	for (j = 0; j < (1 << MP_HUGE_ORDER); j++) {
>   		page = hu->page[idx] + j;
> -		if (page_ref_count(page) != 1) {
> +		if (page_pool_page_ref_count(page) != 1) {
>   			pr_warn("Page with ref count %d at %u, %u. Can't safely destory, leaking memory!\n",
> -				page_ref_count(page), idx, j);
> +				page_pool_page_ref_count(page), idx, j);
>   			return true;
>   		}
>   	}
> @@ -1330,7 +1361,7 @@ static struct page *mp_huge_alloc_pages(struct page_pool *pool, gfp_t gfp)
>   			continue;
> 
>   		if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
> -		    page_ref_count(page) != 1) {
> +		    page_pool_page_ref_count(page) != 1) {
>   			atomic_inc(&mp_huge_ins_b);
>   			continue;
>   		}
> @@ -1458,9 +1489,9 @@ static void mp_huge_1g_destroy(struct page_pool *pool)
>   	free = true;
>   	for (i = 0; i < MP_HUGE_1G_CNT; i++) {
>   		page = hu->page + i;
> -		if (page_ref_count(page) != 1) {
> +		if (page_pool_page_ref_count(page) != 1) {
>   			pr_warn("Page with ref count %d at %u. Can't safely destory, leaking memory!\n",
> -				page_ref_count(page), i);
> +				page_pool_page_ref_count(page), i);
>   			free = false;
>   			break;
>   		}
> @@ -1489,7 +1520,7 @@ static struct page *mp_huge_1g_alloc_pages(struct page_pool *pool, gfp_t gfp)
>   		page = hu->page + page_i;
> 
>   		if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
> -		    page_ref_count(page) != 1) {
> +		    page_pool_page_ref_count(page) != 1) {
>   			atomic_inc(&mp_huge_ins_b);
>   			continue;
>   		}
> --
> 2.41.0.640.ga95def55d0-goog
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 06/11] page-pool: add device memory support
  2023-08-19  9:51   ` Jesper Dangaard Brouer
@ 2023-08-19 14:08     ` Willem de Bruijn
  2023-08-19 15:22       ` Jesper Dangaard Brouer
  2023-08-22  6:05     ` Mina Almasry
  1 sibling, 1 reply; 62+ messages in thread
From: Willem de Bruijn @ 2023-08-19 14:08 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Mina Almasry, netdev, linux-media, dri-devel, brouer,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	David Ahern, Sumit Semwal, Christian König, Jason Gunthorpe,
	Hari Ramakrishnan, Dan Williams, Andy Lutomirski, stephen, sdf

On Sat, Aug 19, 2023 at 5:51 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 10/08/2023 03.57, Mina Almasry wrote:
> > Overload the LSB of struct page* to indicate that it's a page_pool_iov.
> >
> > Refactor mm calls on struct page * into helpers, and add page_pool_iov
> > handling on those helpers. Modify callers of these mm APIs with calls to
> > these helpers instead.
> >
>
> I don't like of this approach.
> This is adding code to the PP (page_pool) fast-path in multiple places.
>
> I've not had time to run my usual benchmarks, which are here:
>
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
>
> But I'm sure it will affect performance.
>
> Regardless of performance, this approach is using ptr-LSB-bits, to hide
> that page-pointer are not really struct-pages, feels like force feeding
> a solution just to use the page_pool APIs.
>
>
> > In areas where struct page* is dereferenced, add a check for special
> > handling of page_pool_iov.
> >
> > The memory providers producing page_pool_iov can set the LSB on the
> > struct page* returned to the page pool.
> >
> > Note that instead of overloading the LSB of page pointers, we can
> > instead define a new union between struct page & struct page_pool_iov and
> > compact it in a new type. However, we'd need to implement the code churn
> > to modify the page_pool & drivers to use this new type. For this POC
> > that is not implemented (feedback welcome).
> >
>
> I've said before, that I prefer multiplexing on page->pp_magic.
> For your page_pool_iov the layout would have to match the offset of
> pp_magic, to do this. (And if insisting on using PP infra the refcnt
> would also need to align).

Perhaps I misunderstand, but this suggests continuing to using
struct page to demultiplex memory type?

I think the feedback has been strong to not multiplex yet another
memory type into that struct, that is not a real page. Which is why
we went into this direction. This latest series limits the impact largely
to networking structures and code.

One way or another, there will be a branch and multiplexing. Whether
that is in struct page, the page pool or a new netdev mem type as you
propose.

Any regression in page pool can be avoided in the common case that
does not use device mem by placing that behind a static_branch. Would
that address your performance concerns?

>
> On the allocation side, all drivers already use a driver helper
> page_pool_dev_alloc_pages() or we could add another (better named)
> helper to multiplex between other types of allocators, e.g. a devmem
> allocator.
>
> On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle
> could multiplex on pp_magic and call another API.  The API could be an
> extension to PP helpers, but it could also be a devmap allocator helper.
>
> IMHO forcing/piggy-bagging everything into page_pool is not the right
> solution.  I really think netstack need to support different allocator
> types.

To me this is lifting page_pool into such a netstack alloctator pool.

Not sure adding another explicit layer of indirection would be cleaner
or faster (potentially more indirect calls).

As for the LSB trick: that avoided adding a lot of boilerplate churn
with new type and helper functions.



> The page pool have been leading the way, yes, but perhaps it is
> time to add an API layer that e.g. could be named netmem, that gives us
> the multiplexing between allocators.  In that process some of page_pool
> APIs would be lifted out as common blocks and others remain.
>
> --Jesper
>
> > I have a sample implementation of adding a new page_pool_token type
> > in the page_pool to give a general idea here:
> > https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d
> >
> > Full branch here:
> > https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens
> >
> > (In the branches above, page_pool_iov is called devmem_slice).
> >
> > Could also add static_branch to speed up the checks in page_pool_iov
> > memory providers are being used.
> >
> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> > ---
> >   include/net/page_pool.h | 74 ++++++++++++++++++++++++++++++++++-
> >   net/core/page_pool.c    | 85 ++++++++++++++++++++++++++++-------------
> >   2 files changed, 131 insertions(+), 28 deletions(-)
> >
> > diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> > index 537eb36115ed..f08ca230d68e 100644
> > --- a/include/net/page_pool.h
> > +++ b/include/net/page_pool.h
> > @@ -282,6 +282,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
> >       return NULL;
> >   }
> >
> > +static inline int page_pool_page_ref_count(struct page *page)
> > +{
> > +     if (page_is_page_pool_iov(page))
> > +             return page_pool_iov_refcount(page_to_page_pool_iov(page));
> > +
> > +     return page_ref_count(page);
> > +}
> > +
> > +static inline void page_pool_page_get_many(struct page *page,
> > +                                        unsigned int count)
> > +{
> > +     if (page_is_page_pool_iov(page))
> > +             return page_pool_iov_get_many(page_to_page_pool_iov(page),
> > +                                           count);
> > +
> > +     return page_ref_add(page, count);
> > +}
> > +
> > +static inline void page_pool_page_put_many(struct page *page,
> > +                                        unsigned int count)
> > +{
> > +     if (page_is_page_pool_iov(page))
> > +             return page_pool_iov_put_many(page_to_page_pool_iov(page),
> > +                                           count);
> > +
> > +     if (count > 1)
> > +             page_ref_sub(page, count - 1);
> > +
> > +     put_page(page);
> > +}
> > +
> > +static inline bool page_pool_page_is_pfmemalloc(struct page *page)
> > +{
> > +     if (page_is_page_pool_iov(page))
> > +             return false;
> > +
> > +     return page_is_pfmemalloc(page);
> > +}
> > +
> > +static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
> > +{
> > +     /* Assume page_pool_iov are on the preferred node without actually
> > +      * checking...
> > +      *
> > +      * This check is only used to check for recycling memory in the page
> > +      * pool's fast paths. Currently the only implementation of page_pool_iov
> > +      * is dmabuf device memory. It's a deliberate decision by the user to
> > +      * bind a certain dmabuf to a certain netdev, and the netdev rx queue
> > +      * would not be able to reallocate memory from another dmabuf that
> > +      * exists on the preferred node, so, this check doesn't make much sense
> > +      * in this case. Assume all page_pool_iovs can be recycled for now.
> > +      */
> > +     if (page_is_page_pool_iov(page))
> > +             return true;
> > +
> > +     return page_to_nid(page) == pref_nid;
> > +}
> > +
> >   struct page_pool {
> >       struct page_pool_params p;
> >
> > @@ -434,6 +492,9 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
> >   {
> >       long ret;
> >
> > +     if (page_is_page_pool_iov(page))
> > +             return -EINVAL;
> > +
> >       /* If nr == pp_frag_count then we have cleared all remaining
> >        * references to the page. No need to actually overwrite it, instead
> >        * we can leave this to be overwritten by the calling function.
> > @@ -494,7 +555,12 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
> >
> >   static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
> >   {
> > -     dma_addr_t ret = page->dma_addr;
> > +     dma_addr_t ret;
> > +
> > +     if (page_is_page_pool_iov(page))
> > +             return page_pool_iov_dma_addr(page_to_page_pool_iov(page));
> > +
> > +     ret = page->dma_addr;
> >
> >       if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> >               ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16;
> > @@ -504,6 +570,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
> >
> >   static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
> >   {
> > +     /* page_pool_iovs are mapped and their dma-addr can't be modified. */
> > +     if (page_is_page_pool_iov(page)) {
> > +             DEBUG_NET_WARN_ON_ONCE(true);
> > +             return;
> > +     }
> > +
> >       page->dma_addr = addr;
> >       if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> >               page->dma_addr_upper = upper_32_bits(addr);
> > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > index 0a7c08d748b8..20c1f74fd844 100644
> > --- a/net/core/page_pool.c
> > +++ b/net/core/page_pool.c
> > @@ -318,7 +318,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
> >               if (unlikely(!page))
> >                       break;
> >
> > -             if (likely(page_to_nid(page) == pref_nid)) {
> > +             if (likely(page_pool_page_is_pref_nid(page, pref_nid))) {
> >                       pool->alloc.cache[pool->alloc.count++] = page;
> >               } else {
> >                       /* NUMA mismatch;
> > @@ -363,7 +363,15 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
> >                                         struct page *page,
> >                                         unsigned int dma_sync_size)
> >   {
> > -     dma_addr_t dma_addr = page_pool_get_dma_addr(page);
> > +     dma_addr_t dma_addr;
> > +
> > +     /* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */
> > +     if (page_is_page_pool_iov(page)) {
> > +             DEBUG_NET_WARN_ON_ONCE(true);
> > +             return;
> > +     }
> > +
> > +     dma_addr = page_pool_get_dma_addr(page);
> >
> >       dma_sync_size = min(dma_sync_size, pool->p.max_len);
> >       dma_sync_single_range_for_device(pool->p.dev, dma_addr,
> > @@ -375,6 +383,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
> >   {
> >       dma_addr_t dma;
> >
> > +     if (page_is_page_pool_iov(page)) {
> > +             /* page_pool_iovs are already mapped */
> > +             DEBUG_NET_WARN_ON_ONCE(true);
> > +             return true;
> > +     }
> > +
> >       /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
> >        * since dma_addr_t can be either 32 or 64 bits and does not always fit
> >        * into page private data (i.e 32bit cpu with 64bit DMA caps)
> > @@ -398,14 +412,24 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
> >   static void page_pool_set_pp_info(struct page_pool *pool,
> >                                 struct page *page)
> >   {
> > -     page->pp = pool;
> > -     page->pp_magic |= PP_SIGNATURE;
> > +     if (!page_is_page_pool_iov(page)) {
> > +             page->pp = pool;
> > +             page->pp_magic |= PP_SIGNATURE;
> > +     } else {
> > +             page_to_page_pool_iov(page)->pp = pool;
> > +     }
> > +
> >       if (pool->p.init_callback)
> >               pool->p.init_callback(page, pool->p.init_arg);
> >   }
> >
> >   static void page_pool_clear_pp_info(struct page *page)
> >   {
> > +     if (page_is_page_pool_iov(page)) {
> > +             page_to_page_pool_iov(page)->pp = NULL;
> > +             return;
> > +     }
> > +
> >       page->pp_magic = 0;
> >       page->pp = NULL;
> >   }
> > @@ -615,7 +639,7 @@ static bool page_pool_recycle_in_cache(struct page *page,
> >               return false;
> >       }
> >
> > -     /* Caller MUST have verified/know (page_ref_count(page) == 1) */
> > +     /* Caller MUST have verified/know (page_pool_page_ref_count(page) == 1) */
> >       pool->alloc.cache[pool->alloc.count++] = page;
> >       recycle_stat_inc(pool, cached);
> >       return true;
> > @@ -638,9 +662,10 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
> >        * refcnt == 1 means page_pool owns page, and can recycle it.
> >        *
> >        * page is NOT reusable when allocated when system is under
> > -      * some pressure. (page_is_pfmemalloc)
> > +      * some pressure. (page_pool_page_is_pfmemalloc)
> >        */
> > -     if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
> > +     if (likely(page_pool_page_ref_count(page) == 1 &&
> > +                !page_pool_page_is_pfmemalloc(page))) {
> >               /* Read barrier done in page_ref_count / READ_ONCE */
> >
> >               if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> > @@ -741,7 +766,8 @@ static struct page *page_pool_drain_frag(struct page_pool *pool,
> >       if (likely(page_pool_defrag_page(page, drain_count)))
> >               return NULL;
> >
> > -     if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
> > +     if (page_pool_page_ref_count(page) == 1 &&
> > +         !page_pool_page_is_pfmemalloc(page)) {
> >               if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> >                       page_pool_dma_sync_for_device(pool, page, -1);
> >
> > @@ -818,9 +844,9 @@ static void page_pool_empty_ring(struct page_pool *pool)
> >       /* Empty recycle ring */
> >       while ((page = ptr_ring_consume_bh(&pool->ring))) {
> >               /* Verify the refcnt invariant of cached pages */
> > -             if (!(page_ref_count(page) == 1))
> > +             if (!(page_pool_page_ref_count(page) == 1))
> >                       pr_crit("%s() page_pool refcnt %d violation\n",
> > -                             __func__, page_ref_count(page));
> > +                             __func__, page_pool_page_ref_count(page));
> >
> >               page_pool_return_page(pool, page);
> >       }
> > @@ -977,19 +1003,24 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
> >       struct page_pool *pp;
> >       bool allow_direct;
> >
> > -     page = compound_head(page);
> > +     if (!page_is_page_pool_iov(page)) {
> > +             page = compound_head(page);
> >
> > -     /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
> > -      * in order to preserve any existing bits, such as bit 0 for the
> > -      * head page of compound page and bit 1 for pfmemalloc page, so
> > -      * mask those bits for freeing side when doing below checking,
> > -      * and page_is_pfmemalloc() is checked in __page_pool_put_page()
> > -      * to avoid recycling the pfmemalloc page.
> > -      */
> > -     if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> > -             return false;
> > +             /* page->pp_magic is OR'ed with PP_SIGNATURE after the
> > +              * allocation in order to preserve any existing bits, such as
> > +              * bit 0 for the head page of compound page and bit 1 for
> > +              * pfmemalloc page, so mask those bits for freeing side when
> > +              * doing below checking, and page_pool_page_is_pfmemalloc() is
> > +              * checked in __page_pool_put_page() to avoid recycling the
> > +              * pfmemalloc page.
> > +              */
> > +             if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> > +                     return false;
> >
> > -     pp = page->pp;
> > +             pp = page->pp;
> > +     } else {
> > +             pp = page_to_page_pool_iov(page)->pp;
> > +     }
> >
> >       /* Allow direct recycle if we have reasons to believe that we are
> >        * in the same context as the consumer would run, so there's
> > @@ -1273,9 +1304,9 @@ static bool mp_huge_busy(struct mp_huge *hu, unsigned int idx)
> >
> >       for (j = 0; j < (1 << MP_HUGE_ORDER); j++) {
> >               page = hu->page[idx] + j;
> > -             if (page_ref_count(page) != 1) {
> > +             if (page_pool_page_ref_count(page) != 1) {
> >                       pr_warn("Page with ref count %d at %u, %u. Can't safely destory, leaking memory!\n",
> > -                             page_ref_count(page), idx, j);
> > +                             page_pool_page_ref_count(page), idx, j);
> >                       return true;
> >               }
> >       }
> > @@ -1330,7 +1361,7 @@ static struct page *mp_huge_alloc_pages(struct page_pool *pool, gfp_t gfp)
> >                       continue;
> >
> >               if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
> > -                 page_ref_count(page) != 1) {
> > +                 page_pool_page_ref_count(page) != 1) {
> >                       atomic_inc(&mp_huge_ins_b);
> >                       continue;
> >               }
> > @@ -1458,9 +1489,9 @@ static void mp_huge_1g_destroy(struct page_pool *pool)
> >       free = true;
> >       for (i = 0; i < MP_HUGE_1G_CNT; i++) {
> >               page = hu->page + i;
> > -             if (page_ref_count(page) != 1) {
> > +             if (page_pool_page_ref_count(page) != 1) {
> >                       pr_warn("Page with ref count %d at %u. Can't safely destory, leaking memory!\n",
> > -                             page_ref_count(page), i);
> > +                             page_pool_page_ref_count(page), i);
> >                       free = false;
> >                       break;
> >               }
> > @@ -1489,7 +1520,7 @@ static struct page *mp_huge_1g_alloc_pages(struct page_pool *pool, gfp_t gfp)
> >               page = hu->page + page_i;
> >
> >               if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
> > -                 page_ref_count(page) != 1) {
> > +                 page_pool_page_ref_count(page) != 1) {
> >                       atomic_inc(&mp_huge_ins_b);
> >                       continue;
> >               }
> > --
> > 2.41.0.640.ga95def55d0-goog
> >
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice
  2023-08-19  3:30                   ` David Ahern
@ 2023-08-19 14:18                     ` Willem de Bruijn
  2023-08-19 17:59                       ` Mina Almasry
                                         ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Willem de Bruijn @ 2023-08-19 14:18 UTC (permalink / raw)
  To: David Ahern
  Cc: Jakub Kicinski, Mina Almasry, Praveen Kaligineedi, netdev,
	Eric Dumazet, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Magnus Karlsson, sdf, Willem de Bruijn,
	Kaiyuan Zhang

On Fri, Aug 18, 2023 at 11:30 PM David Ahern <dsahern@kernel.org> wrote:
>
> On 8/18/23 8:06 PM, Jakub Kicinski wrote:
> > On Fri, 18 Aug 2023 19:34:32 -0600 David Ahern wrote:
> >> On 8/18/23 3:52 PM, Mina Almasry wrote:
> >>> The sticking points are:
> >>> 1. From David: this proposal doesn't give an application the ability
> >>> to flush an rx queue, which means that we have to rely on a driver
> >>> reset that affects all queues to refill the rx queue buffers.
> >>
> >> Generically, the design needs to be able to flush (or invalidate) all
> >> references to the dma-buf once the process no longer "owns" it.
> >
> > Are we talking about the ability for the app to flush the queue
> > when it wants to (do no idea what)? Or auto-flush when app crashes?
>
> If a buffer reference can be invalidated such that a posted buffer is
> ignored by H/W, then no flush is needed per se. Either way the key point
> is that posted buffers can no longer be filled by H/W once a process no
> longer owns the dma-buf reference. I believe the actual mechanism here
> will vary by H/W.

Right. Many devices only allow bringing all queues down at the same time.

Once a descriptor is posted and the ring head is written, there is no
way to retract that. Since waiting for the device to catch up is not
acceptable, the only option is to bring down the queue, right? Which
will imply bringing down the entire device on many devices. Not ideal,
but acceptable short term, imho.

That may be an incentive for vendors to support per-queue
start/stop/alloc/free. Maybe the ones that support RDMA already do?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 06/11] page-pool: add device memory support
  2023-08-19 14:08     ` Willem de Bruijn
@ 2023-08-19 15:22       ` Jesper Dangaard Brouer
  2023-08-19 15:49         ` David Ahern
                           ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Jesper Dangaard Brouer @ 2023-08-19 15:22 UTC (permalink / raw)
  To: Willem de Bruijn, Jesper Dangaard Brouer
  Cc: brouer, Mina Almasry, netdev, linux-media, dri-devel,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	David Ahern, Sumit Semwal, Christian König, Jason Gunthorpe,
	Hari Ramakrishnan, Dan Williams, Andy Lutomirski, stephen, sdf



On 19/08/2023 16.08, Willem de Bruijn wrote:
> On Sat, Aug 19, 2023 at 5:51 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
>>
>>
>>
>> On 10/08/2023 03.57, Mina Almasry wrote:
>>> Overload the LSB of struct page* to indicate that it's a page_pool_iov.
>>>
>>> Refactor mm calls on struct page * into helpers, and add page_pool_iov
>>> handling on those helpers. Modify callers of these mm APIs with calls to
>>> these helpers instead.
>>>
>>
>> I don't like of this approach.
>> This is adding code to the PP (page_pool) fast-path in multiple places.
>>
>> I've not had time to run my usual benchmarks, which are here:
>>
>> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
>>
>> But I'm sure it will affect performance.
>>
>> Regardless of performance, this approach is using ptr-LSB-bits, to hide
>> that page-pointer are not really struct-pages, feels like force feeding
>> a solution just to use the page_pool APIs.
>>
>>
>>> In areas where struct page* is dereferenced, add a check for special
>>> handling of page_pool_iov.
>>>
>>> The memory providers producing page_pool_iov can set the LSB on the
>>> struct page* returned to the page pool.
>>>
>>> Note that instead of overloading the LSB of page pointers, we can
>>> instead define a new union between struct page & struct page_pool_iov and
>>> compact it in a new type. However, we'd need to implement the code churn
>>> to modify the page_pool & drivers to use this new type. For this POC
>>> that is not implemented (feedback welcome).
>>>
>>
>> I've said before, that I prefer multiplexing on page->pp_magic.
>> For your page_pool_iov the layout would have to match the offset of
>> pp_magic, to do this. (And if insisting on using PP infra the refcnt
>> would also need to align).
> 
> Perhaps I misunderstand, but this suggests continuing to using
> struct page to demultiplex memory type?
> 

(Perhaps we are misunderstanding each-other and my use of the words 
multiplexing and demultiplex are wrong, I'm sorry, as English isn't my 
native language.)

I do see the problem of depending on having a struct page, as the 
page_pool_iov isn't related to struct page.  Having "page" in the name 
of "page_pool_iov" is also confusing (hardest problem is CS is naming, 
as we all know).

To support more allocator types, perhaps skb->pp_recycle bit need to 
grow another bit (and be renamed skb->recycle), so we can tell allocator 
types apart, those that are page based and those whom are not.


> I think the feedback has been strong to not multiplex yet another
> memory type into that struct, that is not a real page. Which is why
> we went into this direction. This latest series limits the impact largely
> to networking structures and code.
> 

Some what related what I'm objecting to: the "page_pool_iov" is not a 
real page, but this getting recycled into something called "page_pool", 
which funny enough deals with struct-pages internally and depend on the 
struct-page-refcnt.

Given the approach changed way from using struct page, then I also don't 
see the connection with the page_pool. Sorry.

> One way or another, there will be a branch and multiplexing. Whether
> that is in struct page, the page pool or a new netdev mem type as you
> propose.
> 

I'm asking to have this branch/multiplexing done a the call sites.

(IMHO not changing the drivers is a pipe-dream.)

> Any regression in page pool can be avoided in the common case that
> does not use device mem by placing that behind a static_branch. Would
> that address your performance concerns?
> 

No. This will not help.

The problem is that every where in the page_pool code it is getting 
polluted with:

   if (page_is_page_pool_iov(page))
     call-some-iov-func-instead()

Like: the very central piece of getting the refcnt:

+static inline int page_pool_page_ref_count(struct page *page)
+{
+	if (page_is_page_pool_iov(page))
+		return page_pool_iov_refcount(page_to_page_pool_iov(page));
+
+	return page_ref_count(page);
+}


The fast-path of the PP is used for XDP_DROP scenarios, and is currently 
around 14 cycles (tsc). Thus, any extra code in this code patch will 
change the fast-path.


>>
>> On the allocation side, all drivers already use a driver helper
>> page_pool_dev_alloc_pages() or we could add another (better named)
>> helper to multiplex between other types of allocators, e.g. a devmem
>> allocator.
>>
>> On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle
>> could multiplex on pp_magic and call another API.  The API could be an
>> extension to PP helpers, but it could also be a devmap allocator helper.
>>
>> IMHO forcing/piggy-bagging everything into page_pool is not the right
>> solution.  I really think netstack need to support different allocator
>> types.
> 
> To me this is lifting page_pool into such a netstack alloctator pool.
> 

This is should be renamed as it is not longer dealing with pages.

> Not sure adding another explicit layer of indirection would be cleaner
> or faster (potentially more indirect calls).
> 

It seems we are talking past each-other.  The layer of indirection I'm 
talking about is likely a simple header file (e.g. named netmem.h) that 
will get inline compiled so there is no overhead. It will be used by 
driver, such that we can avoid touching driver again when introducing 
new memory allocator types.


> As for the LSB trick: that avoided adding a lot of boilerplate churn
> with new type and helper functions.
> 

Says the lazy programmer :-P ... sorry could not resist ;-)

> 
> 
>> The page pool have been leading the way, yes, but perhaps it is
>> time to add an API layer that e.g. could be named netmem, that gives us
>> the multiplexing between allocators.  In that process some of page_pool
>> APIs would be lifted out as common blocks and others remain.
>>
>> --Jesper
>>
>>> I have a sample implementation of adding a new page_pool_token type
>>> in the page_pool to give a general idea here:
>>> https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d
>>>
>>> Full branch here:
>>> https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens
>>>
>>> (In the branches above, page_pool_iov is called devmem_slice).
>>>
>>> Could also add static_branch to speed up the checks in page_pool_iov
>>> memory providers are being used.
>>>
>>> Signed-off-by: Mina Almasry <almasrymina@google.com>
>>> ---
>>>    include/net/page_pool.h | 74 ++++++++++++++++++++++++++++++++++-
>>>    net/core/page_pool.c    | 85 ++++++++++++++++++++++++++++-------------
>>>    2 files changed, 131 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
>>> index 537eb36115ed..f08ca230d68e 100644
>>> --- a/include/net/page_pool.h
>>> +++ b/include/net/page_pool.h
>>> @@ -282,6 +282,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
>>>        return NULL;
>>>    }
>>>
>>> +static inline int page_pool_page_ref_count(struct page *page)
>>> +{
>>> +     if (page_is_page_pool_iov(page))
>>> +             return page_pool_iov_refcount(page_to_page_pool_iov(page));
>>> +
>>> +     return page_ref_count(page);
>>> +}
>>> +
>>> +static inline void page_pool_page_get_many(struct page *page,
>>> +                                        unsigned int count)
>>> +{
>>> +     if (page_is_page_pool_iov(page))
>>> +             return page_pool_iov_get_many(page_to_page_pool_iov(page),
>>> +                                           count);
>>> +
>>> +     return page_ref_add(page, count);
>>> +}
>>> +
>>> +static inline void page_pool_page_put_many(struct page *page,
>>> +                                        unsigned int count)
>>> +{
>>> +     if (page_is_page_pool_iov(page))
>>> +             return page_pool_iov_put_many(page_to_page_pool_iov(page),
>>> +                                           count);
>>> +
>>> +     if (count > 1)
>>> +             page_ref_sub(page, count - 1);
>>> +
>>> +     put_page(page);
>>> +}
>>> +
>>> +static inline bool page_pool_page_is_pfmemalloc(struct page *page)
>>> +{
>>> +     if (page_is_page_pool_iov(page))
>>> +             return false;
>>> +
>>> +     return page_is_pfmemalloc(page);
>>> +}
>>> +
>>> +static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
>>> +{
>>> +     /* Assume page_pool_iov are on the preferred node without actually
>>> +      * checking...
>>> +      *
>>> +      * This check is only used to check for recycling memory in the page
>>> +      * pool's fast paths. Currently the only implementation of page_pool_iov
>>> +      * is dmabuf device memory. It's a deliberate decision by the user to
>>> +      * bind a certain dmabuf to a certain netdev, and the netdev rx queue
>>> +      * would not be able to reallocate memory from another dmabuf that
>>> +      * exists on the preferred node, so, this check doesn't make much sense
>>> +      * in this case. Assume all page_pool_iovs can be recycled for now.
>>> +      */
>>> +     if (page_is_page_pool_iov(page))
>>> +             return true;
>>> +
>>> +     return page_to_nid(page) == pref_nid;
>>> +}
>>> +
>>>    struct page_pool {
>>>        struct page_pool_params p;
>>>
>>> @@ -434,6 +492,9 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
>>>    {
>>>        long ret;
>>>
>>> +     if (page_is_page_pool_iov(page))
>>> +             return -EINVAL;
>>> +
>>>        /* If nr == pp_frag_count then we have cleared all remaining
>>>         * references to the page. No need to actually overwrite it, instead
>>>         * we can leave this to be overwritten by the calling function.
>>> @@ -494,7 +555,12 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
>>>
>>>    static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>>>    {
>>> -     dma_addr_t ret = page->dma_addr;
>>> +     dma_addr_t ret;
>>> +
>>> +     if (page_is_page_pool_iov(page))
>>> +             return page_pool_iov_dma_addr(page_to_page_pool_iov(page));
>>> +
>>> +     ret = page->dma_addr;
>>>
>>>        if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
>>>                ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16;
>>> @@ -504,6 +570,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>>>
>>>    static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
>>>    {
>>> +     /* page_pool_iovs are mapped and their dma-addr can't be modified. */
>>> +     if (page_is_page_pool_iov(page)) {
>>> +             DEBUG_NET_WARN_ON_ONCE(true);
>>> +             return;
>>> +     }
>>> +
>>>        page->dma_addr = addr;
>>>        if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
>>>                page->dma_addr_upper = upper_32_bits(addr);
>>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>>> index 0a7c08d748b8..20c1f74fd844 100644
>>> --- a/net/core/page_pool.c
>>> +++ b/net/core/page_pool.c
>>> @@ -318,7 +318,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
>>>                if (unlikely(!page))
>>>                        break;
>>>
>>> -             if (likely(page_to_nid(page) == pref_nid)) {
>>> +             if (likely(page_pool_page_is_pref_nid(page, pref_nid))) {
>>>                        pool->alloc.cache[pool->alloc.count++] = page;
>>>                } else {
>>>                        /* NUMA mismatch;
>>> @@ -363,7 +363,15 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
>>>                                          struct page *page,
>>>                                          unsigned int dma_sync_size)
>>>    {
>>> -     dma_addr_t dma_addr = page_pool_get_dma_addr(page);
>>> +     dma_addr_t dma_addr;
>>> +
>>> +     /* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */
>>> +     if (page_is_page_pool_iov(page)) {
>>> +             DEBUG_NET_WARN_ON_ONCE(true);
>>> +             return;
>>> +     }
>>> +
>>> +     dma_addr = page_pool_get_dma_addr(page);
>>>
>>>        dma_sync_size = min(dma_sync_size, pool->p.max_len);
>>>        dma_sync_single_range_for_device(pool->p.dev, dma_addr,
>>> @@ -375,6 +383,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>>>    {
>>>        dma_addr_t dma;
>>>
>>> +     if (page_is_page_pool_iov(page)) {
>>> +             /* page_pool_iovs are already mapped */
>>> +             DEBUG_NET_WARN_ON_ONCE(true);
>>> +             return true;
>>> +     }
>>> +
>>>        /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
>>>         * since dma_addr_t can be either 32 or 64 bits and does not always fit
>>>         * into page private data (i.e 32bit cpu with 64bit DMA caps)
>>> @@ -398,14 +412,24 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>>>    static void page_pool_set_pp_info(struct page_pool *pool,
>>>                                  struct page *page)
>>>    {
>>> -     page->pp = pool;
>>> -     page->pp_magic |= PP_SIGNATURE;
>>> +     if (!page_is_page_pool_iov(page)) {
>>> +             page->pp = pool;
>>> +             page->pp_magic |= PP_SIGNATURE;
>>> +     } else {
>>> +             page_to_page_pool_iov(page)->pp = pool;
>>> +     }
>>> +
>>>        if (pool->p.init_callback)
>>>                pool->p.init_callback(page, pool->p.init_arg);
>>>    }
>>>
>>>    static void page_pool_clear_pp_info(struct page *page)
>>>    {
>>> +     if (page_is_page_pool_iov(page)) {
>>> +             page_to_page_pool_iov(page)->pp = NULL;
>>> +             return;
>>> +     }
>>> +
>>>        page->pp_magic = 0;
>>>        page->pp = NULL;
>>>    }
>>> @@ -615,7 +639,7 @@ static bool page_pool_recycle_in_cache(struct page *page,
>>>                return false;
>>>        }
>>>
>>> -     /* Caller MUST have verified/know (page_ref_count(page) == 1) */
>>> +     /* Caller MUST have verified/know (page_pool_page_ref_count(page) == 1) */
>>>        pool->alloc.cache[pool->alloc.count++] = page;
>>>        recycle_stat_inc(pool, cached);
>>>        return true;
>>> @@ -638,9 +662,10 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
>>>         * refcnt == 1 means page_pool owns page, and can recycle it.
>>>         *
>>>         * page is NOT reusable when allocated when system is under
>>> -      * some pressure. (page_is_pfmemalloc)
>>> +      * some pressure. (page_pool_page_is_pfmemalloc)
>>>         */
>>> -     if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
>>> +     if (likely(page_pool_page_ref_count(page) == 1 &&
>>> +                !page_pool_page_is_pfmemalloc(page))) {
>>>                /* Read barrier done in page_ref_count / READ_ONCE */
>>>
>>>                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
>>> @@ -741,7 +766,8 @@ static struct page *page_pool_drain_frag(struct page_pool *pool,
>>>        if (likely(page_pool_defrag_page(page, drain_count)))
>>>                return NULL;
>>>
>>> -     if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
>>> +     if (page_pool_page_ref_count(page) == 1 &&
>>> +         !page_pool_page_is_pfmemalloc(page)) {
>>>                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
>>>                        page_pool_dma_sync_for_device(pool, page, -1);
>>>
>>> @@ -818,9 +844,9 @@ static void page_pool_empty_ring(struct page_pool *pool)
>>>        /* Empty recycle ring */
>>>        while ((page = ptr_ring_consume_bh(&pool->ring))) {
>>>                /* Verify the refcnt invariant of cached pages */
>>> -             if (!(page_ref_count(page) == 1))
>>> +             if (!(page_pool_page_ref_count(page) == 1))
>>>                        pr_crit("%s() page_pool refcnt %d violation\n",
>>> -                             __func__, page_ref_count(page));
>>> +                             __func__, page_pool_page_ref_count(page));
>>>
>>>                page_pool_return_page(pool, page);
>>>        }
>>> @@ -977,19 +1003,24 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
>>>        struct page_pool *pp;
>>>        bool allow_direct;
>>>
>>> -     page = compound_head(page);
>>> +     if (!page_is_page_pool_iov(page)) {
>>> +             page = compound_head(page);
>>>
>>> -     /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
>>> -      * in order to preserve any existing bits, such as bit 0 for the
>>> -      * head page of compound page and bit 1 for pfmemalloc page, so
>>> -      * mask those bits for freeing side when doing below checking,
>>> -      * and page_is_pfmemalloc() is checked in __page_pool_put_page()
>>> -      * to avoid recycling the pfmemalloc page.
>>> -      */
>>> -     if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
>>> -             return false;
>>> +             /* page->pp_magic is OR'ed with PP_SIGNATURE after the
>>> +              * allocation in order to preserve any existing bits, such as
>>> +              * bit 0 for the head page of compound page and bit 1 for
>>> +              * pfmemalloc page, so mask those bits for freeing side when
>>> +              * doing below checking, and page_pool_page_is_pfmemalloc() is
>>> +              * checked in __page_pool_put_page() to avoid recycling the
>>> +              * pfmemalloc page.
>>> +              */
>>> +             if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
>>> +                     return false;
>>>
>>> -     pp = page->pp;
>>> +             pp = page->pp;
>>> +     } else {
>>> +             pp = page_to_page_pool_iov(page)->pp;
>>> +     }
>>>
>>>        /* Allow direct recycle if we have reasons to believe that we are
>>>         * in the same context as the consumer would run, so there's
>>> @@ -1273,9 +1304,9 @@ static bool mp_huge_busy(struct mp_huge *hu, unsigned int idx)
>>>
>>>        for (j = 0; j < (1 << MP_HUGE_ORDER); j++) {
>>>                page = hu->page[idx] + j;
>>> -             if (page_ref_count(page) != 1) {
>>> +             if (page_pool_page_ref_count(page) != 1) {
>>>                        pr_warn("Page with ref count %d at %u, %u. Can't safely destory, leaking memory!\n",
>>> -                             page_ref_count(page), idx, j);
>>> +                             page_pool_page_ref_count(page), idx, j);
>>>                        return true;
>>>                }
>>>        }
>>> @@ -1330,7 +1361,7 @@ static struct page *mp_huge_alloc_pages(struct page_pool *pool, gfp_t gfp)
>>>                        continue;
>>>
>>>                if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
>>> -                 page_ref_count(page) != 1) {
>>> +                 page_pool_page_ref_count(page) != 1) {
>>>                        atomic_inc(&mp_huge_ins_b);
>>>                        continue;
>>>                }
>>> @@ -1458,9 +1489,9 @@ static void mp_huge_1g_destroy(struct page_pool *pool)
>>>        free = true;
>>>        for (i = 0; i < MP_HUGE_1G_CNT; i++) {
>>>                page = hu->page + i;
>>> -             if (page_ref_count(page) != 1) {
>>> +             if (page_pool_page_ref_count(page) != 1) {
>>>                        pr_warn("Page with ref count %d at %u. Can't safely destory, leaking memory!\n",
>>> -                             page_ref_count(page), i);
>>> +                             page_pool_page_ref_count(page), i);
>>>                        free = false;
>>>                        break;
>>>                }
>>> @@ -1489,7 +1520,7 @@ static struct page *mp_huge_1g_alloc_pages(struct page_pool *pool, gfp_t gfp)
>>>                page = hu->page + page_i;
>>>
>>>                if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
>>> -                 page_ref_count(page) != 1) {
>>> +                 page_pool_page_ref_count(page) != 1) {
>>>                        atomic_inc(&mp_huge_ins_b);
>>>                        continue;
>>>                }
>>> --
>>> 2.41.0.640.ga95def55d0-goog
>>>
>>
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 06/11] page-pool: add device memory support
  2023-08-19 15:22       ` Jesper Dangaard Brouer
@ 2023-08-19 15:49         ` David Ahern
  2023-08-19 16:12           ` Willem de Bruijn
  2023-08-19 16:11         ` Willem de Bruijn
  2023-08-19 20:24         ` Mina Almasry
  2 siblings, 1 reply; 62+ messages in thread
From: David Ahern @ 2023-08-19 15:49 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Willem de Bruijn
  Cc: brouer, Mina Almasry, netdev, linux-media, dri-devel,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	Sumit Semwal, Christian König, Jason Gunthorpe,
	Hari Ramakrishnan, Dan Williams, Andy Lutomirski, stephen, sdf

On 8/19/23 9:22 AM, Jesper Dangaard Brouer wrote:
> 
> I do see the problem of depending on having a struct page, as the
> page_pool_iov isn't related to struct page.  Having "page" in the name
> of "page_pool_iov" is also confusing (hardest problem is CS is naming,
> as we all know).
> 
> To support more allocator types, perhaps skb->pp_recycle bit need to
> grow another bit (and be renamed skb->recycle), so we can tell allocator
> types apart, those that are page based and those whom are not.
> 
> 
>> I think the feedback has been strong to not multiplex yet another
>> memory type into that struct, that is not a real page. Which is why
>> we went into this direction. This latest series limits the impact largely
>> to networking structures and code.
>>
> 
> Some what related what I'm objecting to: the "page_pool_iov" is not a
> real page, but this getting recycled into something called "page_pool",
> which funny enough deals with struct-pages internally and depend on the
> struct-page-refcnt.
> 
> Given the approach changed way from using struct page, then I also don't
> see the connection with the page_pool. Sorry.

I do not care for the page_pool_iov name either; I presumed it was least
change to prove an idea and the name and details would evolve.

How about something like buffer_pool or netdev_buf_pool that can operate
with either pages, dma addresses, or something else in the future?

> 
>> As for the LSB trick: that avoided adding a lot of boilerplate churn
>> with new type and helper functions.
>>
> 
> Says the lazy programmer :-P ... sorry could not resist ;-)

Use of the LSB (or bits depending on alignment expectations) is a common
trick and already done in quite a few places in the networking stack.
This trick is essential to any realistic change here to incorporate gpu
memory; way too much code will have unnecessary churn without it.

I do prefer my earlier suggestion though where the skb_frag_t has a
union of relevant types though. Instead of `struct page *page` it could
be `void *addr` with the helpers indicating page, iov, or other.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 06/11] page-pool: add device memory support
  2023-08-19 15:22       ` Jesper Dangaard Brouer
  2023-08-19 15:49         ` David Ahern
@ 2023-08-19 16:11         ` Willem de Bruijn
  2023-08-19 20:24         ` Mina Almasry
  2 siblings, 0 replies; 62+ messages in thread
From: Willem de Bruijn @ 2023-08-19 16:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: brouer, Mina Almasry, netdev, linux-media, dri-devel,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	David Ahern, Sumit Semwal, Christian König, Jason Gunthorpe,
	Hari Ramakrishnan, Dan Williams, Andy Lutomirski, stephen, sdf

> > Any regression in page pool can be avoided in the common case that
> > does not use device mem by placing that behind a static_branch. Would
> > that address your performance concerns?
> >
>
> No. This will not help.
>
> The problem is that every where in the page_pool code it is getting
> polluted with:
>
>    if (page_is_page_pool_iov(page))
>      call-some-iov-func-instead()
>
> Like: the very central piece of getting the refcnt:
>
> +static inline int page_pool_page_ref_count(struct page *page)
> +{
> +       if (page_is_page_pool_iov(page))
> +               return page_pool_iov_refcount(page_to_page_pool_iov(page));
> +
> +       return page_ref_count(page);
> +}
>
>
> The fast-path of the PP is used for XDP_DROP scenarios, and is currently
> around 14 cycles (tsc). Thus, any extra code in this code patch will
> change the fast-path.

With static_branch disabled, it would only insert a NOP?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 06/11] page-pool: add device memory support
  2023-08-19 15:49         ` David Ahern
@ 2023-08-19 16:12           ` Willem de Bruijn
  2023-08-21 21:31             ` Jakub Kicinski
  0 siblings, 1 reply; 62+ messages in thread
From: Willem de Bruijn @ 2023-08-19 16:12 UTC (permalink / raw)
  To: David Ahern
  Cc: Jesper Dangaard Brouer, brouer, Mina Almasry, netdev,
	linux-media, dri-devel, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Arnd Bergmann, Sumit Semwal,
	Christian König, Jason Gunthorpe, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf

On Sat, Aug 19, 2023 at 11:49 AM David Ahern <dsahern@kernel.org> wrote:
>
> On 8/19/23 9:22 AM, Jesper Dangaard Brouer wrote:
> >
> > I do see the problem of depending on having a struct page, as the
> > page_pool_iov isn't related to struct page.  Having "page" in the name
> > of "page_pool_iov" is also confusing (hardest problem is CS is naming,
> > as we all know).
> >
> > To support more allocator types, perhaps skb->pp_recycle bit need to
> > grow another bit (and be renamed skb->recycle), so we can tell allocator
> > types apart, those that are page based and those whom are not.
> >
> >
> >> I think the feedback has been strong to not multiplex yet another
> >> memory type into that struct, that is not a real page. Which is why
> >> we went into this direction. This latest series limits the impact largely
> >> to networking structures and code.
> >>
> >
> > Some what related what I'm objecting to: the "page_pool_iov" is not a
> > real page, but this getting recycled into something called "page_pool",
> > which funny enough deals with struct-pages internally and depend on the
> > struct-page-refcnt.
> >
> > Given the approach changed way from using struct page, then I also don't
> > see the connection with the page_pool. Sorry.
>
> I do not care for the page_pool_iov name either; I presumed it was least
> change to prove an idea and the name and details would evolve.
>
> How about something like buffer_pool or netdev_buf_pool that can operate
> with either pages, dma addresses, or something else in the future?

Sounds good. I suggested this name, but I see how using page in the
name is not very clear.

> >
> >> As for the LSB trick: that avoided adding a lot of boilerplate churn
> >> with new type and helper functions.
> >>
> >
> > Says the lazy programmer :-P ... sorry could not resist ;-)

:-) For the record, there is a prior version that added a separate type.

I did not like the churn it brought and asked for this.

>
> Use of the LSB (or bits depending on alignment expectations) is a common
> trick and already done in quite a few places in the networking stack.
> This trick is essential to any realistic change here to incorporate gpu
> memory; way too much code will have unnecessary churn without it.
>
> I do prefer my earlier suggestion though where the skb_frag_t has a
> union of relevant types though. Instead of `struct page *page` it could
> be `void *addr` with the helpers indicating page, iov, or other.

Okay. I think that is how we did it previously.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice
  2023-08-19 14:18                     ` Willem de Bruijn
@ 2023-08-19 17:59                       ` Mina Almasry
  2023-08-21 21:16                       ` Jakub Kicinski
  2023-08-22  3:19                       ` David Ahern
  2 siblings, 0 replies; 62+ messages in thread
From: Mina Almasry @ 2023-08-19 17:59 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David Ahern, Jakub Kicinski, Praveen Kaligineedi, netdev,
	Eric Dumazet, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Magnus Karlsson, sdf, Willem de Bruijn,
	Kaiyuan Zhang

On Sat, Aug 19, 2023 at 7:19 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Fri, Aug 18, 2023 at 11:30 PM David Ahern <dsahern@kernel.org> wrote:
> >
> > On 8/18/23 8:06 PM, Jakub Kicinski wrote:
> > > On Fri, 18 Aug 2023 19:34:32 -0600 David Ahern wrote:
> > >> On 8/18/23 3:52 PM, Mina Almasry wrote:
> > >>> The sticking points are:
> > >>> 1. From David: this proposal doesn't give an application the ability
> > >>> to flush an rx queue, which means that we have to rely on a driver
> > >>> reset that affects all queues to refill the rx queue buffers.
> > >>
> > >> Generically, the design needs to be able to flush (or invalidate) all
> > >> references to the dma-buf once the process no longer "owns" it.
> > >
> > > Are we talking about the ability for the app to flush the queue
> > > when it wants to (do no idea what)? Or auto-flush when app crashes?
> >
> > If a buffer reference can be invalidated such that a posted buffer is
> > ignored by H/W, then no flush is needed per se. Either way the key point
> > is that posted buffers can no longer be filled by H/W once a process no
> > longer owns the dma-buf reference. I believe the actual mechanism here
> > will vary by H/W.
>
> Right. Many devices only allow bringing all queues down at the same time.
>

FWIW, I spoke with our Praveen (GVE maintainer) about this. Suspicion
is that bringing up/down individual queues _should_ work with GVE for
the most part, but it's pending me trying it and confirming.

I think if a driver can't support bringing up/down individual queues,
then Jakub's direction for per queue configs all cannot be done on
that driver  (queue_mem_alloc, queue_mem_free, queue_start,
queue_stop), and addressing David's concerns vis-a-vis dma-buf being
auto-detached if the application crashes/exists also cannot be done.
The driver will not be able to support device memory TCP unless there
is an option to make it work with a full driver reset.

> Once a descriptor is posted and the ring head is written, there is no
> way to retract that. Since waiting for the device to catch up is not
> acceptable, the only option is to bring down the queue, right? Which
> will imply bringing down the entire device on many devices. Not ideal,
> but acceptable short term, imho.
>

I also wonder if it may be acceptable to have both modes supported.
I.e. (roughly):

1. Add APIs that create an rx-queue bound to a dma-buf.
2. Add APIs that bind an rx-queue to a dma-buf.

Drivers that support per-queue allocation/freeing can support and use
#1 and can work as David likes. Drivers that cannot allocate or bring
up individual queues can only support #2, and trigger a driver-reset
to refill or release the dma-buf references.

This patch series already implements APIs #2.

> That may be an incentive for vendors to support per-queue
> start/stop/alloc/free. Maybe the ones that support RDMA already do?



-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 06/11] page-pool: add device memory support
  2023-08-19 15:22       ` Jesper Dangaard Brouer
  2023-08-19 15:49         ` David Ahern
  2023-08-19 16:11         ` Willem de Bruijn
@ 2023-08-19 20:24         ` Mina Almasry
  2023-08-19 20:27           ` Mina Almasry
  2023-09-08  2:32           ` David Wei
  2 siblings, 2 replies; 62+ messages in thread
From: Mina Almasry @ 2023-08-19 20:24 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Willem de Bruijn, brouer, netdev, linux-media, dri-devel,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	David Ahern, Sumit Semwal, Christian König, Jason Gunthorpe,
	Hari Ramakrishnan, Dan Williams, Andy Lutomirski, stephen, sdf

On Sat, Aug 19, 2023 at 8:22 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 19/08/2023 16.08, Willem de Bruijn wrote:
> > On Sat, Aug 19, 2023 at 5:51 AM Jesper Dangaard Brouer
> > <jbrouer@redhat.com> wrote:
> >>
> >>
> >>
> >> On 10/08/2023 03.57, Mina Almasry wrote:
> >>> Overload the LSB of struct page* to indicate that it's a page_pool_iov.
> >>>
> >>> Refactor mm calls on struct page * into helpers, and add page_pool_iov
> >>> handling on those helpers. Modify callers of these mm APIs with calls to
> >>> these helpers instead.
> >>>
> >>
> >> I don't like of this approach.
> >> This is adding code to the PP (page_pool) fast-path in multiple places.
> >>
> >> I've not had time to run my usual benchmarks, which are here:
> >>
> >> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
> >>

Thank you for linking that, I'll try to run these against the next submission.

> >> But I'm sure it will affect performance.
> >>
> >> Regardless of performance, this approach is using ptr-LSB-bits, to hide
> >> that page-pointer are not really struct-pages, feels like force feeding
> >> a solution just to use the page_pool APIs.
> >>
> >>
> >>> In areas where struct page* is dereferenced, add a check for special
> >>> handling of page_pool_iov.
> >>>
> >>> The memory providers producing page_pool_iov can set the LSB on the
> >>> struct page* returned to the page pool.
> >>>
> >>> Note that instead of overloading the LSB of page pointers, we can
> >>> instead define a new union between struct page & struct page_pool_iov and
> >>> compact it in a new type. However, we'd need to implement the code churn
> >>> to modify the page_pool & drivers to use this new type. For this POC
> >>> that is not implemented (feedback welcome).
> >>>
> >>
> >> I've said before, that I prefer multiplexing on page->pp_magic.
> >> For your page_pool_iov the layout would have to match the offset of
> >> pp_magic, to do this. (And if insisting on using PP infra the refcnt
> >> would also need to align).
> >
> > Perhaps I misunderstand, but this suggests continuing to using
> > struct page to demultiplex memory type?
> >
>
> (Perhaps we are misunderstanding each-other and my use of the words
> multiplexing and demultiplex are wrong, I'm sorry, as English isn't my
> native language.)
>
> I do see the problem of depending on having a struct page, as the
> page_pool_iov isn't related to struct page.  Having "page" in the name
> of "page_pool_iov" is also confusing (hardest problem is CS is naming,
> as we all know).
>
> To support more allocator types, perhaps skb->pp_recycle bit need to
> grow another bit (and be renamed skb->recycle), so we can tell allocator
> types apart, those that are page based and those whom are not.
>
>
> > I think the feedback has been strong to not multiplex yet another
> > memory type into that struct, that is not a real page. Which is why
> > we went into this direction. This latest series limits the impact largely
> > to networking structures and code.
> >
>
> Some what related what I'm objecting to: the "page_pool_iov" is not a
> real page, but this getting recycled into something called "page_pool",
> which funny enough deals with struct-pages internally and depend on the
> struct-page-refcnt.
>
> Given the approach changed way from using struct page, then I also don't
> see the connection with the page_pool. Sorry.
>
> > One way or another, there will be a branch and multiplexing. Whether
> > that is in struct page, the page pool or a new netdev mem type as you
> > propose.
> >
>
> I'm asking to have this branch/multiplexing done a the call sites.
>
> (IMHO not changing the drivers is a pipe-dream.)
>

I think I understand what Jesper is saying. I think Jesper wants the
page_pool to remain unchanged, and another layer on top of it to do
the multiplexing, i.e.:

driver ---> new_layer (does multiplexing) ---> page_pool -------> mm layer
                                \------------------------------>
devmem_pool --> dma-buf layer

But, I think, Jakub wants the page_pool to be the front end, and for
the multiplexing to happen in the page pool (I think, Jakub did not
write this in an email, but this is how I interpret his comments from
[1], and his memory provider RFC). So I think Jakub wants:

driver --> page_pool ---> memory_provider (does multiplexing) --->
basic_provider -------> mm layer

\----------------------------------------> devmem_provider --> dma-buf
layer

That is the approach in this RFC.

I think proper naming that makes sense can be figured out, and is not
a huge issue. I think in both cases we can minimize the changes to the
drivers, maybe. In the first approach the driver will need to use the
APIs created by the new layer. In the second approach, the driver
continues to use page_pool APIs.

I think we need to converge on a path between the 2 approaches (or
maybe 3rd approach to do). For me the pros/cons of each approach
(please add):

multiplexing in new_layer:
- Pro: maybe better for performance? Not sure if static_branch can
achieve the same perf. I can verify with Jesper's perf tests.
- Pro: doesn't add complexity in the page_pool (but adds complexity in
terms of adding new pools like devmem_pool)
- Con: the devmem_pool & page_pool will end up being duplicated code,
I suspect, because they largely do similar things (both need to
recycle memory for example).

multiplexing via memory_provider:
- Pro: no code duplication.
- Pro: less changes to the drivers, I think. The drivers can continue
to use the page_pool API, no need to introduce calls to 'new_layer'.
- Con: adds complexity to the page_pool (needs to handle devmem).
- Con: probably careful handling via static_branch needs to be done to
achieve performance.

[1] https://lore.kernel.org/netdev/20230619110705.106ec599@kernel.org/

> > Any regression in page pool can be avoided in the common case that
> > does not use device mem by placing that behind a static_branch. Would
> > that address your performance concerns?
> >
>
> No. This will not help.
>
> The problem is that every where in the page_pool code it is getting
> polluted with:
>
>    if (page_is_page_pool_iov(page))
>      call-some-iov-func-instead()
>
> Like: the very central piece of getting the refcnt:
>
> +static inline int page_pool_page_ref_count(struct page *page)
> +{
> +       if (page_is_page_pool_iov(page))
> +               return page_pool_iov_refcount(page_to_page_pool_iov(page));
> +
> +       return page_ref_count(page);
> +}
>
>
> The fast-path of the PP is used for XDP_DROP scenarios, and is currently
> around 14 cycles (tsc). Thus, any extra code in this code patch will
> change the fast-path.
>
>
> >>
> >> On the allocation side, all drivers already use a driver helper
> >> page_pool_dev_alloc_pages() or we could add another (better named)
> >> helper to multiplex between other types of allocators, e.g. a devmem
> >> allocator.
> >>
> >> On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle
> >> could multiplex on pp_magic and call another API.  The API could be an
> >> extension to PP helpers, but it could also be a devmap allocator helper.
> >>
> >> IMHO forcing/piggy-bagging everything into page_pool is not the right
> >> solution.  I really think netstack need to support different allocator
> >> types.
> >
> > To me this is lifting page_pool into such a netstack alloctator pool.
> >
>
> This is should be renamed as it is not longer dealing with pages.
>
> > Not sure adding another explicit layer of indirection would be cleaner
> > or faster (potentially more indirect calls).
> >
>
> It seems we are talking past each-other.  The layer of indirection I'm
> talking about is likely a simple header file (e.g. named netmem.h) that
> will get inline compiled so there is no overhead. It will be used by
> driver, such that we can avoid touching driver again when introducing
> new memory allocator types.
>
>
> > As for the LSB trick: that avoided adding a lot of boilerplate churn
> > with new type and helper functions.
> >
>
> Says the lazy programmer :-P ... sorry could not resist ;-)
>
> >
> >
> >> The page pool have been leading the way, yes, but perhaps it is
> >> time to add an API layer that e.g. could be named netmem, that gives us
> >> the multiplexing between allocators.  In that process some of page_pool
> >> APIs would be lifted out as common blocks and others remain.
> >>
> >> --Jesper
> >>
> >>> I have a sample implementation of adding a new page_pool_token type
> >>> in the page_pool to give a general idea here:
> >>> https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d
> >>>
> >>> Full branch here:
> >>> https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens
> >>>
> >>> (In the branches above, page_pool_iov is called devmem_slice).
> >>>
> >>> Could also add static_branch to speed up the checks in page_pool_iov
> >>> memory providers are being used.
> >>>
> >>> Signed-off-by: Mina Almasry <almasrymina@google.com>
> >>> ---
> >>>    include/net/page_pool.h | 74 ++++++++++++++++++++++++++++++++++-
> >>>    net/core/page_pool.c    | 85 ++++++++++++++++++++++++++++-------------
> >>>    2 files changed, 131 insertions(+), 28 deletions(-)
> >>>
> >>> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> >>> index 537eb36115ed..f08ca230d68e 100644
> >>> --- a/include/net/page_pool.h
> >>> +++ b/include/net/page_pool.h
> >>> @@ -282,6 +282,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
> >>>        return NULL;
> >>>    }
> >>>
> >>> +static inline int page_pool_page_ref_count(struct page *page)
> >>> +{
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return page_pool_iov_refcount(page_to_page_pool_iov(page));
> >>> +
> >>> +     return page_ref_count(page);
> >>> +}
> >>> +
> >>> +static inline void page_pool_page_get_many(struct page *page,
> >>> +                                        unsigned int count)
> >>> +{
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return page_pool_iov_get_many(page_to_page_pool_iov(page),
> >>> +                                           count);
> >>> +
> >>> +     return page_ref_add(page, count);
> >>> +}
> >>> +
> >>> +static inline void page_pool_page_put_many(struct page *page,
> >>> +                                        unsigned int count)
> >>> +{
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return page_pool_iov_put_many(page_to_page_pool_iov(page),
> >>> +                                           count);
> >>> +
> >>> +     if (count > 1)
> >>> +             page_ref_sub(page, count - 1);
> >>> +
> >>> +     put_page(page);
> >>> +}
> >>> +
> >>> +static inline bool page_pool_page_is_pfmemalloc(struct page *page)
> >>> +{
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return false;
> >>> +
> >>> +     return page_is_pfmemalloc(page);
> >>> +}
> >>> +
> >>> +static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
> >>> +{
> >>> +     /* Assume page_pool_iov are on the preferred node without actually
> >>> +      * checking...
> >>> +      *
> >>> +      * This check is only used to check for recycling memory in the page
> >>> +      * pool's fast paths. Currently the only implementation of page_pool_iov
> >>> +      * is dmabuf device memory. It's a deliberate decision by the user to
> >>> +      * bind a certain dmabuf to a certain netdev, and the netdev rx queue
> >>> +      * would not be able to reallocate memory from another dmabuf that
> >>> +      * exists on the preferred node, so, this check doesn't make much sense
> >>> +      * in this case. Assume all page_pool_iovs can be recycled for now.
> >>> +      */
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return true;
> >>> +
> >>> +     return page_to_nid(page) == pref_nid;
> >>> +}
> >>> +
> >>>    struct page_pool {
> >>>        struct page_pool_params p;
> >>>
> >>> @@ -434,6 +492,9 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
> >>>    {
> >>>        long ret;
> >>>
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return -EINVAL;
> >>> +
> >>>        /* If nr == pp_frag_count then we have cleared all remaining
> >>>         * references to the page. No need to actually overwrite it, instead
> >>>         * we can leave this to be overwritten by the calling function.
> >>> @@ -494,7 +555,12 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
> >>>
> >>>    static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
> >>>    {
> >>> -     dma_addr_t ret = page->dma_addr;
> >>> +     dma_addr_t ret;
> >>> +
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return page_pool_iov_dma_addr(page_to_page_pool_iov(page));
> >>> +
> >>> +     ret = page->dma_addr;
> >>>
> >>>        if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> >>>                ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16;
> >>> @@ -504,6 +570,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
> >>>
> >>>    static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
> >>>    {
> >>> +     /* page_pool_iovs are mapped and their dma-addr can't be modified. */
> >>> +     if (page_is_page_pool_iov(page)) {
> >>> +             DEBUG_NET_WARN_ON_ONCE(true);
> >>> +             return;
> >>> +     }
> >>> +
> >>>        page->dma_addr = addr;
> >>>        if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> >>>                page->dma_addr_upper = upper_32_bits(addr);
> >>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> >>> index 0a7c08d748b8..20c1f74fd844 100644
> >>> --- a/net/core/page_pool.c
> >>> +++ b/net/core/page_pool.c
> >>> @@ -318,7 +318,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
> >>>                if (unlikely(!page))
> >>>                        break;
> >>>
> >>> -             if (likely(page_to_nid(page) == pref_nid)) {
> >>> +             if (likely(page_pool_page_is_pref_nid(page, pref_nid))) {
> >>>                        pool->alloc.cache[pool->alloc.count++] = page;
> >>>                } else {
> >>>                        /* NUMA mismatch;
> >>> @@ -363,7 +363,15 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
> >>>                                          struct page *page,
> >>>                                          unsigned int dma_sync_size)
> >>>    {
> >>> -     dma_addr_t dma_addr = page_pool_get_dma_addr(page);
> >>> +     dma_addr_t dma_addr;
> >>> +
> >>> +     /* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */
> >>> +     if (page_is_page_pool_iov(page)) {
> >>> +             DEBUG_NET_WARN_ON_ONCE(true);
> >>> +             return;
> >>> +     }
> >>> +
> >>> +     dma_addr = page_pool_get_dma_addr(page);
> >>>
> >>>        dma_sync_size = min(dma_sync_size, pool->p.max_len);
> >>>        dma_sync_single_range_for_device(pool->p.dev, dma_addr,
> >>> @@ -375,6 +383,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
> >>>    {
> >>>        dma_addr_t dma;
> >>>
> >>> +     if (page_is_page_pool_iov(page)) {
> >>> +             /* page_pool_iovs are already mapped */
> >>> +             DEBUG_NET_WARN_ON_ONCE(true);
> >>> +             return true;
> >>> +     }
> >>> +
> >>>        /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
> >>>         * since dma_addr_t can be either 32 or 64 bits and does not always fit
> >>>         * into page private data (i.e 32bit cpu with 64bit DMA caps)
> >>> @@ -398,14 +412,24 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
> >>>    static void page_pool_set_pp_info(struct page_pool *pool,
> >>>                                  struct page *page)
> >>>    {
> >>> -     page->pp = pool;
> >>> -     page->pp_magic |= PP_SIGNATURE;
> >>> +     if (!page_is_page_pool_iov(page)) {
> >>> +             page->pp = pool;
> >>> +             page->pp_magic |= PP_SIGNATURE;
> >>> +     } else {
> >>> +             page_to_page_pool_iov(page)->pp = pool;
> >>> +     }
> >>> +
> >>>        if (pool->p.init_callback)
> >>>                pool->p.init_callback(page, pool->p.init_arg);
> >>>    }
> >>>
> >>>    static void page_pool_clear_pp_info(struct page *page)
> >>>    {
> >>> +     if (page_is_page_pool_iov(page)) {
> >>> +             page_to_page_pool_iov(page)->pp = NULL;
> >>> +             return;
> >>> +     }
> >>> +
> >>>        page->pp_magic = 0;
> >>>        page->pp = NULL;
> >>>    }
> >>> @@ -615,7 +639,7 @@ static bool page_pool_recycle_in_cache(struct page *page,
> >>>                return false;
> >>>        }
> >>>
> >>> -     /* Caller MUST have verified/know (page_ref_count(page) == 1) */
> >>> +     /* Caller MUST have verified/know (page_pool_page_ref_count(page) == 1) */
> >>>        pool->alloc.cache[pool->alloc.count++] = page;
> >>>        recycle_stat_inc(pool, cached);
> >>>        return true;
> >>> @@ -638,9 +662,10 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
> >>>         * refcnt == 1 means page_pool owns page, and can recycle it.
> >>>         *
> >>>         * page is NOT reusable when allocated when system is under
> >>> -      * some pressure. (page_is_pfmemalloc)
> >>> +      * some pressure. (page_pool_page_is_pfmemalloc)
> >>>         */
> >>> -     if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
> >>> +     if (likely(page_pool_page_ref_count(page) == 1 &&
> >>> +                !page_pool_page_is_pfmemalloc(page))) {
> >>>                /* Read barrier done in page_ref_count / READ_ONCE */
> >>>
> >>>                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> >>> @@ -741,7 +766,8 @@ static struct page *page_pool_drain_frag(struct page_pool *pool,
> >>>        if (likely(page_pool_defrag_page(page, drain_count)))
> >>>                return NULL;
> >>>
> >>> -     if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
> >>> +     if (page_pool_page_ref_count(page) == 1 &&
> >>> +         !page_pool_page_is_pfmemalloc(page)) {
> >>>                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> >>>                        page_pool_dma_sync_for_device(pool, page, -1);
> >>>
> >>> @@ -818,9 +844,9 @@ static void page_pool_empty_ring(struct page_pool *pool)
> >>>        /* Empty recycle ring */
> >>>        while ((page = ptr_ring_consume_bh(&pool->ring))) {
> >>>                /* Verify the refcnt invariant of cached pages */
> >>> -             if (!(page_ref_count(page) == 1))
> >>> +             if (!(page_pool_page_ref_count(page) == 1))
> >>>                        pr_crit("%s() page_pool refcnt %d violation\n",
> >>> -                             __func__, page_ref_count(page));
> >>> +                             __func__, page_pool_page_ref_count(page));
> >>>
> >>>                page_pool_return_page(pool, page);
> >>>        }
> >>> @@ -977,19 +1003,24 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
> >>>        struct page_pool *pp;
> >>>        bool allow_direct;
> >>>
> >>> -     page = compound_head(page);
> >>> +     if (!page_is_page_pool_iov(page)) {
> >>> +             page = compound_head(page);
> >>>
> >>> -     /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
> >>> -      * in order to preserve any existing bits, such as bit 0 for the
> >>> -      * head page of compound page and bit 1 for pfmemalloc page, so
> >>> -      * mask those bits for freeing side when doing below checking,
> >>> -      * and page_is_pfmemalloc() is checked in __page_pool_put_page()
> >>> -      * to avoid recycling the pfmemalloc page.
> >>> -      */
> >>> -     if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> >>> -             return false;
> >>> +             /* page->pp_magic is OR'ed with PP_SIGNATURE after the
> >>> +              * allocation in order to preserve any existing bits, such as
> >>> +              * bit 0 for the head page of compound page and bit 1 for
> >>> +              * pfmemalloc page, so mask those bits for freeing side when
> >>> +              * doing below checking, and page_pool_page_is_pfmemalloc() is
> >>> +              * checked in __page_pool_put_page() to avoid recycling the
> >>> +              * pfmemalloc page.
> >>> +              */
> >>> +             if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> >>> +                     return false;
> >>>
> >>> -     pp = page->pp;
> >>> +             pp = page->pp;
> >>> +     } else {
> >>> +             pp = page_to_page_pool_iov(page)->pp;
> >>> +     }
> >>>
> >>>        /* Allow direct recycle if we have reasons to believe that we are
> >>>         * in the same context as the consumer would run, so there's
> >>> @@ -1273,9 +1304,9 @@ static bool mp_huge_busy(struct mp_huge *hu, unsigned int idx)
> >>>
> >>>        for (j = 0; j < (1 << MP_HUGE_ORDER); j++) {
> >>>                page = hu->page[idx] + j;
> >>> -             if (page_ref_count(page) != 1) {
> >>> +             if (page_pool_page_ref_count(page) != 1) {
> >>>                        pr_warn("Page with ref count %d at %u, %u. Can't safely destory, leaking memory!\n",
> >>> -                             page_ref_count(page), idx, j);
> >>> +                             page_pool_page_ref_count(page), idx, j);
> >>>                        return true;
> >>>                }
> >>>        }
> >>> @@ -1330,7 +1361,7 @@ static struct page *mp_huge_alloc_pages(struct page_pool *pool, gfp_t gfp)
> >>>                        continue;
> >>>
> >>>                if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
> >>> -                 page_ref_count(page) != 1) {
> >>> +                 page_pool_page_ref_count(page) != 1) {
> >>>                        atomic_inc(&mp_huge_ins_b);
> >>>                        continue;
> >>>                }
> >>> @@ -1458,9 +1489,9 @@ static void mp_huge_1g_destroy(struct page_pool *pool)
> >>>        free = true;
> >>>        for (i = 0; i < MP_HUGE_1G_CNT; i++) {
> >>>                page = hu->page + i;
> >>> -             if (page_ref_count(page) != 1) {
> >>> +             if (page_pool_page_ref_count(page) != 1) {
> >>>                        pr_warn("Page with ref count %d at %u. Can't safely destory, leaking memory!\n",
> >>> -                             page_ref_count(page), i);
> >>> +                             page_pool_page_ref_count(page), i);
> >>>                        free = false;
> >>>                        break;
> >>>                }
> >>> @@ -1489,7 +1520,7 @@ static struct page *mp_huge_1g_alloc_pages(struct page_pool *pool, gfp_t gfp)
> >>>                page = hu->page + page_i;
> >>>
> >>>                if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
> >>> -                 page_ref_count(page) != 1) {
> >>> +                 page_pool_page_ref_count(page) != 1) {
> >>>                        atomic_inc(&mp_huge_ins_b);
> >>>                        continue;
> >>>                }
> >>> --
> >>> 2.41.0.640.ga95def55d0-goog
> >>>
> >>
> >
>


--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 06/11] page-pool: add device memory support
  2023-08-19 20:24         ` Mina Almasry
@ 2023-08-19 20:27           ` Mina Almasry
  2023-09-08  2:32           ` David Wei
  1 sibling, 0 replies; 62+ messages in thread
From: Mina Almasry @ 2023-08-19 20:27 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Willem de Bruijn, brouer, netdev, linux-media, dri-devel,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	David Ahern, Sumit Semwal, Christian König, Jason Gunthorpe,
	Hari Ramakrishnan, Dan Williams, Andy Lutomirski, stephen, sdf

On Sat, Aug 19, 2023 at 1:24 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Sat, Aug 19, 2023 at 8:22 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
> >
> >
> >
> > On 19/08/2023 16.08, Willem de Bruijn wrote:
> > > On Sat, Aug 19, 2023 at 5:51 AM Jesper Dangaard Brouer
> > > <jbrouer@redhat.com> wrote:
> > >>
> > >>
> > >>
> > >> On 10/08/2023 03.57, Mina Almasry wrote:
> > >>> Overload the LSB of struct page* to indicate that it's a page_pool_iov.
> > >>>
> > >>> Refactor mm calls on struct page * into helpers, and add page_pool_iov
> > >>> handling on those helpers. Modify callers of these mm APIs with calls to
> > >>> these helpers instead.
> > >>>
> > >>
> > >> I don't like of this approach.
> > >> This is adding code to the PP (page_pool) fast-path in multiple places.
> > >>
> > >> I've not had time to run my usual benchmarks, which are here:
> > >>
> > >> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
> > >>
>
> Thank you for linking that, I'll try to run these against the next submission.
>
> > >> But I'm sure it will affect performance.
> > >>
> > >> Regardless of performance, this approach is using ptr-LSB-bits, to hide
> > >> that page-pointer are not really struct-pages, feels like force feeding
> > >> a solution just to use the page_pool APIs.
> > >>
> > >>
> > >>> In areas where struct page* is dereferenced, add a check for special
> > >>> handling of page_pool_iov.
> > >>>
> > >>> The memory providers producing page_pool_iov can set the LSB on the
> > >>> struct page* returned to the page pool.
> > >>>
> > >>> Note that instead of overloading the LSB of page pointers, we can
> > >>> instead define a new union between struct page & struct page_pool_iov and
> > >>> compact it in a new type. However, we'd need to implement the code churn
> > >>> to modify the page_pool & drivers to use this new type. For this POC
> > >>> that is not implemented (feedback welcome).
> > >>>
> > >>
> > >> I've said before, that I prefer multiplexing on page->pp_magic.
> > >> For your page_pool_iov the layout would have to match the offset of
> > >> pp_magic, to do this. (And if insisting on using PP infra the refcnt
> > >> would also need to align).
> > >
> > > Perhaps I misunderstand, but this suggests continuing to using
> > > struct page to demultiplex memory type?
> > >
> >
> > (Perhaps we are misunderstanding each-other and my use of the words
> > multiplexing and demultiplex are wrong, I'm sorry, as English isn't my
> > native language.)
> >
> > I do see the problem of depending on having a struct page, as the
> > page_pool_iov isn't related to struct page.  Having "page" in the name
> > of "page_pool_iov" is also confusing (hardest problem is CS is naming,
> > as we all know).
> >
> > To support more allocator types, perhaps skb->pp_recycle bit need to
> > grow another bit (and be renamed skb->recycle), so we can tell allocator
> > types apart, those that are page based and those whom are not.
> >
> >
> > > I think the feedback has been strong to not multiplex yet another
> > > memory type into that struct, that is not a real page. Which is why
> > > we went into this direction. This latest series limits the impact largely
> > > to networking structures and code.
> > >
> >
> > Some what related what I'm objecting to: the "page_pool_iov" is not a
> > real page, but this getting recycled into something called "page_pool",
> > which funny enough deals with struct-pages internally and depend on the
> > struct-page-refcnt.
> >
> > Given the approach changed way from using struct page, then I also don't
> > see the connection with the page_pool. Sorry.
> >
> > > One way or another, there will be a branch and multiplexing. Whether
> > > that is in struct page, the page pool or a new netdev mem type as you
> > > propose.
> > >
> >
> > I'm asking to have this branch/multiplexing done a the call sites.
> >
> > (IMHO not changing the drivers is a pipe-dream.)
> >
>
> I think I understand what Jesper is saying. I think Jesper wants the
> page_pool to remain unchanged, and another layer on top of it to do
> the multiplexing, i.e.:
>
> driver ---> new_layer (does multiplexing) ---> page_pool -------> mm layer
>                                 \------------------------------>
> devmem_pool --> dma-buf layer
>

Gmail mangled this :/ Let me try again. This should read:

driver -> new_layer -> page_pool ------> mm layer
                      \--------->devmem_pool -> dma-buf

> But, I think, Jakub wants the page_pool to be the front end, and for
> the multiplexing to happen in the page pool (I think, Jakub did not
> write this in an email, but this is how I interpret his comments from
> [1], and his memory provider RFC). So I think Jakub wants:
>
> driver --> page_pool ---> memory_provider (does multiplexing) --->
> basic_provider -------> mm layer
>
> \----------------------------------------> devmem_provider --> dma-buf
> layer
>

This should read:
driver -> pp -> memory provider -> basic provider -> mm
                                   \--------------> devmem provider -> dma-buf

Sorry for the spam!

> That is the approach in this RFC.
>
> I think proper naming that makes sense can be figured out, and is not
> a huge issue. I think in both cases we can minimize the changes to the
> drivers, maybe. In the first approach the driver will need to use the
> APIs created by the new layer. In the second approach, the driver
> continues to use page_pool APIs.
>
> I think we need to converge on a path between the 2 approaches (or
> maybe 3rd approach to do). For me the pros/cons of each approach
> (please add):
>
> multiplexing in new_layer:
> - Pro: maybe better for performance? Not sure if static_branch can
> achieve the same perf. I can verify with Jesper's perf tests.
> - Pro: doesn't add complexity in the page_pool (but adds complexity in
> terms of adding new pools like devmem_pool)
> - Con: the devmem_pool & page_pool will end up being duplicated code,
> I suspect, because they largely do similar things (both need to
> recycle memory for example).
>
> multiplexing via memory_provider:
> - Pro: no code duplication.
> - Pro: less changes to the drivers, I think. The drivers can continue
> to use the page_pool API, no need to introduce calls to 'new_layer'.
> - Con: adds complexity to the page_pool (needs to handle devmem).
> - Con: probably careful handling via static_branch needs to be done to
> achieve performance.
>
> [1] https://lore.kernel.org/netdev/20230619110705.106ec599@kernel.org/
>
> > > Any regression in page pool can be avoided in the common case that
> > > does not use device mem by placing that behind a static_branch. Would
> > > that address your performance concerns?
> > >
> >
> > No. This will not help.
> >
> > The problem is that every where in the page_pool code it is getting
> > polluted with:
> >
> >    if (page_is_page_pool_iov(page))
> >      call-some-iov-func-instead()
> >
> > Like: the very central piece of getting the refcnt:
> >
> > +static inline int page_pool_page_ref_count(struct page *page)
> > +{
> > +       if (page_is_page_pool_iov(page))
> > +               return page_pool_iov_refcount(page_to_page_pool_iov(page));
> > +
> > +       return page_ref_count(page);
> > +}
> >
> >
> > The fast-path of the PP is used for XDP_DROP scenarios, and is currently
> > around 14 cycles (tsc). Thus, any extra code in this code patch will
> > change the fast-path.
> >
> >
> > >>
> > >> On the allocation side, all drivers already use a driver helper
> > >> page_pool_dev_alloc_pages() or we could add another (better named)
> > >> helper to multiplex between other types of allocators, e.g. a devmem
> > >> allocator.
> > >>
> > >> On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle
> > >> could multiplex on pp_magic and call another API.  The API could be an
> > >> extension to PP helpers, but it could also be a devmap allocator helper.
> > >>
> > >> IMHO forcing/piggy-bagging everything into page_pool is not the right
> > >> solution.  I really think netstack need to support different allocator
> > >> types.
> > >
> > > To me this is lifting page_pool into such a netstack alloctator pool.
> > >
> >
> > This is should be renamed as it is not longer dealing with pages.
> >
> > > Not sure adding another explicit layer of indirection would be cleaner
> > > or faster (potentially more indirect calls).
> > >
> >
> > It seems we are talking past each-other.  The layer of indirection I'm
> > talking about is likely a simple header file (e.g. named netmem.h) that
> > will get inline compiled so there is no overhead. It will be used by
> > driver, such that we can avoid touching driver again when introducing
> > new memory allocator types.
> >
> >
> > > As for the LSB trick: that avoided adding a lot of boilerplate churn
> > > with new type and helper functions.
> > >
> >
> > Says the lazy programmer :-P ... sorry could not resist ;-)
> >
> > >
> > >
> > >> The page pool have been leading the way, yes, but perhaps it is
> > >> time to add an API layer that e.g. could be named netmem, that gives us
> > >> the multiplexing between allocators.  In that process some of page_pool
> > >> APIs would be lifted out as common blocks and others remain.
> > >>
> > >> --Jesper
> > >>
> > >>> I have a sample implementation of adding a new page_pool_token type
> > >>> in the page_pool to give a general idea here:
> > >>> https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d
> > >>>
> > >>> Full branch here:
> > >>> https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens
> > >>>
> > >>> (In the branches above, page_pool_iov is called devmem_slice).
> > >>>
> > >>> Could also add static_branch to speed up the checks in page_pool_iov
> > >>> memory providers are being used.
> > >>>
> > >>> Signed-off-by: Mina Almasry <almasrymina@google.com>
> > >>> ---
> > >>>    include/net/page_pool.h | 74 ++++++++++++++++++++++++++++++++++-
> > >>>    net/core/page_pool.c    | 85 ++++++++++++++++++++++++++++-------------
> > >>>    2 files changed, 131 insertions(+), 28 deletions(-)
> > >>>
> > >>> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> > >>> index 537eb36115ed..f08ca230d68e 100644
> > >>> --- a/include/net/page_pool.h
> > >>> +++ b/include/net/page_pool.h
> > >>> @@ -282,6 +282,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
> > >>>        return NULL;
> > >>>    }
> > >>>
> > >>> +static inline int page_pool_page_ref_count(struct page *page)
> > >>> +{
> > >>> +     if (page_is_page_pool_iov(page))
> > >>> +             return page_pool_iov_refcount(page_to_page_pool_iov(page));
> > >>> +
> > >>> +     return page_ref_count(page);
> > >>> +}
> > >>> +
> > >>> +static inline void page_pool_page_get_many(struct page *page,
> > >>> +                                        unsigned int count)
> > >>> +{
> > >>> +     if (page_is_page_pool_iov(page))
> > >>> +             return page_pool_iov_get_many(page_to_page_pool_iov(page),
> > >>> +                                           count);
> > >>> +
> > >>> +     return page_ref_add(page, count);
> > >>> +}
> > >>> +
> > >>> +static inline void page_pool_page_put_many(struct page *page,
> > >>> +                                        unsigned int count)
> > >>> +{
> > >>> +     if (page_is_page_pool_iov(page))
> > >>> +             return page_pool_iov_put_many(page_to_page_pool_iov(page),
> > >>> +                                           count);
> > >>> +
> > >>> +     if (count > 1)
> > >>> +             page_ref_sub(page, count - 1);
> > >>> +
> > >>> +     put_page(page);
> > >>> +}
> > >>> +
> > >>> +static inline bool page_pool_page_is_pfmemalloc(struct page *page)
> > >>> +{
> > >>> +     if (page_is_page_pool_iov(page))
> > >>> +             return false;
> > >>> +
> > >>> +     return page_is_pfmemalloc(page);
> > >>> +}
> > >>> +
> > >>> +static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
> > >>> +{
> > >>> +     /* Assume page_pool_iov are on the preferred node without actually
> > >>> +      * checking...
> > >>> +      *
> > >>> +      * This check is only used to check for recycling memory in the page
> > >>> +      * pool's fast paths. Currently the only implementation of page_pool_iov
> > >>> +      * is dmabuf device memory. It's a deliberate decision by the user to
> > >>> +      * bind a certain dmabuf to a certain netdev, and the netdev rx queue
> > >>> +      * would not be able to reallocate memory from another dmabuf that
> > >>> +      * exists on the preferred node, so, this check doesn't make much sense
> > >>> +      * in this case. Assume all page_pool_iovs can be recycled for now.
> > >>> +      */
> > >>> +     if (page_is_page_pool_iov(page))
> > >>> +             return true;
> > >>> +
> > >>> +     return page_to_nid(page) == pref_nid;
> > >>> +}
> > >>> +
> > >>>    struct page_pool {
> > >>>        struct page_pool_params p;
> > >>>
> > >>> @@ -434,6 +492,9 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
> > >>>    {
> > >>>        long ret;
> > >>>
> > >>> +     if (page_is_page_pool_iov(page))
> > >>> +             return -EINVAL;
> > >>> +
> > >>>        /* If nr == pp_frag_count then we have cleared all remaining
> > >>>         * references to the page. No need to actually overwrite it, instead
> > >>>         * we can leave this to be overwritten by the calling function.
> > >>> @@ -494,7 +555,12 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
> > >>>
> > >>>    static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
> > >>>    {
> > >>> -     dma_addr_t ret = page->dma_addr;
> > >>> +     dma_addr_t ret;
> > >>> +
> > >>> +     if (page_is_page_pool_iov(page))
> > >>> +             return page_pool_iov_dma_addr(page_to_page_pool_iov(page));
> > >>> +
> > >>> +     ret = page->dma_addr;
> > >>>
> > >>>        if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> > >>>                ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16;
> > >>> @@ -504,6 +570,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
> > >>>
> > >>>    static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
> > >>>    {
> > >>> +     /* page_pool_iovs are mapped and their dma-addr can't be modified. */
> > >>> +     if (page_is_page_pool_iov(page)) {
> > >>> +             DEBUG_NET_WARN_ON_ONCE(true);
> > >>> +             return;
> > >>> +     }
> > >>> +
> > >>>        page->dma_addr = addr;
> > >>>        if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> > >>>                page->dma_addr_upper = upper_32_bits(addr);
> > >>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > >>> index 0a7c08d748b8..20c1f74fd844 100644
> > >>> --- a/net/core/page_pool.c
> > >>> +++ b/net/core/page_pool.c
> > >>> @@ -318,7 +318,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
> > >>>                if (unlikely(!page))
> > >>>                        break;
> > >>>
> > >>> -             if (likely(page_to_nid(page) == pref_nid)) {
> > >>> +             if (likely(page_pool_page_is_pref_nid(page, pref_nid))) {
> > >>>                        pool->alloc.cache[pool->alloc.count++] = page;
> > >>>                } else {
> > >>>                        /* NUMA mismatch;
> > >>> @@ -363,7 +363,15 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
> > >>>                                          struct page *page,
> > >>>                                          unsigned int dma_sync_size)
> > >>>    {
> > >>> -     dma_addr_t dma_addr = page_pool_get_dma_addr(page);
> > >>> +     dma_addr_t dma_addr;
> > >>> +
> > >>> +     /* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */
> > >>> +     if (page_is_page_pool_iov(page)) {
> > >>> +             DEBUG_NET_WARN_ON_ONCE(true);
> > >>> +             return;
> > >>> +     }
> > >>> +
> > >>> +     dma_addr = page_pool_get_dma_addr(page);
> > >>>
> > >>>        dma_sync_size = min(dma_sync_size, pool->p.max_len);
> > >>>        dma_sync_single_range_for_device(pool->p.dev, dma_addr,
> > >>> @@ -375,6 +383,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
> > >>>    {
> > >>>        dma_addr_t dma;
> > >>>
> > >>> +     if (page_is_page_pool_iov(page)) {
> > >>> +             /* page_pool_iovs are already mapped */
> > >>> +             DEBUG_NET_WARN_ON_ONCE(true);
> > >>> +             return true;
> > >>> +     }
> > >>> +
> > >>>        /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
> > >>>         * since dma_addr_t can be either 32 or 64 bits and does not always fit
> > >>>         * into page private data (i.e 32bit cpu with 64bit DMA caps)
> > >>> @@ -398,14 +412,24 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
> > >>>    static void page_pool_set_pp_info(struct page_pool *pool,
> > >>>                                  struct page *page)
> > >>>    {
> > >>> -     page->pp = pool;
> > >>> -     page->pp_magic |= PP_SIGNATURE;
> > >>> +     if (!page_is_page_pool_iov(page)) {
> > >>> +             page->pp = pool;
> > >>> +             page->pp_magic |= PP_SIGNATURE;
> > >>> +     } else {
> > >>> +             page_to_page_pool_iov(page)->pp = pool;
> > >>> +     }
> > >>> +
> > >>>        if (pool->p.init_callback)
> > >>>                pool->p.init_callback(page, pool->p.init_arg);
> > >>>    }
> > >>>
> > >>>    static void page_pool_clear_pp_info(struct page *page)
> > >>>    {
> > >>> +     if (page_is_page_pool_iov(page)) {
> > >>> +             page_to_page_pool_iov(page)->pp = NULL;
> > >>> +             return;
> > >>> +     }
> > >>> +
> > >>>        page->pp_magic = 0;
> > >>>        page->pp = NULL;
> > >>>    }
> > >>> @@ -615,7 +639,7 @@ static bool page_pool_recycle_in_cache(struct page *page,
> > >>>                return false;
> > >>>        }
> > >>>
> > >>> -     /* Caller MUST have verified/know (page_ref_count(page) == 1) */
> > >>> +     /* Caller MUST have verified/know (page_pool_page_ref_count(page) == 1) */
> > >>>        pool->alloc.cache[pool->alloc.count++] = page;
> > >>>        recycle_stat_inc(pool, cached);
> > >>>        return true;
> > >>> @@ -638,9 +662,10 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
> > >>>         * refcnt == 1 means page_pool owns page, and can recycle it.
> > >>>         *
> > >>>         * page is NOT reusable when allocated when system is under
> > >>> -      * some pressure. (page_is_pfmemalloc)
> > >>> +      * some pressure. (page_pool_page_is_pfmemalloc)
> > >>>         */
> > >>> -     if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
> > >>> +     if (likely(page_pool_page_ref_count(page) == 1 &&
> > >>> +                !page_pool_page_is_pfmemalloc(page))) {
> > >>>                /* Read barrier done in page_ref_count / READ_ONCE */
> > >>>
> > >>>                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> > >>> @@ -741,7 +766,8 @@ static struct page *page_pool_drain_frag(struct page_pool *pool,
> > >>>        if (likely(page_pool_defrag_page(page, drain_count)))
> > >>>                return NULL;
> > >>>
> > >>> -     if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
> > >>> +     if (page_pool_page_ref_count(page) == 1 &&
> > >>> +         !page_pool_page_is_pfmemalloc(page)) {
> > >>>                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> > >>>                        page_pool_dma_sync_for_device(pool, page, -1);
> > >>>
> > >>> @@ -818,9 +844,9 @@ static void page_pool_empty_ring(struct page_pool *pool)
> > >>>        /* Empty recycle ring */
> > >>>        while ((page = ptr_ring_consume_bh(&pool->ring))) {
> > >>>                /* Verify the refcnt invariant of cached pages */
> > >>> -             if (!(page_ref_count(page) == 1))
> > >>> +             if (!(page_pool_page_ref_count(page) == 1))
> > >>>                        pr_crit("%s() page_pool refcnt %d violation\n",
> > >>> -                             __func__, page_ref_count(page));
> > >>> +                             __func__, page_pool_page_ref_count(page));
> > >>>
> > >>>                page_pool_return_page(pool, page);
> > >>>        }
> > >>> @@ -977,19 +1003,24 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
> > >>>        struct page_pool *pp;
> > >>>        bool allow_direct;
> > >>>
> > >>> -     page = compound_head(page);
> > >>> +     if (!page_is_page_pool_iov(page)) {
> > >>> +             page = compound_head(page);
> > >>>
> > >>> -     /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
> > >>> -      * in order to preserve any existing bits, such as bit 0 for the
> > >>> -      * head page of compound page and bit 1 for pfmemalloc page, so
> > >>> -      * mask those bits for freeing side when doing below checking,
> > >>> -      * and page_is_pfmemalloc() is checked in __page_pool_put_page()
> > >>> -      * to avoid recycling the pfmemalloc page.
> > >>> -      */
> > >>> -     if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> > >>> -             return false;
> > >>> +             /* page->pp_magic is OR'ed with PP_SIGNATURE after the
> > >>> +              * allocation in order to preserve any existing bits, such as
> > >>> +              * bit 0 for the head page of compound page and bit 1 for
> > >>> +              * pfmemalloc page, so mask those bits for freeing side when
> > >>> +              * doing below checking, and page_pool_page_is_pfmemalloc() is
> > >>> +              * checked in __page_pool_put_page() to avoid recycling the
> > >>> +              * pfmemalloc page.
> > >>> +              */
> > >>> +             if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> > >>> +                     return false;
> > >>>
> > >>> -     pp = page->pp;
> > >>> +             pp = page->pp;
> > >>> +     } else {
> > >>> +             pp = page_to_page_pool_iov(page)->pp;
> > >>> +     }
> > >>>
> > >>>        /* Allow direct recycle if we have reasons to believe that we are
> > >>>         * in the same context as the consumer would run, so there's
> > >>> @@ -1273,9 +1304,9 @@ static bool mp_huge_busy(struct mp_huge *hu, unsigned int idx)
> > >>>
> > >>>        for (j = 0; j < (1 << MP_HUGE_ORDER); j++) {
> > >>>                page = hu->page[idx] + j;
> > >>> -             if (page_ref_count(page) != 1) {
> > >>> +             if (page_pool_page_ref_count(page) != 1) {
> > >>>                        pr_warn("Page with ref count %d at %u, %u. Can't safely destory, leaking memory!\n",
> > >>> -                             page_ref_count(page), idx, j);
> > >>> +                             page_pool_page_ref_count(page), idx, j);
> > >>>                        return true;
> > >>>                }
> > >>>        }
> > >>> @@ -1330,7 +1361,7 @@ static struct page *mp_huge_alloc_pages(struct page_pool *pool, gfp_t gfp)
> > >>>                        continue;
> > >>>
> > >>>                if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
> > >>> -                 page_ref_count(page) != 1) {
> > >>> +                 page_pool_page_ref_count(page) != 1) {
> > >>>                        atomic_inc(&mp_huge_ins_b);
> > >>>                        continue;
> > >>>                }
> > >>> @@ -1458,9 +1489,9 @@ static void mp_huge_1g_destroy(struct page_pool *pool)
> > >>>        free = true;
> > >>>        for (i = 0; i < MP_HUGE_1G_CNT; i++) {
> > >>>                page = hu->page + i;
> > >>> -             if (page_ref_count(page) != 1) {
> > >>> +             if (page_pool_page_ref_count(page) != 1) {
> > >>>                        pr_warn("Page with ref count %d at %u. Can't safely destory, leaking memory!\n",
> > >>> -                             page_ref_count(page), i);
> > >>> +                             page_pool_page_ref_count(page), i);
> > >>>                        free = false;
> > >>>                        break;
> > >>>                }
> > >>> @@ -1489,7 +1520,7 @@ static struct page *mp_huge_1g_alloc_pages(struct page_pool *pool, gfp_t gfp)
> > >>>                page = hu->page + page_i;
> > >>>
> > >>>                if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
> > >>> -                 page_ref_count(page) != 1) {
> > >>> +                 page_pool_page_ref_count(page) != 1) {
> > >>>                        atomic_inc(&mp_huge_ins_b);
> > >>>                        continue;
> > >>>                }
> > >>> --
> > >>> 2.41.0.640.ga95def55d0-goog
> > >>>
> > >>
> > >
> >
>
>
> --
> Thanks,
> Mina



-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice
  2023-08-19 14:18                     ` Willem de Bruijn
  2023-08-19 17:59                       ` Mina Almasry
@ 2023-08-21 21:16                       ` Jakub Kicinski
  2023-08-22  0:38                         ` Willem de Bruijn
  2023-08-22  3:19                       ` David Ahern
  2 siblings, 1 reply; 62+ messages in thread
From: Jakub Kicinski @ 2023-08-21 21:16 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David Ahern, Mina Almasry, Praveen Kaligineedi, netdev,
	Eric Dumazet, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Magnus Karlsson, sdf, Willem de Bruijn,
	Kaiyuan Zhang

On Sat, 19 Aug 2023 10:18:57 -0400 Willem de Bruijn wrote:
> Right. Many devices only allow bringing all queues down at the same time.
> 
> Once a descriptor is posted and the ring head is written, there is no
> way to retract that. Since waiting for the device to catch up is not
> acceptable, the only option is to bring down the queue, right? Which
> will imply bringing down the entire device on many devices. Not ideal,
> but acceptable short term, imho.
> 
> That may be an incentive for vendors to support per-queue
> start/stop/alloc/free. Maybe the ones that support RDMA already do?

Are you talking about HW devices, or virt? I thought most HW made 
in the last 10 years should be able to take down individual queues :o

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 06/11] page-pool: add device memory support
  2023-08-19 16:12           ` Willem de Bruijn
@ 2023-08-21 21:31             ` Jakub Kicinski
  2023-08-22  0:58               ` Willem de Bruijn
  0 siblings, 1 reply; 62+ messages in thread
From: Jakub Kicinski @ 2023-08-21 21:31 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David Ahern, Jesper Dangaard Brouer, brouer, Mina Almasry,
	netdev, linux-media, dri-devel, David S. Miller, Eric Dumazet,
	Paolo Abeni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Arnd Bergmann, Sumit Semwal, Christian König,
	Jason Gunthorpe, Hari Ramakrishnan, Dan Williams,
	Andy Lutomirski, stephen, sdf

On Sat, 19 Aug 2023 12:12:16 -0400 Willem de Bruijn wrote:
> :-) For the record, there is a prior version that added a separate type.
> 
> I did not like the churn it brought and asked for this.

It does end up looking cleaner that I personally expected, FWIW.

> > Use of the LSB (or bits depending on alignment expectations) is a common
> > trick and already done in quite a few places in the networking stack.
> > This trick is essential to any realistic change here to incorporate gpu
> > memory; way too much code will have unnecessary churn without it.

We'll end up needing the LSB trick either way, right? The only question
is whether the "if" is part of page pool or the caller of page pool.

Having seen zctap I'm afraid if we push this out of pp every provider
will end up re-implementing page pool's recycling/caching functionality
:(

Maybe we need to "fork" the API? The device memory "ifs" are only needed
for data pages. Which means that we can retain a faster, "if-less" API
for headers and XDP. Or is that too much duplication?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice
  2023-08-21 21:16                       ` Jakub Kicinski
@ 2023-08-22  0:38                         ` Willem de Bruijn
  2023-08-22  1:51                           ` Jakub Kicinski
  0 siblings, 1 reply; 62+ messages in thread
From: Willem de Bruijn @ 2023-08-22  0:38 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Ahern, Mina Almasry, Praveen Kaligineedi, netdev,
	Eric Dumazet, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Magnus Karlsson, sdf, Willem de Bruijn,
	Kaiyuan Zhang

On Mon, Aug 21, 2023 at 5:17 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sat, 19 Aug 2023 10:18:57 -0400 Willem de Bruijn wrote:
> > Right. Many devices only allow bringing all queues down at the same time.
> >
> > Once a descriptor is posted and the ring head is written, there is no
> > way to retract that. Since waiting for the device to catch up is not
> > acceptable, the only option is to bring down the queue, right? Which
> > will imply bringing down the entire device on many devices. Not ideal,
> > but acceptable short term, imho.
> >
> > That may be an incentive for vendors to support per-queue
> > start/stop/alloc/free. Maybe the ones that support RDMA already do?
>
> Are you talking about HW devices, or virt? I thought most HW made
> in the last 10 years should be able to take down individual queues :o

That's great. This is currently mostly encapsulated device-wide behind
ndo_close, with code looping over all rx rings, say.

Taking a look at one driver, bnxt, it indeed has a per-ring
communication exchange with the device, in hwrm_ring_free_send_msg
("/* Flush rings and disable interrupts */"), which is called before
the other normal steps: napi disable, dma unmap, posted mem free,
irq_release, napi delete and ring mem free.

This is what you meant? The issue I was unsure of was quiescing the
device immediately, i.e., that hwrm_ring_free_send_msg.

I guess this means that this could all be structured on a per-queue
basis rather than from ndo_close. Would be a significant change to
many drivers, I'd imagine.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 06/11] page-pool: add device memory support
  2023-08-21 21:31             ` Jakub Kicinski
@ 2023-08-22  0:58               ` Willem de Bruijn
  0 siblings, 0 replies; 62+ messages in thread
From: Willem de Bruijn @ 2023-08-22  0:58 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Ahern, Jesper Dangaard Brouer, brouer, Mina Almasry,
	netdev, linux-media, dri-devel, David S. Miller, Eric Dumazet,
	Paolo Abeni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Arnd Bergmann, Sumit Semwal, Christian König,
	Jason Gunthorpe, Hari Ramakrishnan, Dan Williams,
	Andy Lutomirski, stephen, sdf

On Mon, Aug 21, 2023 at 5:31 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sat, 19 Aug 2023 12:12:16 -0400 Willem de Bruijn wrote:
> > :-) For the record, there is a prior version that added a separate type.
> >
> > I did not like the churn it brought and asked for this.
>
> It does end up looking cleaner that I personally expected, FWIW.
>
> > > Use of the LSB (or bits depending on alignment expectations) is a common
> > > trick and already done in quite a few places in the networking stack.
> > > This trick is essential to any realistic change here to incorporate gpu
> > > memory; way too much code will have unnecessary churn without it.
>
> We'll end up needing the LSB trick either way, right? The only question
> is whether the "if" is part of page pool or the caller of page pool.

Indeed. Adding layering does not remove this.

> Having seen zctap I'm afraid if we push this out of pp every provider
> will end up re-implementing page pool's recycling/caching functionality
> :(
>
> Maybe we need to "fork" the API? The device memory "ifs" are only needed
> for data pages. Which means that we can retain a faster, "if-less" API
> for headers and XDP. Or is that too much duplication?

I don't think that would be faster. Just a different structuring of
the code. We still need to take one of multiple paths for, say, page
allocation (page_pool_alloc_pages).

If having a struct page_pool and struct mem_pool, there would still be
a type-based branch. But now either in every caller (yech), or in some
thin shim layer. Why not just have it behind the existing API? That is
what your memory provider does. The only difference now is that one of
the providers really does not deal with pages.

I think this multiplexing does not have to introduce performance
regressions with well placed static_branch. Benchmarks and
side-by-side comparison of assembly will have to verify that.

Indirect function calls need to be avoided, of course, in favor of a
type based switch. And unless the non_default_providers static branch
is enabled, the default path is taken unconditionally.

If it no longer is a pure page pool, maybe it can be renamed. I
personally don't care.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice
  2023-08-22  0:38                         ` Willem de Bruijn
@ 2023-08-22  1:51                           ` Jakub Kicinski
  0 siblings, 0 replies; 62+ messages in thread
From: Jakub Kicinski @ 2023-08-22  1:51 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David Ahern, Mina Almasry, Praveen Kaligineedi, netdev,
	Eric Dumazet, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Magnus Karlsson, sdf, Willem de Bruijn,
	Kaiyuan Zhang

On Mon, 21 Aug 2023 20:38:09 -0400 Willem de Bruijn wrote:
> > Are you talking about HW devices, or virt? I thought most HW made
> > in the last 10 years should be able to take down individual queues :o  
> 
> That's great. This is currently mostly encapsulated device-wide behind
> ndo_close, with code looping over all rx rings, say.
> 
> Taking a look at one driver, bnxt, it indeed has a per-ring
> communication exchange with the device, in hwrm_ring_free_send_msg
> ("/* Flush rings and disable interrupts */"), which is called before
> the other normal steps: napi disable, dma unmap, posted mem free,
> irq_release, napi delete and ring mem free.
> 
> This is what you meant? The issue I was unsure of was quiescing the
> device immediately, i.e., that hwrm_ring_free_send_msg.

Yes, and I recall we had similar APIs at Netronome for the NFP.
I haven't see it in MS specs myself but I wouldn't be surprised if 
they required it..

There's a bit of an unknown in how well all of this actually works,
as the FW/HW paths were not exercised outside of RDMA and potentially
other proprietary stuff.

> I guess this means that this could all be structured on a per-queue
> basis rather than from ndo_close. Would be a significant change to
> many drivers, I'd imagine.

Yes, it definitely is. The question is how much of it do we require
from Mina before merging the mem provider work. I'd really like to
avoid the "delegate all the work to the driver" approach that AF_XDP
has taken, which is what I'm afraid we'll end up with if we push too
hard for a full set of APIs from the start.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice
  2023-08-19 14:18                     ` Willem de Bruijn
  2023-08-19 17:59                       ` Mina Almasry
  2023-08-21 21:16                       ` Jakub Kicinski
@ 2023-08-22  3:19                       ` David Ahern
  2 siblings, 0 replies; 62+ messages in thread
From: David Ahern @ 2023-08-22  3:19 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Jakub Kicinski, Mina Almasry, Praveen Kaligineedi, netdev,
	Eric Dumazet, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Magnus Karlsson, sdf, Willem de Bruijn,
	Kaiyuan Zhang

On 8/19/23 8:18 AM, Willem de Bruijn wrote:
> That may be an incentive for vendors to support per-queue
> start/stop/alloc/free. Maybe the ones that support RDMA already do?

I looked at most of the H/W RDMA in-tree kernel drivers today, and all
of them hand-off create_qp to firmware. I am really surprised by that
given that many of those drivers work with netdev. I am fairly certain
mlx5 and ConnectX 5-6, for example, can be used to achieve the
architecture we proposed at netdev [1], so I was expecting a general
queue management interface.

Between that and a skim of the providers (userspace side of it), the IB
interface may be a dead-end from the perspective of the goals here.

[1]
https://netdevconf.info/0x16/slides/34/netdev-0x16-Merging-the-Networking-Worlds-slides.pdf,
slide 14.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 06/11] page-pool: add device memory support
  2023-08-19  9:51   ` Jesper Dangaard Brouer
  2023-08-19 14:08     ` Willem de Bruijn
@ 2023-08-22  6:05     ` Mina Almasry
  2023-08-22 12:24       ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 62+ messages in thread
From: Mina Almasry @ 2023-08-22  6:05 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, linux-media, dri-devel, brouer, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	David Ahern, Willem de Bruijn, Sumit Semwal,
	Christian König, Jason Gunthorpe, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf

On Sat, Aug 19, 2023 at 2:51 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 10/08/2023 03.57, Mina Almasry wrote:
> > Overload the LSB of struct page* to indicate that it's a page_pool_iov.
> >
> > Refactor mm calls on struct page * into helpers, and add page_pool_iov
> > handling on those helpers. Modify callers of these mm APIs with calls to
> > these helpers instead.
> >
>
> I don't like of this approach.
> This is adding code to the PP (page_pool) fast-path in multiple places.
>
> I've not had time to run my usual benchmarks, which are here:
>
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
>

I ported over this benchmark to my tree and ran it, my results:

net-next @ b44693495af8
https://pastebin.com/raw/JuU7UQXe

+ Jakub's memory-provider APIs:
https://pastebin.com/raw/StMBhetn

+ devmem TCP changes:
https://pastebin.com/raw/mY1L6U4r

+ intentional regression just to make sure the benchmark is working:
https://pastebin.com/raw/wqWhcJdG

I don't seem to be able to detect a regression with this series as-is,
but I'm not that familiar with the test and may be doing something
wrong or misinterpreting the results. Does this look ok to you?

> But I'm sure it will affect performance.
>
> Regardless of performance, this approach is using ptr-LSB-bits, to hide
> that page-pointer are not really struct-pages, feels like force feeding
> a solution just to use the page_pool APIs.
>
>
> > In areas where struct page* is dereferenced, add a check for special
> > handling of page_pool_iov.
> >
> > The memory providers producing page_pool_iov can set the LSB on the
> > struct page* returned to the page pool.
> >
> > Note that instead of overloading the LSB of page pointers, we can
> > instead define a new union between struct page & struct page_pool_iov and
> > compact it in a new type. However, we'd need to implement the code churn
> > to modify the page_pool & drivers to use this new type. For this POC
> > that is not implemented (feedback welcome).
> >
>
> I've said before, that I prefer multiplexing on page->pp_magic.
> For your page_pool_iov the layout would have to match the offset of
> pp_magic, to do this. (And if insisting on using PP infra the refcnt
> would also need to align).
>
> On the allocation side, all drivers already use a driver helper
> page_pool_dev_alloc_pages() or we could add another (better named)
> helper to multiplex between other types of allocators, e.g. a devmem
> allocator.
>
> On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle
> could multiplex on pp_magic and call another API.  The API could be an
> extension to PP helpers, but it could also be a devmap allocator helper.
>
> IMHO forcing/piggy-bagging everything into page_pool is not the right
> solution.  I really think netstack need to support different allocator
> types. The page pool have been leading the way, yes, but perhaps it is
> time to add an API layer that e.g. could be named netmem, that gives us
> the multiplexing between allocators.  In that process some of page_pool
> APIs would be lifted out as common blocks and others remain.
>
> --Jesper
>
> > I have a sample implementation of adding a new page_pool_token type
> > in the page_pool to give a general idea here:
> > https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d
> >
> > Full branch here:
> > https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens
> >
> > (In the branches above, page_pool_iov is called devmem_slice).
> >
> > Could also add static_branch to speed up the checks in page_pool_iov
> > memory providers are being used.
> >
> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> > ---
> >   include/net/page_pool.h | 74 ++++++++++++++++++++++++++++++++++-
> >   net/core/page_pool.c    | 85 ++++++++++++++++++++++++++++-------------
> >   2 files changed, 131 insertions(+), 28 deletions(-)
> >
> > diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> > index 537eb36115ed..f08ca230d68e 100644
> > --- a/include/net/page_pool.h
> > +++ b/include/net/page_pool.h
> > @@ -282,6 +282,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
> >       return NULL;
> >   }
> >
> > +static inline int page_pool_page_ref_count(struct page *page)
> > +{
> > +     if (page_is_page_pool_iov(page))
> > +             return page_pool_iov_refcount(page_to_page_pool_iov(page));
> > +
> > +     return page_ref_count(page);
> > +}
> > +
> > +static inline void page_pool_page_get_many(struct page *page,
> > +                                        unsigned int count)
> > +{
> > +     if (page_is_page_pool_iov(page))
> > +             return page_pool_iov_get_many(page_to_page_pool_iov(page),
> > +                                           count);
> > +
> > +     return page_ref_add(page, count);
> > +}
> > +
> > +static inline void page_pool_page_put_many(struct page *page,
> > +                                        unsigned int count)
> > +{
> > +     if (page_is_page_pool_iov(page))
> > +             return page_pool_iov_put_many(page_to_page_pool_iov(page),
> > +                                           count);
> > +
> > +     if (count > 1)
> > +             page_ref_sub(page, count - 1);
> > +
> > +     put_page(page);
> > +}
> > +
> > +static inline bool page_pool_page_is_pfmemalloc(struct page *page)
> > +{
> > +     if (page_is_page_pool_iov(page))
> > +             return false;
> > +
> > +     return page_is_pfmemalloc(page);
> > +}
> > +
> > +static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
> > +{
> > +     /* Assume page_pool_iov are on the preferred node without actually
> > +      * checking...
> > +      *
> > +      * This check is only used to check for recycling memory in the page
> > +      * pool's fast paths. Currently the only implementation of page_pool_iov
> > +      * is dmabuf device memory. It's a deliberate decision by the user to
> > +      * bind a certain dmabuf to a certain netdev, and the netdev rx queue
> > +      * would not be able to reallocate memory from another dmabuf that
> > +      * exists on the preferred node, so, this check doesn't make much sense
> > +      * in this case. Assume all page_pool_iovs can be recycled for now.
> > +      */
> > +     if (page_is_page_pool_iov(page))
> > +             return true;
> > +
> > +     return page_to_nid(page) == pref_nid;
> > +}
> > +
> >   struct page_pool {
> >       struct page_pool_params p;
> >
> > @@ -434,6 +492,9 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
> >   {
> >       long ret;
> >
> > +     if (page_is_page_pool_iov(page))
> > +             return -EINVAL;
> > +
> >       /* If nr == pp_frag_count then we have cleared all remaining
> >        * references to the page. No need to actually overwrite it, instead
> >        * we can leave this to be overwritten by the calling function.
> > @@ -494,7 +555,12 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
> >
> >   static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
> >   {
> > -     dma_addr_t ret = page->dma_addr;
> > +     dma_addr_t ret;
> > +
> > +     if (page_is_page_pool_iov(page))
> > +             return page_pool_iov_dma_addr(page_to_page_pool_iov(page));
> > +
> > +     ret = page->dma_addr;
> >
> >       if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> >               ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16;
> > @@ -504,6 +570,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
> >
> >   static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
> >   {
> > +     /* page_pool_iovs are mapped and their dma-addr can't be modified. */
> > +     if (page_is_page_pool_iov(page)) {
> > +             DEBUG_NET_WARN_ON_ONCE(true);
> > +             return;
> > +     }
> > +
> >       page->dma_addr = addr;
> >       if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> >               page->dma_addr_upper = upper_32_bits(addr);
> > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > index 0a7c08d748b8..20c1f74fd844 100644
> > --- a/net/core/page_pool.c
> > +++ b/net/core/page_pool.c
> > @@ -318,7 +318,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
> >               if (unlikely(!page))
> >                       break;
> >
> > -             if (likely(page_to_nid(page) == pref_nid)) {
> > +             if (likely(page_pool_page_is_pref_nid(page, pref_nid))) {
> >                       pool->alloc.cache[pool->alloc.count++] = page;
> >               } else {
> >                       /* NUMA mismatch;
> > @@ -363,7 +363,15 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
> >                                         struct page *page,
> >                                         unsigned int dma_sync_size)
> >   {
> > -     dma_addr_t dma_addr = page_pool_get_dma_addr(page);
> > +     dma_addr_t dma_addr;
> > +
> > +     /* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */
> > +     if (page_is_page_pool_iov(page)) {
> > +             DEBUG_NET_WARN_ON_ONCE(true);
> > +             return;
> > +     }
> > +
> > +     dma_addr = page_pool_get_dma_addr(page);
> >
> >       dma_sync_size = min(dma_sync_size, pool->p.max_len);
> >       dma_sync_single_range_for_device(pool->p.dev, dma_addr,
> > @@ -375,6 +383,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
> >   {
> >       dma_addr_t dma;
> >
> > +     if (page_is_page_pool_iov(page)) {
> > +             /* page_pool_iovs are already mapped */
> > +             DEBUG_NET_WARN_ON_ONCE(true);
> > +             return true;
> > +     }
> > +
> >       /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
> >        * since dma_addr_t can be either 32 or 64 bits and does not always fit
> >        * into page private data (i.e 32bit cpu with 64bit DMA caps)
> > @@ -398,14 +412,24 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
> >   static void page_pool_set_pp_info(struct page_pool *pool,
> >                                 struct page *page)
> >   {
> > -     page->pp = pool;
> > -     page->pp_magic |= PP_SIGNATURE;
> > +     if (!page_is_page_pool_iov(page)) {
> > +             page->pp = pool;
> > +             page->pp_magic |= PP_SIGNATURE;
> > +     } else {
> > +             page_to_page_pool_iov(page)->pp = pool;
> > +     }
> > +
> >       if (pool->p.init_callback)
> >               pool->p.init_callback(page, pool->p.init_arg);
> >   }
> >
> >   static void page_pool_clear_pp_info(struct page *page)
> >   {
> > +     if (page_is_page_pool_iov(page)) {
> > +             page_to_page_pool_iov(page)->pp = NULL;
> > +             return;
> > +     }
> > +
> >       page->pp_magic = 0;
> >       page->pp = NULL;
> >   }
> > @@ -615,7 +639,7 @@ static bool page_pool_recycle_in_cache(struct page *page,
> >               return false;
> >       }
> >
> > -     /* Caller MUST have verified/know (page_ref_count(page) == 1) */
> > +     /* Caller MUST have verified/know (page_pool_page_ref_count(page) == 1) */
> >       pool->alloc.cache[pool->alloc.count++] = page;
> >       recycle_stat_inc(pool, cached);
> >       return true;
> > @@ -638,9 +662,10 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
> >        * refcnt == 1 means page_pool owns page, and can recycle it.
> >        *
> >        * page is NOT reusable when allocated when system is under
> > -      * some pressure. (page_is_pfmemalloc)
> > +      * some pressure. (page_pool_page_is_pfmemalloc)
> >        */
> > -     if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
> > +     if (likely(page_pool_page_ref_count(page) == 1 &&
> > +                !page_pool_page_is_pfmemalloc(page))) {
> >               /* Read barrier done in page_ref_count / READ_ONCE */
> >
> >               if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> > @@ -741,7 +766,8 @@ static struct page *page_pool_drain_frag(struct page_pool *pool,
> >       if (likely(page_pool_defrag_page(page, drain_count)))
> >               return NULL;
> >
> > -     if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
> > +     if (page_pool_page_ref_count(page) == 1 &&
> > +         !page_pool_page_is_pfmemalloc(page)) {
> >               if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> >                       page_pool_dma_sync_for_device(pool, page, -1);
> >
> > @@ -818,9 +844,9 @@ static void page_pool_empty_ring(struct page_pool *pool)
> >       /* Empty recycle ring */
> >       while ((page = ptr_ring_consume_bh(&pool->ring))) {
> >               /* Verify the refcnt invariant of cached pages */
> > -             if (!(page_ref_count(page) == 1))
> > +             if (!(page_pool_page_ref_count(page) == 1))
> >                       pr_crit("%s() page_pool refcnt %d violation\n",
> > -                             __func__, page_ref_count(page));
> > +                             __func__, page_pool_page_ref_count(page));
> >
> >               page_pool_return_page(pool, page);
> >       }
> > @@ -977,19 +1003,24 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
> >       struct page_pool *pp;
> >       bool allow_direct;
> >
> > -     page = compound_head(page);
> > +     if (!page_is_page_pool_iov(page)) {
> > +             page = compound_head(page);
> >
> > -     /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
> > -      * in order to preserve any existing bits, such as bit 0 for the
> > -      * head page of compound page and bit 1 for pfmemalloc page, so
> > -      * mask those bits for freeing side when doing below checking,
> > -      * and page_is_pfmemalloc() is checked in __page_pool_put_page()
> > -      * to avoid recycling the pfmemalloc page.
> > -      */
> > -     if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> > -             return false;
> > +             /* page->pp_magic is OR'ed with PP_SIGNATURE after the
> > +              * allocation in order to preserve any existing bits, such as
> > +              * bit 0 for the head page of compound page and bit 1 for
> > +              * pfmemalloc page, so mask those bits for freeing side when
> > +              * doing below checking, and page_pool_page_is_pfmemalloc() is
> > +              * checked in __page_pool_put_page() to avoid recycling the
> > +              * pfmemalloc page.
> > +              */
> > +             if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> > +                     return false;
> >
> > -     pp = page->pp;
> > +             pp = page->pp;
> > +     } else {
> > +             pp = page_to_page_pool_iov(page)->pp;
> > +     }
> >
> >       /* Allow direct recycle if we have reasons to believe that we are
> >        * in the same context as the consumer would run, so there's
> > @@ -1273,9 +1304,9 @@ static bool mp_huge_busy(struct mp_huge *hu, unsigned int idx)
> >
> >       for (j = 0; j < (1 << MP_HUGE_ORDER); j++) {
> >               page = hu->page[idx] + j;
> > -             if (page_ref_count(page) != 1) {
> > +             if (page_pool_page_ref_count(page) != 1) {
> >                       pr_warn("Page with ref count %d at %u, %u. Can't safely destory, leaking memory!\n",
> > -                             page_ref_count(page), idx, j);
> > +                             page_pool_page_ref_count(page), idx, j);
> >                       return true;
> >               }
> >       }
> > @@ -1330,7 +1361,7 @@ static struct page *mp_huge_alloc_pages(struct page_pool *pool, gfp_t gfp)
> >                       continue;
> >
> >               if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
> > -                 page_ref_count(page) != 1) {
> > +                 page_pool_page_ref_count(page) != 1) {
> >                       atomic_inc(&mp_huge_ins_b);
> >                       continue;
> >               }
> > @@ -1458,9 +1489,9 @@ static void mp_huge_1g_destroy(struct page_pool *pool)
> >       free = true;
> >       for (i = 0; i < MP_HUGE_1G_CNT; i++) {
> >               page = hu->page + i;
> > -             if (page_ref_count(page) != 1) {
> > +             if (page_pool_page_ref_count(page) != 1) {
> >                       pr_warn("Page with ref count %d at %u. Can't safely destory, leaking memory!\n",
> > -                             page_ref_count(page), i);
> > +                             page_pool_page_ref_count(page), i);
> >                       free = false;
> >                       break;
> >               }
> > @@ -1489,7 +1520,7 @@ static struct page *mp_huge_1g_alloc_pages(struct page_pool *pool, gfp_t gfp)
> >               page = hu->page + page_i;
> >
> >               if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
> > -                 page_ref_count(page) != 1) {
> > +                 page_pool_page_ref_count(page) != 1) {
> >                       atomic_inc(&mp_huge_ins_b);
> >                       continue;
> >               }
> > --
> > 2.41.0.640.ga95def55d0-goog
> >
>


--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 06/11] page-pool: add device memory support
  2023-08-22  6:05     ` Mina Almasry
@ 2023-08-22 12:24       ` Jesper Dangaard Brouer
  2023-08-22 23:33         ` Mina Almasry
  0 siblings, 1 reply; 62+ messages in thread
From: Jesper Dangaard Brouer @ 2023-08-22 12:24 UTC (permalink / raw)
  To: Mina Almasry, Jesper Dangaard Brouer
  Cc: brouer, netdev, linux-media, dri-devel, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	David Ahern, Willem de Bruijn, Sumit Semwal,
	Christian König, Jason Gunthorpe, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf



On 22/08/2023 08.05, Mina Almasry wrote:
> On Sat, Aug 19, 2023 at 2:51 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
>>
>> On 10/08/2023 03.57, Mina Almasry wrote:
>>> Overload the LSB of struct page* to indicate that it's a page_pool_iov.
>>>
>>> Refactor mm calls on struct page * into helpers, and add page_pool_iov
>>> handling on those helpers. Modify callers of these mm APIs with calls to
>>> these helpers instead.
>>>
>>
>> I don't like of this approach.
>> This is adding code to the PP (page_pool) fast-path in multiple places.
>>
>> I've not had time to run my usual benchmarks, which are here:
>>
>> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
>>
> 
> I ported over this benchmark to my tree and ran it, my results:
> 

What CPU is this and GHz?  (I guess 2.6 GHz based on results).

(It looks like this CPU is more efficient, instructions per cycles, than 
my E5-1650 v4 @ 3.60GHz).

> net-next @ b44693495af8
> https://pastebin.com/raw/JuU7UQXe
> 
> + Jakub's memory-provider APIs:
> https://pastebin.com/raw/StMBhetn
> 
> + devmem TCP changes:
> https://pastebin.com/raw/mY1L6U4r
> 

Only a single cycle slowdown for "page_pool01_fast_path".
 From 10 cycles to 11 cycles.

> + intentional regression just to make sure the benchmark is working:
> https://pastebin.com/raw/wqWhcJdG
> 
> I don't seem to be able to detect a regression with this series as-is,
> but I'm not that familiar with the test and may be doing something
> wrong or misinterpreting the results. Does this look ok to you?
> 

The performance results are better than I expected.  The small
regression from 10 cycles to 11 cycles is actually 10%, but I expect
with some likely/unlikely instrumentation we can "likely" remove this again.

So, this change actually looks acceptable from a performance PoV.
I still think this page_pool_iov is very invasive to page_pool, but
maybe it is better to hide this "uglyness" inside page_pool.

The test primarily tests fast-path, and you also add "if" statements to
all the DMA operations, which is not part of this benchmark.  Perhaps we 
can add unlikely statements, or inspect (objdump) the ASM to check code 
priorities the original page based "provider".

>> But I'm sure it will affect performance.
>>

Guess, I was wrong ;-)

--Jesper


>> Regardless of performance, this approach is using ptr-LSB-bits, to hide
>> that page-pointer are not really struct-pages, feels like force feeding
>> a solution just to use the page_pool APIs.
>>
>>
>>> In areas where struct page* is dereferenced, add a check for special
>>> handling of page_pool_iov.
>>>
>>> The memory providers producing page_pool_iov can set the LSB on the
>>> struct page* returned to the page pool.
>>>
>>> Note that instead of overloading the LSB of page pointers, we can
>>> instead define a new union between struct page & struct page_pool_iov and
>>> compact it in a new type. However, we'd need to implement the code churn
>>> to modify the page_pool & drivers to use this new type. For this POC
>>> that is not implemented (feedback welcome).
>>>
>>
>> I've said before, that I prefer multiplexing on page->pp_magic.
>> For your page_pool_iov the layout would have to match the offset of
>> pp_magic, to do this. (And if insisting on using PP infra the refcnt
>> would also need to align).
>>
>> On the allocation side, all drivers already use a driver helper
>> page_pool_dev_alloc_pages() or we could add another (better named)
>> helper to multiplex between other types of allocators, e.g. a devmem
>> allocator.
>>
>> On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle
>> could multiplex on pp_magic and call another API.  The API could be an
>> extension to PP helpers, but it could also be a devmap allocator helper.
>>
>> IMHO forcing/piggy-bagging everything into page_pool is not the right
>> solution.  I really think netstack need to support different allocator
>> types. The page pool have been leading the way, yes, but perhaps it is
>> time to add an API layer that e.g. could be named netmem, that gives us
>> the multiplexing between allocators.  In that process some of page_pool
>> APIs would be lifted out as common blocks and others remain.
>>
>> --Jesper
>>
>>> I have a sample implementation of adding a new page_pool_token type
>>> in the page_pool to give a general idea here:
>>> https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d
>>>
>>> Full branch here:
>>> https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens
>>>
>>> (In the branches above, page_pool_iov is called devmem_slice).
>>>
>>> Could also add static_branch to speed up the checks in page_pool_iov
>>> memory providers are being used.
>>>
>>> Signed-off-by: Mina Almasry <almasrymina@google.com>
>>> ---
>>>    include/net/page_pool.h | 74 ++++++++++++++++++++++++++++++++++-
>>>    net/core/page_pool.c    | 85 ++++++++++++++++++++++++++++-------------
>>>    2 files changed, 131 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
>>> index 537eb36115ed..f08ca230d68e 100644
>>> --- a/include/net/page_pool.h
>>> +++ b/include/net/page_pool.h
>>> @@ -282,6 +282,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
>>>        return NULL;
>>>    }
>>>
>>> +static inline int page_pool_page_ref_count(struct page *page)
>>> +{
>>> +     if (page_is_page_pool_iov(page))
>>> +             return page_pool_iov_refcount(page_to_page_pool_iov(page));
>>> +
>>> +     return page_ref_count(page);
>>> +}
>>> +
>>> +static inline void page_pool_page_get_many(struct page *page,
>>> +                                        unsigned int count)
>>> +{
>>> +     if (page_is_page_pool_iov(page))
>>> +             return page_pool_iov_get_many(page_to_page_pool_iov(page),
>>> +                                           count);
>>> +
>>> +     return page_ref_add(page, count);
>>> +}
>>> +
>>> +static inline void page_pool_page_put_many(struct page *page,
>>> +                                        unsigned int count)
>>> +{
>>> +     if (page_is_page_pool_iov(page))
>>> +             return page_pool_iov_put_many(page_to_page_pool_iov(page),
>>> +                                           count);
>>> +
>>> +     if (count > 1)
>>> +             page_ref_sub(page, count - 1);
>>> +
>>> +     put_page(page);
>>> +}
>>> +
>>> +static inline bool page_pool_page_is_pfmemalloc(struct page *page)
>>> +{
>>> +     if (page_is_page_pool_iov(page))
>>> +             return false;
>>> +
>>> +     return page_is_pfmemalloc(page);
>>> +}
>>> +
>>> +static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
>>> +{
>>> +     /* Assume page_pool_iov are on the preferred node without actually
>>> +      * checking...
>>> +      *
>>> +      * This check is only used to check for recycling memory in the page
>>> +      * pool's fast paths. Currently the only implementation of page_pool_iov
>>> +      * is dmabuf device memory. It's a deliberate decision by the user to
>>> +      * bind a certain dmabuf to a certain netdev, and the netdev rx queue
>>> +      * would not be able to reallocate memory from another dmabuf that
>>> +      * exists on the preferred node, so, this check doesn't make much sense
>>> +      * in this case. Assume all page_pool_iovs can be recycled for now.
>>> +      */
>>> +     if (page_is_page_pool_iov(page))
>>> +             return true;
>>> +
>>> +     return page_to_nid(page) == pref_nid;
>>> +}
>>> +
>>>    struct page_pool {
>>>        struct page_pool_params p;
>>>
>>> @@ -434,6 +492,9 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
>>>    {
>>>        long ret;
>>>
>>> +     if (page_is_page_pool_iov(page))
>>> +             return -EINVAL;
>>> +
>>>        /* If nr == pp_frag_count then we have cleared all remaining
>>>         * references to the page. No need to actually overwrite it, instead
>>>         * we can leave this to be overwritten by the calling function.
>>> @@ -494,7 +555,12 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
>>>
>>>    static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>>>    {
>>> -     dma_addr_t ret = page->dma_addr;
>>> +     dma_addr_t ret;
>>> +
>>> +     if (page_is_page_pool_iov(page))
>>> +             return page_pool_iov_dma_addr(page_to_page_pool_iov(page));
>>> +
>>> +     ret = page->dma_addr;
>>>
>>>        if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
>>>                ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16;
>>> @@ -504,6 +570,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>>>
>>>    static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
>>>    {
>>> +     /* page_pool_iovs are mapped and their dma-addr can't be modified. */
>>> +     if (page_is_page_pool_iov(page)) {
>>> +             DEBUG_NET_WARN_ON_ONCE(true);
>>> +             return;
>>> +     }
>>> +
>>>        page->dma_addr = addr;
>>>        if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
>>>                page->dma_addr_upper = upper_32_bits(addr);
>>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>>> index 0a7c08d748b8..20c1f74fd844 100644
>>> --- a/net/core/page_pool.c
>>> +++ b/net/core/page_pool.c
>>> @@ -318,7 +318,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
>>>                if (unlikely(!page))
>>>                        break;
>>>
>>> -             if (likely(page_to_nid(page) == pref_nid)) {
>>> +             if (likely(page_pool_page_is_pref_nid(page, pref_nid))) {
>>>                        pool->alloc.cache[pool->alloc.count++] = page;
>>>                } else {
>>>                        /* NUMA mismatch;
>>> @@ -363,7 +363,15 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
>>>                                          struct page *page,
>>>                                          unsigned int dma_sync_size)
>>>    {
>>> -     dma_addr_t dma_addr = page_pool_get_dma_addr(page);
>>> +     dma_addr_t dma_addr;
>>> +
>>> +     /* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */
>>> +     if (page_is_page_pool_iov(page)) {
>>> +             DEBUG_NET_WARN_ON_ONCE(true);
>>> +             return;
>>> +     }
>>> +
>>> +     dma_addr = page_pool_get_dma_addr(page);
>>>
>>>        dma_sync_size = min(dma_sync_size, pool->p.max_len);
>>>        dma_sync_single_range_for_device(pool->p.dev, dma_addr,
>>> @@ -375,6 +383,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>>>    {
>>>        dma_addr_t dma;
>>>
>>> +     if (page_is_page_pool_iov(page)) {
>>> +             /* page_pool_iovs are already mapped */
>>> +             DEBUG_NET_WARN_ON_ONCE(true);
>>> +             return true;
>>> +     }
>>> +
>>>        /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
>>>         * since dma_addr_t can be either 32 or 64 bits and does not always fit
>>>         * into page private data (i.e 32bit cpu with 64bit DMA caps)
>>> @@ -398,14 +412,24 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>>>    static void page_pool_set_pp_info(struct page_pool *pool,
>>>                                  struct page *page)
>>>    {
>>> -     page->pp = pool;
>>> -     page->pp_magic |= PP_SIGNATURE;
>>> +     if (!page_is_page_pool_iov(page)) {
>>> +             page->pp = pool;
>>> +             page->pp_magic |= PP_SIGNATURE;
>>> +     } else {
>>> +             page_to_page_pool_iov(page)->pp = pool;
>>> +     }
>>> +
>>>        if (pool->p.init_callback)
>>>                pool->p.init_callback(page, pool->p.init_arg);
>>>    }
>>>
>>>    static void page_pool_clear_pp_info(struct page *page)
>>>    {
>>> +     if (page_is_page_pool_iov(page)) {
>>> +             page_to_page_pool_iov(page)->pp = NULL;
>>> +             return;
>>> +     }
>>> +
>>>        page->pp_magic = 0;
>>>        page->pp = NULL;
>>>    }
>>> @@ -615,7 +639,7 @@ static bool page_pool_recycle_in_cache(struct page *page,
>>>                return false;
>>>        }
>>>
>>> -     /* Caller MUST have verified/know (page_ref_count(page) == 1) */
>>> +     /* Caller MUST have verified/know (page_pool_page_ref_count(page) == 1) */
>>>        pool->alloc.cache[pool->alloc.count++] = page;
>>>        recycle_stat_inc(pool, cached);
>>>        return true;
>>> @@ -638,9 +662,10 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
>>>         * refcnt == 1 means page_pool owns page, and can recycle it.
>>>         *
>>>         * page is NOT reusable when allocated when system is under
>>> -      * some pressure. (page_is_pfmemalloc)
>>> +      * some pressure. (page_pool_page_is_pfmemalloc)
>>>         */
>>> -     if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
>>> +     if (likely(page_pool_page_ref_count(page) == 1 &&
>>> +                !page_pool_page_is_pfmemalloc(page))) {
>>>                /* Read barrier done in page_ref_count / READ_ONCE */
>>>
>>>                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
>>> @@ -741,7 +766,8 @@ static struct page *page_pool_drain_frag(struct page_pool *pool,
>>>        if (likely(page_pool_defrag_page(page, drain_count)))
>>>                return NULL;
>>>
>>> -     if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
>>> +     if (page_pool_page_ref_count(page) == 1 &&
>>> +         !page_pool_page_is_pfmemalloc(page)) {
>>>                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
>>>                        page_pool_dma_sync_for_device(pool, page, -1);
>>>
>>> @@ -818,9 +844,9 @@ static void page_pool_empty_ring(struct page_pool *pool)
>>>        /* Empty recycle ring */
>>>        while ((page = ptr_ring_consume_bh(&pool->ring))) {
>>>                /* Verify the refcnt invariant of cached pages */
>>> -             if (!(page_ref_count(page) == 1))
>>> +             if (!(page_pool_page_ref_count(page) == 1))
>>>                        pr_crit("%s() page_pool refcnt %d violation\n",
>>> -                             __func__, page_ref_count(page));
>>> +                             __func__, page_pool_page_ref_count(page));
>>>
>>>                page_pool_return_page(pool, page);
>>>        }
>>> @@ -977,19 +1003,24 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
>>>        struct page_pool *pp;
>>>        bool allow_direct;
>>>
>>> -     page = compound_head(page);
>>> +     if (!page_is_page_pool_iov(page)) {
>>> +             page = compound_head(page);
>>>
>>> -     /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
>>> -      * in order to preserve any existing bits, such as bit 0 for the
>>> -      * head page of compound page and bit 1 for pfmemalloc page, so
>>> -      * mask those bits for freeing side when doing below checking,
>>> -      * and page_is_pfmemalloc() is checked in __page_pool_put_page()
>>> -      * to avoid recycling the pfmemalloc page.
>>> -      */
>>> -     if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
>>> -             return false;
>>> +             /* page->pp_magic is OR'ed with PP_SIGNATURE after the
>>> +              * allocation in order to preserve any existing bits, such as
>>> +              * bit 0 for the head page of compound page and bit 1 for
>>> +              * pfmemalloc page, so mask those bits for freeing side when
>>> +              * doing below checking, and page_pool_page_is_pfmemalloc() is
>>> +              * checked in __page_pool_put_page() to avoid recycling the
>>> +              * pfmemalloc page.
>>> +              */
>>> +             if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
>>> +                     return false;
>>>
>>> -     pp = page->pp;
>>> +             pp = page->pp;
>>> +     } else {
>>> +             pp = page_to_page_pool_iov(page)->pp;
>>> +     }
>>>
>>>        /* Allow direct recycle if we have reasons to believe that we are
>>>         * in the same context as the consumer would run, so there's
>>> @@ -1273,9 +1304,9 @@ static bool mp_huge_busy(struct mp_huge *hu, unsigned int idx)
>>>
>>>        for (j = 0; j < (1 << MP_HUGE_ORDER); j++) {
>>>                page = hu->page[idx] + j;
>>> -             if (page_ref_count(page) != 1) {
>>> +             if (page_pool_page_ref_count(page) != 1) {
>>>                        pr_warn("Page with ref count %d at %u, %u. Can't safely destory, leaking memory!\n",
>>> -                             page_ref_count(page), idx, j);
>>> +                             page_pool_page_ref_count(page), idx, j);
>>>                        return true;
>>>                }
>>>        }
>>> @@ -1330,7 +1361,7 @@ static struct page *mp_huge_alloc_pages(struct page_pool *pool, gfp_t gfp)
>>>                        continue;
>>>
>>>                if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
>>> -                 page_ref_count(page) != 1) {
>>> +                 page_pool_page_ref_count(page) != 1) {
>>>                        atomic_inc(&mp_huge_ins_b);
>>>                        continue;
>>>                }
>>> @@ -1458,9 +1489,9 @@ static void mp_huge_1g_destroy(struct page_pool *pool)
>>>        free = true;
>>>        for (i = 0; i < MP_HUGE_1G_CNT; i++) {
>>>                page = hu->page + i;
>>> -             if (page_ref_count(page) != 1) {
>>> +             if (page_pool_page_ref_count(page) != 1) {
>>>                        pr_warn("Page with ref count %d at %u. Can't safely destory, leaking memory!\n",
>>> -                             page_ref_count(page), i);
>>> +                             page_pool_page_ref_count(page), i);
>>>                        free = false;
>>>                        break;
>>>                }
>>> @@ -1489,7 +1520,7 @@ static struct page *mp_huge_1g_alloc_pages(struct page_pool *pool, gfp_t gfp)
>>>                page = hu->page + page_i;
>>>
>>>                if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
>>> -                 page_ref_count(page) != 1) {
>>> +                 page_pool_page_ref_count(page) != 1) {
>>>                        atomic_inc(&mp_huge_ins_b);
>>>                        continue;
>>>                }
>>> --
>>> 2.41.0.640.ga95def55d0-goog
>>>
>>
> 
> 
> --
> Thanks,
> Mina
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 06/11] page-pool: add device memory support
  2023-08-22 12:24       ` Jesper Dangaard Brouer
@ 2023-08-22 23:33         ` Mina Almasry
  0 siblings, 0 replies; 62+ messages in thread
From: Mina Almasry @ 2023-08-22 23:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: brouer, netdev, linux-media, dri-devel, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	David Ahern, Willem de Bruijn, Sumit Semwal,
	Christian König, Jason Gunthorpe, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf

On Tue, Aug 22, 2023 at 5:24 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 22/08/2023 08.05, Mina Almasry wrote:
> > On Sat, Aug 19, 2023 at 2:51 AM Jesper Dangaard Brouer
> > <jbrouer@redhat.com> wrote:
> >>
> >> On 10/08/2023 03.57, Mina Almasry wrote:
> >>> Overload the LSB of struct page* to indicate that it's a page_pool_iov.
> >>>
> >>> Refactor mm calls on struct page * into helpers, and add page_pool_iov
> >>> handling on those helpers. Modify callers of these mm APIs with calls to
> >>> these helpers instead.
> >>>
> >>
> >> I don't like of this approach.
> >> This is adding code to the PP (page_pool) fast-path in multiple places.
> >>
> >> I've not had time to run my usual benchmarks, which are here:
> >>
> >> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
> >>
> >
> > I ported over this benchmark to my tree and ran it, my results:
> >
>
> What CPU is this and GHz?  (I guess 2.6 GHz based on results).
>
> (It looks like this CPU is more efficient, instructions per cycles, than
> my E5-1650 v4 @ 3.60GHz).
>

cat /proc/cpuinfo
...
vendor_id       : GenuineIntel
cpu family      : 6
model           : 143
model name      : Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz
stepping        : 8
microcode       : 0xffffffff
cpu MHz         : 2699.998
```

This is a vCPU on the Google Cloud A3 VMs.

> > net-next @ b44693495af8
> > https://pastebin.com/raw/JuU7UQXe
> >
> > + Jakub's memory-provider APIs:
> > https://pastebin.com/raw/StMBhetn
> >
> > + devmem TCP changes:
> > https://pastebin.com/raw/mY1L6U4r
> >
>
> Only a single cycle slowdown for "page_pool01_fast_path".
>  From 10 cycles to 11 cycles.
>
> > + intentional regression just to make sure the benchmark is working:
> > https://pastebin.com/raw/wqWhcJdG
> >
> > I don't seem to be able to detect a regression with this series as-is,
> > but I'm not that familiar with the test and may be doing something
> > wrong or misinterpreting the results. Does this look ok to you?
> >
>
> The performance results are better than I expected.  The small
> regression from 10 cycles to 11 cycles is actually 10%, but I expect
> with some likely/unlikely instrumentation we can "likely" remove this again.
>

So the patch is already optimized carefully (I hope) to put all the
devmem processing in the default unlikely path. Willem showed me that:

if (page_pool_iov())
   return handle_page_pool_iov();

return handle_page();

The handle_page() will be 'likely' by default, which removes the need
for explicit likely/unlikely. I'm not sure we can get better perf with
explicit likely/unlikey, but I can try.

> So, this change actually looks acceptable from a performance PoV.
> I still think this page_pool_iov is very invasive to page_pool, but
> maybe it is better to hide this "uglyness" inside page_pool.
>
> The test primarily tests fast-path, and you also add "if" statements to
> all the DMA operations, which is not part of this benchmark.  Perhaps we
> can add unlikely statements, or inspect (objdump) the ASM to check code
> priorities the original page based "provider".
>
> >> But I'm sure it will affect performance.
> >>
>
> Guess, I was wrong ;-)
>
> --Jesper
>
>
> >> Regardless of performance, this approach is using ptr-LSB-bits, to hide
> >> that page-pointer are not really struct-pages, feels like force feeding
> >> a solution just to use the page_pool APIs.
> >>
> >>
> >>> In areas where struct page* is dereferenced, add a check for special
> >>> handling of page_pool_iov.
> >>>
> >>> The memory providers producing page_pool_iov can set the LSB on the
> >>> struct page* returned to the page pool.
> >>>
> >>> Note that instead of overloading the LSB of page pointers, we can
> >>> instead define a new union between struct page & struct page_pool_iov and
> >>> compact it in a new type. However, we'd need to implement the code churn
> >>> to modify the page_pool & drivers to use this new type. For this POC
> >>> that is not implemented (feedback welcome).
> >>>
> >>
> >> I've said before, that I prefer multiplexing on page->pp_magic.
> >> For your page_pool_iov the layout would have to match the offset of
> >> pp_magic, to do this. (And if insisting on using PP infra the refcnt
> >> would also need to align).
> >>
> >> On the allocation side, all drivers already use a driver helper
> >> page_pool_dev_alloc_pages() or we could add another (better named)
> >> helper to multiplex between other types of allocators, e.g. a devmem
> >> allocator.
> >>
> >> On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle
> >> could multiplex on pp_magic and call another API.  The API could be an
> >> extension to PP helpers, but it could also be a devmap allocator helper.
> >>
> >> IMHO forcing/piggy-bagging everything into page_pool is not the right
> >> solution.  I really think netstack need to support different allocator
> >> types. The page pool have been leading the way, yes, but perhaps it is
> >> time to add an API layer that e.g. could be named netmem, that gives us
> >> the multiplexing between allocators.  In that process some of page_pool
> >> APIs would be lifted out as common blocks and others remain.
> >>
> >> --Jesper
> >>
> >>> I have a sample implementation of adding a new page_pool_token type
> >>> in the page_pool to give a general idea here:
> >>> https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d
> >>>
> >>> Full branch here:
> >>> https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens
> >>>
> >>> (In the branches above, page_pool_iov is called devmem_slice).
> >>>
> >>> Could also add static_branch to speed up the checks in page_pool_iov
> >>> memory providers are being used.
> >>>
> >>> Signed-off-by: Mina Almasry <almasrymina@google.com>
> >>> ---
> >>>    include/net/page_pool.h | 74 ++++++++++++++++++++++++++++++++++-
> >>>    net/core/page_pool.c    | 85 ++++++++++++++++++++++++++++-------------
> >>>    2 files changed, 131 insertions(+), 28 deletions(-)
> >>>
> >>> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> >>> index 537eb36115ed..f08ca230d68e 100644
> >>> --- a/include/net/page_pool.h
> >>> +++ b/include/net/page_pool.h
> >>> @@ -282,6 +282,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
> >>>        return NULL;
> >>>    }
> >>>
> >>> +static inline int page_pool_page_ref_count(struct page *page)
> >>> +{
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return page_pool_iov_refcount(page_to_page_pool_iov(page));
> >>> +
> >>> +     return page_ref_count(page);
> >>> +}
> >>> +
> >>> +static inline void page_pool_page_get_many(struct page *page,
> >>> +                                        unsigned int count)
> >>> +{
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return page_pool_iov_get_many(page_to_page_pool_iov(page),
> >>> +                                           count);
> >>> +
> >>> +     return page_ref_add(page, count);
> >>> +}
> >>> +
> >>> +static inline void page_pool_page_put_many(struct page *page,
> >>> +                                        unsigned int count)
> >>> +{
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return page_pool_iov_put_many(page_to_page_pool_iov(page),
> >>> +                                           count);
> >>> +
> >>> +     if (count > 1)
> >>> +             page_ref_sub(page, count - 1);
> >>> +
> >>> +     put_page(page);
> >>> +}
> >>> +
> >>> +static inline bool page_pool_page_is_pfmemalloc(struct page *page)
> >>> +{
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return false;
> >>> +
> >>> +     return page_is_pfmemalloc(page);
> >>> +}
> >>> +
> >>> +static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
> >>> +{
> >>> +     /* Assume page_pool_iov are on the preferred node without actually
> >>> +      * checking...
> >>> +      *
> >>> +      * This check is only used to check for recycling memory in the page
> >>> +      * pool's fast paths. Currently the only implementation of page_pool_iov
> >>> +      * is dmabuf device memory. It's a deliberate decision by the user to
> >>> +      * bind a certain dmabuf to a certain netdev, and the netdev rx queue
> >>> +      * would not be able to reallocate memory from another dmabuf that
> >>> +      * exists on the preferred node, so, this check doesn't make much sense
> >>> +      * in this case. Assume all page_pool_iovs can be recycled for now.
> >>> +      */
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return true;
> >>> +
> >>> +     return page_to_nid(page) == pref_nid;
> >>> +}
> >>> +
> >>>    struct page_pool {
> >>>        struct page_pool_params p;
> >>>
> >>> @@ -434,6 +492,9 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
> >>>    {
> >>>        long ret;
> >>>
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return -EINVAL;
> >>> +
> >>>        /* If nr == pp_frag_count then we have cleared all remaining
> >>>         * references to the page. No need to actually overwrite it, instead
> >>>         * we can leave this to be overwritten by the calling function.
> >>> @@ -494,7 +555,12 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
> >>>
> >>>    static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
> >>>    {
> >>> -     dma_addr_t ret = page->dma_addr;
> >>> +     dma_addr_t ret;
> >>> +
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return page_pool_iov_dma_addr(page_to_page_pool_iov(page));
> >>> +
> >>> +     ret = page->dma_addr;
> >>>
> >>>        if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> >>>                ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16;
> >>> @@ -504,6 +570,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
> >>>
> >>>    static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
> >>>    {
> >>> +     /* page_pool_iovs are mapped and their dma-addr can't be modified. */
> >>> +     if (page_is_page_pool_iov(page)) {
> >>> +             DEBUG_NET_WARN_ON_ONCE(true);
> >>> +             return;
> >>> +     }
> >>> +
> >>>        page->dma_addr = addr;
> >>>        if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> >>>                page->dma_addr_upper = upper_32_bits(addr);
> >>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> >>> index 0a7c08d748b8..20c1f74fd844 100644
> >>> --- a/net/core/page_pool.c
> >>> +++ b/net/core/page_pool.c
> >>> @@ -318,7 +318,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
> >>>                if (unlikely(!page))
> >>>                        break;
> >>>
> >>> -             if (likely(page_to_nid(page) == pref_nid)) {
> >>> +             if (likely(page_pool_page_is_pref_nid(page, pref_nid))) {
> >>>                        pool->alloc.cache[pool->alloc.count++] = page;
> >>>                } else {
> >>>                        /* NUMA mismatch;
> >>> @@ -363,7 +363,15 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
> >>>                                          struct page *page,
> >>>                                          unsigned int dma_sync_size)
> >>>    {
> >>> -     dma_addr_t dma_addr = page_pool_get_dma_addr(page);
> >>> +     dma_addr_t dma_addr;
> >>> +
> >>> +     /* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */
> >>> +     if (page_is_page_pool_iov(page)) {
> >>> +             DEBUG_NET_WARN_ON_ONCE(true);
> >>> +             return;
> >>> +     }
> >>> +
> >>> +     dma_addr = page_pool_get_dma_addr(page);
> >>>
> >>>        dma_sync_size = min(dma_sync_size, pool->p.max_len);
> >>>        dma_sync_single_range_for_device(pool->p.dev, dma_addr,
> >>> @@ -375,6 +383,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
> >>>    {
> >>>        dma_addr_t dma;
> >>>
> >>> +     if (page_is_page_pool_iov(page)) {
> >>> +             /* page_pool_iovs are already mapped */
> >>> +             DEBUG_NET_WARN_ON_ONCE(true);
> >>> +             return true;
> >>> +     }
> >>> +
> >>>        /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
> >>>         * since dma_addr_t can be either 32 or 64 bits and does not always fit
> >>>         * into page private data (i.e 32bit cpu with 64bit DMA caps)
> >>> @@ -398,14 +412,24 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
> >>>    static void page_pool_set_pp_info(struct page_pool *pool,
> >>>                                  struct page *page)
> >>>    {
> >>> -     page->pp = pool;
> >>> -     page->pp_magic |= PP_SIGNATURE;
> >>> +     if (!page_is_page_pool_iov(page)) {
> >>> +             page->pp = pool;
> >>> +             page->pp_magic |= PP_SIGNATURE;
> >>> +     } else {
> >>> +             page_to_page_pool_iov(page)->pp = pool;
> >>> +     }
> >>> +
> >>>        if (pool->p.init_callback)
> >>>                pool->p.init_callback(page, pool->p.init_arg);
> >>>    }
> >>>
> >>>    static void page_pool_clear_pp_info(struct page *page)
> >>>    {
> >>> +     if (page_is_page_pool_iov(page)) {
> >>> +             page_to_page_pool_iov(page)->pp = NULL;
> >>> +             return;
> >>> +     }
> >>> +
> >>>        page->pp_magic = 0;
> >>>        page->pp = NULL;
> >>>    }
> >>> @@ -615,7 +639,7 @@ static bool page_pool_recycle_in_cache(struct page *page,
> >>>                return false;
> >>>        }
> >>>
> >>> -     /* Caller MUST have verified/know (page_ref_count(page) == 1) */
> >>> +     /* Caller MUST have verified/know (page_pool_page_ref_count(page) == 1) */
> >>>        pool->alloc.cache[pool->alloc.count++] = page;
> >>>        recycle_stat_inc(pool, cached);
> >>>        return true;
> >>> @@ -638,9 +662,10 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
> >>>         * refcnt == 1 means page_pool owns page, and can recycle it.
> >>>         *
> >>>         * page is NOT reusable when allocated when system is under
> >>> -      * some pressure. (page_is_pfmemalloc)
> >>> +      * some pressure. (page_pool_page_is_pfmemalloc)
> >>>         */
> >>> -     if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
> >>> +     if (likely(page_pool_page_ref_count(page) == 1 &&
> >>> +                !page_pool_page_is_pfmemalloc(page))) {
> >>>                /* Read barrier done in page_ref_count / READ_ONCE */
> >>>
> >>>                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> >>> @@ -741,7 +766,8 @@ static struct page *page_pool_drain_frag(struct page_pool *pool,
> >>>        if (likely(page_pool_defrag_page(page, drain_count)))
> >>>                return NULL;
> >>>
> >>> -     if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
> >>> +     if (page_pool_page_ref_count(page) == 1 &&
> >>> +         !page_pool_page_is_pfmemalloc(page)) {
> >>>                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> >>>                        page_pool_dma_sync_for_device(pool, page, -1);
> >>>
> >>> @@ -818,9 +844,9 @@ static void page_pool_empty_ring(struct page_pool *pool)
> >>>        /* Empty recycle ring */
> >>>        while ((page = ptr_ring_consume_bh(&pool->ring))) {
> >>>                /* Verify the refcnt invariant of cached pages */
> >>> -             if (!(page_ref_count(page) == 1))
> >>> +             if (!(page_pool_page_ref_count(page) == 1))
> >>>                        pr_crit("%s() page_pool refcnt %d violation\n",
> >>> -                             __func__, page_ref_count(page));
> >>> +                             __func__, page_pool_page_ref_count(page));
> >>>
> >>>                page_pool_return_page(pool, page);
> >>>        }
> >>> @@ -977,19 +1003,24 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
> >>>        struct page_pool *pp;
> >>>        bool allow_direct;
> >>>
> >>> -     page = compound_head(page);
> >>> +     if (!page_is_page_pool_iov(page)) {
> >>> +             page = compound_head(page);
> >>>
> >>> -     /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
> >>> -      * in order to preserve any existing bits, such as bit 0 for the
> >>> -      * head page of compound page and bit 1 for pfmemalloc page, so
> >>> -      * mask those bits for freeing side when doing below checking,
> >>> -      * and page_is_pfmemalloc() is checked in __page_pool_put_page()
> >>> -      * to avoid recycling the pfmemalloc page.
> >>> -      */
> >>> -     if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> >>> -             return false;
> >>> +             /* page->pp_magic is OR'ed with PP_SIGNATURE after the
> >>> +              * allocation in order to preserve any existing bits, such as
> >>> +              * bit 0 for the head page of compound page and bit 1 for
> >>> +              * pfmemalloc page, so mask those bits for freeing side when
> >>> +              * doing below checking, and page_pool_page_is_pfmemalloc() is
> >>> +              * checked in __page_pool_put_page() to avoid recycling the
> >>> +              * pfmemalloc page.
> >>> +              */
> >>> +             if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> >>> +                     return false;
> >>>
> >>> -     pp = page->pp;
> >>> +             pp = page->pp;
> >>> +     } else {
> >>> +             pp = page_to_page_pool_iov(page)->pp;
> >>> +     }
> >>>
> >>>        /* Allow direct recycle if we have reasons to believe that we are
> >>>         * in the same context as the consumer would run, so there's
> >>> @@ -1273,9 +1304,9 @@ static bool mp_huge_busy(struct mp_huge *hu, unsigned int idx)
> >>>
> >>>        for (j = 0; j < (1 << MP_HUGE_ORDER); j++) {
> >>>                page = hu->page[idx] + j;
> >>> -             if (page_ref_count(page) != 1) {
> >>> +             if (page_pool_page_ref_count(page) != 1) {
> >>>                        pr_warn("Page with ref count %d at %u, %u. Can't safely destory, leaking memory!\n",
> >>> -                             page_ref_count(page), idx, j);
> >>> +                             page_pool_page_ref_count(page), idx, j);
> >>>                        return true;
> >>>                }
> >>>        }
> >>> @@ -1330,7 +1361,7 @@ static struct page *mp_huge_alloc_pages(struct page_pool *pool, gfp_t gfp)
> >>>                        continue;
> >>>
> >>>                if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
> >>> -                 page_ref_count(page) != 1) {
> >>> +                 page_pool_page_ref_count(page) != 1) {
> >>>                        atomic_inc(&mp_huge_ins_b);
> >>>                        continue;
> >>>                }
> >>> @@ -1458,9 +1489,9 @@ static void mp_huge_1g_destroy(struct page_pool *pool)
> >>>        free = true;
> >>>        for (i = 0; i < MP_HUGE_1G_CNT; i++) {
> >>>                page = hu->page + i;
> >>> -             if (page_ref_count(page) != 1) {
> >>> +             if (page_pool_page_ref_count(page) != 1) {
> >>>                        pr_warn("Page with ref count %d at %u. Can't safely destory, leaking memory!\n",
> >>> -                             page_ref_count(page), i);
> >>> +                             page_pool_page_ref_count(page), i);
> >>>                        free = false;
> >>>                        break;
> >>>                }
> >>> @@ -1489,7 +1520,7 @@ static struct page *mp_huge_1g_alloc_pages(struct page_pool *pool, gfp_t gfp)
> >>>                page = hu->page + page_i;
> >>>
> >>>                if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
> >>> -                 page_ref_count(page) != 1) {
> >>> +                 page_pool_page_ref_count(page) != 1) {
> >>>                        atomic_inc(&mp_huge_ins_b);
> >>>                        continue;
> >>>                }
> >>> --
> >>> 2.41.0.640.ga95def55d0-goog
> >>>
> >>
> >
> >
> > --
> > Thanks,
> > Mina
> >
>


--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 00/11] Device Memory TCP
  2023-08-17 22:18     ` Mina Almasry
@ 2023-08-23 22:52       ` David Wei
  2023-08-24  3:35         ` David Ahern
  0 siblings, 1 reply; 62+ messages in thread
From: David Wei @ 2023-08-23 22:52 UTC (permalink / raw)
  To: Mina Almasry, Pavel Begunkov
  Cc: David Ahern, netdev, linux-media, dri-devel, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	Willem de Bruijn, Sumit Semwal, Christian König,
	Jason Gunthorpe, Hari Ramakrishnan, Dan Williams,
	Andy Lutomirski, stephen, sdf

On 17/08/2023 15:18, Mina Almasry wrote:
> On Thu, Aug 17, 2023 at 11:04 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> On 8/14/23 02:12, David Ahern wrote:
>>> On 8/9/23 7:57 PM, Mina Almasry wrote:
>>>> Changes in RFC v2:
>>>> ------------------
>> ...
>>>> ** Test Setup
>>>>
>>>> Kernel: net-next with this RFC and memory provider API cherry-picked
>>>> locally.
>>>>
>>>> Hardware: Google Cloud A3 VMs.
>>>>
>>>> NIC: GVE with header split & RSS & flow steering support.
>>>
>>> This set seems to depend on Jakub's memory provider patches and a netdev
>>> driver change which is not included. For the testing mentioned here, you
>>> must have a tree + branch with all of the patches. Is it publicly available?
>>>
>>> It would be interesting to see how well (easy) this integrates with
>>> io_uring. Besides avoiding all of the syscalls for receiving the iov and
>>> releasing the buffers back to the pool, io_uring also brings in the
>>> ability to seed a page_pool with registered buffers which provides a
>>> means to get simpler Rx ZC for host memory.
>>
>> The patchset sounds pretty interesting. I've been working with David Wei
>> (CC'ing) on io_uring zc rx (prototype polishing stage) all that is old
>> similar approaches based on allocating an rx queue. It targets host
>> memory and device memory as an extra feature, uapi is different, lifetimes
>> are managed/bound to io_uring. Completions/buffers are returned to user via
>> a separate queue instead of cmsg, and pushed back granularly to the kernel
>> via another queue. I'll leave it to David to elaborate
>>
>> It sounds like we have space for collaboration here, if not merging then
>> reusing internals as much as we can, but we'd need to look into the
>> details deeper.
>>
> 
> I'm happy to look at your implementation and collaborate on something
> that works for both use cases. Feel free to share unpolished prototype
> so I can start having a general idea if possible.

Hi I'm David and I am working with Pavel on this. We will have something to
share with you on the mailing list before the end of the week.

I'm also preparing a submission for NetDev conf. I wonder if you and others at
Google plan to present there as well? If so, then we may want to coordinate our
submissions and talks (if accepted).

Please let me know this week, thanks!

> 
>>> Overall I like the intent and possibilities for extensions, but a lot of
>>> details are missing - perhaps some are answered by seeing an end-to-end
>>> implementation.
>>
>> --
>> Pavel Begunkov
> 
> 
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 00/11] Device Memory TCP
  2023-08-23 22:52       ` David Wei
@ 2023-08-24  3:35         ` David Ahern
  0 siblings, 0 replies; 62+ messages in thread
From: David Ahern @ 2023-08-24  3:35 UTC (permalink / raw)
  To: David Wei, Mina Almasry, Pavel Begunkov
  Cc: netdev, linux-media, dri-devel, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Arnd Bergmann, Willem de Bruijn, Sumit Semwal,
	Christian König, Jason Gunthorpe, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf

On 8/23/23 3:52 PM, David Wei wrote:
> I'm also preparing a submission for NetDev conf. I wonder if you and others at
> Google plan to present there as well? If so, then we may want to coordinate our
> submissions and talks (if accepted).

personally, I see them as related but separate topics. Mina's proposal
as infra that io_uring builds on. Both are interesting and needed
discussions.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice
  2023-08-10  1:57 ` [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice Mina Almasry
  2023-08-13 11:26   ` Leon Romanovsky
  2023-08-14  1:10   ` David Ahern
@ 2023-08-30 12:38   ` Yunsheng Lin
  2023-09-08  0:47   ` David Wei
  3 siblings, 0 replies; 62+ messages in thread
From: Yunsheng Lin @ 2023-08-30 12:38 UTC (permalink / raw)
  To: Mina Almasry, netdev, linux-media, dri-devel
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	David Ahern, Willem de Bruijn, Sumit Semwal,
	Christian König, Jason Gunthorpe, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf, Willem de Bruijn,
	Kaiyuan Zhang

On 2023/8/10 9:57, Mina Almasry wrote:

...

> +
> +int netdev_bind_dmabuf_to_queue(struct net_device *dev, unsigned int dmabuf_fd,
> +				u32 rxq_idx, struct netdev_dmabuf_binding **out)
> +{
> +	struct netdev_dmabuf_binding *binding;
> +	struct netdev_rx_queue *rxq;
> +	struct scatterlist *sg;
> +	struct dma_buf *dmabuf;
> +	unsigned int sg_idx, i;
> +	unsigned long virtual;
> +	u32 xa_idx;
> +	int err;
> +
> +	rxq = __netif_get_rx_queue(dev, rxq_idx);
> +
> +	if (rxq->binding)
> +		return -EEXIST;
> +
> +	dmabuf = dma_buf_get(dmabuf_fd);
> +	if (IS_ERR_OR_NULL(dmabuf))
> +		return -EBADFD;
> +
> +	binding = kzalloc_node(sizeof(*binding), GFP_KERNEL,
> +			       dev_to_node(&dev->dev));
> +	if (!binding) {
> +		err = -ENOMEM;
> +		goto err_put_dmabuf;
> +	}
> +
> +	xa_init_flags(&binding->bound_rxq_list, XA_FLAGS_ALLOC);
> +
> +	refcount_set(&binding->ref, 1);
> +
> +	binding->dmabuf = dmabuf;
> +
> +	binding->attachment = dma_buf_attach(binding->dmabuf, dev->dev.parent);
> +	if (IS_ERR(binding->attachment)) {
> +		err = PTR_ERR(binding->attachment);
> +		goto err_free_binding;
> +	}
> +
> +	binding->sgt = dma_buf_map_attachment(binding->attachment,
> +					      DMA_BIDIRECTIONAL);
> +	if (IS_ERR(binding->sgt)) {
> +		err = PTR_ERR(binding->sgt);
> +		goto err_detach;
> +	}
> +
> +	/* For simplicity we expect to make PAGE_SIZE allocations, but the
> +	 * binding can be much more flexible than that. We may be able to
> +	 * allocate MTU sized chunks here. Leave that for future work...
> +	 */
> +	binding->chunk_pool = gen_pool_create(PAGE_SHIFT,
> +					      dev_to_node(&dev->dev));
> +	if (!binding->chunk_pool) {
> +		err = -ENOMEM;
> +		goto err_unmap;
> +	}
> +
> +	virtual = 0;
> +	for_each_sgtable_dma_sg(binding->sgt, sg, sg_idx) {
> +		dma_addr_t dma_addr = sg_dma_address(sg);
> +		struct dmabuf_genpool_chunk_owner *owner;
> +		size_t len = sg_dma_len(sg);
> +		struct page_pool_iov *ppiov;
> +
> +		owner = kzalloc_node(sizeof(*owner), GFP_KERNEL,
> +				     dev_to_node(&dev->dev));
> +		owner->base_virtual = virtual;
> +		owner->base_dma_addr = dma_addr;
> +		owner->num_ppiovs = len / PAGE_SIZE;
> +		owner->binding = binding;
> +
> +		err = gen_pool_add_owner(binding->chunk_pool, dma_addr,
> +					 dma_addr, len, dev_to_node(&dev->dev),
> +					 owner);
> +		if (err) {
> +			err = -EINVAL;
> +			goto err_free_chunks;
> +		}
> +
> +		owner->ppiovs = kvmalloc_array(owner->num_ppiovs,
> +					       sizeof(*owner->ppiovs),
> +					       GFP_KERNEL);

I guess the 'struct page_pool_iov' is the metadat for each chunk, just
like the 'struct page' for each page?
Do we want to make it cache line aligned as 'struct page' does, so that
there is no cache bouncing between different ppiov when freeing and
allocating?

And we may be able to get rid of the gen_pool if we add more field
to store the dma addree, the binding, etc in 'struct page_pool_iov'
as devmem is not usable by other subsystem other than net stack
and we can use a big page_pool->ring as the backing for all the
devmem chunks to replace the gen_pool, at least for current
implemention, dmabuf:page_pool/queue is 1:1. If we want to have
the same dmabuf shared by different page_pool, it seems better
to implement it in page_pool core instead of specific provider,
so that other provider or native page pool can make use of that
too.

As far as we go here, I am not sure if it is possible and reasonable
to reuse 'struct page' used by normal memory as much as possible,
and add some specific union fields for devmem like page pool does,
so that we can have more common handling between devmem and normal
memory?

I think we may need to split out metadata used by page pool
currently from 'struct page', something like the netmem patch
set does, as willy was trying to avoid more users using the
'struct page' directly instead of adding more direct users to it:

https://lore.kernel.org/linux-mm/20230111042214.907030-1-willy@infradead.org/

> +		if (!owner->ppiovs) {
> +			err = -ENOMEM;
> +			goto err_free_chunks;
> +		}
> +
> +		for (i = 0; i < owner->num_ppiovs; i++) {
> +			ppiov = &owner->ppiovs[i];
> +			ppiov->owner = owner;
> +			refcount_set(&ppiov->refcount, 1);
> +		}
> +
> +		dma_addr += len;
> +		virtual += len;
> +	}
> +
> +	/* TODO: need to be able to bind to multiple rx queues on the same
> +	 * netdevice. The code should already be able to handle that with
> +	 * minimal changes, but the netlink API currently allows for 1 rx
> +	 * queue.
> +	 */
> +	err = xa_alloc(&binding->bound_rxq_list, &xa_idx, rxq, xa_limit_32b,
> +		       GFP_KERNEL);
> +	if (err)
> +		goto err_free_chunks;
> +
> +	rxq->binding = binding;
> +	*out = binding;
> +
> +	return 0;
> +
> +err_free_chunks:
> +	gen_pool_for_each_chunk(binding->chunk_pool,
> +				netdev_devmem_free_chunk_owner, NULL);
> +	gen_pool_destroy(binding->chunk_pool);
> +err_unmap:
> +	dma_buf_unmap_attachment(binding->attachment, binding->sgt,
> +				 DMA_BIDIRECTIONAL);
> +err_detach:
> +	dma_buf_detach(dmabuf, binding->attachment);
> +err_free_binding:
> +	kfree(binding);
> +err_put_dmabuf:
> +	dma_buf_put(dmabuf);
> +	return err;
> +}
> +


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice
  2023-08-10  1:57 ` [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice Mina Almasry
                     ` (2 preceding siblings ...)
  2023-08-30 12:38   ` Yunsheng Lin
@ 2023-09-08  0:47   ` David Wei
  3 siblings, 0 replies; 62+ messages in thread
From: David Wei @ 2023-09-08  0:47 UTC (permalink / raw)
  To: Mina Almasry, netdev, linux-media, dri-devel
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	David Ahern, Willem de Bruijn, Sumit Semwal,
	Christian König, Jason Gunthorpe, Hari Ramakrishnan,
	Dan Williams, Andy Lutomirski, stephen, sdf, Willem de Bruijn,
	Kaiyuan Zhang

On 09/08/2023 18:57, Mina Almasry wrote:
> Add a netdev_dmabuf_binding struct which represents the
> dma-buf-to-netdevice binding. The netlink API will bind the dma-buf to
> an rx queue on the netdevice. On the binding, the dma_buf_attach
> & dma_buf_map_attachment will occur. The entries in the sg_table from
> mapping will be inserted into a genpool to make it ready
> for allocation.
> 
> The chunks in the genpool are owned by a dmabuf_chunk_owner struct which
> holds the dma-buf offset of the base of the chunk and the dma_addr of
> the chunk. Both are needed to use allocations that come from this chunk.
> 
> We create a new type that represents an allocation from the genpool:
> page_pool_iov. We setup the page_pool_iov allocation size in the
> genpool to PAGE_SIZE for simplicity: to match the PAGE_SIZE normally
> allocated by the page pool and given to the drivers.
> 
> The user can unbind the dmabuf from the netdevice by closing the netlink
> socket that established the binding. We do this so that the binding is
> automatically unbound even if the userspace process crashes.
> 
> The binding and unbinding leaves an indicator in struct netdev_rx_queue
> that the given queue is bound, but the binding doesn't take effect until
> the driver actually reconfigures its queues, and re-initializes its page
> pool. This issue/weirdness is highlighted in the memory provider
> proposal[1], and I'm hoping that some generic solution for all
> memory providers will be discussed; this patch doesn't address that
> weirdness again.
> 
> The netdev_dmabuf_binding struct is refcounted, and releases its
> resources only when all the refs are released.
> 
> [1] https://lore.kernel.org/netdev/20230707183935.997267-1-kuba@kernel.org/
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> 
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> ---
>  include/linux/netdevice.h |  57 ++++++++++++
>  include/net/page_pool.h   |  27 ++++++
>  net/core/dev.c            | 178 ++++++++++++++++++++++++++++++++++++++
>  net/core/netdev-genl.c    | 101 ++++++++++++++++++++-
>  4 files changed, 361 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 3800d0479698..1b7c5966d2ca 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -53,6 +53,8 @@
>  #include <net/net_trackers.h>
>  #include <net/net_debug.h>
>  #include <net/dropreason-core.h>
> +#include <linux/xarray.h>
> +#include <linux/refcount.h>
> 
>  struct netpoll_info;
>  struct device;
> @@ -782,6 +784,55 @@ bool rps_may_expire_flow(struct net_device *dev, u16 rxq_index, u32 flow_id,
>  #endif
>  #endif /* CONFIG_RPS */
> 
> +struct netdev_dmabuf_binding {
> +	struct dma_buf *dmabuf;
> +	struct dma_buf_attachment *attachment;
> +	struct sg_table *sgt;
> +	struct net_device *dev;
> +	struct gen_pool *chunk_pool;
> +
> +	/* The user holds a ref (via the netlink API) for as long as they want
> +	 * the binding to remain alive. Each page pool using this binding holds
> +	 * a ref to keep the binding alive. Each allocated page_pool_iov holds a
> +	 * ref.
> +	 *
> +	 * The binding undos itself and unmaps the underlying dmabuf once all
> +	 * those refs are dropped and the binding is no longer desired or in
> +	 * use.
> +	 */
> +	refcount_t ref;
> +
> +	/* The portid of the user that owns this binding. Used for netlink to
> +	 * notify us of the user dropping the bind.
> +	 */
> +	u32 owner_nlportid;
> +
> +	/* The list of bindings currently active. Used for netlink to notify us
> +	 * of the user dropping the bind.
> +	 */
> +	struct list_head list;
> +
> +	/* rxq's this binding is active on. */
> +	struct xarray bound_rxq_list;
> +};
> +
> +void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding);
> +
> +static inline void
> +netdev_devmem_binding_get(struct netdev_dmabuf_binding *binding)
> +{
> +	refcount_inc(&binding->ref);
> +}
> +
> +static inline void
> +netdev_devmem_binding_put(struct netdev_dmabuf_binding *binding)
> +{
> +	if (!refcount_dec_and_test(&binding->ref))
> +		return;
> +
> +	__netdev_devmem_binding_free(binding);
> +}
> +
>  /* This structure contains an instance of an RX queue. */
>  struct netdev_rx_queue {
>  	struct xdp_rxq_info		xdp_rxq;
> @@ -796,6 +847,7 @@ struct netdev_rx_queue {
>  #ifdef CONFIG_XDP_SOCKETS
>  	struct xsk_buff_pool            *pool;
>  #endif
> +	struct netdev_dmabuf_binding	*binding;
>  } ____cacheline_aligned_in_smp;
> 
>  /*
> @@ -5026,6 +5078,11 @@ void netif_set_tso_max_segs(struct net_device *dev, unsigned int segs);
>  void netif_inherit_tso_max(struct net_device *to,
>  			   const struct net_device *from);
> 
> +void netdev_unbind_dmabuf_to_queue(struct netdev_dmabuf_binding *binding);
> +int netdev_bind_dmabuf_to_queue(struct net_device *dev, unsigned int dmabuf_fd,
> +				u32 rxq_idx,
> +				struct netdev_dmabuf_binding **out);
> +
>  static inline bool netif_is_macsec(const struct net_device *dev)
>  {
>  	return dev->priv_flags & IFF_MACSEC;
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index 364fe6924258..61b2066d32b5 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -170,6 +170,33 @@ extern const struct pp_memory_provider_ops hugesp_ops;
>  extern const struct pp_memory_provider_ops huge_ops;
>  extern const struct pp_memory_provider_ops huge_1g_ops;
> 
> +/* page_pool_iov support */
> +
> +/* Owner of the dma-buf chunks inserted into the gen pool. Each scatterlist
> + * entry from the dmabuf is inserted into the genpool as a chunk, and needs
> + * this owner struct to keep track of some metadata necessary to create
> + * allocations from this chunk.
> + */
> +struct dmabuf_genpool_chunk_owner {
> +	/* Offset into the dma-buf where this chunk starts.  */
> +	unsigned long base_virtual;
> +
> +	/* dma_addr of the start of the chunk.  */
> +	dma_addr_t base_dma_addr;
> +
> +	/* Array of page_pool_iovs for this chunk. */
> +	struct page_pool_iov *ppiovs;
> +	size_t num_ppiovs;
> +
> +	struct netdev_dmabuf_binding *binding;
> +};
> +
> +struct page_pool_iov {
> +	struct dmabuf_genpool_chunk_owner *owner;
> +
> +	refcount_t refcount;
> +};
> +

Hi Mina, we're working on ZC RX to host memory [1] and have a similar need for
a page pool memory provider (backed by userspace host memory instead of GPU
memory) that hands out generic page_pool_iov buffers. The current differences
are that we hold a page ptr and its dma_addr_t directly as we are backed by
real pages; we still need a refcount for lifetime management. See struct
io_zc_rx_buf in [2].

It would be great to align on page_pool_iov such that it will work for both of
us. Would a union work here?

[1] https://lore.kernel.org/io-uring/20230826011954.1801099-1-dw@davidwei.uk/
[2] https://lore.kernel.org/io-uring/20230826011954.1801099-6-dw@davidwei.uk/

>  struct page_pool {
>  	struct page_pool_params p;
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 8e7d0cb540cd..02a25ccf771a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -151,6 +151,8 @@
>  #include <linux/pm_runtime.h>
>  #include <linux/prandom.h>
>  #include <linux/once_lite.h>
> +#include <linux/genalloc.h>
> +#include <linux/dma-buf.h>
> 
>  #include "dev.h"
>  #include "net-sysfs.h"
> @@ -2037,6 +2039,182 @@ static int call_netdevice_notifiers_mtu(unsigned long val,
>  	return call_netdevice_notifiers_info(val, &info.info);
>  }
> 
> +/* Device memory support */
> +
> +static void netdev_devmem_free_chunk_owner(struct gen_pool *genpool,
> +					   struct gen_pool_chunk *chunk,
> +					   void *not_used)
> +{
> +	struct dmabuf_genpool_chunk_owner *owner = chunk->owner;
> +
> +	kvfree(owner->ppiovs);
> +	kfree(owner);
> +}
> +
> +void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding)
> +{
> +	size_t size, avail;
> +
> +	gen_pool_for_each_chunk(binding->chunk_pool,
> +				netdev_devmem_free_chunk_owner, NULL);
> +
> +	size = gen_pool_size(binding->chunk_pool);
> +	avail = gen_pool_avail(binding->chunk_pool);
> +
> +	if (!WARN(size != avail, "can't destroy genpool. size=%lu, avail=%lu",
> +		  size, avail))
> +		gen_pool_destroy(binding->chunk_pool);
> +
> +	dma_buf_unmap_attachment(binding->attachment, binding->sgt,
> +				 DMA_BIDIRECTIONAL);
> +	dma_buf_detach(binding->dmabuf, binding->attachment);
> +	dma_buf_put(binding->dmabuf);
> +	kfree(binding);
> +}
> +
> +void netdev_unbind_dmabuf_to_queue(struct netdev_dmabuf_binding *binding)
> +{
> +	struct netdev_rx_queue *rxq;
> +	unsigned long xa_idx;
> +
> +	list_del_rcu(&binding->list);
> +
> +	xa_for_each(&binding->bound_rxq_list, xa_idx, rxq)
> +		if (rxq->binding == binding)
> +			rxq->binding = NULL;
> +
> +	netdev_devmem_binding_put(binding);
> +}
> +
> +int netdev_bind_dmabuf_to_queue(struct net_device *dev, unsigned int dmabuf_fd,
> +				u32 rxq_idx, struct netdev_dmabuf_binding **out)
> +{
> +	struct netdev_dmabuf_binding *binding;
> +	struct netdev_rx_queue *rxq;
> +	struct scatterlist *sg;
> +	struct dma_buf *dmabuf;
> +	unsigned int sg_idx, i;
> +	unsigned long virtual;
> +	u32 xa_idx;
> +	int err;
> +
> +	rxq = __netif_get_rx_queue(dev, rxq_idx);
> +
> +	if (rxq->binding)
> +		return -EEXIST;
> +
> +	dmabuf = dma_buf_get(dmabuf_fd);
> +	if (IS_ERR_OR_NULL(dmabuf))
> +		return -EBADFD;
> +
> +	binding = kzalloc_node(sizeof(*binding), GFP_KERNEL,
> +			       dev_to_node(&dev->dev));
> +	if (!binding) {
> +		err = -ENOMEM;
> +		goto err_put_dmabuf;
> +	}
> +
> +	xa_init_flags(&binding->bound_rxq_list, XA_FLAGS_ALLOC);
> +
> +	refcount_set(&binding->ref, 1);
> +
> +	binding->dmabuf = dmabuf;
> +
> +	binding->attachment = dma_buf_attach(binding->dmabuf, dev->dev.parent);
> +	if (IS_ERR(binding->attachment)) {
> +		err = PTR_ERR(binding->attachment);
> +		goto err_free_binding;
> +	}
> +
> +	binding->sgt = dma_buf_map_attachment(binding->attachment,
> +					      DMA_BIDIRECTIONAL);
> +	if (IS_ERR(binding->sgt)) {
> +		err = PTR_ERR(binding->sgt);
> +		goto err_detach;
> +	}
> +
> +	/* For simplicity we expect to make PAGE_SIZE allocations, but the
> +	 * binding can be much more flexible than that. We may be able to
> +	 * allocate MTU sized chunks here. Leave that for future work...
> +	 */
> +	binding->chunk_pool = gen_pool_create(PAGE_SHIFT,
> +					      dev_to_node(&dev->dev));
> +	if (!binding->chunk_pool) {
> +		err = -ENOMEM;
> +		goto err_unmap;
> +	}
> +
> +	virtual = 0;
> +	for_each_sgtable_dma_sg(binding->sgt, sg, sg_idx) {
> +		dma_addr_t dma_addr = sg_dma_address(sg);
> +		struct dmabuf_genpool_chunk_owner *owner;
> +		size_t len = sg_dma_len(sg);
> +		struct page_pool_iov *ppiov;
> +
> +		owner = kzalloc_node(sizeof(*owner), GFP_KERNEL,
> +				     dev_to_node(&dev->dev));
> +		owner->base_virtual = virtual;
> +		owner->base_dma_addr = dma_addr;
> +		owner->num_ppiovs = len / PAGE_SIZE;
> +		owner->binding = binding;
> +
> +		err = gen_pool_add_owner(binding->chunk_pool, dma_addr,
> +					 dma_addr, len, dev_to_node(&dev->dev),
> +					 owner);
> +		if (err) {
> +			err = -EINVAL;
> +			goto err_free_chunks;
> +		}
> +
> +		owner->ppiovs = kvmalloc_array(owner->num_ppiovs,
> +					       sizeof(*owner->ppiovs),
> +					       GFP_KERNEL);
> +		if (!owner->ppiovs) {
> +			err = -ENOMEM;
> +			goto err_free_chunks;
> +		}
> +
> +		for (i = 0; i < owner->num_ppiovs; i++) {
> +			ppiov = &owner->ppiovs[i];
> +			ppiov->owner = owner;
> +			refcount_set(&ppiov->refcount, 1);
> +		}
> +
> +		dma_addr += len;
> +		virtual += len;
> +	}
> +
> +	/* TODO: need to be able to bind to multiple rx queues on the same
> +	 * netdevice. The code should already be able to handle that with
> +	 * minimal changes, but the netlink API currently allows for 1 rx
> +	 * queue.
> +	 */
> +	err = xa_alloc(&binding->bound_rxq_list, &xa_idx, rxq, xa_limit_32b,
> +		       GFP_KERNEL);
> +	if (err)
> +		goto err_free_chunks;
> +
> +	rxq->binding = binding;
> +	*out = binding;
> +
> +	return 0;
> +
> +err_free_chunks:
> +	gen_pool_for_each_chunk(binding->chunk_pool,
> +				netdev_devmem_free_chunk_owner, NULL);
> +	gen_pool_destroy(binding->chunk_pool);
> +err_unmap:
> +	dma_buf_unmap_attachment(binding->attachment, binding->sgt,
> +				 DMA_BIDIRECTIONAL);
> +err_detach:
> +	dma_buf_detach(dmabuf, binding->attachment);
> +err_free_binding:
> +	kfree(binding);
> +err_put_dmabuf:
> +	dma_buf_put(dmabuf);
> +	return err;
> +}
> +
>  #ifdef CONFIG_NET_INGRESS
>  static DEFINE_STATIC_KEY_FALSE(ingress_needed_key);
> 
> diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> index bf7324dd6c36..288ed0112995 100644
> --- a/net/core/netdev-genl.c
> +++ b/net/core/netdev-genl.c
> @@ -141,10 +141,74 @@ int netdev_nl_dev_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
>  	return skb->len;
>  }
> 
> -/* Stub */
> +static LIST_HEAD(netdev_rbinding_list);
> +
>  int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
>  {
> -	return 0;
> +	struct netdev_dmabuf_binding *out_binding;
> +	u32 ifindex, dmabuf_fd, rxq_idx;
> +	struct net_device *netdev;
> +	struct sk_buff *rsp;
> +	int err = 0;
> +	void *hdr;
> +
> +	if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_DEV_IFINDEX) ||
> +	    GENL_REQ_ATTR_CHECK(info, NETDEV_A_BIND_DMABUF_DMABUF_FD) ||
> +	    GENL_REQ_ATTR_CHECK(info, NETDEV_A_BIND_DMABUF_QUEUE_IDX))
> +		return -EINVAL;
> +
> +	ifindex = nla_get_u32(info->attrs[NETDEV_A_DEV_IFINDEX]);
> +	dmabuf_fd = nla_get_u32(info->attrs[NETDEV_A_BIND_DMABUF_DMABUF_FD]);
> +	rxq_idx = nla_get_u32(info->attrs[NETDEV_A_BIND_DMABUF_QUEUE_IDX]);
> +
> +	rtnl_lock();
> +
> +	netdev = __dev_get_by_index(genl_info_net(info), ifindex);
> +	if (!netdev) {
> +		err = -ENODEV;
> +		goto err_unlock;
> +	}
> +
> +	if (rxq_idx >= netdev->num_rx_queues) {
> +		err = -ERANGE;
> +		goto err_unlock;
> +	}
> +
> +	if (netdev_bind_dmabuf_to_queue(netdev, dmabuf_fd, rxq_idx,
> +					&out_binding)) {
> +		err = -EINVAL;
> +		goto err_unlock;
> +	}
> +
> +	out_binding->owner_nlportid = info->snd_portid;
> +	list_add_rcu(&out_binding->list, &netdev_rbinding_list);
> +
> +	rsp = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL);
> +	if (!rsp) {
> +		err = -ENOMEM;
> +		goto err_unbind;
> +	}
> +
> +	hdr = genlmsg_put(rsp, info->snd_portid, info->snd_seq,
> +			  &netdev_nl_family, 0, info->genlhdr->cmd);
> +	if (!hdr) {
> +		err = -EMSGSIZE;
> +		goto err_genlmsg_free;
> +	}
> +
> +	genlmsg_end(rsp, hdr);
> +
> +	rtnl_unlock();
> +
> +	return genlmsg_reply(rsp, info);
> +
> +err_genlmsg_free:
> +	nlmsg_free(rsp);
> +err_unbind:
> +	netdev_unbind_dmabuf_to_queue(out_binding);
> +err_unlock:
> +	rtnl_unlock();
> +	return err;
>  }
> 
>  static int netdev_genl_netdevice_event(struct notifier_block *nb,
> @@ -167,10 +231,37 @@ static int netdev_genl_netdevice_event(struct notifier_block *nb,
>  	return NOTIFY_OK;
>  }
> 
> +static int netdev_netlink_notify(struct notifier_block *nb, unsigned long state,
> +				 void *_notify)
> +{
> +	struct netlink_notify *notify = _notify;
> +	struct netdev_dmabuf_binding *rbinding;
> +
> +	if (state != NETLINK_URELEASE || notify->protocol != NETLINK_GENERIC)
> +		return NOTIFY_DONE;
> +
> +	rcu_read_lock();
> +
> +	list_for_each_entry_rcu(rbinding, &netdev_rbinding_list, list) {
> +		if (rbinding->owner_nlportid == notify->portid) {
> +			netdev_unbind_dmabuf_to_queue(rbinding);
> +			break;
> +		}
> +	}
> +
> +	rcu_read_unlock();
> +
> +	return NOTIFY_OK;
> +}
> +
>  static struct notifier_block netdev_genl_nb = {
>  	.notifier_call	= netdev_genl_netdevice_event,
>  };
> 
> +static struct notifier_block netdev_netlink_notifier = {
> +	.notifier_call = netdev_netlink_notify,
> +};
> +
>  static int __init netdev_genl_init(void)
>  {
>  	int err;
> @@ -183,8 +274,14 @@ static int __init netdev_genl_init(void)
>  	if (err)
>  		goto err_unreg_ntf;
> 
> +	err = netlink_register_notifier(&netdev_netlink_notifier);
> +	if (err)
> +		goto err_unreg_family;
> +
>  	return 0;
> 
> +err_unreg_family:
> +	genl_unregister_family(&netdev_nl_family);
>  err_unreg_ntf:
>  	unregister_netdevice_notifier(&netdev_genl_nb);
>  	return err;
> --
> 2.41.0.640.ga95def55d0-goog


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 06/11] page-pool: add device memory support
  2023-08-19 20:24         ` Mina Almasry
  2023-08-19 20:27           ` Mina Almasry
@ 2023-09-08  2:32           ` David Wei
  1 sibling, 0 replies; 62+ messages in thread
From: David Wei @ 2023-09-08  2:32 UTC (permalink / raw)
  To: Mina Almasry, Jesper Dangaard Brouer
  Cc: Willem de Bruijn, brouer, netdev, linux-media, dri-devel,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	David Ahern, Sumit Semwal, Christian König, Jason Gunthorpe,
	Hari Ramakrishnan, Dan Williams, Andy Lutomirski, stephen, sdf

On 19/08/2023 13:24, Mina Almasry wrote:
> On Sat, Aug 19, 2023 at 8:22 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
>>
>>
>>
>> On 19/08/2023 16.08, Willem de Bruijn wrote:
>>> On Sat, Aug 19, 2023 at 5:51 AM Jesper Dangaard Brouer
>>> <jbrouer@redhat.com> wrote:
>>>>
>>>>
>>>>
>>>> On 10/08/2023 03.57, Mina Almasry wrote:
>>>>> Overload the LSB of struct page* to indicate that it's a page_pool_iov.
>>>>>
>>>>> Refactor mm calls on struct page * into helpers, and add page_pool_iov
>>>>> handling on those helpers. Modify callers of these mm APIs with calls to
>>>>> these helpers instead.
>>>>>
>>>>
>>>> I don't like of this approach.
>>>> This is adding code to the PP (page_pool) fast-path in multiple places.
>>>>
>>>> I've not had time to run my usual benchmarks, which are here:
>>>>
>>>> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
>>>>
> 
> Thank you for linking that, I'll try to run these against the next submission.
> 
>>>> But I'm sure it will affect performance.
>>>>
>>>> Regardless of performance, this approach is using ptr-LSB-bits, to hide
>>>> that page-pointer are not really struct-pages, feels like force feeding
>>>> a solution just to use the page_pool APIs.
>>>>
>>>>
>>>>> In areas where struct page* is dereferenced, add a check for special
>>>>> handling of page_pool_iov.
>>>>>
>>>>> The memory providers producing page_pool_iov can set the LSB on the
>>>>> struct page* returned to the page pool.
>>>>>
>>>>> Note that instead of overloading the LSB of page pointers, we can
>>>>> instead define a new union between struct page & struct page_pool_iov and
>>>>> compact it in a new type. However, we'd need to implement the code churn
>>>>> to modify the page_pool & drivers to use this new type. For this POC
>>>>> that is not implemented (feedback welcome).
>>>>>
>>>>
>>>> I've said before, that I prefer multiplexing on page->pp_magic.
>>>> For your page_pool_iov the layout would have to match the offset of
>>>> pp_magic, to do this. (And if insisting on using PP infra the refcnt
>>>> would also need to align).
>>>
>>> Perhaps I misunderstand, but this suggests continuing to using
>>> struct page to demultiplex memory type?
>>>
>>
>> (Perhaps we are misunderstanding each-other and my use of the words
>> multiplexing and demultiplex are wrong, I'm sorry, as English isn't my
>> native language.)
>>
>> I do see the problem of depending on having a struct page, as the
>> page_pool_iov isn't related to struct page.  Having "page" in the name
>> of "page_pool_iov" is also confusing (hardest problem is CS is naming,
>> as we all know).
>>
>> To support more allocator types, perhaps skb->pp_recycle bit need to
>> grow another bit (and be renamed skb->recycle), so we can tell allocator
>> types apart, those that are page based and those whom are not.
>>
>>
>>> I think the feedback has been strong to not multiplex yet another
>>> memory type into that struct, that is not a real page. Which is why
>>> we went into this direction. This latest series limits the impact largely
>>> to networking structures and code.
>>>
>>
>> Some what related what I'm objecting to: the "page_pool_iov" is not a
>> real page, but this getting recycled into something called "page_pool",
>> which funny enough deals with struct-pages internally and depend on the
>> struct-page-refcnt.
>>
>> Given the approach changed way from using struct page, then I also don't
>> see the connection with the page_pool. Sorry.
>>
>>> One way or another, there will be a branch and multiplexing. Whether
>>> that is in struct page, the page pool or a new netdev mem type as you
>>> propose.
>>>
>>
>> I'm asking to have this branch/multiplexing done a the call sites.
>>
>> (IMHO not changing the drivers is a pipe-dream.)
>>
> 
> I think I understand what Jesper is saying. I think Jesper wants the
> page_pool to remain unchanged, and another layer on top of it to do
> the multiplexing, i.e.:
> 
> driver ---> new_layer (does multiplexing) ---> page_pool -------> mm layer
>                                 \------------------------------>
> devmem_pool --> dma-buf layer
> 
> But, I think, Jakub wants the page_pool to be the front end, and for
> the multiplexing to happen in the page pool (I think, Jakub did not
> write this in an email, but this is how I interpret his comments from
> [1], and his memory provider RFC). So I think Jakub wants:
> 
> driver --> page_pool ---> memory_provider (does multiplexing) --->
> basic_provider -------> mm layer
> 
> \----------------------------------------> devmem_provider --> dma-buf
> layer
> 
> That is the approach in this RFC.
> 
> I think proper naming that makes sense can be figured out, and is not
> a huge issue. I think in both cases we can minimize the changes to the
> drivers, maybe. In the first approach the driver will need to use the
> APIs created by the new layer. In the second approach, the driver
> continues to use page_pool APIs.
> 
> I think we need to converge on a path between the 2 approaches (or
> maybe 3rd approach to do). For me the pros/cons of each approach
> (please add):
> 
> multiplexing in new_layer:
> - Pro: maybe better for performance? Not sure if static_branch can
> achieve the same perf. I can verify with Jesper's perf tests.
> - Pro: doesn't add complexity in the page_pool (but adds complexity in
> terms of adding new pools like devmem_pool)
> - Con: the devmem_pool & page_pool will end up being duplicated code,
> I suspect, because they largely do similar things (both need to
> recycle memory for example).

Hi all, for our ZC RX into userspace host memory implementation, we opted for
the 1st approach [1] that was initially suggested by Kuba. The multiplexing was
added into a thin shim layer that we named "data_pool", but is functionally
identical to devmem_pool.

Seeing Mina's work in this patchset, I would also prefer the 2nd approach i.e.
multiplexing via memory_provider within a page_pool. In addition to the
pros/cons Mina mentioned, in our case the pages are actually pages so it makes
sense to me to put it inside the page_pool.

[1]: https://lore.kernel.org/io-uring/20230826011954.1801099-11-dw@davidwei.uk/

> 
> multiplexing via memory_provider:
> - Pro: no code duplication.
> - Pro: less changes to the drivers, I think. The drivers can continue
> to use the page_pool API, no need to introduce calls to 'new_layer'.
> - Con: adds complexity to the page_pool (needs to handle devmem).
> - Con: probably careful handling via static_branch needs to be done to
> achieve performance.
> 
> [1] https://lore.kernel.org/netdev/20230619110705.106ec599@kernel.org/
> 
>>> Any regression in page pool can be avoided in the common case that
>>> does not use device mem by placing that behind a static_branch. Would
>>> that address your performance concerns?
>>>
>>
>> No. This will not help.
>>
>> The problem is that every where in the page_pool code it is getting
>> polluted with:
>>
>>    if (page_is_page_pool_iov(page))
>>      call-some-iov-func-instead()
>>
>> Like: the very central piece of getting the refcnt:
>>
>> +static inline int page_pool_page_ref_count(struct page *page)
>> +{
>> +       if (page_is_page_pool_iov(page))
>> +               return page_pool_iov_refcount(page_to_page_pool_iov(page));
>> +
>> +       return page_ref_count(page);
>> +}
>>
>>
>> The fast-path of the PP is used for XDP_DROP scenarios, and is currently
>> around 14 cycles (tsc). Thus, any extra code in this code patch will
>> change the fast-path.
>>
>>
>>>>
>>>> On the allocation side, all drivers already use a driver helper
>>>> page_pool_dev_alloc_pages() or we could add another (better named)
>>>> helper to multiplex between other types of allocators, e.g. a devmem
>>>> allocator.
>>>>
>>>> On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle
>>>> could multiplex on pp_magic and call another API.  The API could be an
>>>> extension to PP helpers, but it could also be a devmap allocator helper.
>>>>
>>>> IMHO forcing/piggy-bagging everything into page_pool is not the right
>>>> solution.  I really think netstack need to support different allocator
>>>> types.
>>>
>>> To me this is lifting page_pool into such a netstack alloctator pool.
>>>
>>
>> This is should be renamed as it is not longer dealing with pages.
>>
>>> Not sure adding another explicit layer of indirection would be cleaner
>>> or faster (potentially more indirect calls).
>>>
>>
>> It seems we are talking past each-other.  The layer of indirection I'm
>> talking about is likely a simple header file (e.g. named netmem.h) that
>> will get inline compiled so there is no overhead. It will be used by
>> driver, such that we can avoid touching driver again when introducing
>> new memory allocator types.
>>
>>
>>> As for the LSB trick: that avoided adding a lot of boilerplate churn
>>> with new type and helper functions.
>>>
>>
>> Says the lazy programmer :-P ... sorry could not resist ;-)
>>
>>>
>>>
>>>> The page pool have been leading the way, yes, but perhaps it is
>>>> time to add an API layer that e.g. could be named netmem, that gives us
>>>> the multiplexing between allocators.  In that process some of page_pool
>>>> APIs would be lifted out as common blocks and others remain.
>>>>
>>>> --Jesper
>>>>
>>>>> I have a sample implementation of adding a new page_pool_token type
>>>>> in the page_pool to give a general idea here:
>>>>> https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d
>>>>>
>>>>> Full branch here:
>>>>> https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens
>>>>>
>>>>> (In the branches above, page_pool_iov is called devmem_slice).
>>>>>
>>>>> Could also add static_branch to speed up the checks in page_pool_iov
>>>>> memory providers are being used.
>>>>>
>>>>> Signed-off-by: Mina Almasry <almasrymina@google.com>
>>>>> ---
>>>>>    include/net/page_pool.h | 74 ++++++++++++++++++++++++++++++++++-
>>>>>    net/core/page_pool.c    | 85 ++++++++++++++++++++++++++++-------------
>>>>>    2 files changed, 131 insertions(+), 28 deletions(-)
>>>>>
>>>>> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
>>>>> index 537eb36115ed..f08ca230d68e 100644
>>>>> --- a/include/net/page_pool.h
>>>>> +++ b/include/net/page_pool.h
>>>>> @@ -282,6 +282,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
>>>>>        return NULL;
>>>>>    }
>>>>>
>>>>> +static inline int page_pool_page_ref_count(struct page *page)
>>>>> +{
>>>>> +     if (page_is_page_pool_iov(page))
>>>>> +             return page_pool_iov_refcount(page_to_page_pool_iov(page));
>>>>> +
>>>>> +     return page_ref_count(page);
>>>>> +}
>>>>> +
>>>>> +static inline void page_pool_page_get_many(struct page *page,
>>>>> +                                        unsigned int count)
>>>>> +{
>>>>> +     if (page_is_page_pool_iov(page))
>>>>> +             return page_pool_iov_get_many(page_to_page_pool_iov(page),
>>>>> +                                           count);
>>>>> +
>>>>> +     return page_ref_add(page, count);
>>>>> +}
>>>>> +
>>>>> +static inline void page_pool_page_put_many(struct page *page,
>>>>> +                                        unsigned int count)
>>>>> +{
>>>>> +     if (page_is_page_pool_iov(page))
>>>>> +             return page_pool_iov_put_many(page_to_page_pool_iov(page),
>>>>> +                                           count);
>>>>> +
>>>>> +     if (count > 1)
>>>>> +             page_ref_sub(page, count - 1);
>>>>> +
>>>>> +     put_page(page);
>>>>> +}
>>>>> +
>>>>> +static inline bool page_pool_page_is_pfmemalloc(struct page *page)
>>>>> +{
>>>>> +     if (page_is_page_pool_iov(page))
>>>>> +             return false;
>>>>> +
>>>>> +     return page_is_pfmemalloc(page);
>>>>> +}
>>>>> +
>>>>> +static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
>>>>> +{
>>>>> +     /* Assume page_pool_iov are on the preferred node without actually
>>>>> +      * checking...
>>>>> +      *
>>>>> +      * This check is only used to check for recycling memory in the page
>>>>> +      * pool's fast paths. Currently the only implementation of page_pool_iov
>>>>> +      * is dmabuf device memory. It's a deliberate decision by the user to
>>>>> +      * bind a certain dmabuf to a certain netdev, and the netdev rx queue
>>>>> +      * would not be able to reallocate memory from another dmabuf that
>>>>> +      * exists on the preferred node, so, this check doesn't make much sense
>>>>> +      * in this case. Assume all page_pool_iovs can be recycled for now.
>>>>> +      */
>>>>> +     if (page_is_page_pool_iov(page))
>>>>> +             return true;
>>>>> +
>>>>> +     return page_to_nid(page) == pref_nid;
>>>>> +}
>>>>> +
>>>>>    struct page_pool {
>>>>>        struct page_pool_params p;
>>>>>
>>>>> @@ -434,6 +492,9 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
>>>>>    {
>>>>>        long ret;
>>>>>
>>>>> +     if (page_is_page_pool_iov(page))
>>>>> +             return -EINVAL;
>>>>> +
>>>>>        /* If nr == pp_frag_count then we have cleared all remaining
>>>>>         * references to the page. No need to actually overwrite it, instead
>>>>>         * we can leave this to be overwritten by the calling function.
>>>>> @@ -494,7 +555,12 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
>>>>>
>>>>>    static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>>>>>    {
>>>>> -     dma_addr_t ret = page->dma_addr;
>>>>> +     dma_addr_t ret;
>>>>> +
>>>>> +     if (page_is_page_pool_iov(page))
>>>>> +             return page_pool_iov_dma_addr(page_to_page_pool_iov(page));
>>>>> +
>>>>> +     ret = page->dma_addr;
>>>>>
>>>>>        if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
>>>>>                ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16;
>>>>> @@ -504,6 +570,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>>>>>
>>>>>    static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
>>>>>    {
>>>>> +     /* page_pool_iovs are mapped and their dma-addr can't be modified. */
>>>>> +     if (page_is_page_pool_iov(page)) {
>>>>> +             DEBUG_NET_WARN_ON_ONCE(true);
>>>>> +             return;
>>>>> +     }
>>>>> +
>>>>>        page->dma_addr = addr;
>>>>>        if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
>>>>>                page->dma_addr_upper = upper_32_bits(addr);
>>>>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>>>>> index 0a7c08d748b8..20c1f74fd844 100644
>>>>> --- a/net/core/page_pool.c
>>>>> +++ b/net/core/page_pool.c
>>>>> @@ -318,7 +318,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
>>>>>                if (unlikely(!page))
>>>>>                        break;
>>>>>
>>>>> -             if (likely(page_to_nid(page) == pref_nid)) {
>>>>> +             if (likely(page_pool_page_is_pref_nid(page, pref_nid))) {
>>>>>                        pool->alloc.cache[pool->alloc.count++] = page;
>>>>>                } else {
>>>>>                        /* NUMA mismatch;
>>>>> @@ -363,7 +363,15 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
>>>>>                                          struct page *page,
>>>>>                                          unsigned int dma_sync_size)
>>>>>    {
>>>>> -     dma_addr_t dma_addr = page_pool_get_dma_addr(page);
>>>>> +     dma_addr_t dma_addr;
>>>>> +
>>>>> +     /* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */
>>>>> +     if (page_is_page_pool_iov(page)) {
>>>>> +             DEBUG_NET_WARN_ON_ONCE(true);
>>>>> +             return;
>>>>> +     }
>>>>> +
>>>>> +     dma_addr = page_pool_get_dma_addr(page);
>>>>>
>>>>>        dma_sync_size = min(dma_sync_size, pool->p.max_len);
>>>>>        dma_sync_single_range_for_device(pool->p.dev, dma_addr,
>>>>> @@ -375,6 +383,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>>>>>    {
>>>>>        dma_addr_t dma;
>>>>>
>>>>> +     if (page_is_page_pool_iov(page)) {
>>>>> +             /* page_pool_iovs are already mapped */
>>>>> +             DEBUG_NET_WARN_ON_ONCE(true);
>>>>> +             return true;
>>>>> +     }
>>>>> +
>>>>>        /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
>>>>>         * since dma_addr_t can be either 32 or 64 bits and does not always fit
>>>>>         * into page private data (i.e 32bit cpu with 64bit DMA caps)
>>>>> @@ -398,14 +412,24 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>>>>>    static void page_pool_set_pp_info(struct page_pool *pool,
>>>>>                                  struct page *page)
>>>>>    {
>>>>> -     page->pp = pool;
>>>>> -     page->pp_magic |= PP_SIGNATURE;
>>>>> +     if (!page_is_page_pool_iov(page)) {
>>>>> +             page->pp = pool;
>>>>> +             page->pp_magic |= PP_SIGNATURE;
>>>>> +     } else {
>>>>> +             page_to_page_pool_iov(page)->pp = pool;
>>>>> +     }
>>>>> +
>>>>>        if (pool->p.init_callback)
>>>>>                pool->p.init_callback(page, pool->p.init_arg);
>>>>>    }
>>>>>
>>>>>    static void page_pool_clear_pp_info(struct page *page)
>>>>>    {
>>>>> +     if (page_is_page_pool_iov(page)) {
>>>>> +             page_to_page_pool_iov(page)->pp = NULL;
>>>>> +             return;
>>>>> +     }
>>>>> +
>>>>>        page->pp_magic = 0;
>>>>>        page->pp = NULL;
>>>>>    }
>>>>> @@ -615,7 +639,7 @@ static bool page_pool_recycle_in_cache(struct page *page,
>>>>>                return false;
>>>>>        }
>>>>>
>>>>> -     /* Caller MUST have verified/know (page_ref_count(page) == 1) */
>>>>> +     /* Caller MUST have verified/know (page_pool_page_ref_count(page) == 1) */
>>>>>        pool->alloc.cache[pool->alloc.count++] = page;
>>>>>        recycle_stat_inc(pool, cached);
>>>>>        return true;
>>>>> @@ -638,9 +662,10 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
>>>>>         * refcnt == 1 means page_pool owns page, and can recycle it.
>>>>>         *
>>>>>         * page is NOT reusable when allocated when system is under
>>>>> -      * some pressure. (page_is_pfmemalloc)
>>>>> +      * some pressure. (page_pool_page_is_pfmemalloc)
>>>>>         */
>>>>> -     if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
>>>>> +     if (likely(page_pool_page_ref_count(page) == 1 &&
>>>>> +                !page_pool_page_is_pfmemalloc(page))) {
>>>>>                /* Read barrier done in page_ref_count / READ_ONCE */
>>>>>
>>>>>                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
>>>>> @@ -741,7 +766,8 @@ static struct page *page_pool_drain_frag(struct page_pool *pool,
>>>>>        if (likely(page_pool_defrag_page(page, drain_count)))
>>>>>                return NULL;
>>>>>
>>>>> -     if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
>>>>> +     if (page_pool_page_ref_count(page) == 1 &&
>>>>> +         !page_pool_page_is_pfmemalloc(page)) {
>>>>>                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
>>>>>                        page_pool_dma_sync_for_device(pool, page, -1);
>>>>>
>>>>> @@ -818,9 +844,9 @@ static void page_pool_empty_ring(struct page_pool *pool)
>>>>>        /* Empty recycle ring */
>>>>>        while ((page = ptr_ring_consume_bh(&pool->ring))) {
>>>>>                /* Verify the refcnt invariant of cached pages */
>>>>> -             if (!(page_ref_count(page) == 1))
>>>>> +             if (!(page_pool_page_ref_count(page) == 1))
>>>>>                        pr_crit("%s() page_pool refcnt %d violation\n",
>>>>> -                             __func__, page_ref_count(page));
>>>>> +                             __func__, page_pool_page_ref_count(page));
>>>>>
>>>>>                page_pool_return_page(pool, page);
>>>>>        }
>>>>> @@ -977,19 +1003,24 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
>>>>>        struct page_pool *pp;
>>>>>        bool allow_direct;
>>>>>
>>>>> -     page = compound_head(page);
>>>>> +     if (!page_is_page_pool_iov(page)) {
>>>>> +             page = compound_head(page);
>>>>>
>>>>> -     /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
>>>>> -      * in order to preserve any existing bits, such as bit 0 for the
>>>>> -      * head page of compound page and bit 1 for pfmemalloc page, so
>>>>> -      * mask those bits for freeing side when doing below checking,
>>>>> -      * and page_is_pfmemalloc() is checked in __page_pool_put_page()
>>>>> -      * to avoid recycling the pfmemalloc page.
>>>>> -      */
>>>>> -     if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
>>>>> -             return false;
>>>>> +             /* page->pp_magic is OR'ed with PP_SIGNATURE after the
>>>>> +              * allocation in order to preserve any existing bits, such as
>>>>> +              * bit 0 for the head page of compound page and bit 1 for
>>>>> +              * pfmemalloc page, so mask those bits for freeing side when
>>>>> +              * doing below checking, and page_pool_page_is_pfmemalloc() is
>>>>> +              * checked in __page_pool_put_page() to avoid recycling the
>>>>> +              * pfmemalloc page.
>>>>> +              */
>>>>> +             if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
>>>>> +                     return false;
>>>>>
>>>>> -     pp = page->pp;
>>>>> +             pp = page->pp;
>>>>> +     } else {
>>>>> +             pp = page_to_page_pool_iov(page)->pp;
>>>>> +     }
>>>>>
>>>>>        /* Allow direct recycle if we have reasons to believe that we are
>>>>>         * in the same context as the consumer would run, so there's
>>>>> @@ -1273,9 +1304,9 @@ static bool mp_huge_busy(struct mp_huge *hu, unsigned int idx)
>>>>>
>>>>>        for (j = 0; j < (1 << MP_HUGE_ORDER); j++) {
>>>>>                page = hu->page[idx] + j;
>>>>> -             if (page_ref_count(page) != 1) {
>>>>> +             if (page_pool_page_ref_count(page) != 1) {
>>>>>                        pr_warn("Page with ref count %d at %u, %u. Can't safely destory, leaking memory!\n",
>>>>> -                             page_ref_count(page), idx, j);
>>>>> +                             page_pool_page_ref_count(page), idx, j);
>>>>>                        return true;
>>>>>                }
>>>>>        }
>>>>> @@ -1330,7 +1361,7 @@ static struct page *mp_huge_alloc_pages(struct page_pool *pool, gfp_t gfp)
>>>>>                        continue;
>>>>>
>>>>>                if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
>>>>> -                 page_ref_count(page) != 1) {
>>>>> +                 page_pool_page_ref_count(page) != 1) {
>>>>>                        atomic_inc(&mp_huge_ins_b);
>>>>>                        continue;
>>>>>                }
>>>>> @@ -1458,9 +1489,9 @@ static void mp_huge_1g_destroy(struct page_pool *pool)
>>>>>        free = true;
>>>>>        for (i = 0; i < MP_HUGE_1G_CNT; i++) {
>>>>>                page = hu->page + i;
>>>>> -             if (page_ref_count(page) != 1) {
>>>>> +             if (page_pool_page_ref_count(page) != 1) {
>>>>>                        pr_warn("Page with ref count %d at %u. Can't safely destory, leaking memory!\n",
>>>>> -                             page_ref_count(page), i);
>>>>> +                             page_pool_page_ref_count(page), i);
>>>>>                        free = false;
>>>>>                        break;
>>>>>                }
>>>>> @@ -1489,7 +1520,7 @@ static struct page *mp_huge_1g_alloc_pages(struct page_pool *pool, gfp_t gfp)
>>>>>                page = hu->page + page_i;
>>>>>
>>>>>                if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
>>>>> -                 page_ref_count(page) != 1) {
>>>>> +                 page_pool_page_ref_count(page) != 1) {
>>>>>                        atomic_inc(&mp_huge_ins_b);
>>>>>                        continue;
>>>>>                }
>>>>> --
>>>>> 2.41.0.640.ga95def55d0-goog
>>>>>
>>>>
>>>
>>
> 
> 
> --
> Thanks,
> Mina


^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2023-09-08  2:32 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-10  1:57 [RFC PATCH v2 00/11] Device Memory TCP Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 01/11] net: add netdev netlink api to bind dma-buf to a net device Mina Almasry
2023-08-10 16:04   ` Samudrala, Sridhar
2023-08-11  2:19     ` Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice Mina Almasry
2023-08-13 11:26   ` Leon Romanovsky
2023-08-14  1:10   ` David Ahern
2023-08-14  3:15     ` Mina Almasry
2023-08-16  0:16     ` Jakub Kicinski
2023-08-16 16:12       ` Willem de Bruijn
2023-08-18  1:33         ` David Ahern
2023-08-18  2:09           ` Jakub Kicinski
2023-08-18  2:21             ` David Ahern
2023-08-18 21:52             ` Mina Almasry
2023-08-19  1:34               ` David Ahern
2023-08-19  2:06                 ` Jakub Kicinski
2023-08-19  3:30                   ` David Ahern
2023-08-19 14:18                     ` Willem de Bruijn
2023-08-19 17:59                       ` Mina Almasry
2023-08-21 21:16                       ` Jakub Kicinski
2023-08-22  0:38                         ` Willem de Bruijn
2023-08-22  1:51                           ` Jakub Kicinski
2023-08-22  3:19                       ` David Ahern
2023-08-30 12:38   ` Yunsheng Lin
2023-09-08  0:47   ` David Wei
2023-08-10  1:57 ` [RFC PATCH v2 03/11] netdev: implement netdevice devmem allocator Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 04/11] memory-provider: updates to core provider API for devmem TCP Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 05/11] memory-provider: implement dmabuf devmem memory provider Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 06/11] page-pool: add device memory support Mina Almasry
2023-08-19  9:51   ` Jesper Dangaard Brouer
2023-08-19 14:08     ` Willem de Bruijn
2023-08-19 15:22       ` Jesper Dangaard Brouer
2023-08-19 15:49         ` David Ahern
2023-08-19 16:12           ` Willem de Bruijn
2023-08-21 21:31             ` Jakub Kicinski
2023-08-22  0:58               ` Willem de Bruijn
2023-08-19 16:11         ` Willem de Bruijn
2023-08-19 20:24         ` Mina Almasry
2023-08-19 20:27           ` Mina Almasry
2023-09-08  2:32           ` David Wei
2023-08-22  6:05     ` Mina Almasry
2023-08-22 12:24       ` Jesper Dangaard Brouer
2023-08-22 23:33         ` Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 07/11] net: support non paged skb frags Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 08/11] net: add support for skbs with unreadable frags Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 09/11] tcp: implement recvmsg() RX path for devmem TCP Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 10/11] net: add SO_DEVMEM_DONTNEED setsockopt to release RX pages Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 11/11] selftests: add ncdevmem, netcat for devmem TCP Mina Almasry
2023-08-10 10:29 ` [RFC PATCH v2 00/11] Device Memory TCP Christian König
2023-08-10 16:06   ` Jason Gunthorpe
2023-08-10 18:44   ` Mina Almasry
2023-08-10 18:58     ` Jason Gunthorpe
2023-08-11  1:56       ` Mina Almasry
2023-08-11 11:02     ` Christian König
2023-08-14  1:12 ` David Ahern
2023-08-14  2:11   ` Mina Almasry
2023-08-17 18:00   ` Pavel Begunkov
2023-08-17 22:18     ` Mina Almasry
2023-08-23 22:52       ` David Wei
2023-08-24  3:35         ` David Ahern
2023-08-15 13:38 ` David Laight
2023-08-15 14:41   ` Willem de Bruijn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).