netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/21] netgpu: networking between NIC and GPU/CPU.
@ 2020-06-18 16:09 Jonathan Lemon
  2020-06-18 16:09 ` [RFC PATCH 01/21] mm: add {add|release}_memory_pages Jonathan Lemon
                   ` (20 more replies)
  0 siblings, 21 replies; 28+ messages in thread
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, axboe

This series is a working RFC proof-of-concept that implements DMA
zero-copy between the NIC and a GPU device for the data path, while
keeping the protocol processing on the host CPU.

This also works for zero-copy send/recv to host (CPU) memory.

Current limitations:
  - mlx5 only, header splitting is at a fixed offset.
  - currently only TCP protocol delivery is performed.
  - not optimized (hey, it works!)
  - TX completion notification is planned, but not in this patchset.
  - one socket per device
  - not compatible with xsk (re-uses same datastructures)
  - not compatible with bpf payload inspection
  - x86 !iommu only; liberties are taken with PA addresses.

The next section provides a brief overview of how things work, for this
phase 0 proof of concept.

A transport context is created on a device, which sets up the datapath,
and the device queues.  Only specialized RX queues are needed, the
standard TX queues are used for packet transmission.

Memory areas which participate in zero-copy transmission are registered
with the context.  These areas can be used as either RX packet buffers
or TX data areas (or both).  The memory can come from either malloc/mmap
or cudaMalloc().  The latter call provides a handle to the userspace
application, but the memory region is only accessible to the GPU.

A socket is created and registered with the context, which sets
SOCK_ZEROCOPY, and is bound to the device with SO_BINDTODEVICE.

Asymmetrical data paths are possible (zc TX, normal RX), and vice versa,
but the curreent PoC sets things up for symmetrical transport.  The
application needs to provide the RX buffers to the receive queue,
similar to AF_XDP.

Once things are set up, data is sent to the network with sendmsg().  The
iovecs provided contain an address in the region previously registered.
The normal protocol stack processing constructs the packet, but the data
is not touched by the stack.  In this phase, the application is not
notified when the protocol processing is complete and the data area is
safe to modify again.

For RX, packets undergo the usual protocol processing and are delivered
up to the socket receive queue.  At this point, the skb data fragments
are delivered to the application as iovecs through an AF_XDP style
queue.  The application can poll for readability, but does not use
read() to receive the data.

The initial application used is iperf3, a modified version with the
userspace library is available at:
    https://github.com/jlemon/iperf
    https://github.com/jlemon/netgpu

Running "iperf3 -s -z --dport 8888" (host memory) on a 12Gbps link:
    11.3 Gbit/sec receive
    10.8 Gbit/sec tramsmit

Running "iperf3 -s -z --dport 8888 --gpu" on a 25Gbps link:
    22.5 Gbit/sec receive
    12.6 Gbit/sec transmit  (!!!)

For the GPU runs, the Intel PCI monitoring tools were used to confirm
that the host PCI bus was mostly idle.  The TX performance needs further
investigation.

Comments welcome.  The next phase of the work will clean up the
interface, adding completion notifications, and a flexible queue
creation mechanism.
--
Jonathan


Jonathan Lemon (21):
  mm: add {add|release}_memory_pages
  mm: Allow DMA mapping of pages which are not online
  tcp: Pad TCP options out to a fixed size
  mlx5: add definitions for header split and netgpu
  mlx5/xsk: check that xsk does not conflict with netgpu
  mlx5: add header_split flag
  mlx5: remove the umem parameter from mlx5e_open_channel
  misc: add shqueue.h for prototyping
  include: add definitions for netgpu
  mlx5: add netgpu queue functions
  skbuff: add a zc_netgpu bitflag
  mlx5: hook up the netgpu channel functions
  netdevice: add SETUP_NETGPU to the netdev_bpf structure
  kernel: export free_uid
  netgpu: add network/gpu dma module
  lib: have __zerocopy_sg_from_iter get netgpu pages for a sk
  net/core: add the SO_REGISTER_DMA socket option
  tcp: add MSG_NETDMA flag for sendmsg()
  core: add page recycling logic for netgpu pages
  core/skbuff: use skb_zdata for testing whether skb is zerocopy
  mlx5: add XDP_SETUP_NETGPU hook

 drivers/misc/Kconfig                          |    1 +
 drivers/misc/Makefile                         |    1 +
 drivers/misc/netgpu/Kconfig                   |   10 +
 drivers/misc/netgpu/Makefile                  |   11 +
 drivers/misc/netgpu/nvidia.c                  | 1516 +++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/Makefile  |    3 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   22 +-
 .../mellanox/mlx5/core/en/netgpu/setup.c      |  475 ++++++
 .../mellanox/mlx5/core/en/netgpu/setup.h      |   42 +
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h |    3 +
 .../ethernet/mellanox/mlx5/core/en/xsk/umem.c |    3 +
 .../ethernet/mellanox/mlx5/core/en/xsk/umem.h |    3 +
 .../ethernet/mellanox/mlx5/core/en_ethtool.c  |   15 +
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  118 +-
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |   52 +-
 .../net/ethernet/mellanox/mlx5/core/en_txrx.c |   15 +-
 include/linux/dma-mapping.h                   |    4 +-
 include/linux/memory_hotplug.h                |    4 +
 include/linux/mmzone.h                        |    7 +
 include/linux/netdevice.h                     |    6 +
 include/linux/skbuff.h                        |   27 +-
 include/linux/socket.h                        |    1 +
 include/linux/uio.h                           |    4 +
 include/net/netgpu.h                          |   65 +
 include/uapi/asm-generic/socket.h             |    2 +
 include/uapi/misc/netgpu.h                    |   43 +
 include/uapi/misc/shqueue.h                   |  205 +++
 kernel/user.c                                 |    1 +
 lib/iov_iter.c                                |   45 +
 mm/memory_hotplug.c                           |   65 +-
 net/core/datagram.c                           |    6 +-
 net/core/skbuff.c                             |   44 +-
 net/core/sock.c                               |   26 +
 net/ipv4/tcp.c                                |    8 +
 net/ipv4/tcp_output.c                         |   16 +
 35 files changed, 2828 insertions(+), 41 deletions(-)
 create mode 100644 drivers/misc/netgpu/Kconfig
 create mode 100644 drivers/misc/netgpu/Makefile
 create mode 100644 drivers/misc/netgpu/nvidia.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.h
 create mode 100644 include/net/netgpu.h
 create mode 100644 include/uapi/misc/netgpu.h
 create mode 100644 include/uapi/misc/shqueue.h

-- 
2.24.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH 01/21] mm: add {add|release}_memory_pages
  2020-06-18 16:09 [RFC PATCH 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
@ 2020-06-18 16:09 ` Jonathan Lemon
  2020-06-18 16:09 ` [RFC PATCH 02/21] mm: Allow DMA mapping of pages which are not online Jonathan Lemon
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, axboe

This allows creation of system pages at a specific physical address,
which is useful for creating dummy backing pages which correspond to
unaddressable external memory at specific locations.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/linux/memory_hotplug.h |  4 +++
 mm/memory_hotplug.c            | 65 ++++++++++++++++++++++++++++++++--
 2 files changed, 67 insertions(+), 2 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 375515803cd8..05e012e1a203 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -138,6 +138,10 @@ extern void __remove_pages(unsigned long start_pfn, unsigned long nr_pages,
 extern int __add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
 		       struct mhp_params *params);
 
+struct resource *add_memory_pages(int nid, u64 start, u64 size,
+                                  struct mhp_params *params);
+void release_memory_pages(struct resource *res);
+
 #ifndef CONFIG_ARCH_HAS_ADD_PAGES
 static inline int add_pages(int nid, unsigned long start_pfn,
 		unsigned long nr_pages, struct mhp_params *params)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9b34e03e730a..926cd4a2f81f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -125,8 +125,8 @@ static struct resource *register_memory_resource(u64 start, u64 size,
 			       resource_name, flags);
 
 	if (!res) {
-		pr_debug("Unable to reserve System RAM region: %016llx->%016llx\n",
-				start, start + size);
+		pr_debug("Unable to reserve %s region: %016llx->%016llx\n",
+				resource_name, start, start + size);
 		return ERR_PTR(-EEXIST);
 	}
 	return res;
@@ -1109,6 +1109,67 @@ int add_memory(int nid, u64 start, u64 size)
 }
 EXPORT_SYMBOL_GPL(add_memory);
 
+static int __ref add_memory_section(int nid, struct resource *res,
+				    struct mhp_params *params)
+{
+	u64 start, end, section_size;
+	int ret;
+
+	/* must align start/end with memory block size */
+	end = res->start + resource_size(res);
+	section_size = memory_block_size_bytes();
+	start = round_down(res->start, section_size);
+	end = round_up(end, section_size);
+
+	mem_hotplug_begin();
+	ret = __add_pages(nid,
+		PHYS_PFN(start), PHYS_PFN(end - start), params);
+	mem_hotplug_done();
+
+	return ret;
+}
+
+/* requires device_hotplug_lock, see add_memory_resource() */
+static struct resource * __ref __add_memory_pages(int nid, u64 start, u64 size,
+				    struct mhp_params *params)
+{
+	struct resource *res;
+	int ret;
+
+	res = register_memory_resource(start, size, "Private RAM");
+	if (IS_ERR(res))
+		return res;
+
+	ret = add_memory_section(nid, res, params);
+	if (ret < 0) {
+		release_memory_resource(res);
+		return ERR_PTR(ret);
+	}
+
+	return res;
+}
+
+struct resource *add_memory_pages(int nid, u64 start, u64 size,
+				  struct mhp_params *params)
+{
+	struct resource *res;
+
+	lock_device_hotplug();
+	res = __add_memory_pages(nid, start, size, params);
+	unlock_device_hotplug();
+
+	return res;
+}
+EXPORT_SYMBOL_GPL(add_memory_pages);
+
+void release_memory_pages(struct resource *res)
+{
+	lock_device_hotplug();
+	release_memory_resource(res);
+	unlock_device_hotplug();
+}
+EXPORT_SYMBOL_GPL(release_memory_pages);
+
 /*
  * Add special, driver-managed memory to the system as system RAM. Such
  * memory is not exposed via the raw firmware-provided memmap as system
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 02/21] mm: Allow DMA mapping of pages which are not online
  2020-06-18 16:09 [RFC PATCH 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
  2020-06-18 16:09 ` [RFC PATCH 01/21] mm: add {add|release}_memory_pages Jonathan Lemon
@ 2020-06-18 16:09 ` Jonathan Lemon
  2020-06-18 16:09 ` [RFC PATCH 03/21] tcp: Pad TCP options out to a fixed size Jonathan Lemon
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, axboe

Change the system RAM check from 'valid' to 'online', so dummy
pages which refer to external DMA resources can be mapped.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/linux/dma-mapping.h | 4 ++--
 include/linux/mmzone.h      | 7 +++++++
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 78f677cf45ab..fb142a01d1ba 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -348,8 +348,8 @@ static inline dma_addr_t dma_map_resource(struct device *dev,
 
 	BUG_ON(!valid_dma_direction(dir));
 
-	/* Don't allow RAM to be mapped */
-	if (WARN_ON_ONCE(pfn_valid(PHYS_PFN(phys_addr))))
+	/* Don't allow online RAM to be mapped */
+	if (WARN_ON_ONCE(pfn_online(PHYS_PFN(phys_addr))))
 		return DMA_MAPPING_ERROR;
 
 	if (dma_is_direct(ops))
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c4c37fd12104..9a9fe5704f97 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1348,6 +1348,13 @@ static inline unsigned long next_present_section_nr(unsigned long section_nr)
 	return -1;
 }
 
+static inline int pfn_online(unsigned long pfn)
+{
+	if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
+		return 0;
+	return online_section(__nr_to_section(pfn_to_section_nr(pfn)));
+}
+
 /*
  * These are _only_ used during initialisation, therefore they
  * can use __initdata ...  They could have names to indicate
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 03/21] tcp: Pad TCP options out to a fixed size
  2020-06-18 16:09 [RFC PATCH 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
  2020-06-18 16:09 ` [RFC PATCH 01/21] mm: add {add|release}_memory_pages Jonathan Lemon
  2020-06-18 16:09 ` [RFC PATCH 02/21] mm: Allow DMA mapping of pages which are not online Jonathan Lemon
@ 2020-06-18 16:09 ` Jonathan Lemon
  2020-06-18 16:09 ` [RFC PATCH 04/21] mlx5: add definitions for header split and netgpu Jonathan Lemon
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, axboe

The "header splitting" feature used by netgpu doesn't actually parse
the incoming packet header.  Instead, it splits the packet at a fixed
offset.  In order for this to work, the sender needs to send packets
with a fixed header size.

(Obviously not for upstream committing, just for prototyping)

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 net/ipv4/tcp_output.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a50e1990a845..afc996ef2d4e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -438,6 +438,7 @@ struct tcp_out_options {
 	u8 ws;			/* window scale, 0 to disable */
 	u8 num_sack_blocks;	/* number of SACK blocks to include */
 	u8 hash_size;		/* bytes in hash_location */
+	u8 pad_size;		/* additional nops for padding */
 	__u8 *hash_location;	/* temporary pointer, overloaded */
 	__u32 tsval, tsecr;	/* need to include OPTION_TS */
 	struct tcp_fastopen_cookie *fastopen_cookie;	/* Fast open cookie */
@@ -562,6 +563,15 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 	smc_options_write(ptr, &options);
 
 	mptcp_options_write(ptr, opts);
+
+	/* pad out options for netgpu */
+	if (opts->pad_size) {
+		int len = opts->pad_size;
+		u8 *p = (u8 *)ptr;
+
+		while (len--)
+			*p++ = TCPOPT_NOP;
+	}
 }
 
 static void smc_set_option(const struct tcp_sock *tp,
@@ -824,6 +834,12 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 			opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
 	}
 
+	/* force padding for netgpu */
+	if (size < 20) {
+		opts->pad_size = 20 - size;
+		size += opts->pad_size;
+	}
+
 	return size;
 }
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 04/21] mlx5: add definitions for header split and netgpu
  2020-06-18 16:09 [RFC PATCH 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (2 preceding siblings ...)
  2020-06-18 16:09 ` [RFC PATCH 03/21] tcp: Pad TCP options out to a fixed size Jonathan Lemon
@ 2020-06-18 16:09 ` Jonathan Lemon
  2020-06-18 16:09 ` [RFC PATCH 05/21] mlx5/xsk: check that xsk does not conflict with netgpu Jonathan Lemon
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, axboe

Add definitions for fixed-length header splitting at TOTAL_HEADERS,
and pointers for the upcoming netdma work.  This reuses the same
structures as xsk, so both cannot be run simultaneously.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  | 22 +++++++++++++++++--
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  3 +++
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 842db20493df..0483cc815340 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -58,6 +58,12 @@
 
 extern const struct net_device_ops mlx5e_netdev_ops;
 struct page_pool;
+#define TCP_HDRS_LEN (20 + 20)  /* headers + options */
+#define IP6_HDRS_LEN (40)
+#define MAC_HDR_LEN (14)
+#define TOTAL_HEADERS (TCP_HDRS_LEN + IP6_HDRS_LEN + MAC_HDR_LEN)
+#define HD_SPLIT_DEFAULT_FRAG_SIZE (4096)
+#define MLX5E_HD_SPLIT(params) (params->hd_split)
 
 #define MLX5E_METADATA_ETHER_TYPE (0x8CE4)
 #define MLX5E_METADATA_ETHER_LEN 8
@@ -228,6 +234,7 @@ enum mlx5e_priv_flag {
 	MLX5E_PFLAG_RX_STRIDING_RQ,
 	MLX5E_PFLAG_RX_NO_CSUM_COMPLETE,
 	MLX5E_PFLAG_XDP_TX_MPWQE,
+	MLX5E_PFLAG_RX_HD_SPLIT,
 	MLX5E_NUM_PFLAGS, /* Keep last */
 };
 
@@ -263,6 +270,7 @@ struct mlx5e_params {
 	struct mlx5e_xsk *xsk;
 	unsigned int sw_mtu;
 	int hard_mtu;
+	bool hd_split;
 };
 
 enum {
@@ -299,7 +307,8 @@ struct mlx5e_cq_decomp {
 
 enum mlx5e_dma_map_type {
 	MLX5E_DMA_MAP_SINGLE,
-	MLX5E_DMA_MAP_PAGE
+	MLX5E_DMA_MAP_PAGE,
+	MLX5E_DMA_MAP_FIXED
 };
 
 struct mlx5e_sq_dma {
@@ -367,6 +376,7 @@ struct mlx5e_dma_info {
 		struct page *page;
 		struct xdp_buff *xsk;
 	};
+	bool netgpu_source;
 };
 
 /* XDP packets can be transmitted in different ways. On completion, we need to
@@ -545,6 +555,7 @@ enum mlx5e_rq_flag {
 struct mlx5e_rq_frag_info {
 	int frag_size;
 	int frag_stride;
+	int frag_source;
 };
 
 struct mlx5e_rq_frags_info {
@@ -554,6 +565,7 @@ struct mlx5e_rq_frags_info {
 	u8 wqe_bulk;
 };
 
+struct netgpu_ctx;
 struct mlx5e_rq {
 	/* data path */
 	union {
@@ -611,6 +623,7 @@ struct mlx5e_rq {
 
 	/* AF_XDP zero-copy */
 	struct xdp_umem       *umem;
+	struct netgpu_ctx     *netgpu;
 
 	struct work_struct     recover_work;
 
@@ -628,6 +641,7 @@ struct mlx5e_rq {
 
 enum mlx5e_channel_state {
 	MLX5E_CHANNEL_STATE_XSK,
+	MLX5E_CHANNEL_STATE_NETGPU,
 	MLX5E_CHANNEL_NUM_STATES
 };
 
@@ -736,9 +750,13 @@ struct mlx5e_xsk {
 	 * but it doesn't distinguish between zero-copy and non-zero-copy UMEMs,
 	 * so rely on our mechanism.
 	 */
-	struct xdp_umem **umems;
+	union {
+		struct xdp_umem **umems;
+		struct netgpu_ctx **ctx_tbl;
+	};
 	u16 refcnt;
 	bool ever_used;
+	bool is_netgpu;
 };
 
 /* Temporary storage for variables that are allocated when struct mlx5e_priv is
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
index bfd3e1161bc6..dfd20c30de02 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
@@ -238,6 +238,9 @@ mlx5e_tx_dma_unmap(struct device *pdev, struct mlx5e_sq_dma *dma)
 	case MLX5E_DMA_MAP_PAGE:
 		dma_unmap_page(pdev, dma->addr, dma->size, DMA_TO_DEVICE);
 		break;
+	case MLX5E_DMA_MAP_FIXED:
+		/* DMA mappings are fixed, or managed elsewhere. */
+		break;
 	default:
 		WARN_ONCE(true, "mlx5e_tx_dma_unmap unknown DMA type!\n");
 	}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 05/21] mlx5/xsk: check that xsk does not conflict with netgpu
  2020-06-18 16:09 [RFC PATCH 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (3 preceding siblings ...)
  2020-06-18 16:09 ` [RFC PATCH 04/21] mlx5: add definitions for header split and netgpu Jonathan Lemon
@ 2020-06-18 16:09 ` Jonathan Lemon
  2020-06-18 16:09 ` [RFC PATCH 06/21] mlx5: add header_split flag Jonathan Lemon
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, axboe

netgpu will use the same data structures as xsk, so make sure that
they are not conflicting.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c | 3 +++
 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.h | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c
index 7b17fcd0a56d..f3d3569816cb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c
@@ -27,7 +27,10 @@ static int mlx5e_xsk_get_umems(struct mlx5e_xsk *xsk)
 				     sizeof(*xsk->umems), GFP_KERNEL);
 		if (unlikely(!xsk->umems))
 			return -ENOMEM;
+		xsk->is_netgpu = false;
 	}
+	if (xsk->is_netgpu)
+		return -EINVAL;
 
 	xsk->refcnt++;
 	xsk->ever_used = true;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.h
index 25b4cbe58b54..c7eff534d28a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.h
@@ -15,6 +15,9 @@ static inline struct xdp_umem *mlx5e_xsk_get_umem(struct mlx5e_params *params,
 	if (unlikely(ix >= params->num_channels))
 		return NULL;
 
+	if (unlikely(xsk->is_netgpu))
+		return NULL;
+
 	return xsk->umems[ix];
 }
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 06/21] mlx5: add header_split flag
  2020-06-18 16:09 [RFC PATCH 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (4 preceding siblings ...)
  2020-06-18 16:09 ` [RFC PATCH 05/21] mlx5/xsk: check that xsk does not conflict with netgpu Jonathan Lemon
@ 2020-06-18 16:09 ` Jonathan Lemon
  2020-06-18 18:12   ` Eric Dumazet
  2020-06-18 16:09 ` [RFC PATCH 07/21] mlx5: remove the umem parameter from mlx5e_open_channel Jonathan Lemon
                   ` (14 subsequent siblings)
  20 siblings, 1 reply; 28+ messages in thread
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, axboe

Adds a "rx_hd_split" private flag parameter to ethtool.

This enables header splitting, and sets up the fragment mappings.
The feature is currently only enabled for netgpu channels.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 .../ethernet/mellanox/mlx5/core/en_ethtool.c  | 15 +++++++
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 45 +++++++++++++++----
 2 files changed, 52 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index ec5658bbe3c5..a1b5d8b33b0b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -1905,6 +1905,20 @@ static int set_pflag_xdp_tx_mpwqe(struct net_device *netdev, bool enable)
 	return err;
 }
 
+static int set_pflag_rx_hd_split(struct net_device *netdev, bool enable)
+{
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	int err;
+
+	priv->channels.params.hd_split = enable;
+	err = mlx5e_safe_reopen_channels(priv);
+	if (err)
+		netdev_err(priv->netdev,
+			   "%s failed to reopen channels, err(%d).\n",
+				__func__, err);
+	return err;
+}
+
 static const struct pflag_desc mlx5e_priv_flags[MLX5E_NUM_PFLAGS] = {
 	{ "rx_cqe_moder",        set_pflag_rx_cqe_based_moder },
 	{ "tx_cqe_moder",        set_pflag_tx_cqe_based_moder },
@@ -1912,6 +1926,7 @@ static const struct pflag_desc mlx5e_priv_flags[MLX5E_NUM_PFLAGS] = {
 	{ "rx_striding_rq",      set_pflag_rx_striding_rq },
 	{ "rx_no_csum_complete", set_pflag_rx_no_csum_complete },
 	{ "xdp_tx_mpwqe",        set_pflag_xdp_tx_mpwqe },
+	{ "rx_hd_split",	 set_pflag_rx_hd_split },
 };
 
 static int mlx5e_handle_pflag(struct net_device *netdev,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index a836a02a2116..cc8d30aa8a33 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -123,7 +123,8 @@ bool mlx5e_striding_rq_possible(struct mlx5_core_dev *mdev,
 
 void mlx5e_set_rq_type(struct mlx5_core_dev *mdev, struct mlx5e_params *params)
 {
-	params->rq_wq_type = mlx5e_striding_rq_possible(mdev, params) &&
+	params->rq_wq_type = MLX5E_HD_SPLIT(params) ? MLX5_WQ_TYPE_CYCLIC :
+		mlx5e_striding_rq_possible(mdev, params) &&
 		MLX5E_GET_PFLAG(params, MLX5E_PFLAG_RX_STRIDING_RQ) ?
 		MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ :
 		MLX5_WQ_TYPE_CYCLIC;
@@ -323,6 +324,8 @@ static void mlx5e_init_frags_partition(struct mlx5e_rq *rq)
 				if (prev)
 					prev->last_in_page = true;
 			}
+			next_frag.di->netgpu_source =
+						!!frag_info[f].frag_source;
 			*frag = next_frag;
 
 			/* prepare next */
@@ -373,6 +376,8 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 	struct mlx5_core_dev *mdev = c->mdev;
 	void *rqc = rqp->rqc;
 	void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
+	bool hd_split = MLX5E_HD_SPLIT(params) && (umem == (void *)0x1);
+	u32 num_xsk_frames = 0;
 	u32 rq_xdp_ix;
 	u32 pool_size;
 	int wq_sz;
@@ -391,9 +396,10 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 	rq->mdev    = mdev;
 	rq->hw_mtu  = MLX5E_SW2HW_MTU(params, params->sw_mtu);
 	rq->xdpsq   = &c->rq_xdpsq;
-	rq->umem    = umem;
+	if (xsk)
+		rq->umem    = umem;
 
-	if (rq->umem)
+	if (umem)
 		rq->stats = &c->priv->channel_stats[c->ix].xskrq;
 	else
 		rq->stats = &c->priv->channel_stats[c->ix].rq;
@@ -404,14 +410,18 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 	rq->xdp_prog = params->xdp_prog;
 
 	rq_xdp_ix = rq->ix;
-	if (xsk)
+	if (umem)
 		rq_xdp_ix += params->num_channels * MLX5E_RQ_GROUP_XSK;
 	err = xdp_rxq_info_reg(&rq->xdp_rxq, rq->netdev, rq_xdp_ix);
 	if (err < 0)
 		goto err_rq_wq_destroy;
 
+	if (umem == (void *)0x1)
+		rq->buff.headroom = 0;
+	else
+		rq->buff.headroom = mlx5e_get_rq_headroom(mdev, params, xsk);
+
 	rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
-	rq->buff.headroom = mlx5e_get_rq_headroom(mdev, params, xsk);
 	pool_size = 1 << params->log_rq_mtu_frames;
 
 	switch (rq->wq_type) {
@@ -509,6 +519,7 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 
 		rq->wqe.skb_from_cqe = xsk ?
 			mlx5e_xsk_skb_from_cqe_linear :
+			hd_split ? mlx5e_skb_from_cqe_nonlinear :
 			mlx5e_rx_is_linear_skb(params, NULL) ?
 				mlx5e_skb_from_cqe_linear :
 				mlx5e_skb_from_cqe_nonlinear;
@@ -2035,13 +2046,19 @@ static void mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
 	int frag_size_max = DEFAULT_FRAG_SIZE;
 	u32 buf_size = 0;
 	int i;
+	bool hd_split = MLX5E_HD_SPLIT(params) && xsk;
+
+	if (hd_split)
+		frag_size_max = HD_SPLIT_DEFAULT_FRAG_SIZE;
+	else
+		frag_size_max = DEFAULT_FRAG_SIZE;
 
 #ifdef CONFIG_MLX5_EN_IPSEC
 	if (MLX5_IPSEC_DEV(mdev))
 		byte_count += MLX5E_METADATA_ETHER_LEN;
 #endif
 
-	if (mlx5e_rx_is_linear_skb(params, xsk)) {
+	if (!hd_split && mlx5e_rx_is_linear_skb(params, xsk)) {
 		int frag_stride;
 
 		frag_stride = mlx5e_rx_get_linear_frag_sz(params, xsk);
@@ -2059,6 +2076,16 @@ static void mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
 		frag_size_max = PAGE_SIZE;
 
 	i = 0;
+
+	if (hd_split) {
+		// Start with one fragment for all headers (implementing HDS)
+		info->arr[0].frag_size = TOTAL_HEADERS;
+		info->arr[0].frag_stride = roundup_pow_of_two(PAGE_SIZE);
+		buf_size += TOTAL_HEADERS;
+		// Now, continue with the payload frags.
+		i = 1;
+	}
+
 	while (buf_size < byte_count) {
 		int frag_size = byte_count - buf_size;
 
@@ -2066,8 +2093,10 @@ static void mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
 			frag_size = min(frag_size, frag_size_max);
 
 		info->arr[i].frag_size = frag_size;
-		info->arr[i].frag_stride = roundup_pow_of_two(frag_size);
-
+		info->arr[i].frag_stride = roundup_pow_of_two(hd_split ?
+							      PAGE_SIZE :
+							      frag_size);
+		info->arr[i].frag_source = hd_split;
 		buf_size += frag_size;
 		i++;
 	}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 07/21] mlx5: remove the umem parameter from mlx5e_open_channel
  2020-06-18 16:09 [RFC PATCH 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (5 preceding siblings ...)
  2020-06-18 16:09 ` [RFC PATCH 06/21] mlx5: add header_split flag Jonathan Lemon
@ 2020-06-18 16:09 ` Jonathan Lemon
  2020-06-18 16:09 ` [RFC PATCH 08/21] misc: add shqueue.h for prototyping Jonathan Lemon
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, axboe

Instead of obtaining the umem parameter from the channel parameters
and passing it to the function, push this down into the function itself.

Move xsk open logic into its own function, in preparation for the
upcoming netgpu commit.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 35 +++++++++++++------
 1 file changed, 24 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index cc8d30aa8a33..01d234369df6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1935,15 +1935,33 @@ static u8 mlx5e_enumerate_lag_port(struct mlx5_core_dev *mdev, int ix)
 	return (ix + port_aff_bias) % mlx5e_get_num_lag_ports(mdev);
 }
 
+static int
+mlx5e_xsk_optional_open(struct mlx5e_priv *priv, int ix,
+			struct mlx5e_params *params,
+			struct mlx5e_channel_param *cparam,
+			struct mlx5e_channel *c)
+{
+	struct mlx5e_xsk_param xsk;
+	struct xdp_umem *umem;
+	int err = 0;
+
+	umem = mlx5e_xsk_get_umem(params, params->xsk, ix);
+
+	if (umem) {
+		mlx5e_build_xsk_param(umem, &xsk);
+		err = mlx5e_open_xsk(priv, params, &xsk, umem, c);
+	}
+
+	return err;
+}
+
 static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 			      struct mlx5e_params *params,
 			      struct mlx5e_channel_param *cparam,
-			      struct xdp_umem *umem,
 			      struct mlx5e_channel **cp)
 {
 	int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix));
 	struct net_device *netdev = priv->netdev;
-	struct mlx5e_xsk_param xsk;
 	struct mlx5e_channel *c;
 	unsigned int irq;
 	int err;
@@ -1977,9 +1995,9 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 	if (unlikely(err))
 		goto err_napi_del;
 
-	if (umem) {
-		mlx5e_build_xsk_param(umem, &xsk);
-		err = mlx5e_open_xsk(priv, params, &xsk, umem, c);
+	/* This opens a second set of shadow queues for xsk */
+	if (params->xdp_prog) {
+		err = mlx5e_xsk_optional_open(priv, ix, params, cparam, c);
 		if (unlikely(err))
 			goto err_close_queues;
 	}
@@ -2345,12 +2363,7 @@ int mlx5e_open_channels(struct mlx5e_priv *priv,
 
 	mlx5e_build_channel_param(priv, &chs->params, cparam);
 	for (i = 0; i < chs->num; i++) {
-		struct xdp_umem *umem = NULL;
-
-		if (chs->params.xdp_prog)
-			umem = mlx5e_xsk_get_umem(&chs->params, chs->params.xsk, i);
-
-		err = mlx5e_open_channel(priv, i, &chs->params, cparam, umem, &chs->c[i]);
+		err = mlx5e_open_channel(priv, i, &chs->params, cparam, &chs->c[i]);
 		if (err)
 			goto err_close_channels;
 	}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 08/21] misc: add shqueue.h for prototyping
  2020-06-18 16:09 [RFC PATCH 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (6 preceding siblings ...)
  2020-06-18 16:09 ` [RFC PATCH 07/21] mlx5: remove the umem parameter from mlx5e_open_channel Jonathan Lemon
@ 2020-06-18 16:09 ` Jonathan Lemon
  2020-06-18 16:09 ` [RFC PATCH 09/21] include: add definitions for netgpu Jonathan Lemon
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, axboe

Shared queues between user and kernel use their own private structures
for accessing a shared data area, but they need to use the same queue
functions.

Rather than doing the 'right' thing and duplicating the file for
each domain, temporary cheat for prototyping and use a single shared
file.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/uapi/misc/shqueue.h | 205 ++++++++++++++++++++++++++++++++++++
 1 file changed, 205 insertions(+)
 create mode 100644 include/uapi/misc/shqueue.h

diff --git a/include/uapi/misc/shqueue.h b/include/uapi/misc/shqueue.h
new file mode 100644
index 000000000000..258b9db35dbd
--- /dev/null
+++ b/include/uapi/misc/shqueue.h
@@ -0,0 +1,205 @@
+#pragma once
+
+/* XXX
+ * This is not a user api, but placed here for prototyping, in order to
+ * avoid two nigh identical copies for user and kernel space.
+ */
+
+/* kernel only */
+struct shared_queue_map {
+	unsigned prod ____cacheline_aligned_in_smp;
+	unsigned cons ____cacheline_aligned_in_smp;
+	char data[] ____cacheline_aligned_in_smp;
+};
+
+/* user and kernel private copy - identical in order to share sq* fcns */
+struct shared_queue {
+	unsigned *prod;
+	unsigned *cons;
+	char *data;
+	unsigned elt_sz;
+	unsigned mask;
+	unsigned cached_prod;
+	unsigned cached_cons;
+	unsigned entries;
+
+	unsigned map_sz;
+	void *map_ptr;
+};
+
+/*
+ * see documenation in tools/include/linux/ring_buffer.h
+ * using  explicit smp_/_ONCE is an optimization over smp_{store|load}
+ */
+
+static inline void __sq_load_acquire_cons(struct shared_queue *q)
+{
+	/* Refresh the local tail pointer */
+	q->cached_cons = READ_ONCE(*q->cons);
+	/* A, matches D */
+}
+
+static inline void __sq_store_release_cons(struct shared_queue *q)
+{
+	smp_mb(); /* D, matches A */
+	WRITE_ONCE(*q->cons, q->cached_cons);
+}
+
+static inline void __sq_load_acquire_prod(struct shared_queue *q)
+{
+	/* Refresh the local pointer */
+	q->cached_prod = READ_ONCE(*q->prod);
+	smp_rmb(); /* C, matches B */
+}
+
+static inline void __sq_store_release_prod(struct shared_queue *q)
+{
+	smp_wmb(); /* B, matches C */
+	WRITE_ONCE(*q->prod, q->cached_prod);
+}
+
+static inline void sq_cons_refresh(struct shared_queue *q)
+{
+	__sq_store_release_cons(q);
+	__sq_load_acquire_prod(q);
+}
+
+static inline bool sq_empty(struct shared_queue *q)
+{
+	return READ_ONCE(*q->prod) == READ_ONCE(*q->cons);
+}
+
+static inline bool sq_cons_empty(struct shared_queue *q)
+{
+	return q->cached_prod == q->cached_cons;
+}
+
+static inline unsigned __sq_cons_ready(struct shared_queue *q)
+{
+	return q->cached_prod - q->cached_cons;
+}
+
+static inline unsigned sq_cons_ready(struct shared_queue *q)
+{
+	if (q->cached_prod == q->cached_cons)
+		__sq_load_acquire_prod(q);
+
+	return q->cached_prod - q->cached_cons;
+}
+
+static inline bool sq_cons_avail(struct shared_queue *q, unsigned count)
+{
+	if (count <= __sq_cons_ready(q))
+		return true;
+	__sq_load_acquire_prod(q);
+	return count <= __sq_cons_ready(q);
+}
+
+static inline void *sq_get_ptr(struct shared_queue *q, unsigned idx)
+{
+	return q->data + (idx & q->mask) * q->elt_sz;
+}
+
+static inline void sq_cons_complete(struct shared_queue *q)
+{
+	__sq_store_release_cons(q);
+}
+
+static inline void *sq_cons_peek(struct shared_queue *q)
+{
+	if (sq_cons_empty(q)) {
+		sq_cons_refresh(q);
+		if (sq_cons_empty(q))
+			return NULL;
+	}
+	return sq_get_ptr(q, q->cached_cons);
+}
+
+static inline unsigned
+sq_peek_batch(struct shared_queue *q, void **ptr, unsigned count)
+{
+	unsigned i, idx, ready;
+
+	ready = sq_cons_ready(q);
+	if (!ready)
+		return 0;
+
+	count = count > ready ? ready : count;
+
+	idx = q->cached_cons;
+	for (i = 0; i < count; i++)
+		ptr[i] = sq_get_ptr(q, idx++);
+
+	q->cached_cons += count;
+
+	return count;
+}
+
+static inline unsigned
+sq_cons_batch(struct shared_queue *q, void **ptr, unsigned count)
+{
+	unsigned i, idx, ready;
+
+	ready = sq_cons_ready(q);
+	if (!ready)
+		return 0;
+
+	count = count > ready ? ready : count;
+
+	idx = q->cached_cons;
+	for (i = 0; i < count; i++)
+		ptr[i] = sq_get_ptr(q, idx++);
+
+	q->cached_cons += count;
+	sq_cons_complete(q);
+
+	return count;
+}
+
+static inline void sq_cons_advance(struct shared_queue *q)
+{
+	q->cached_cons++;
+}
+
+static inline unsigned __sq_prod_space(struct shared_queue *q)
+{
+	return q->entries - (q->cached_prod - q->cached_cons);
+}
+
+static inline unsigned sq_prod_space(struct shared_queue *q)
+{
+	unsigned space;
+
+	space = __sq_prod_space(q);
+	if (!space) {
+		__sq_load_acquire_cons(q);
+		space = __sq_prod_space(q);
+	}
+	return space;
+}
+
+static inline bool sq_prod_avail(struct shared_queue *q, unsigned count)
+{
+	if (count <= __sq_prod_space(q))
+		return true;
+	__sq_load_acquire_cons(q);
+	return count <= __sq_prod_space(q);
+}
+
+static inline void *sq_prod_get_ptr(struct shared_queue *q)
+{
+	return sq_get_ptr(q, q->cached_prod++);
+}
+
+static inline void *sq_prod_reserve(struct shared_queue *q)
+{
+	if (!sq_prod_space(q))
+		return NULL;
+
+	return sq_prod_get_ptr(q);
+}
+
+static inline void sq_prod_submit(struct shared_queue *q)
+{
+	__sq_store_release_prod(q);
+}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 09/21] include: add definitions for netgpu
  2020-06-18 16:09 [RFC PATCH 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (7 preceding siblings ...)
  2020-06-18 16:09 ` [RFC PATCH 08/21] misc: add shqueue.h for prototyping Jonathan Lemon
@ 2020-06-18 16:09 ` Jonathan Lemon
  2020-06-18 16:09 ` [RFC PATCH 10/21] mlx5: add netgpu queue functions Jonathan Lemon
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, axboe

This adds the netgpu structure (which arguably should be private),
and some cruft to support using netgpu as a loadable module, which
should disappear.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/net/netgpu.h | 65 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)
 create mode 100644 include/net/netgpu.h

diff --git a/include/net/netgpu.h b/include/net/netgpu.h
new file mode 100644
index 000000000000..fee84ba3db78
--- /dev/null
+++ b/include/net/netgpu.h
@@ -0,0 +1,65 @@
+#pragma once
+
+struct net_device;
+#include <uapi/misc/shqueue.h>		/* XXX */
+
+struct netgpu_pgcache {
+	struct netgpu_pgcache *next;
+	struct page *page[];
+};
+
+struct netgpu_ctx {
+	struct xarray xa;		/* contains regions */
+	unsigned int index;
+	refcount_t ref;
+	struct shared_queue fill;
+	struct shared_queue rx;
+	struct net_device *dev;
+	struct netgpu_pgcache *napi_cache;
+	struct netgpu_pgcache *spare_cache;
+	struct netgpu_pgcache *any_cache;
+	spinlock_t pgcache_lock;
+	struct page *dummy_page;
+	unsigned page_extra_refc;
+	int queue_id;
+	int napi_cache_count;
+	int any_cache_count;
+	struct user_struct *user;
+	unsigned account_mem : 1;
+};
+
+int netgpu_get_page(struct netgpu_ctx *ctx, struct page **page,
+		    dma_addr_t *dma);
+void netgpu_put_page(struct netgpu_ctx *ctx, struct page *page, bool napi);
+int netgpu_get_pages(struct sock *sk, struct page **pages, unsigned long addr,
+		 int count);
+
+/*---------------------------------------------------------------------------*/
+/* XXX temporary development support */
+
+extern int (*fn_netgpu_get_page)(struct netgpu_ctx *ctx,
+			  struct page **page, dma_addr_t *dma);
+extern void (*fn_netgpu_put_page)(struct netgpu_ctx *, struct page *, bool);
+extern int (*fn_netgpu_get_pages)(struct sock *, struct page **,
+                           unsigned long, int);
+extern struct netgpu_ctx *g_ctx;
+
+static inline int
+__netgpu_get_page(struct netgpu_ctx *ctx,
+                  struct page **page, dma_addr_t *dma)
+{
+        return fn_netgpu_get_page(ctx, page, dma);
+}
+
+static inline void
+__netgpu_put_page(struct netgpu_ctx *ctx, struct page *page, bool napi)
+{
+        return fn_netgpu_put_page(ctx, page, napi);
+}
+
+static inline int
+__netgpu_get_pages(struct sock *sk, struct page **pages,
+                   unsigned long addr, int count)
+{
+        return fn_netgpu_get_pages(sk, pages, addr, count);
+}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 10/21] mlx5: add netgpu queue functions
  2020-06-18 16:09 [RFC PATCH 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (8 preceding siblings ...)
  2020-06-18 16:09 ` [RFC PATCH 09/21] include: add definitions for netgpu Jonathan Lemon
@ 2020-06-18 16:09 ` Jonathan Lemon
  2020-06-18 16:09 ` [RFC PATCH 11/21] skbuff: add a zc_netgpu bitflag Jonathan Lemon
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, axboe

Add the netgpu setup/teardown functions, which are not hooked up yet.
The driver also handles netgpu module loading and unloading.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   3 +-
 .../mellanox/mlx5/core/en/netgpu/setup.c      | 475 ++++++++++++++++++
 .../mellanox/mlx5/core/en/netgpu/setup.h      |  42 ++
 3 files changed, 519 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index b61e47bc16e8..27983bd074e9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -25,7 +25,8 @@ mlx5_core-$(CONFIG_MLX5_CORE_EN) += en_main.o en_common.o en_fs.o en_ethtool.o \
 		en_tx.o en_rx.o en_dim.o en_txrx.o en/xdp.o en_stats.o \
 		en_selftest.o en/port.o en/monitor_stats.o en/health.o \
 		en/reporter_tx.o en/reporter_rx.o en/params.o en/xsk/umem.o \
-		en/xsk/setup.o en/xsk/rx.o en/xsk/tx.o en/devlink.o
+		en/xsk/setup.o en/xsk/rx.o en/xsk/tx.o en/devlink.o \
+		en/netgpu/setup.o
 
 #
 # Netdev extra
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c b/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c
new file mode 100644
index 000000000000..f0578c41951d
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c
@@ -0,0 +1,475 @@
+#include <linux/prefetch.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/tcp.h>
+#include <linux/indirect_call_wrapper.h>
+#include <net/ip6_checksum.h>
+#include <net/page_pool.h>
+#include <net/inet_ecn.h>
+#include "en.h"
+#include "en_tc.h"
+#include "lib/clock.h"
+#include "en/xdp.h"
+#include "en/params.h"
+#include "en/netgpu/setup.h"
+
+#include <net/netgpu.h>
+#include <uapi/misc/shqueue.h>
+
+int (*fn_netgpu_get_page)(struct netgpu_ctx *ctx,
+				 struct page **page, dma_addr_t *dma);
+void (*fn_netgpu_put_page)(struct netgpu_ctx *, struct page *, bool);
+int (*fn_netgpu_get_pages)(struct sock *, struct page **,
+			   unsigned long, int);
+struct netgpu_ctx *g_ctx;
+
+static void
+netgpu_fn_unload(void)
+{
+	if (fn_netgpu_get_page)
+		symbol_put(netgpu_get_page);
+	if (fn_netgpu_put_page)
+		symbol_put(netgpu_put_page);
+	if (fn_netgpu_get_pages)
+		symbol_put(netgpu_get_pages);
+
+	fn_netgpu_get_page = NULL;
+	fn_netgpu_put_page = NULL;
+	fn_netgpu_get_pages = NULL;
+}
+
+static int
+netgpu_fn_load(void)
+{
+	fn_netgpu_get_page = symbol_get(netgpu_get_page);
+	fn_netgpu_put_page = symbol_get(netgpu_put_page);
+	fn_netgpu_get_pages = symbol_get(netgpu_get_pages);
+
+	if (fn_netgpu_get_page &&
+	    fn_netgpu_put_page &&
+	    fn_netgpu_get_pages)
+		return 0;
+
+	netgpu_fn_unload();
+
+	return -EFAULT;
+}
+
+void
+mlx5e_netgpu_put_page(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
+		      bool recycle)
+{
+	struct netgpu_ctx *ctx = rq->netgpu;
+	struct page *page = dma_info->page;
+
+	if (page) {
+		put_page(page);
+		__netgpu_put_page(ctx, page, recycle);
+	}
+}
+
+bool
+mlx5e_netgpu_avail(struct mlx5e_rq *rq, u8 count)
+{
+	struct netgpu_ctx *ctx = rq->netgpu;
+
+	/* XXX
+	 * napi_cache_count is not a total count, and this also
+	 * doesn't consider any_cache_count.
+	 */
+	return ctx->napi_cache_count >= count ||
+		sq_cons_ready(&ctx->fill) >= (count - ctx->napi_cache_count);
+}
+
+void mlx5e_netgpu_taken(struct mlx5e_rq *rq)
+{
+	struct netgpu_ctx *ctx = rq->netgpu;
+
+	sq_cons_complete(&ctx->fill);
+}
+
+int
+mlx5e_netgpu_get_page(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info)
+{
+	struct netgpu_ctx *ctx = rq->netgpu;
+
+	return __netgpu_get_page(ctx, &dma_info->page, &dma_info->addr);
+}
+
+struct netgpu_ctx *
+mlx5e_netgpu_get_ctx(struct mlx5e_params *params, struct mlx5e_xsk *xsk,
+		     u16 ix)
+{
+	if (!xsk || !xsk->ctx_tbl)
+		return NULL;
+
+	if (unlikely(ix >= params->num_channels))
+		return NULL;
+
+	if (unlikely(!xsk->is_netgpu))
+		return NULL;
+
+	return xsk->ctx_tbl[ix];
+}
+
+static int mlx5e_netgpu_get_tbl(struct mlx5e_xsk *xsk)
+{
+	if (!xsk->ctx_tbl) {
+		xsk->ctx_tbl = kcalloc(MLX5E_MAX_NUM_CHANNELS,
+				       sizeof(*xsk->ctx_tbl), GFP_KERNEL);
+		if (unlikely(!xsk->ctx_tbl))
+			return -ENOMEM;
+		xsk->is_netgpu = true;
+	}
+	if (!xsk->is_netgpu)
+		return -EINVAL;
+
+	xsk->refcnt++;
+	xsk->ever_used = true;
+
+	return 0;
+}
+
+static void mlx5e_netgpu_put_tbl(struct mlx5e_xsk *xsk)
+{
+	if (!--xsk->refcnt) {
+		kfree(xsk->ctx_tbl);
+		xsk->ctx_tbl = NULL;
+	}
+}
+
+static void mlx5e_netgpu_remove_ctx(struct mlx5e_xsk *xsk, u16 ix)
+{
+	xsk->ctx_tbl[ix] = NULL;
+
+	mlx5e_netgpu_put_tbl(xsk);
+}
+
+static int mlx5e_netgpu_add_ctx(struct mlx5e_xsk *xsk, struct netgpu_ctx *ctx,
+				u16 ix)
+{
+	int err;
+
+	err = mlx5e_netgpu_get_tbl(xsk);
+	if (unlikely(err))
+		return err;
+
+	xsk->ctx_tbl[ix] = ctx;
+
+	return 0;
+}
+
+static int mlx5e_netgpu_enable_locked(struct mlx5e_priv *priv,
+				      struct netgpu_ctx *ctx, u16 ix)
+{
+	struct mlx5e_params *params = &priv->channels.params;
+	struct mlx5e_channel *c;
+	int err;
+
+	if (unlikely(mlx5e_netgpu_get_ctx(&priv->channels.params,
+					  &priv->xsk, ix)))
+		return -EBUSY;
+
+	err = mlx5e_netgpu_add_ctx(&priv->xsk, ctx, ix);
+	if (unlikely(err))
+		return err;
+
+	if (!test_bit(MLX5E_STATE_OPENED, &priv->state)) {
+		/* XSK objects will be created on open. */
+		goto validate_closed;
+	}
+
+	if (!params->hd_split) {
+		/* XSK objects will be created when header split is set,
+		 * and the channels are reopened.
+		 */
+		goto validate_closed;
+	}
+
+	c = priv->channels.c[ix];
+
+	err = mlx5e_open_netgpu(priv, params, ctx, c);
+	if (unlikely(err))
+		goto err_remove_ctx;
+
+	mlx5e_activate_netgpu(c);
+
+	/* Don't wait for WQEs, because the newer xdpsock sample doesn't provide
+	 * any Fill Ring entries at the setup stage.
+	 */
+
+	err = mlx5e_netgpu_redirect_rqt_to_channel(priv, priv->channels.c[ix]);
+	if (unlikely(err))
+		goto err_deactivate;
+
+	return 0;
+
+err_deactivate:
+	mlx5e_deactivate_netgpu(c);
+	mlx5e_close_netgpu(c);
+
+err_remove_ctx:
+	mlx5e_netgpu_remove_ctx(&priv->xsk, ix);
+
+	return err;
+
+validate_closed:
+	return 0;
+}
+
+static int mlx5e_netgpu_disable_locked(struct mlx5e_priv *priv, u16 ix)
+{
+	struct mlx5e_channel *c;
+	struct netgpu_ctx *ctx;
+
+	ctx = mlx5e_netgpu_get_ctx(&priv->channels.params, &priv->xsk, ix);
+
+	if (unlikely(!ctx))
+		return -EINVAL;
+
+	if (!test_bit(MLX5E_STATE_OPENED, &priv->state))
+		goto remove_ctx;
+
+	/* NETGPU RQ is only created if header split is set. */
+	if (!priv->channels.params.hd_split)
+		goto remove_ctx;
+
+	c = priv->channels.c[ix];
+	mlx5e_netgpu_redirect_rqt_to_drop(priv, ix);
+	mlx5e_deactivate_netgpu(c);
+	mlx5e_close_netgpu(c);
+
+remove_ctx:
+	mlx5e_netgpu_remove_ctx(&priv->xsk, ix);
+
+	return 0;
+}
+
+static int mlx5e_netgpu_enable_ctx(struct mlx5e_priv *priv,
+				   struct netgpu_ctx *ctx, u16 ix)
+{
+	int err;
+
+	mutex_lock(&priv->state_lock);
+	err = netgpu_fn_load();
+	if (!err)
+		err = mlx5e_netgpu_enable_locked(priv, ctx, ix);
+	g_ctx = ctx;
+	mutex_unlock(&priv->state_lock);
+
+	return err;
+}
+
+static int mlx5e_netgpu_disable_ctx(struct mlx5e_priv *priv, u16 ix)
+{
+	int err;
+
+	mutex_lock(&priv->state_lock);
+	err = mlx5e_netgpu_disable_locked(priv, ix);
+	netgpu_fn_unload();
+	g_ctx = NULL;
+	mutex_unlock(&priv->state_lock);
+
+	return err;
+}
+
+int
+mlx5e_netgpu_setup_ctx(struct net_device *dev, struct netgpu_ctx *ctx, u16 qid)
+{
+	struct mlx5e_priv *priv = netdev_priv(dev);
+	struct mlx5e_params *params = &priv->channels.params;
+	u16 ix;
+
+	if (unlikely(!mlx5e_qid_get_ch_if_in_group(params, qid,
+						   MLX5E_RQ_GROUP_XSK, &ix)))
+		return -EINVAL;
+
+	return ctx ? mlx5e_netgpu_enable_ctx(priv, ctx, ix) :
+		     mlx5e_netgpu_disable_ctx(priv, ix);
+}
+
+static void mlx5e_build_netgpuicosq_param(struct mlx5e_priv *priv,
+					  u8 log_wq_size,
+					  struct mlx5e_sq_param *param)
+{
+	void *sqc = param->sqc;
+	void *wq = MLX5_ADDR_OF(sqc, sqc, wq);
+
+	mlx5e_build_sq_param_common(priv, param);
+
+	MLX5_SET(wq, wq, log_wq_sz, log_wq_size);
+}
+
+static void mlx5e_build_netgpu_cparam(struct mlx5e_priv *priv,
+				      struct mlx5e_params *params,
+				      struct mlx5e_channel_param *cparam)
+{
+	const u8 icosq_size = MLX5E_PARAMS_MINIMUM_LOG_SQ_SIZE;
+	struct mlx5e_xsk_param *xsk = (void *)0x1;
+
+	mlx5e_build_rq_param(priv, params, xsk, &cparam->rq);
+	mlx5e_build_rx_cq_param(priv, params, NULL, &cparam->rx_cq);
+
+	mlx5e_build_netgpuicosq_param(priv, icosq_size, &cparam->icosq);
+	mlx5e_build_ico_cq_param(priv, icosq_size, &cparam->icosq_cq);
+}
+
+int mlx5e_open_netgpu(struct mlx5e_priv *priv, struct mlx5e_params *params,
+		      struct netgpu_ctx *ctx, struct mlx5e_channel *c)
+{
+	struct mlx5e_channel_param *cparam;
+	struct dim_cq_moder icocq_moder = {};
+	struct xdp_umem *umem = (void *)0x1;
+	int err;
+
+	cparam = kvzalloc(sizeof(*cparam), GFP_KERNEL);
+	if (!cparam)
+		return -ENOMEM;
+
+	mlx5e_build_netgpu_cparam(priv, params, cparam);
+
+	err = mlx5e_open_cq(c, params->rx_cq_moderation, &cparam->rx_cq,
+			   &c->xskrq.cq);
+	if (unlikely(err))
+		goto err_free_cparam;
+
+	err = mlx5e_open_rq(c, params, &cparam->rq, NULL, umem, &c->xskrq);
+	if (unlikely(err))
+		goto err_close_rx_cq;
+	c->xskrq.netgpu = ctx;
+
+	err = mlx5e_open_cq(c, icocq_moder, &cparam->icosq_cq, &c->xskicosq.cq);
+	if (unlikely(err))
+		goto err_close_rq;
+
+	/* Create a dedicated SQ for posting NOPs whenever we need an IRQ to be
+	 * triggered and NAPI to be called on the correct CPU.
+	 */
+	err = mlx5e_open_icosq(c, params, &cparam->icosq, &c->xskicosq);
+	if (unlikely(err))
+		goto err_close_icocq;
+
+	kvfree(cparam);
+
+	spin_lock_init(&c->xskicosq_lock);
+
+	set_bit(MLX5E_CHANNEL_STATE_NETGPU, c->state);
+
+	return 0;
+
+err_close_icocq:
+	mlx5e_close_cq(&c->xskicosq.cq);
+
+err_close_rq:
+	mlx5e_close_rq(&c->xskrq);
+
+err_close_rx_cq:
+	mlx5e_close_cq(&c->xskrq.cq);
+
+err_free_cparam:
+	kvfree(cparam);
+
+	return err;
+}
+
+void mlx5e_close_netgpu(struct mlx5e_channel *c)
+{
+	clear_bit(MLX5E_CHANNEL_STATE_NETGPU, c->state);
+	napi_synchronize(&c->napi);
+	synchronize_rcu(); /* Sync with the XSK wakeup. */
+
+	mlx5e_close_rq(&c->xskrq);
+	mlx5e_close_cq(&c->xskrq.cq);
+	mlx5e_close_icosq(&c->xskicosq);
+	mlx5e_close_cq(&c->xskicosq.cq);
+
+	/* zero these out - so the next open has a clean slate. */
+	memset(&c->xskrq, 0, sizeof(c->xskrq));
+	memset(&c->xsksq, 0, sizeof(c->xsksq));
+	memset(&c->xskicosq, 0, sizeof(c->xskicosq));
+}
+
+void mlx5e_activate_netgpu(struct mlx5e_channel *c)
+{
+	mlx5e_activate_icosq(&c->xskicosq);
+	set_bit(MLX5E_RQ_STATE_ENABLED, &c->xskrq.state);
+	/* TX queue is created active. */
+
+	spin_lock(&c->xskicosq_lock);
+	mlx5e_trigger_irq(&c->xskicosq);
+	spin_unlock(&c->xskicosq_lock);
+}
+
+void mlx5e_deactivate_netgpu(struct mlx5e_channel *c)
+{
+	mlx5e_deactivate_rq(&c->xskrq);
+	/* TX queue is disabled on close. */
+	mlx5e_deactivate_icosq(&c->xskicosq);
+}
+
+static int mlx5e_redirect_netgpu_rqt(struct mlx5e_priv *priv, u16 ix, u32 rqn)
+{
+	struct mlx5e_redirect_rqt_param direct_rrp = {
+		.is_rss = false,
+		{
+			.rqn = rqn,
+		},
+	};
+
+	u32 rqtn = priv->xsk_tir[ix].rqt.rqtn;
+
+	return mlx5e_redirect_rqt(priv, rqtn, 1, direct_rrp);
+}
+
+int mlx5e_netgpu_redirect_rqt_to_channel(struct mlx5e_priv *priv,
+					 struct mlx5e_channel *c)
+{
+	return mlx5e_redirect_netgpu_rqt(priv, c->ix, c->xskrq.rqn);
+}
+
+int mlx5e_netgpu_redirect_rqt_to_drop(struct mlx5e_priv *priv, u16 ix)
+{
+	return mlx5e_redirect_netgpu_rqt(priv, ix, priv->drop_rq.rqn);
+}
+
+int mlx5e_netgpu_redirect_rqts_to_channels(struct mlx5e_priv *priv,
+					   struct mlx5e_channels *chs)
+{
+	int err, i;
+
+	for (i = 0; i < chs->num; i++) {
+		struct mlx5e_channel *c = chs->c[i];
+
+		if (!test_bit(MLX5E_CHANNEL_STATE_NETGPU, c->state))
+			continue;
+
+		err = mlx5e_netgpu_redirect_rqt_to_channel(priv, c);
+		if (unlikely(err))
+			goto err_stop;
+	}
+
+	return 0;
+
+err_stop:
+	for (i--; i >= 0; i--) {
+		if (!test_bit(MLX5E_CHANNEL_STATE_NETGPU, chs->c[i]->state))
+			continue;
+
+		mlx5e_netgpu_redirect_rqt_to_drop(priv, i);
+	}
+
+	return err;
+}
+
+void mlx5e_netgpu_redirect_rqts_to_drop(struct mlx5e_priv *priv,
+					struct mlx5e_channels *chs)
+{
+	int i;
+
+	for (i = 0; i < chs->num; i++) {
+		if (!test_bit(MLX5E_CHANNEL_STATE_NETGPU, chs->c[i]->state))
+			continue;
+
+		mlx5e_netgpu_redirect_rqt_to_drop(priv, i);
+	}
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.h b/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.h
new file mode 100644
index 000000000000..37fde92ef89d
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.h
@@ -0,0 +1,42 @@
+#pragma once
+
+struct netgpu_ctx *
+mlx5e_netgpu_get_ctx(struct mlx5e_params *params, struct mlx5e_xsk *xsk,
+                     u16 ix);
+
+int
+mlx5e_open_netgpu(struct mlx5e_priv *priv, struct mlx5e_params *params,
+                  struct netgpu_ctx *ctx, struct mlx5e_channel *c);
+
+bool mlx5e_netgpu_avail(struct mlx5e_rq *rq, u8 count);
+void mlx5e_netgpu_taken(struct mlx5e_rq *rq);
+
+int
+mlx5e_netgpu_setup_ctx(struct net_device *dev, struct netgpu_ctx *ctx, u16 qid);
+
+int
+mlx5e_netgpu_get_page(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info);
+
+void
+mlx5e_netgpu_put_page(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
+		      bool recycle);
+
+int mlx5e_open_netgpu(struct mlx5e_priv *priv, struct mlx5e_params *params,
+		      struct netgpu_ctx *ctx, struct mlx5e_channel *c);
+
+void mlx5e_close_netgpu(struct mlx5e_channel *c);
+
+void mlx5e_activate_netgpu(struct mlx5e_channel *c);
+
+void mlx5e_deactivate_netgpu(struct mlx5e_channel *c);
+
+int mlx5e_netgpu_redirect_rqt_to_channel(struct mlx5e_priv *priv,
+					 struct mlx5e_channel *c);
+
+int mlx5e_netgpu_redirect_rqt_to_drop(struct mlx5e_priv *priv, u16 ix);
+
+int mlx5e_netgpu_redirect_rqts_to_channels(struct mlx5e_priv *priv,
+					   struct mlx5e_channels *chs);
+
+void mlx5e_netgpu_redirect_rqts_to_drop(struct mlx5e_priv *priv,
+					struct mlx5e_channels *chs);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 11/21] skbuff: add a zc_netgpu bitflag
  2020-06-18 16:09 [RFC PATCH 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (9 preceding siblings ...)
  2020-06-18 16:09 ` [RFC PATCH 10/21] mlx5: add netgpu queue functions Jonathan Lemon
@ 2020-06-18 16:09 ` Jonathan Lemon
  2020-06-18 16:09 ` [RFC PATCH 12/21] mlx5: hook up the netgpu channel functions Jonathan Lemon
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, axboe

This could likely be moved elsewhere.  The presence of the flag on
the skb indicates that one of the fragments may contain zerocopy
data (where the data is not accessible to the cpu).

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/linux/skbuff.h | 3 ++-
 net/core/skbuff.c      | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0c0377fc00c2..ba41d1a383f8 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -782,7 +782,8 @@ struct sk_buff {
 				fclone:2,
 				peeked:1,
 				head_frag:1,
-				pfmemalloc:1;
+				pfmemalloc:1,
+				zc_netgpu:1;
 #ifdef CONFIG_SKB_EXTENSIONS
 	__u8			active_extensions;
 #endif
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b8afefe6f6b6..2a391042be53 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -992,6 +992,7 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
 	n->cloned = 1;
 	n->nohdr = 0;
 	n->peeked = 0;
+	C(zc_netgpu);
 	C(pfmemalloc);
 	n->destructor = NULL;
 	C(tail);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 12/21] mlx5: hook up the netgpu channel functions
  2020-06-18 16:09 [RFC PATCH 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (10 preceding siblings ...)
  2020-06-18 16:09 ` [RFC PATCH 11/21] skbuff: add a zc_netgpu bitflag Jonathan Lemon
@ 2020-06-18 16:09 ` Jonathan Lemon
  2020-06-18 16:09 ` [RFC PATCH 13/21] netdevice: add SETUP_NETGPU to the netdev_bpf structure Jonathan Lemon
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, axboe

Hook up all the netgpu plumbing, except the enable/disable calls.
Those will be added after the netgpu module itself.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 .../mellanox/mlx5/core/en/netgpu/setup.c      |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 35 +++++++++++++
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 52 +++++++++++++++++--
 .../net/ethernet/mellanox/mlx5/core/en_txrx.c | 15 +++++-
 4 files changed, 97 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c b/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c
index f0578c41951d..76df316611fe 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c
@@ -78,7 +78,7 @@ mlx5e_netgpu_avail(struct mlx5e_rq *rq, u8 count)
 	 * doesn't consider any_cache_count.
 	 */
 	return ctx->napi_cache_count >= count ||
-		sq_cons_ready(&ctx->fill) >= (count - ctx->napi_cache_count);
+		sq_cons_avail(&ctx->fill, count - ctx->napi_cache_count);
 }
 
 void mlx5e_netgpu_taken(struct mlx5e_rq *rq)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 01d234369df6..c791578be5ea 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -62,6 +62,7 @@
 #include "en/xsk/setup.h"
 #include "en/xsk/rx.h"
 #include "en/xsk/tx.h"
+#include "en/netgpu/setup.h"
 #include "en/hv_vhca_stats.h"
 #include "en/devlink.h"
 #include "lib/mlx5.h"
@@ -1955,6 +1956,24 @@ mlx5e_xsk_optional_open(struct mlx5e_priv *priv, int ix,
 	return err;
 }
 
+static int
+mlx5e_netgpu_optional_open(struct mlx5e_priv *priv, int ix,
+			   struct mlx5e_params *params,
+			   struct mlx5e_channel_param *cparam,
+			   struct mlx5e_channel *c)
+{
+	struct netgpu_ctx *ctx;
+	int err = 0;
+
+	ctx = mlx5e_netgpu_get_ctx(params, params->xsk, ix);
+
+	if (ctx)
+		err = mlx5e_open_netgpu(priv, params, ctx, c);
+
+	return err;
+}
+
+
 static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 			      struct mlx5e_params *params,
 			      struct mlx5e_channel_param *cparam,
@@ -2002,6 +2021,13 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 			goto err_close_queues;
 	}
 
+	/* This opens a second set of shadow queues for netgpu */
+	if (params->hd_split) {
+		err = mlx5e_netgpu_optional_open(priv, ix, params, cparam, c);
+		if (unlikely(err))
+			goto err_close_queues;
+	}
+
 	*cp = c;
 
 	return 0;
@@ -2037,6 +2063,9 @@ static void mlx5e_deactivate_channel(struct mlx5e_channel *c)
 	if (test_bit(MLX5E_CHANNEL_STATE_XSK, c->state))
 		mlx5e_deactivate_xsk(c);
 
+	if (test_bit(MLX5E_CHANNEL_STATE_NETGPU, c->state))
+		mlx5e_deactivate_netgpu(c);
+
 	mlx5e_deactivate_rq(&c->rq);
 	mlx5e_deactivate_icosq(&c->icosq);
 	for (tc = 0; tc < c->num_tc; tc++)
@@ -2047,6 +2076,10 @@ static void mlx5e_close_channel(struct mlx5e_channel *c)
 {
 	if (test_bit(MLX5E_CHANNEL_STATE_XSK, c->state))
 		mlx5e_close_xsk(c);
+
+	if (test_bit(MLX5E_CHANNEL_STATE_NETGPU, c->state))
+		mlx5e_close_netgpu(c);
+
 	mlx5e_close_queues(c);
 	netif_napi_del(&c->napi);
 
@@ -3012,11 +3045,13 @@ void mlx5e_activate_priv_channels(struct mlx5e_priv *priv)
 	mlx5e_redirect_rqts_to_channels(priv, &priv->channels);
 
 	mlx5e_xsk_redirect_rqts_to_channels(priv, &priv->channels);
+	mlx5e_netgpu_redirect_rqts_to_channels(priv, &priv->channels);
 }
 
 void mlx5e_deactivate_priv_channels(struct mlx5e_priv *priv)
 {
 	mlx5e_xsk_redirect_rqts_to_drop(priv, &priv->channels);
+	mlx5e_netgpu_redirect_rqts_to_drop(priv, &priv->channels);
 
 	mlx5e_redirect_rqts_to_drop(priv);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index dbb1c6323967..1edc157696f2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -50,6 +50,9 @@
 #include "en/xdp.h"
 #include "en/xsk/rx.h"
 #include "en/health.h"
+#include "en/netgpu/setup.h"
+
+#include <net/netgpu.h>
 
 static inline bool mlx5e_rx_hw_stamp(struct hwtstamp_config *config)
 {
@@ -266,8 +269,11 @@ static inline int mlx5e_page_alloc(struct mlx5e_rq *rq,
 {
 	if (rq->umem)
 		return mlx5e_xsk_page_alloc_umem(rq, dma_info);
-	else
-		return mlx5e_page_alloc_pool(rq, dma_info);
+
+	if (dma_info->netgpu_source)
+		return mlx5e_netgpu_get_page(rq, dma_info);
+
+	return mlx5e_page_alloc_pool(rq, dma_info);
 }
 
 void mlx5e_page_dma_unmap(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info)
@@ -279,6 +285,9 @@ void mlx5e_page_release_dynamic(struct mlx5e_rq *rq,
 				struct mlx5e_dma_info *dma_info,
 				bool recycle)
 {
+	if (dma_info->netgpu_source)
+		return mlx5e_netgpu_put_page(rq, dma_info, recycle);
+
 	if (likely(recycle)) {
 		if (mlx5e_rx_cache_put(rq, dma_info))
 			return;
@@ -394,6 +403,9 @@ static int mlx5e_alloc_rx_wqes(struct mlx5e_rq *rq, u16 ix, u8 wqe_bulk)
 			return -ENOMEM;
 	}
 
+	if (rq->netgpu && !mlx5e_netgpu_avail(rq, wqe_bulk))
+		return -ENOMEM;
+
 	for (i = 0; i < wqe_bulk; i++) {
 		struct mlx5e_rx_wqe_cyc *wqe = mlx5_wq_cyc_get_wqe(wq, ix + i);
 
@@ -402,6 +414,9 @@ static int mlx5e_alloc_rx_wqes(struct mlx5e_rq *rq, u16 ix, u8 wqe_bulk)
 			goto free_wqes;
 	}
 
+	if (rq->netgpu)
+		mlx5e_netgpu_taken(rq);
+
 	return 0;
 
 free_wqes:
@@ -416,12 +431,17 @@ mlx5e_add_skb_frag(struct mlx5e_rq *rq, struct sk_buff *skb,
 		   struct mlx5e_dma_info *di, u32 frag_offset, u32 len,
 		   unsigned int truesize)
 {
+	/* XXX skip this if netgpu_source... */
 	dma_sync_single_for_cpu(rq->pdev,
 				di->addr + frag_offset,
 				len, DMA_FROM_DEVICE);
-	page_ref_inc(di->page);
 	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
 			di->page, frag_offset, len, truesize);
+
+	if (skb->zc_netgpu)
+		di->page = NULL;
+	else
+		page_ref_inc(di->page);
 }
 
 static inline void
@@ -1109,16 +1129,26 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 {
 	struct mlx5e_rq_frag_info *frag_info = &rq->wqe.info.arr[0];
 	struct mlx5e_wqe_frag_info *head_wi = wi;
-	u16 headlen      = min_t(u32, MLX5E_RX_MAX_HEAD, cqe_bcnt);
+	bool hd_split	 = rq->netgpu;
+	u16 header_len	 = hd_split ? TOTAL_HEADERS : MLX5E_RX_MAX_HEAD;
+	u16 headlen      = min_t(u32, header_len, cqe_bcnt);
 	u16 frag_headlen = headlen;
 	u16 byte_cnt     = cqe_bcnt - headlen;
 	struct sk_buff *skb;
 
+	/* RST packets may have short headers (74) and no payload */
+	if (hd_split && headlen != TOTAL_HEADERS && byte_cnt) {
+		/* XXX add drop counter */
+		pr_warn_once("BAD hd_split: headlen %d != %d\n",
+			     headlen, TOTAL_HEADERS);
+		return NULL;
+	}
+
 	/* XDP is not supported in this configuration, as incoming packets
 	 * might spread among multiple pages.
 	 */
 	skb = napi_alloc_skb(rq->cq.napi,
-			     ALIGN(MLX5E_RX_MAX_HEAD, sizeof(long)));
+			     ALIGN(header_len, sizeof(long)));
 	if (unlikely(!skb)) {
 		rq->stats->buff_alloc_err++;
 		return NULL;
@@ -1126,6 +1156,18 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 
 	prefetchw(skb->data);
 
+	if (hd_split) {
+		/* first frag is only headers, should skip this frag and
+		 * assume that all of the headers already copied to the skb
+		 * inline data.
+		 */
+		frag_info++;
+		frag_headlen = 0;
+		wi++;
+
+		skb->zc_netgpu = 1;
+	}
+
 	while (byte_cnt) {
 		u16 frag_consumed_bytes =
 			min_t(u16, frag_info->frag_size - frag_headlen, byte_cnt);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index 8480278f2ee2..1c646a6dc29a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -122,6 +122,7 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
 	struct mlx5e_rq *xskrq = &c->xskrq;
 	struct mlx5e_rq *rq = &c->rq;
 	bool xsk_open = test_bit(MLX5E_CHANNEL_STATE_XSK, c->state);
+	bool netgpu_open = test_bit(MLX5E_CHANNEL_STATE_NETGPU, c->state);
 	bool aff_change = false;
 	bool busy_xsk = false;
 	bool busy = false;
@@ -139,7 +140,7 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
 		busy |= mlx5e_poll_xdpsq_cq(&c->rq_xdpsq.cq);
 
 	if (likely(budget)) { /* budget=0 means: don't poll rx rings */
-		if (xsk_open)
+		if (xsk_open || netgpu_open)
 			work_done = mlx5e_poll_rx_cq(&xskrq->cq, budget);
 
 		if (likely(budget - work_done))
@@ -154,6 +155,12 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
 				mlx5e_post_rx_mpwqes,
 				mlx5e_post_rx_wqes,
 				rq);
+
+	if (netgpu_open) {
+		mlx5e_poll_ico_cq(&c->xskicosq.cq);
+		busy_xsk |= xskrq->post_wqes(xskrq);
+	}
+
 	if (xsk_open) {
 		if (mlx5e_poll_ico_cq(&c->xskicosq.cq))
 			/* Don't clear the flag if nothing was polled to prevent
@@ -191,6 +198,12 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
 	mlx5e_cq_arm(&c->icosq.cq);
 	mlx5e_cq_arm(&c->xdpsq.cq);
 
+	if (netgpu_open) {
+		mlx5e_handle_rx_dim(xskrq);
+		mlx5e_cq_arm(&c->xskicosq.cq);
+		mlx5e_cq_arm(&xskrq->cq);
+	}
+
 	if (xsk_open) {
 		mlx5e_handle_rx_dim(xskrq);
 		mlx5e_cq_arm(&c->xskicosq.cq);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 13/21] netdevice: add SETUP_NETGPU to the netdev_bpf structure
  2020-06-18 16:09 [RFC PATCH 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (11 preceding siblings ...)
  2020-06-18 16:09 ` [RFC PATCH 12/21] mlx5: hook up the netgpu channel functions Jonathan Lemon
@ 2020-06-18 16:09 ` Jonathan Lemon
  2020-06-18 16:09 ` [RFC PATCH 14/21] kernel: export free_uid Jonathan Lemon
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, axboe

This command will be used to setup/tear down netgpu queues.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/linux/netdevice.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 6fc613ed8eae..ea3c15ef0f29 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -880,6 +880,7 @@ enum bpf_netdev_command {
 	BPF_OFFLOAD_MAP_ALLOC,
 	BPF_OFFLOAD_MAP_FREE,
 	XDP_SETUP_XSK_UMEM,
+	XDP_SETUP_NETGPU,
 };
 
 struct bpf_prog_offload_ops;
@@ -911,6 +912,11 @@ struct netdev_bpf {
 			struct xdp_umem *umem;
 			u16 queue_id;
 		} xsk;
+		/* XDP_SETUP_NETGPU */
+		struct {
+			struct netgpu_ctx *ctx;
+			u16 queue_id;
+		} netgpu;
 	};
 };
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 14/21] kernel: export free_uid
  2020-06-18 16:09 [RFC PATCH 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (12 preceding siblings ...)
  2020-06-18 16:09 ` [RFC PATCH 13/21] netdevice: add SETUP_NETGPU to the netdev_bpf structure Jonathan Lemon
@ 2020-06-18 16:09 ` Jonathan Lemon
  2020-06-18 16:09 ` [RFC PATCH 15/21] netgpu: add network/gpu dma module Jonathan Lemon
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, axboe

get_uid is a static inline which can be called from a module, so
free_uid should also be callable.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 kernel/user.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/user.c b/kernel/user.c
index b1635d94a1f2..1e015abf0a2b 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -171,6 +171,7 @@ void free_uid(struct user_struct *up)
 	if (refcount_dec_and_lock_irqsave(&up->__count, &uidhash_lock, &flags))
 		free_user(up, flags);
 }
+EXPORT_SYMBOL_GPL(free_uid);
 
 struct user_struct *alloc_uid(kuid_t uid)
 {
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 15/21] netgpu: add network/gpu dma module
  2020-06-18 16:09 [RFC PATCH 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (13 preceding siblings ...)
  2020-06-18 16:09 ` [RFC PATCH 14/21] kernel: export free_uid Jonathan Lemon
@ 2020-06-18 16:09 ` Jonathan Lemon
  2020-06-18 16:09 ` [RFC PATCH 16/21] lib: have __zerocopy_sg_from_iter get netgpu pages for a sk Jonathan Lemon
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, axboe

Netgpu provides a data path for zero-copy TCP sends and receives
directly to GPU memory.  TCP processing is done on the host CPU,
while data is DMA'd to and from device memory.

The use case for this module are GPUs used for machine learning,
which are located near the NICs, and have a high bandwith PCI
connection between the GPU/NIC.

This initial working code is a proof of concept, for discussion.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 drivers/misc/Kconfig         |    1 +
 drivers/misc/Makefile        |    1 +
 drivers/misc/netgpu/Kconfig  |   10 +
 drivers/misc/netgpu/Makefile |   11 +
 drivers/misc/netgpu/nvidia.c | 1516 ++++++++++++++++++++++++++++++++++
 include/uapi/misc/netgpu.h   |   43 +
 6 files changed, 1582 insertions(+)
 create mode 100644 drivers/misc/netgpu/Kconfig
 create mode 100644 drivers/misc/netgpu/Makefile
 create mode 100644 drivers/misc/netgpu/nvidia.c
 create mode 100644 include/uapi/misc/netgpu.h

diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index e1b1ba5e2b92..13ae8e55d2a2 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -472,4 +472,5 @@ source "drivers/misc/ocxl/Kconfig"
 source "drivers/misc/cardreader/Kconfig"
 source "drivers/misc/habanalabs/Kconfig"
 source "drivers/misc/uacce/Kconfig"
+source "drivers/misc/netgpu/Kconfig"
 endmenu
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index c7bd01ac6291..e026fe95a629 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -57,3 +57,4 @@ obj-$(CONFIG_PVPANIC)   	+= pvpanic.o
 obj-$(CONFIG_HABANA_AI)		+= habanalabs/
 obj-$(CONFIG_UACCE)		+= uacce/
 obj-$(CONFIG_XILINX_SDFEC)	+= xilinx_sdfec.o
+obj-$(CONFIG_NETGPU)		+= netgpu/
diff --git a/drivers/misc/netgpu/Kconfig b/drivers/misc/netgpu/Kconfig
new file mode 100644
index 000000000000..f67adf825c1b
--- /dev/null
+++ b/drivers/misc/netgpu/Kconfig
@@ -0,0 +1,10 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# NetGPU framework
+#
+
+config NETGPU
+	tristate "Network/GPU driver"
+	depends on PCI
+	---help---
+	  Experimental Network / GPU driver
diff --git a/drivers/misc/netgpu/Makefile b/drivers/misc/netgpu/Makefile
new file mode 100644
index 000000000000..fe58963efdf7
--- /dev/null
+++ b/drivers/misc/netgpu/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+pkg = /home/bsd/local/pull/nvidia/NVIDIA-Linux-x86_64-440.59/kernel
+
+obj-$(CONFIG_NETGPU) := netgpu.o
+
+netgpu-y := nvidia.o
+
+# netgpu-$(CONFIG_DEBUG_FS) += debugfs.o
+
+ccflags-y += -I$(pkg)
diff --git a/drivers/misc/netgpu/nvidia.c b/drivers/misc/netgpu/nvidia.c
new file mode 100644
index 000000000000..a0ea82effb2f
--- /dev/null
+++ b/drivers/misc/netgpu/nvidia.c
@@ -0,0 +1,1516 @@
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/miscdevice.h>
+#include <linux/seq_file.h>
+#include <linux/uio.h>
+#include <linux/errno.h>
+#include <linux/netdevice.h>
+#include <linux/pci.h>
+#include <linux/memory.h>
+#include <linux/device.h>
+#include <linux/interval_tree.h>
+
+#include <net/tcp.h>
+
+#include <net/netgpu.h>
+#include <uapi/misc/netgpu.h>
+
+/* XXX enable if using nvidia - will be split out to its own file */
+//#define USE_CUDA 1
+
+#ifdef USE_CUDA
+#include "nvidia/nv-p2p.h"
+#endif
+
+/* nvidia GPU uses 64K pages */
+#define GPU_PAGE_SHIFT		16
+#define GPU_PAGE_SIZE		(1UL << GPU_PAGE_SHIFT)
+#define GPU_PAGE_MASK		(GPU_PAGE_SIZE - 1)
+
+/* self is 3 so skb_netgpu_unref does not catch the dummy page */
+#define NETGPU_REFC_MAX		0xff00
+#define NETGPU_REFC_SELF	3
+#define NETGPU_REFC_EXTRA	(NETGPU_REFC_MAX - NETGPU_REFC_SELF)
+
+static struct mutex netgpu_lock;
+static unsigned int netgpu_index;
+static DEFINE_XARRAY(xa_netgpu);
+static const struct file_operations netgpu_fops;
+
+/* XXX hack */
+static void (*sk_data_ready)(struct sock *sk);
+static struct proto netgpu_prot;
+
+#ifdef USE_CUDA
+/* page_range represents one contiguous GPU PA region */
+struct netgpu_page_range {
+	unsigned long pfn;
+	struct resource *res;
+	struct netgpu_region *region;
+	struct interval_tree_node va_node;
+};
+#endif
+
+struct netgpu_pginfo {
+	unsigned long addr;
+	dma_addr_t dma;
+};
+
+#define NETGPU_CACHE_COUNT	63
+
+/* region represents GPU VA region backed by gpu_pgtbl
+ * as the region is VA, the PA ranges may be discontiguous
+ */
+struct netgpu_region {
+	struct nvidia_p2p_page_table *gpu_pgtbl;
+	struct nvidia_p2p_dma_mapping *dmamap;
+	struct netgpu_pginfo *pginfo;
+	struct page **page;
+	struct netgpu_ctx *ctx;
+	unsigned long start;
+	unsigned long len;
+	struct rb_root_cached root;
+	unsigned host_memory : 1;
+};
+
+static inline struct device *
+netdev2device(struct net_device *dev)
+{
+	return dev->dev.parent;			/* from SET_NETDEV_DEV() */
+}
+
+static inline struct pci_dev *
+netdev2pci_dev(struct net_device *dev)
+{
+	return to_pci_dev(netdev2device(dev));
+}
+
+#ifdef USE_CUDA
+static int nvidia_pg_size[] = {
+	[NVIDIA_P2P_PAGE_SIZE_4KB]   =	 4 * 1024,
+	[NVIDIA_P2P_PAGE_SIZE_64KB]  =	64 * 1024,
+	[NVIDIA_P2P_PAGE_SIZE_128KB] = 128 * 1024,
+};
+
+static void netgpu_cuda_free_region(struct netgpu_region *r);
+#endif
+static void netgpu_free_ctx(struct netgpu_ctx *ctx);
+static int netgpu_add_region(struct netgpu_ctx *ctx, void __user *arg);
+
+#ifdef USE_CUDA
+#define node2page_range(itn) \
+	container_of(itn, struct netgpu_page_range, va_node)
+
+#define region_for_each(r, idx, itn, pr)				\
+	for (idx = r->start,						\
+		itn = interval_tree_iter_first(r->root, idx, r->last);	\
+	     pr = container_of(itn, struct netgpu_page_range, va_node),	\
+		itn;							\
+	     idx = itn->last + 1,					\
+		itn = interval_tree_iter_next(itn, idx, r->last))
+
+#define region_remove_each(r, itn) \
+	while ((itn = interval_tree_iter_first(&r->root, r->start, \
+					       r->start + r->len - 1)) && \
+	       (interval_tree_remove(itn, &r->root), 1))
+
+static inline struct netgpu_page_range *
+region_find(struct netgpu_region *r, unsigned long start, int count)
+{
+	struct interval_tree_node *itn;
+	unsigned long last;
+
+	last = start + count * PAGE_SIZE - 1;
+
+	itn = interval_tree_iter_first(&r->root, start, last);
+	return itn ? node2page_range(itn) : 0;
+}
+
+static void
+netgpu_cuda_pgtbl_cb(void *data)
+{
+	struct netgpu_region *r = data;
+
+	netgpu_cuda_free_region(r);
+}
+
+static void
+netgpu_init_pages(u64 va, unsigned long pfn_start, unsigned long pfn_end)
+{
+	unsigned long pfn;
+	struct page *page;
+
+	for (pfn = pfn_start; pfn < pfn_end; pfn++) {
+		page = pfn_to_page(pfn);
+		mm_zero_struct_page(page);
+
+		set_page_count(page, 2);	/* matches host logic */
+		page->page_type = 7;		/* XXX differential flag */
+		__SetPageReserved(page);
+
+		set_page_private(page, va);
+		va += PAGE_SIZE;
+	}
+}
+
+static struct resource *
+netgpu_add_pages(int nid, u64 start, u64 end)
+{
+	struct mhp_restrictions restrict = {
+		.flags = MHP_MEMBLOCK_API,
+	};
+
+	return add_memory_pages(nid, start, end - start, &restrict);
+}
+
+static void
+netgpu_free_pages(struct resource *res)
+{
+	release_memory_pages(res);
+}
+
+static int
+netgpu_remap_pages(struct netgpu_region *r, u64 va, u64 start, u64 end)
+{
+	struct netgpu_page_range *pr;
+	struct resource *res;
+
+	pr = kmalloc(sizeof(*pr), GFP_KERNEL);
+	if (!pr)
+		return -ENOMEM;
+
+	res = netgpu_add_pages(numa_mem_id(), start, end);
+	if (IS_ERR(res)) {
+		kfree(pr);
+		return PTR_ERR(res);
+	}
+
+	pr->pfn = PHYS_PFN(start);
+	pr->region = r;
+	pr->va_node.start = va;
+	pr->va_node.last = va + (end - start) - 1;
+	pr->res = res;
+
+	netgpu_init_pages(va, PHYS_PFN(start), PHYS_PFN(end));
+
+//	spin_lock(&r->lock);
+	interval_tree_insert(&pr->va_node, &r->root);
+//	spin_unlock(&r->lock);
+
+	return 0;
+}
+
+static int
+netgpu_cuda_map_region(struct netgpu_region *r)
+{
+	struct pci_dev *pdev;
+	int ret;
+
+	pdev = netdev2pci_dev(r->ctx->dev);
+
+	/*
+	 * takes PA from pgtbl, performs mapping, saves mapping
+	 * dma_mapping holds dma mapped addresses, and pdev.
+	 * mem_info contains pgtbl and mapping list.  mapping is added to list.
+	 * rm_p2p_dma_map_pages() does the work.
+	 */
+	ret = nvidia_p2p_dma_map_pages(pdev, r->gpu_pgtbl, &r->dmamap);
+	if (ret) {
+		pr_err("dma map failed: %d\n", ret);
+		goto out;
+	}
+
+out:
+	return ret;
+}
+
+/*
+ * makes GPU pages at va available to other devices.
+ * expensive operation.
+ */
+static int
+netgpu_cuda_add_region(struct netgpu_ctx *ctx, const struct iovec *iov)
+{
+	struct nvidia_p2p_page_table *gpu_pgtbl;
+	struct netgpu_region *r;
+	u64 va, size, start, end, pa;
+	int i, count, gpu_pgsize;
+	int ret;
+
+	start = (u64)iov->iov_base;
+	va = round_down(start, GPU_PAGE_SIZE);
+	size = round_up(start - va + iov->iov_len, GPU_PAGE_SIZE);
+	count = size / PAGE_SIZE;
+
+	ret = -ENOMEM;
+	r = kzalloc(sizeof(*r), GFP_KERNEL);
+	if (!r)
+		goto out;
+
+	/*
+	 * allocates page table, sets gpu_uuid to owning gpu.
+	 * allocates page array, set PA for each page.
+	 * sets page_size (64K here)
+	 * rm_p2p_get_pages() does the actual work.
+	 */
+	ret = nvidia_p2p_get_pages(0, 0, va, size, &gpu_pgtbl,
+				   netgpu_cuda_pgtbl_cb, r);
+	if (ret) {
+		kfree(r);
+		goto out;
+	}
+
+	/* gpu pgtbl owns r, will free via netgpu_cuda_pgtbl_cb */
+	r->gpu_pgtbl = gpu_pgtbl;
+
+	r->start = va;
+	r->len = size;
+	r->root = RB_ROOT_CACHED;
+//	spin_lock_init(&r->lock);
+
+	if (!NVIDIA_P2P_PAGE_TABLE_VERSION_COMPATIBLE(gpu_pgtbl)) {
+		pr_err("incompatible page table\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	gpu_pgsize = nvidia_pg_size[gpu_pgtbl->page_size];
+	if (count != gpu_pgtbl->entries * gpu_pgsize / PAGE_SIZE) {
+		pr_err("GPU page count %d != host page count %d\n",
+		       gpu_pgtbl->entries, count);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = xa_err(xa_store_range(&ctx->xa, va >> PAGE_SHIFT,
+				    (va + size) >> PAGE_SHIFT,
+				    r, GFP_KERNEL));
+	if (ret)
+		goto out;
+
+	r->ctx = ctx;
+	refcount_inc(&ctx->ref);
+
+	ret = netgpu_cuda_map_region(r);
+	if (ret)
+		goto out;
+
+	start = U64_MAX;
+	end = 0;
+
+	for (i = 0; i < gpu_pgtbl->entries; i++) {
+		pa = gpu_pgtbl->pages[i]->physical_address;
+		if (pa != end) {
+			if (end) {
+				ret = netgpu_remap_pages(r, va, start, end);
+				if (ret)
+					goto out;
+			}
+			start = pa;
+			va = r->start + i * gpu_pgsize;
+		}
+		end = pa + gpu_pgsize;
+	}
+	ret = netgpu_remap_pages(r, va, start, end);
+	if (ret)
+		goto out;
+
+	return 0;
+
+out:
+	return ret;
+}
+#endif
+
+static void
+netgpu_host_unaccount_mem(struct user_struct *user, unsigned long nr_pages)
+{
+	atomic_long_sub(nr_pages, &user->locked_vm);
+}
+
+static int
+netgpu_host_account_mem(struct user_struct *user, unsigned long nr_pages)
+{
+	unsigned long page_limit, cur_pages, new_pages;
+
+	page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+
+	do {
+		cur_pages = atomic_long_read(&user->locked_vm);
+		new_pages = cur_pages + nr_pages;
+		if (new_pages > page_limit)
+			return -ENOMEM;
+	} while (atomic_long_cmpxchg(&user->locked_vm, cur_pages,
+				     new_pages) != cur_pages);
+
+	return 0;
+}
+
+static unsigned
+netgpu_init_region(struct netgpu_region *r, const struct iovec *iov,
+		   unsigned align)
+{
+	u64 addr = (u64)iov->iov_base;
+	u32 len = iov->iov_len;
+	unsigned nr_pages;
+
+	r->root = RB_ROOT_CACHED;
+//	spin_lock_init(&r->lock);
+
+	r->start = round_down(addr, align);
+	r->len = round_up(addr - r->start + len, align);
+	nr_pages = r->len / PAGE_SIZE;
+
+	r->page = kvmalloc_array(nr_pages, sizeof(struct page *), GFP_KERNEL);
+	r->pginfo = kvmalloc_array(nr_pages, sizeof(struct netgpu_pginfo),
+				   GFP_KERNEL);
+	if (!r->page || !r->pginfo)
+		return 0;
+
+	return nr_pages;
+}
+
+/* NOTE: nr_pages may be negative on error. */
+static void
+netgpu_host_put_pages(struct netgpu_region *r, int nr_pages)
+{
+	int i;
+
+	for (i = 0; i < nr_pages; i++)
+		put_page(r->page[i]);
+}
+
+static void
+netgpu_host_release_pages(struct netgpu_region *r, int nr_pages)
+{
+	struct device *device;
+	int i;
+
+	device = netdev2device(r->ctx->dev);
+
+	for (i = 0; i < nr_pages; i++) {
+		dma_unmap_page(device, r->pginfo[i].dma, PAGE_SIZE,
+			       DMA_BIDIRECTIONAL);
+		put_page(r->page[i]);
+	}
+}
+
+static bool
+netgpu_host_setup_pages(struct netgpu_region *r, unsigned nr_pages)
+{
+	struct device *device;
+	struct page *page;
+	dma_addr_t dma;
+	u64 addr;
+	int i;
+
+	device = netdev2device(r->ctx->dev);
+
+	addr = r->start;
+	for (i = 0; i < nr_pages; i++, addr += PAGE_SIZE) {
+		page = r->page[i];
+		dma = dma_map_page(device, page, 0, PAGE_SIZE,
+				   DMA_BIDIRECTIONAL);
+		if (unlikely(dma_mapping_error(device, dma)))
+			goto out;
+
+		r->pginfo[i].dma = dma;
+		r->pginfo[i].addr = addr;
+	}
+	return true;
+
+out:
+	while (i--)
+		dma_unmap_page(device, r->pginfo[i].dma, PAGE_SIZE,
+			       DMA_BIDIRECTIONAL);
+
+	return false;
+}
+
+static int
+netgpu_host_add_region(struct netgpu_ctx *ctx, const struct iovec *iov)
+{
+	struct netgpu_region *r;
+	int err, nr_pages;
+	int count = 0;
+
+	err = -ENOMEM;
+	r = kzalloc(sizeof(*r), GFP_KERNEL);
+	if (!r)
+		return err;
+
+	r->ctx = ctx;			/* no refcount for host regions */
+	r->host_memory = true;
+
+	nr_pages = netgpu_init_region(r, iov, PAGE_SIZE);
+	if (!nr_pages)
+		goto out;
+
+	if (ctx->account_mem) {
+		err = netgpu_host_account_mem(ctx->user, nr_pages);
+		if (err)
+			goto out;
+	}
+
+	/* XXX should this be pin_user_pages? */
+	mmap_read_lock(current->mm);
+	count = get_user_pages(r->start, nr_pages,
+			       FOLL_WRITE | FOLL_LONGTERM,
+			       r->page, NULL);
+	mmap_read_unlock(current->mm);
+
+	if (count != nr_pages) {
+		err = count < 0 ? count : -EFAULT;
+		goto out;
+	}
+
+	if (!netgpu_host_setup_pages(r, count))
+		goto out;
+
+	err = xa_err(xa_store_range(&ctx->xa, r->start >> PAGE_SHIFT,
+				    (r->start + r->len) >> PAGE_SHIFT,
+				    r, GFP_KERNEL));
+	if (err)
+		goto out;
+
+	return 0;
+
+out:
+	if (ctx->account_mem)
+		netgpu_host_unaccount_mem(ctx->user, nr_pages);
+	netgpu_host_put_pages(r, count);
+	kvfree(r->page);
+	kvfree(r->pginfo);
+	kfree(r);
+
+	return err;
+}
+
+static int
+netgpu_add_region(struct netgpu_ctx *ctx, void __user *arg)
+{
+	struct dma_region d;
+	int err = -EIO;
+
+	if (!ctx->dev)
+		return -ENODEV;
+
+	if (copy_from_user(&d, arg, sizeof(d)))
+		return -EFAULT;
+
+	if (d.host_memory)
+		err = netgpu_host_add_region(ctx, &d.iov);
+#ifdef USE_CUDA
+	else
+		err = netgpu_cuda_add_region(ctx, &d.iov);
+#endif
+
+	return err;
+}
+
+#ifdef USE_CUDA
+static void
+region_get_pages(struct page **pages, unsigned long pfn, int n)
+{
+	struct page *p;
+	int i;
+
+	for (i = 0; i < n; i++) {
+		p = pfn_to_page(pfn + i);
+		get_page(p);
+		pages[i] = p;
+	}
+}
+
+static int
+netgpu_cuda_get_page(struct netgpu_region *r, unsigned long addr,
+		     struct page **page, dma_addr_t *dma)
+{
+	struct netgpu_page_range *pr;
+	unsigned long idx;
+	struct page *p;
+
+	pr = region_find(r, addr, 1);
+	if (!pr)
+		return -EFAULT;
+
+	idx = (addr - pr->va_node.start) >> PAGE_SHIFT;
+
+	p = pfn_to_page(pr->pfn + idx);
+	get_page(p);
+	*page = p;
+	*dma = page_to_phys(p);		/* XXX can get away with this for now */
+
+	return 0;
+}
+
+static int
+netgpu_cuda_get_pages(struct netgpu_region *r, struct page **pages,
+		     unsigned long addr, int count)
+{
+	struct netgpu_page_range *pr;
+	unsigned long idx, end;
+	int n;
+
+	pr = region_find(r, addr, count);
+	if (!pr)
+		return -EFAULT;
+
+	idx = (addr - pr->va_node.start) >> PAGE_SHIFT;
+	end = (pr->va_node.last - pr->va_node.start) >> PAGE_SHIFT;
+	n = end - idx + 1;
+	n = min(count, n);
+
+	region_get_pages(pages, pr->pfn + idx, n);
+
+	return n;
+}
+#endif
+
+/* Used by the lib/iov_iter to obtain a set of pages for TX */
+static int
+netgpu_host_get_pages(struct netgpu_region *r, struct page **pages,
+		      unsigned long addr, int count)
+{
+	unsigned long idx;
+	struct page *p;
+	int i, n;
+
+	idx = (addr - r->start) >> PAGE_SHIFT;
+	n = (r->len >> PAGE_SHIFT) - idx + 1;
+	n = min(count, n);
+
+	for (i = 0; i < n; i++) {
+		p = r->page[idx + i];
+		get_page(p);
+		pages[i] = p;
+	}
+
+	return n;
+}
+
+/* Used by the driver to obtain the backing store page for a fill address */
+static int
+netgpu_host_get_page(struct netgpu_region *r, unsigned long addr,
+		     struct page **page, dma_addr_t *dma)
+{
+	unsigned long idx;
+	struct page *p;
+
+	idx = (addr - r->start) >> PAGE_SHIFT;
+
+	p = r->page[idx];
+	get_page(p);
+	set_page_private(p, addr);
+	*page = p;
+	*dma = r->pginfo[idx].dma;
+
+	return 0;
+}
+
+static void
+__netgpu_put_page_any(struct netgpu_ctx *ctx, struct page *page)
+{
+	struct netgpu_pgcache *cache = ctx->any_cache;
+	unsigned count;
+	size_t sz;
+
+	/* unsigned: count == -1 if !cache, so the check will fail. */
+	count = ctx->any_cache_count;
+	if (count < NETGPU_CACHE_COUNT) {
+		cache->page[count] = page;
+		ctx->any_cache_count = count + 1;
+		return;
+	}
+
+	sz = struct_size(cache, page, NETGPU_CACHE_COUNT);
+	cache = kmalloc(sz, GFP_ATOMIC);
+	if (!cache) {
+		/* XXX fixme */
+		pr_err("netgpu: addr 0x%lx lost to overflow\n",
+		       page_private(page));
+		return;
+	}
+	cache->next = ctx->any_cache;
+
+	cache->page[0] = page;
+	ctx->any_cache = cache;
+	ctx->any_cache_count = 1;
+}
+
+static void
+netgpu_put_page_any(struct netgpu_ctx *ctx, struct page *page)
+{
+	spin_lock(&ctx->pgcache_lock);
+
+	__netgpu_put_page_any(ctx, page);
+
+	spin_unlock(&ctx->pgcache_lock);
+}
+
+static void
+netgpu_put_page_napi(struct netgpu_ctx *ctx, struct page *page)
+{
+	struct netgpu_pgcache *spare;
+	unsigned count;
+	size_t sz;
+
+	count = ctx->napi_cache_count;
+	if (count < NETGPU_CACHE_COUNT) {
+		ctx->napi_cache->page[count] = page;
+		ctx->napi_cache_count = count + 1;
+		return;
+	}
+
+	spare = ctx->spare_cache;
+	if (spare) {
+		ctx->spare_cache = NULL;
+		goto out;
+	}
+
+	sz = struct_size(spare, page, NETGPU_CACHE_COUNT);
+	spare = kmalloc(sz, GFP_ATOMIC);
+	if (!spare) {
+		pr_err("netgpu: addr 0x%lx lost to overflow\n",
+		       page_private(page));
+		return;
+	}
+	spare->next = ctx->napi_cache;
+
+out:
+	spare->page[0] = page;
+	ctx->napi_cache = spare;
+	ctx->napi_cache_count = 1;
+}
+
+void
+netgpu_put_page(struct netgpu_ctx *ctx, struct page *page, bool napi)
+{
+	if (napi)
+		netgpu_put_page_napi(ctx, page);
+	else
+		netgpu_put_page_any(ctx, page);
+}
+EXPORT_SYMBOL(netgpu_put_page);
+
+static int
+netgpu_swap_caches(struct netgpu_ctx *ctx, struct netgpu_pgcache **cachep)
+{
+	int count;
+
+	spin_lock(&ctx->pgcache_lock);
+
+	count = ctx->any_cache_count;
+	*cachep = ctx->any_cache;
+	ctx->any_cache = ctx->napi_cache;
+	ctx->any_cache_count = 0;
+
+	spin_unlock(&ctx->pgcache_lock);
+
+	return count;
+}
+
+static struct page *
+netgpu_get_cached_page(struct netgpu_ctx *ctx)
+{
+	struct netgpu_pgcache *cache = ctx->napi_cache;
+	struct page *page;
+	int count;
+
+	count = ctx->napi_cache_count;
+
+	if (!count) {
+		if (cache->next) {
+			if (ctx->spare_cache)
+				kfree(ctx->spare_cache);
+			ctx->spare_cache = cache;
+			cache = cache->next;
+			count = NETGPU_CACHE_COUNT;
+			goto out;
+		}
+
+		/* lockless read of any count - if >0, skip */
+		count = READ_ONCE(ctx->any_cache_count);
+		if (count > 0) {
+			count = netgpu_swap_caches(ctx, &cache);
+			goto out;
+		}
+
+		return NULL;
+out:
+		ctx->napi_cache = cache;
+	}
+
+	page = cache->page[--count];
+	ctx->napi_cache_count = count;
+
+	return page;
+}
+
+/*
+ * Free cache structures.  Pages have already been released.
+ */
+static void
+netgpu_free_cache(struct netgpu_ctx *ctx)
+{
+	struct netgpu_pgcache *cache, *next;
+
+	if (ctx->spare_cache)
+		kfree(ctx->spare_cache);
+	for (cache = ctx->napi_cache; cache; cache = next) {
+		next = cache->next;
+		kfree(cache);
+	}
+	for (cache = ctx->any_cache; cache; cache = next) {
+		next = cache->next;
+		kfree(cache);
+	}
+}
+
+/*
+ * Called from iov_iter when addr is provided for TX.
+ */
+int
+netgpu_get_pages(struct sock *sk, struct page **pages, unsigned long addr,
+		 int count)
+{
+	struct netgpu_region *r;
+	struct netgpu_ctx *ctx;
+	int n = 0;
+
+	ctx = xa_load(&xa_netgpu, (uintptr_t)sk->sk_user_data);
+	if (!ctx)
+		return -EEXIST;
+
+	r = xa_load(&ctx->xa, addr >> PAGE_SHIFT);
+	if (!r)
+		return -EINVAL;
+
+	if (r->host_memory)
+		n = netgpu_host_get_pages(r, pages, addr, count);
+#ifdef USE_CUDA
+	else
+		n = netgpu_cuda_get_pages(r, pages, addr, count);
+#endif
+
+	return n;
+}
+EXPORT_SYMBOL(netgpu_get_pages);
+
+static int
+netgpu_get_fill_page(struct netgpu_ctx *ctx, dma_addr_t *dma,
+		     struct page **page)
+{
+	struct netgpu_region *r;
+	u64 *addrp, addr;
+	int ret = 0;
+
+	addrp = sq_cons_peek(&ctx->fill);
+	if (!addrp)
+		return -ENOMEM;
+
+	addr = READ_ONCE(*addrp);
+
+	r = xa_load(&ctx->xa, addr >> PAGE_SHIFT);
+	if (!r)
+		return -EINVAL;
+
+	if (r->host_memory)
+		ret = netgpu_host_get_page(r, addr, page, dma);
+#ifdef USE_CUDA
+	else
+		ret = netgpu_cuda_get_page(r, addr, page, dma);
+#endif
+
+	if (!ret)
+		sq_cons_advance(&ctx->fill);
+
+	return ret;
+}
+
+static dma_addr_t
+netgpu_page_get_dma(struct netgpu_ctx *ctx, struct page *page)
+{
+	return page_to_phys(page);		/* XXX cheat for now... */
+}
+
+int
+netgpu_get_page(struct netgpu_ctx *ctx, struct page **page, dma_addr_t *dma)
+{
+	struct page *p;
+
+	p = netgpu_get_cached_page(ctx);
+	if (p) {
+		page_ref_inc(p);
+		*dma = netgpu_page_get_dma(ctx, p);
+		*page = p;
+		return 0;
+	}
+
+	return netgpu_get_fill_page(ctx, dma, page);
+}
+EXPORT_SYMBOL(netgpu_get_page);
+
+static struct page *
+netgpu_get_dummy_page(struct netgpu_ctx *ctx)
+{
+	ctx->page_extra_refc--;
+	if (unlikely(!ctx->page_extra_refc)) {
+		page_ref_add(ctx->dummy_page, NETGPU_REFC_EXTRA);
+		ctx->page_extra_refc = NETGPU_REFC_EXTRA;
+	}
+	return ctx->dummy_page;
+}
+
+/* Our version of __skb_datagram_iter */
+static int
+netgpu_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
+		unsigned int offset, size_t len)
+{
+	struct netgpu_ctx *ctx = desc->arg.data;
+	struct sk_buff *frag_iter;
+	struct iovec *iov;
+	struct page *page;
+	unsigned start;
+	int i, used;
+	u64 addr;
+
+	if (skb_headlen(skb)) {
+		pr_err("zc socket receiving non-zc data");
+		return -EFAULT;
+	}
+
+	used = 0;
+	start = 0;
+
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+		skb_frag_t *frag;
+		int end, off, frag_len;
+
+		frag = &skb_shinfo(skb)->frags[i];
+		frag_len = skb_frag_size(frag);
+
+		end = start + frag_len;
+		if (offset < end) {
+			off = offset - start;
+
+			iov = sq_prod_reserve(&ctx->rx);
+			if (!iov)
+				break;
+
+			page = skb_frag_page(frag);
+			addr = (u64)page_private(page) + off;
+
+			iov->iov_base = (void *)(addr + skb_frag_off(frag));
+			iov->iov_len = frag_len - off;
+
+			used += (frag_len - off);
+			offset += (frag_len - off);
+
+			put_page(page);
+			page = netgpu_get_dummy_page(ctx);
+			__skb_frag_set_page(frag, page);
+		}
+		start = end;
+	}
+
+	if (used)
+		sq_prod_submit(&ctx->rx);
+
+	skb_walk_frags(skb, frag_iter) {
+		int end, off, ret;
+
+		end = start + frag_iter->len;
+		if (offset < end) {
+			off = offset - start;
+			len = frag_iter->len - off;
+
+			ret = netgpu_recv_skb(desc, frag_iter, off, len);
+			if (ret < 0) {
+				if (!used)
+					used = ret;
+				goto out;
+			}
+			used += ret;
+			if (ret < len)
+				goto out;
+			offset += ret;
+		}
+		start = end;
+	}
+
+out:
+	return used;
+}
+
+static void
+netgpu_read_sock(struct sock *sk, struct netgpu_ctx *ctx)
+{
+	read_descriptor_t desc;
+	int used;
+
+	desc.arg.data = ctx;
+	desc.count = 1;
+	used = tcp_read_sock(sk, &desc, netgpu_recv_skb);
+}
+
+static void
+netgpu_data_ready(struct sock *sk)
+{
+	struct netgpu_ctx *ctx;
+
+	ctx = xa_load(&xa_netgpu, (uintptr_t)sk->sk_user_data);
+	if (ctx && ctx->rx.entries)
+		netgpu_read_sock(sk, ctx);
+
+	sk_data_ready(sk);
+}
+
+static bool netgpu_stream_memory_read(const struct sock *sk)
+{
+	struct netgpu_ctx *ctx;
+	bool empty = false;
+
+	/* sk is not locked.  called from poll, so not sp. */
+	ctx = xa_load(&xa_netgpu, (uintptr_t)sk->sk_user_data);
+	if (ctx)
+		empty = sq_empty(&ctx->rx);
+
+	return !empty;
+}
+
+static struct netgpu_ctx *
+netgpu_file_to_ctx(struct file *file)
+{
+	struct seq_file *seq = file->private_data;
+	struct netgpu_ctx *ctx = seq->private;
+
+	return ctx;
+}
+
+int
+netgpu_register_dma(struct sock *sk, void __user *optval, unsigned int optlen)
+{
+	struct fd f;
+	int netgpu_fd;
+	struct netgpu_ctx *ctx;
+
+	if (sk->sk_user_data)
+		return -EALREADY;
+	if (optlen < sizeof(netgpu_fd))
+		return -EINVAL;
+	if (copy_from_user(&netgpu_fd, optval, sizeof(netgpu_fd)))
+		return -EFAULT;
+
+	f = fdget(netgpu_fd);
+	if (!f.file)
+		return -EBADF;
+
+	if (f.file->f_op != &netgpu_fops) {
+		fdput(f);
+		return -EOPNOTSUPP;
+	}
+
+	/* XXX should really have some way to identify sk_user_data type */
+	ctx = netgpu_file_to_ctx(f.file);
+	sk->sk_user_data = (void *)(uintptr_t)ctx->index;
+
+	fdput(f);
+
+	if (!sk_data_ready)
+		sk_data_ready = sk->sk_data_ready;
+	sk->sk_data_ready = netgpu_data_ready;
+
+	/* XXX does not do any checking here */
+	if (!netgpu_prot.stream_memory_read) {
+		netgpu_prot = *sk->sk_prot;
+		netgpu_prot.stream_memory_read = netgpu_stream_memory_read;
+	}
+	sk->sk_prot = &netgpu_prot;
+
+	return 0;
+}
+EXPORT_SYMBOL(netgpu_register_dma);
+
+static int
+netgpu_validate_queue(struct netgpu_user_queue *q, unsigned elt_size,
+		      unsigned map_off)
+{
+	struct shared_queue_map *map;
+	unsigned count;
+	size_t size;
+
+	if (q->elt_sz != elt_size)
+		return -EINVAL;
+
+	count = roundup_pow_of_two(q->entries);
+	if (!count)
+		return -EINVAL;
+	q->entries = count;
+	q->mask = count - 1;
+
+	size = struct_size(map, data, count * elt_size);
+	if (size == SIZE_MAX || size > U32_MAX)
+		return -EOVERFLOW;
+	q->map_sz = size;
+
+	q->map_off = map_off;
+
+	return 0;
+}
+
+static int
+netgpu_validate_param(struct netgpu_ctx *ctx, struct netgpu_params *p)
+{
+	int rc;
+
+	if (ctx->queue_id != -1)
+		return -EALREADY;
+
+	rc = netgpu_validate_queue(&p->fill, sizeof(u64), NETGPU_OFF_FILL_ID);
+	if (rc)
+		return rc;
+
+	rc = netgpu_validate_queue(&p->rx, sizeof(struct iovec),
+				   NETGPU_OFF_RX_ID);
+	if (rc)
+		return rc;
+
+	return 0;
+}
+
+static int
+netgpu_queue_create(struct shared_queue *q, struct netgpu_user_queue *u)
+{
+	gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN |
+			  __GFP_COMP | __GFP_NORETRY;
+	struct shared_queue_map *map;
+
+	map = (void *)__get_free_pages(gfp_flags, get_order(u->map_sz));
+	if (!map)
+		return -ENOMEM;
+
+	q->map_ptr = map;
+	q->prod = &map->prod;
+	q->cons = &map->cons;
+	q->data = &map->data[0];
+	q->elt_sz = u->elt_sz;
+	q->mask = u->mask;
+	q->entries = u->entries;
+
+	memset(&u->off, 0, sizeof(u->off));
+	u->off.prod = offsetof(struct shared_queue_map, prod);
+	u->off.cons = offsetof(struct shared_queue_map, cons);
+	u->off.desc = offsetof(struct shared_queue_map, data);
+
+	return 0;
+}
+
+static int
+netgpu_bind_device(struct netgpu_ctx *ctx, int ifindex)
+{
+	struct net_device *dev;
+	int rc;
+
+	dev = dev_get_by_index(&init_net, ifindex);
+	if (!dev)
+		return -ENODEV;
+
+	if (ctx->dev) {
+		rc = dev == ctx->dev ? 0 : -EALREADY;
+		dev_put(dev);
+		return rc;
+	}
+
+	ctx->dev = dev;
+
+	return 0;
+}
+
+static int
+__netgpu_queue_mgmt(struct net_device *dev, struct netgpu_ctx *ctx,
+		    u32 queue_id)
+{
+	struct netdev_bpf cmd;
+	bpf_op_t ndo_bpf;
+
+	cmd.command = XDP_SETUP_NETGPU;
+	cmd.netgpu.ctx = ctx;
+	cmd.netgpu.queue_id = queue_id;
+
+	ndo_bpf = dev->netdev_ops->ndo_bpf;
+	if (!ndo_bpf)
+		return -EINVAL;
+
+	return ndo_bpf(dev, &cmd);
+}
+
+static int
+netgpu_open_queue(struct netgpu_ctx *ctx, u32 queue_id)
+{
+	return __netgpu_queue_mgmt(ctx->dev, ctx, queue_id);
+}
+
+static int
+netgpu_close_queue(struct netgpu_ctx *ctx, u32 queue_id)
+{
+	return __netgpu_queue_mgmt(ctx->dev, NULL, queue_id);
+}
+
+static int
+netgpu_bind_queue(struct netgpu_ctx *ctx, void __user *arg)
+{
+	struct netgpu_params p;
+	int rc;
+
+	if (!ctx->dev)
+		return -ENODEV;
+
+	if (copy_from_user(&p, arg, sizeof(p)))
+		return -EFAULT;
+
+	rc = netgpu_validate_param(ctx, &p);
+	if (rc)
+		return rc;
+
+	rc = netgpu_queue_create(&ctx->fill, &p.fill);
+	if (rc)
+		return rc;
+
+	rc = netgpu_queue_create(&ctx->rx, &p.rx);
+	if (rc)
+		return rc;
+
+	rc = netgpu_open_queue(ctx, p.queue_id);
+	if (rc)
+		return rc;
+	ctx->queue_id = p.queue_id;
+
+	if (copy_to_user(arg, &p, sizeof(p)))
+		return -EFAULT;
+		/* XXX leaks ring here ... */
+
+	return rc;
+}
+
+static int
+netgpu_attach_dev(struct netgpu_ctx *ctx, void __user *arg)
+{
+	int ifindex;
+
+	if (copy_from_user(&ifindex, arg, sizeof(ifindex)))
+		return -EFAULT;
+
+	return netgpu_bind_device(ctx, ifindex);
+}
+
+static long
+netgpu_ioctl(struct file *file, unsigned cmd, unsigned long arg)
+{
+	struct netgpu_ctx *ctx = netgpu_file_to_ctx(file);
+
+	switch (cmd) {
+	case NETGPU_IOCTL_ATTACH_DEV:
+		return netgpu_attach_dev(ctx, (void __user *)arg);
+
+	case NETGPU_IOCTL_BIND_QUEUE:
+		return netgpu_bind_queue(ctx, (void __user *)arg);
+
+	case NETGPU_IOCTL_ADD_REGION:
+		return netgpu_add_region(ctx, (void __user *)arg);
+	}
+	return -ENOTTY;
+}
+
+static int
+netgpu_show(struct seq_file *seq_file, void *private)
+{
+	return 0;
+}
+
+static struct netgpu_ctx *
+netgpu_create_ctx(void)
+{
+	struct netgpu_ctx *ctx;
+	size_t sz;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return NULL;
+
+	ctx->account_mem = !capable(CAP_IPC_LOCK);
+	ctx->user = get_uid(current_user());
+
+	sz = struct_size(ctx->napi_cache, page, NETGPU_CACHE_COUNT);
+	ctx->napi_cache = kmalloc(sz, GFP_KERNEL);
+	if (!ctx->napi_cache)
+		goto out;
+	ctx->napi_cache->next = NULL;
+
+	ctx->dummy_page = alloc_page(GFP_KERNEL);
+	if (!ctx->dummy_page)
+		goto out;
+
+	spin_lock_init(&ctx->pgcache_lock);
+	xa_init(&ctx->xa);
+	refcount_set(&ctx->ref, 1);
+	ctx->queue_id = -1;
+	ctx->any_cache_count = -1;
+
+	/* Set dummy page refs to MAX, with extra to hand out */
+	page_ref_add(ctx->dummy_page, NETGPU_REFC_MAX - 1);
+	ctx->page_extra_refc = NETGPU_REFC_EXTRA;
+
+	return (ctx);
+
+out:
+	free_uid(ctx->user);
+	kfree(ctx->napi_cache);
+	if (ctx->dummy_page)
+		put_page(ctx->dummy_page);
+	kfree(ctx);
+
+	return NULL;
+}
+
+static int
+netgpu_open(struct inode *inode, struct file *file)
+{
+	struct netgpu_ctx *ctx;
+	int err;
+
+	ctx = netgpu_create_ctx();
+	if (!ctx)
+		return -ENOMEM;
+
+	__module_get(THIS_MODULE);
+
+	/* miscdevice inits (but doesn't use) private_data.
+	 * single_open wants to use it, so set to NULL first.
+	 */
+	file->private_data = NULL;
+	err = single_open(file, netgpu_show, ctx);
+	if (err)
+		goto out;
+
+	mutex_lock(&netgpu_lock);
+	ctx->index = ++netgpu_index;
+	mutex_unlock(&netgpu_lock);
+
+	/* XXX retval... */
+	xa_store(&xa_netgpu, ctx->index, ctx, GFP_KERNEL);
+
+	return 0;
+
+out:
+	netgpu_free_ctx(ctx);
+
+	return err;
+}
+
+#ifdef USE_CUDA
+static void
+netgpu_cuda_free_page_range(struct netgpu_page_range *pr)
+{
+	unsigned long pfn, pfn_end;
+	struct page *page;
+
+	pfn_end = pr->pfn +
+		  ((pr->va_node.last + 1 - pr->va_node.start) >> PAGE_SHIFT);
+
+	for (pfn = pr->pfn; pfn < pfn_end; pfn++) {
+		page = pfn_to_page(pfn);
+		set_page_count(page, 0);
+	}
+	netgpu_free_pages(pr->res);
+	kfree(pr);
+}
+
+static void
+netgpu_cuda_release_resources(struct netgpu_region *r)
+{
+	struct pci_dev *pdev;
+	int ret;
+
+	if (r->dmamap) {
+		pdev = netdev2pci_dev(r->ctx->dev);
+		ret = nvidia_p2p_dma_unmap_pages(pdev, r->gpu_pgtbl, r->dmamap);
+		if (ret)
+			pr_err("nvidia_p2p_dma_unmap failed: %d\n", ret);
+	}
+}
+
+static void
+netgpu_cuda_free_region(struct netgpu_region *r)
+{
+	struct interval_tree_node *va_node;
+	int ret;
+
+	netgpu_cuda_release_resources(r);
+
+	region_remove_each(r, va_node)
+		netgpu_cuda_free_page_range(node2page_range(va_node));
+
+	/* NB: this call is a NOP in the current code */
+	ret = nvidia_p2p_free_page_table(r->gpu_pgtbl);
+	if (ret)
+		pr_err("nvidia_p2p_free_page_table error %d\n", ret);
+
+	/* erase if inital store was successful */
+	if (r->ctx) {
+		xa_store_range(&r->ctx->xa, r->start >> PAGE_SHIFT,
+			       (r->start + r->len) >> PAGE_SHIFT,
+			       NULL, GFP_KERNEL);
+		netgpu_free_ctx(r->ctx);
+	}
+
+	kfree(r);
+}
+#endif
+
+static void
+netgpu_host_free_region(struct netgpu_ctx *ctx, struct netgpu_region *r)
+{
+	unsigned nr_pages;
+
+	if (!r->host_memory)
+		return;
+
+	nr_pages = r->len / PAGE_SIZE;
+
+	xa_store_range(&ctx->xa, r->start >> PAGE_SHIFT,
+		      (r->start + r->len) >> PAGE_SHIFT,
+		      NULL, GFP_KERNEL);
+
+	if (ctx->account_mem)
+		netgpu_host_unaccount_mem(ctx->user, nr_pages);
+	netgpu_host_release_pages(r, nr_pages);
+	kvfree(r->page);
+	kvfree(r->pginfo);
+	kfree(r);
+}
+
+static void
+__netgpu_free_ctx(struct netgpu_ctx *ctx)
+{
+	struct netgpu_region *r;
+	unsigned long index;
+
+	xa_for_each(&ctx->xa, index, r)
+		netgpu_host_free_region(ctx, r);
+
+	xa_destroy(&ctx->xa);
+
+	netgpu_free_cache(ctx);
+	free_uid(ctx->user);
+	ctx->page_extra_refc += (NETGPU_REFC_SELF - 1);
+	page_ref_sub(ctx->dummy_page, ctx->page_extra_refc);
+	put_page(ctx->dummy_page);
+	if (ctx->dev)
+		dev_put(ctx->dev);
+	kfree(ctx);
+
+	module_put(THIS_MODULE);
+}
+
+static void
+netgpu_free_ctx(struct netgpu_ctx *ctx)
+{
+	if (refcount_dec_and_test(&ctx->ref))
+		__netgpu_free_ctx(ctx);
+}
+
+static int
+netgpu_release(struct inode *inode, struct file *file)
+{
+	struct netgpu_ctx *ctx = netgpu_file_to_ctx(file);
+	int ret;
+
+	if (ctx->queue_id != -1)
+		netgpu_close_queue(ctx, ctx->queue_id);
+
+	xa_erase(&xa_netgpu, ctx->index);
+
+	netgpu_free_ctx(ctx);
+
+	ret = single_release(inode, file);
+
+	return ret;
+}
+
+static void *
+netgpu_validate_mmap_request(struct file *file, loff_t pgoff, size_t sz)
+{
+	struct netgpu_ctx *ctx = netgpu_file_to_ctx(file);
+	loff_t offset = pgoff << PAGE_SHIFT;
+	struct page *page;
+	void *ptr;
+
+	/* each returned ptr is a separate allocation. */
+	switch (offset) {
+	case NETGPU_OFF_FILL_ID:
+		ptr = ctx->fill.map_ptr;
+		break;
+	case NETGPU_OFF_RX_ID:
+		ptr = ctx->rx.map_ptr;
+		break;
+	default:
+		return ERR_PTR(-EINVAL);
+	}
+
+	page = virt_to_head_page(ptr);
+	if (sz > page_size(page))
+		return ERR_PTR(-EINVAL);
+
+	return ptr;
+}
+
+static int
+netgpu_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	size_t sz = vma->vm_end - vma->vm_start;
+	unsigned long pfn;
+	void *ptr;
+
+	ptr = netgpu_validate_mmap_request(file, vma->vm_pgoff, sz);
+	if (IS_ERR(ptr))
+		return PTR_ERR(ptr);
+
+	pfn = virt_to_phys(ptr) >> PAGE_SHIFT;
+	return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot);
+}
+
+static const struct file_operations netgpu_fops = {
+	.owner =		THIS_MODULE,
+	.open =			netgpu_open,
+	.mmap =			netgpu_mmap,
+	.unlocked_ioctl =	netgpu_ioctl,
+	.release =		netgpu_release,
+};
+
+static struct miscdevice netgpu_miscdev = {
+	.minor		= MISC_DYNAMIC_MINOR,
+	.name		= "netgpu",
+	.fops		= &netgpu_fops,
+};
+
+static int __init
+netgpu_init(void)
+{
+	mutex_init(&netgpu_lock);
+	misc_register(&netgpu_miscdev);
+
+	return 0;
+}
+
+static void __exit
+netgpu_fini(void)
+{
+	misc_deregister(&netgpu_miscdev);
+}
+
+module_init(netgpu_init);
+module_exit(netgpu_fini);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("jlemon@flugsvamp.com");
diff --git a/include/uapi/misc/netgpu.h b/include/uapi/misc/netgpu.h
new file mode 100644
index 000000000000..ca3338464218
--- /dev/null
+++ b/include/uapi/misc/netgpu.h
@@ -0,0 +1,43 @@
+#pragma once
+
+#include <linux/ioctl.h>
+
+/* VA memory provided by a specific PCI device. */
+struct dma_region {
+	struct iovec iov;
+	unsigned host_memory : 1;
+};
+
+#define NETGPU_OFF_FILL_ID	(0ULL << 12)
+#define NETGPU_OFF_RX_ID	(1ULL << 12)
+
+struct netgpu_queue_offsets {
+	unsigned prod;
+	unsigned cons;
+	unsigned desc;
+	unsigned resv;
+};
+
+struct netgpu_user_queue {
+	unsigned elt_sz;
+	unsigned entries;
+	unsigned mask;
+	unsigned map_sz;
+	unsigned map_off;
+	struct netgpu_queue_offsets off;
+};
+
+struct netgpu_params {
+	unsigned flags;
+	unsigned ifindex;
+	unsigned queue_id;
+	unsigned resv;
+	struct netgpu_user_queue fill;
+	struct netgpu_user_queue rx;
+};
+
+#define NETGPU_IOCTL_ATTACH_DEV		_IOR(0, 1, int)
+#define NETGPU_IOCTL_BIND_QUEUE		_IOWR(0, 2, struct netgpu_params)
+#define NETGPU_IOCTL_SETUP_RING		_IOWR(0, 2, struct netgpu_params)
+#define NETGPU_IOCTL_ADD_REGION		_IOW(0, 3, struct dma_region)
+
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 16/21] lib: have __zerocopy_sg_from_iter get netgpu pages for a sk
  2020-06-18 16:09 [RFC PATCH 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (14 preceding siblings ...)
  2020-06-18 16:09 ` [RFC PATCH 15/21] netgpu: add network/gpu dma module Jonathan Lemon
@ 2020-06-18 16:09 ` Jonathan Lemon
  2020-06-18 16:09 ` [RFC PATCH 17/21] net/core: add the SO_REGISTER_DMA socket option Jonathan Lemon
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, axboe

If a sock is marked as sending zc data, have the iterator
retrieve the correct zc pages from the netgpu module.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/linux/uio.h |  4 ++++
 lib/iov_iter.c      | 45 +++++++++++++++++++++++++++++++++++++++++++++
 net/core/datagram.c |  6 +++++-
 3 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 9576fd8158d7..d4c15205a248 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -227,6 +227,10 @@ ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
 			size_t maxsize, size_t *start);
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
+struct sock;
+ssize_t iov_iter_sk_get_pages(struct iov_iter *i, struct sock *sk,
+                   size_t maxsize, struct page **pages, unsigned maxpages,
+                   size_t *pgoff);
 
 const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags);
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index bf538c2bec77..a50fa3999de3 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -10,6 +10,9 @@
 #include <linux/scatterlist.h>
 #include <linux/instrumented.h>
 
+#include <net/netgpu.h>
+#include <net/sock.h>
+
 #define PIPE_PARANOIA /* for now */
 
 #define iterate_iovec(i, n, __v, __p, skip, STEP) {	\
@@ -1349,6 +1352,48 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 }
 EXPORT_SYMBOL(iov_iter_get_pages);
 
+ssize_t iov_iter_sk_get_pages(struct iov_iter *i, struct sock *sk,
+		   size_t maxsize, struct page **pages, unsigned maxpages,
+		   size_t *pgoff)
+{
+	const struct iovec *iov;
+	unsigned long addr;
+	struct iovec v;
+	size_t len;
+	unsigned n;
+	int ret;
+
+	if (!sk->sk_user_data)
+		return iov_iter_get_pages(i, pages, maxsize, maxpages, pgoff);
+
+	if (maxsize > i->count)
+		maxsize = i->count;
+
+	if (!iter_is_iovec(i))
+		return -EFAULT;
+
+	if (iov_iter_rw(i) != WRITE)
+		return -EFAULT;
+
+	iterate_iovec(i, maxsize, v, iov, i->iov_offset, ({
+		addr = (unsigned long)v.iov_base;
+		*pgoff = addr & (PAGE_SIZE - 1);
+		len = v.iov_len + *pgoff;
+
+		if (len > maxpages * PAGE_SIZE)
+			len = maxpages * PAGE_SIZE;
+
+		n = DIV_ROUND_UP(len, PAGE_SIZE);
+
+		ret = __netgpu_get_pages(sk, pages, addr, n);
+		if (ret > 0)
+			ret = (ret == n ? len : ret * PAGE_SIZE) - *pgoff;
+		return ret;
+	0;}));
+	return 0;
+}
+EXPORT_SYMBOL(iov_iter_sk_get_pages);
+
 static struct page **get_pages_array(size_t n)
 {
 	return kvmalloc_array(n, sizeof(struct page *), GFP_KERNEL);
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 639745d4f3b9..7dd8814c222a 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -530,6 +530,10 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
 			   struct iov_iter *to, int len)
 {
 	trace_skb_copy_datagram_iovec(skb, len);
+	if (skb->zc_netgpu) {
+		pr_err("skb netgpu datagram on !netgpu sk\n");
+		return -EFAULT;
+	}
 	return __skb_datagram_iter(skb, offset, to, len, false,
 			simple_copy_to_iter, NULL);
 }
@@ -631,7 +635,7 @@ int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
 		if (frag == MAX_SKB_FRAGS)
 			return -EMSGSIZE;
 
-		copied = iov_iter_get_pages(from, pages, length,
+		copied = iov_iter_sk_get_pages(from, sk, length, pages,
 					    MAX_SKB_FRAGS - frag, &start);
 		if (copied < 0)
 			return -EFAULT;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 17/21] net/core: add the SO_REGISTER_DMA socket option
  2020-06-18 16:09 [RFC PATCH 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (15 preceding siblings ...)
  2020-06-18 16:09 ` [RFC PATCH 16/21] lib: have __zerocopy_sg_from_iter get netgpu pages for a sk Jonathan Lemon
@ 2020-06-18 16:09 ` Jonathan Lemon
  2020-06-18 16:09 ` [RFC PATCH 18/21] tcp: add MSG_NETDMA flag for sendmsg() Jonathan Lemon
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, axboe

This option says that the socket will be performing zero copy sends
and receives through the netgpu module.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/uapi/asm-generic/socket.h |  2 ++
 net/core/sock.c                   | 26 ++++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index 77f7c1638eb1..5a8577c90e2a 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -119,6 +119,8 @@
 
 #define SO_DETACH_REUSEPORT_BPF 68
 
+#define SO_REGISTER_DMA		69
+
 #if !defined(__KERNEL__)
 
 #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))
diff --git a/net/core/sock.c b/net/core/sock.c
index 6c4acf1f0220..c9e93ee675d6 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -828,6 +828,25 @@ void sock_set_rcvbuf(struct sock *sk, int val)
 }
 EXPORT_SYMBOL(sock_set_rcvbuf);
 
+extern int netgpu_register_dma(struct sock *sk, char __user *optval, unsigned int optlen);
+
+static int
+sock_register_dma(struct sock *sk, char __user *optval, unsigned int optlen)
+{
+	int rc;
+	int (*fn)(struct sock *sk, char __user *optval, unsigned int optlen);
+
+	fn = symbol_get(netgpu_register_dma);
+	if (!fn)
+		return -EINVAL;
+
+	rc = fn(sk, optval, optlen);
+
+	symbol_put(netgpu_register_dma);
+
+	return rc;
+}
+
 /*
  *	This is meant for all protocols to use and covers goings on
  *	at the socket level. Everything here is generic.
@@ -1232,6 +1251,13 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 		}
 		break;
 
+	case SO_REGISTER_DMA:
+		if (!sk->sk_bound_dev_if)
+			ret = -EINVAL;
+		else
+			ret = sock_register_dma(sk, optval, optlen);
+		break;
+
 	case SO_TXTIME:
 		if (optlen != sizeof(struct sock_txtime)) {
 			ret = -EINVAL;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 18/21] tcp: add MSG_NETDMA flag for sendmsg()
  2020-06-18 16:09 [RFC PATCH 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (16 preceding siblings ...)
  2020-06-18 16:09 ` [RFC PATCH 17/21] net/core: add the SO_REGISTER_DMA socket option Jonathan Lemon
@ 2020-06-18 16:09 ` Jonathan Lemon
  2020-06-18 16:09 ` [RFC PATCH 19/21] core: add page recycling logic for netgpu pages Jonathan Lemon
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, axboe

This flag indicates that the attached data is a zero-copy send,
and the pages should be retrieved from the netgpu module.  The
socket must have been marked as SOCK_ZEROCOPY, and also registered
with netgpu via SO_REGISTER_DMA.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/linux/socket.h | 1 +
 net/ipv4/tcp.c         | 8 ++++++++
 2 files changed, 9 insertions(+)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 04d2bc97f497..63816cc25dee 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -310,6 +310,7 @@ struct ucred {
 					  */
 
 #define MSG_ZEROCOPY	0x4000000	/* Use user data in kernel path */
+#define MSG_NETDMA  	0x8000000
 #define MSG_FASTOPEN	0x20000000	/* Send data in TCP SYN */
 #define MSG_CMSG_CLOEXEC 0x40000000	/* Set close_on_exec for file
 					   descriptor received through
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 810cc164f795..219670152f77 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1209,6 +1209,14 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 			uarg->zerocopy = 0;
 	}
 
+	if (flags & MSG_NETDMA && size && sock_flag(sk, SOCK_ZEROCOPY)) {
+		zc = sk->sk_route_caps & NETIF_F_SG;
+		if (!zc) {
+			err = -EFAULT;
+			goto out_err;
+		}
+	}
+
 	if (unlikely(flags & MSG_FASTOPEN || inet_sk(sk)->defer_connect) &&
 	    !tp->repair) {
 		err = tcp_sendmsg_fastopen(sk, msg, &copied_syn, size, uarg);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 19/21] core: add page recycling logic for netgpu pages
  2020-06-18 16:09 [RFC PATCH 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (17 preceding siblings ...)
  2020-06-18 16:09 ` [RFC PATCH 18/21] tcp: add MSG_NETDMA flag for sendmsg() Jonathan Lemon
@ 2020-06-18 16:09 ` Jonathan Lemon
  2020-06-18 16:09 ` [RFC PATCH 20/21] core/skbuff: use skb_zdata for testing whether skb is zerocopy Jonathan Lemon
  2020-06-18 16:09 ` [RFC PATCH 21/21] mlx5: add XDP_SETUP_NETGPU hook Jonathan Lemon
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, axboe

netgpu pages will always have a refcount of at least one (held by
the netgpu module).  This logic and the codepath obviously needs
work, but suffices for a proof-of-concept.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 net/core/skbuff.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2a391042be53..2b4176cab578 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -69,6 +69,7 @@
 #include <net/xfrm.h>
 #include <net/mpls.h>
 #include <net/mptcp.h>
+#include <net/netgpu.h>
 
 #include <linux/uaccess.h>
 #include <trace/events/skb.h>
@@ -590,6 +591,24 @@ static void skb_free_head(struct sk_buff *skb)
 		kfree(head);
 }
 
+static void skb_netgpu_unref(struct skb_shared_info *shinfo)
+{
+	struct page *page;
+	int count;
+	int i;
+
+	/* pages attached for skbs for TX shouldn't come here, since
+	 * the skb is not marked as "zc_netgpu". (only RX skbs have this).
+	 * dummy page does come here, but always has elevated refc.
+	 */
+	for (i = 0; i < shinfo->nr_frags; i++) {
+		page = skb_frag_page(&shinfo->frags[i]);
+		count = page_ref_dec_return(page);
+		if (count <= 2)
+			__netgpu_put_page(g_ctx, page, false);
+	}
+}
+
 static void skb_release_data(struct sk_buff *skb)
 {
 	struct skb_shared_info *shinfo = skb_shinfo(skb);
@@ -600,8 +619,12 @@ static void skb_release_data(struct sk_buff *skb)
 			      &shinfo->dataref))
 		return;
 
-	for (i = 0; i < shinfo->nr_frags; i++)
-		__skb_frag_unref(&shinfo->frags[i]);
+	if (skb->zc_netgpu && shinfo->nr_frags) {
+		skb_netgpu_unref(shinfo);
+	} else {
+		for (i = 0; i < shinfo->nr_frags; i++)
+			__skb_frag_unref(&shinfo->frags[i]);
+	}
 
 	if (shinfo->frag_list)
 		kfree_skb_list(shinfo->frag_list);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 20/21] core/skbuff: use skb_zdata for testing whether skb is zerocopy
  2020-06-18 16:09 [RFC PATCH 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (18 preceding siblings ...)
  2020-06-18 16:09 ` [RFC PATCH 19/21] core: add page recycling logic for netgpu pages Jonathan Lemon
@ 2020-06-18 16:09 ` Jonathan Lemon
  2020-06-18 16:09 ` [RFC PATCH 21/21] mlx5: add XDP_SETUP_NETGPU hook Jonathan Lemon
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, axboe

skb_zcopy() flag indicates whether the skb has a zerocopy ubuf.
netgpu does not use ubufs, so skb_zdata() indicates whether the
skb is carrying zero copy data, and should be left alone, while
skb_zcopy() indicates whhether there is an attached ubuf.

Also, when a write() on a zero-copy socket returns EWOULDBLOCK,
this is not synchronized with select(), which will only look at
the send buffer, and return writability if there is tcp space.

This appears to be caused by some ubuf logic, leading to iperf
spending 70% of its time in select() for ZC transmits.  With this
change, the time spent drops to 20%.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/linux/skbuff.h | 24 +++++++++++++++++++++++-
 net/core/skbuff.c      | 16 ++++++++++++----
 2 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ba41d1a383f8..3c2efd45655b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -443,8 +443,12 @@ enum {
 
 	/* generate software time stamp when entering packet scheduling */
 	SKBTX_SCHED_TSTAMP = 1 << 6,
+
+	/* fragments are accessed only via DMA */
+	SKBTX_DEV_NETDMA = 1 << 7,
 };
 
+#define SKBTX_ZERODATA_FRAG	(SKBTX_DEV_ZEROCOPY | SKBTX_DEV_NETDMA)
 #define SKBTX_ZEROCOPY_FRAG	(SKBTX_DEV_ZEROCOPY | SKBTX_SHARED_FRAG)
 #define SKBTX_ANY_SW_TSTAMP	(SKBTX_SW_TSTAMP    | \
 				 SKBTX_SCHED_TSTAMP)
@@ -1416,6 +1420,24 @@ static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
 	return &skb_shinfo(skb)->hwtstamps;
 }
 
+static inline bool skb_netdma(struct sk_buff *skb)
+{
+	return skb && skb_shinfo(skb)->tx_flags & SKBTX_DEV_NETDMA;
+}
+
+static inline bool skb_zdata(struct sk_buff *skb)
+{
+	return skb && skb_shinfo(skb)->tx_flags & SKBTX_ZERODATA_FRAG;
+}
+
+static inline void skb_netdma_set(struct sk_buff *skb, bool netdma)
+{
+	if (skb && netdma) {
+		skb_shinfo(skb)->tx_flags |= SKBTX_DEV_NETDMA;
+		skb_shinfo(skb)->destructor_arg = NULL;
+	}
+}
+
 static inline struct ubuf_info *skb_zcopy(struct sk_buff *skb)
 {
 	bool is_zcopy = skb && skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY;
@@ -3260,7 +3282,7 @@ static inline int skb_add_data(struct sk_buff *skb,
 static inline bool skb_can_coalesce(struct sk_buff *skb, int i,
 				    const struct page *page, int off)
 {
-	if (skb_zcopy(skb))
+	if (skb_zdata(skb))
 		return false;
 	if (i) {
 		const skb_frag_t *frag = &skb_shinfo(skb)->frags[i - 1];
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2b4176cab578..67a421257a27 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1323,6 +1323,8 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
 	}
 
 	skb_zcopy_set(skb, uarg, NULL);
+	skb_netdma_set(skb, sk->sk_user_data);
+
 	return skb->len - orig_len;
 }
 EXPORT_SYMBOL_GPL(skb_zerocopy_iter_stream);
@@ -1330,8 +1332,8 @@ EXPORT_SYMBOL_GPL(skb_zerocopy_iter_stream);
 static int skb_zerocopy_clone(struct sk_buff *nskb, struct sk_buff *orig,
 			      gfp_t gfp_mask)
 {
-	if (skb_zcopy(orig)) {
-		if (skb_zcopy(nskb)) {
+	if (skb_zdata(orig)) {
+		if (skb_zdata(nskb)) {
 			/* !gfp_mask callers are verified to !skb_zcopy(nskb) */
 			if (!gfp_mask) {
 				WARN_ON_ONCE(1);
@@ -1343,6 +1345,7 @@ static int skb_zerocopy_clone(struct sk_buff *nskb, struct sk_buff *orig,
 				return -EIO;
 		}
 		skb_zcopy_set(nskb, skb_uarg(orig), NULL);
+		skb_netdma_set(nskb, skb_netdma(orig));
 	}
 	return 0;
 }
@@ -1372,6 +1375,9 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 	if (skb_shared(skb) || skb_unclone(skb, gfp_mask))
 		return -EINVAL;
 
+	if (!skb_has_shared_frag(skb))
+		return -EINVAL;
+
 	if (!num_frags)
 		goto release;
 
@@ -2078,6 +2084,8 @@ void *__pskb_pull_tail(struct sk_buff *skb, int delta)
 	 */
 	int i, k, eat = (skb->tail + delta) - skb->end;
 
+	BUG_ON(skb_netdma(skb));
+
 	if (eat > 0 || skb_cloned(skb)) {
 		if (pskb_expand_head(skb, 0, eat > 0 ? eat + 128 : 0,
 				     GFP_ATOMIC))
@@ -3328,7 +3336,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
 
 	if (skb_headlen(skb))
 		return 0;
-	if (skb_zcopy(tgt) || skb_zcopy(skb))
+	if (skb_zdata(tgt) || skb_zdata(skb))
 		return 0;
 
 	todo = shiftlen;
@@ -5171,7 +5179,7 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 	from_shinfo = skb_shinfo(from);
 	if (to_shinfo->frag_list || from_shinfo->frag_list)
 		return false;
-	if (skb_zcopy(to) || skb_zcopy(from))
+	if (skb_zdata(to) || skb_zdata(from))
 		return false;
 
 	if (skb_headlen(from) != 0) {
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 21/21] mlx5: add XDP_SETUP_NETGPU hook
  2020-06-18 16:09 [RFC PATCH 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (19 preceding siblings ...)
  2020-06-18 16:09 ` [RFC PATCH 20/21] core/skbuff: use skb_zdata for testing whether skb is zerocopy Jonathan Lemon
@ 2020-06-18 16:09 ` Jonathan Lemon
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, axboe

Add the hook which enables and disables the zero copy queues.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index c791578be5ea..05f93f78ebbc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4598,6 +4598,9 @@ static int mlx5e_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	case XDP_SETUP_XSK_UMEM:
 		return mlx5e_xsk_setup_umem(dev, xdp->xsk.umem,
 					    xdp->xsk.queue_id);
+	case XDP_SETUP_NETGPU:
+		return mlx5e_netgpu_setup_ctx(dev, xdp->netgpu.ctx,
+					      xdp->netgpu.queue_id);
 	default:
 		return -EINVAL;
 	}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 06/21] mlx5: add header_split flag
  2020-06-18 16:09 ` [RFC PATCH 06/21] mlx5: add header_split flag Jonathan Lemon
@ 2020-06-18 18:12   ` Eric Dumazet
  2020-06-18 20:25     ` Michal Kubecek
  2020-06-18 21:50     ` Jonathan Lemon
  0 siblings, 2 replies; 28+ messages in thread
From: Eric Dumazet @ 2020-06-18 18:12 UTC (permalink / raw)
  To: Jonathan Lemon, netdev
  Cc: kernel-team, axboe, Govindarajulu Varadarajan, Michal Kubecek



On 6/18/20 9:09 AM, Jonathan Lemon wrote:
> Adds a "rx_hd_split" private flag parameter to ethtool.
> 
> This enables header splitting, and sets up the fragment mappings.
> The feature is currently only enabled for netgpu channels.

We are using a similar idea (pseudo header split) to implement 4096+(headers) MTU at Google,
to enable TCP RX zerocopy on x86.

Patch for mlx4 has not been sent upstream yet.

For mlx4, we are using a single buffer of 128*(number_of_slots_per_RX_RING),
and 86 bytes for the first frag, so that the payload exactly fits a 4096 bytes page.

(In our case, most of our data TCP packets only have 12 bytes of TCP options)


I suggest that instead of a flag, you use a tunable, that can be set by ethtool,
so that the exact number of bytes can be tuned, instead of hard coded in the driver.

(Patch for the counter part of [1] was resent 10 days ago on netdev@ by Govindarajulu Varadarajan)
(Not sure if this has been merged yet)

[1]

commit f0db9b073415848709dd59a6394969882f517da9
Author: Govindarajulu Varadarajan <_govind@gmx.com>
Date:   Wed Sep 3 03:17:20 2014 +0530

    ethtool: Add generic options for tunables
    
    This patch adds new ethtool cmd, ETHTOOL_GTUNABLE & ETHTOOL_STUNABLE for getting
    tunable values from driver.
    
    Add get_tunable and set_tunable to ethtool_ops. Driver implements these
    functions for getting/setting tunable value.
    
    Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>




> 
> Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> ---
>  .../ethernet/mellanox/mlx5/core/en_ethtool.c  | 15 +++++++
>  .../net/ethernet/mellanox/mlx5/core/en_main.c | 45 +++++++++++++++----
>  2 files changed, 52 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
> index ec5658bbe3c5..a1b5d8b33b0b 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
> @@ -1905,6 +1905,20 @@ static int set_pflag_xdp_tx_mpwqe(struct net_device *netdev, bool enable)
>  	return err;
>  }
>  
> +static int set_pflag_rx_hd_split(struct net_device *netdev, bool enable)
> +{
> +	struct mlx5e_priv *priv = netdev_priv(netdev);
> +	int err;
> +
> +	priv->channels.params.hd_split = enable;
> +	err = mlx5e_safe_reopen_channels(priv);
> +	if (err)
> +		netdev_err(priv->netdev,
> +			   "%s failed to reopen channels, err(%d).\n",
> +				__func__, err);
> +	return err;
> +}
> +
>  static const struct pflag_desc mlx5e_priv_flags[MLX5E_NUM_PFLAGS] = {
>  	{ "rx_cqe_moder",        set_pflag_rx_cqe_based_moder },
>  	{ "tx_cqe_moder",        set_pflag_tx_cqe_based_moder },
> @@ -1912,6 +1926,7 @@ static const struct pflag_desc mlx5e_priv_flags[MLX5E_NUM_PFLAGS] = {
>  	{ "rx_striding_rq",      set_pflag_rx_striding_rq },
>  	{ "rx_no_csum_complete", set_pflag_rx_no_csum_complete },
>  	{ "xdp_tx_mpwqe",        set_pflag_xdp_tx_mpwqe },
> +	{ "rx_hd_split",	 set_pflag_rx_hd_split },
>  };
>  
>  static int mlx5e_handle_pflag(struct net_device *netdev,
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index a836a02a2116..cc8d30aa8a33 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -123,7 +123,8 @@ bool mlx5e_striding_rq_possible(struct mlx5_core_dev *mdev,
>  
>  void mlx5e_set_rq_type(struct mlx5_core_dev *mdev, struct mlx5e_params *params)
>  {
> -	params->rq_wq_type = mlx5e_striding_rq_possible(mdev, params) &&
> +	params->rq_wq_type = MLX5E_HD_SPLIT(params) ? MLX5_WQ_TYPE_CYCLIC :
> +		mlx5e_striding_rq_possible(mdev, params) &&
>  		MLX5E_GET_PFLAG(params, MLX5E_PFLAG_RX_STRIDING_RQ) ?
>  		MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ :
>  		MLX5_WQ_TYPE_CYCLIC;
> @@ -323,6 +324,8 @@ static void mlx5e_init_frags_partition(struct mlx5e_rq *rq)
>  				if (prev)
>  					prev->last_in_page = true;
>  			}
> +			next_frag.di->netgpu_source =
> +						!!frag_info[f].frag_source;
>  			*frag = next_frag;
>  
>  			/* prepare next */
> @@ -373,6 +376,8 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
>  	struct mlx5_core_dev *mdev = c->mdev;
>  	void *rqc = rqp->rqc;
>  	void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
> +	bool hd_split = MLX5E_HD_SPLIT(params) && (umem == (void *)0x1);
> +	u32 num_xsk_frames = 0;
>  	u32 rq_xdp_ix;
>  	u32 pool_size;
>  	int wq_sz;
> @@ -391,9 +396,10 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
>  	rq->mdev    = mdev;
>  	rq->hw_mtu  = MLX5E_SW2HW_MTU(params, params->sw_mtu);
>  	rq->xdpsq   = &c->rq_xdpsq;
> -	rq->umem    = umem;
> +	if (xsk)
> +		rq->umem    = umem;
>  
> -	if (rq->umem)
> +	if (umem)
>  		rq->stats = &c->priv->channel_stats[c->ix].xskrq;
>  	else
>  		rq->stats = &c->priv->channel_stats[c->ix].rq;
> @@ -404,14 +410,18 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
>  	rq->xdp_prog = params->xdp_prog;
>  
>  	rq_xdp_ix = rq->ix;
> -	if (xsk)
> +	if (umem)
>  		rq_xdp_ix += params->num_channels * MLX5E_RQ_GROUP_XSK;
>  	err = xdp_rxq_info_reg(&rq->xdp_rxq, rq->netdev, rq_xdp_ix);
>  	if (err < 0)
>  		goto err_rq_wq_destroy;
>  
> +	if (umem == (void *)0x1)
> +		rq->buff.headroom = 0;
> +	else
> +		rq->buff.headroom = mlx5e_get_rq_headroom(mdev, params, xsk);
> +
>  	rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
> -	rq->buff.headroom = mlx5e_get_rq_headroom(mdev, params, xsk);
>  	pool_size = 1 << params->log_rq_mtu_frames;
>  
>  	switch (rq->wq_type) {
> @@ -509,6 +519,7 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
>  
>  		rq->wqe.skb_from_cqe = xsk ?
>  			mlx5e_xsk_skb_from_cqe_linear :
> +			hd_split ? mlx5e_skb_from_cqe_nonlinear :
>  			mlx5e_rx_is_linear_skb(params, NULL) ?
>  				mlx5e_skb_from_cqe_linear :
>  				mlx5e_skb_from_cqe_nonlinear;
> @@ -2035,13 +2046,19 @@ static void mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
>  	int frag_size_max = DEFAULT_FRAG_SIZE;
>  	u32 buf_size = 0;
>  	int i;
> +	bool hd_split = MLX5E_HD_SPLIT(params) && xsk;
> +
> +	if (hd_split)
> +		frag_size_max = HD_SPLIT_DEFAULT_FRAG_SIZE;
> +	else
> +		frag_size_max = DEFAULT_FRAG_SIZE;
>  
>  #ifdef CONFIG_MLX5_EN_IPSEC
>  	if (MLX5_IPSEC_DEV(mdev))
>  		byte_count += MLX5E_METADATA_ETHER_LEN;
>  #endif
>  
> -	if (mlx5e_rx_is_linear_skb(params, xsk)) {
> +	if (!hd_split && mlx5e_rx_is_linear_skb(params, xsk)) {
>  		int frag_stride;
>  
>  		frag_stride = mlx5e_rx_get_linear_frag_sz(params, xsk);
> @@ -2059,6 +2076,16 @@ static void mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
>  		frag_size_max = PAGE_SIZE;
>  
>  	i = 0;
> +
> +	if (hd_split) {
> +		// Start with one fragment for all headers (implementing HDS)
> +		info->arr[0].frag_size = TOTAL_HEADERS;
> +		info->arr[0].frag_stride = roundup_pow_of_two(PAGE_SIZE);
> +		buf_size += TOTAL_HEADERS;
> +		// Now, continue with the payload frags.
> +		i = 1;
> +	}
> +
>  	while (buf_size < byte_count) {
>  		int frag_size = byte_count - buf_size;
>  
> @@ -2066,8 +2093,10 @@ static void mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
>  			frag_size = min(frag_size, frag_size_max);
>  
>  		info->arr[i].frag_size = frag_size;
> -		info->arr[i].frag_stride = roundup_pow_of_two(frag_size);
> -
> +		info->arr[i].frag_stride = roundup_pow_of_two(hd_split ?
> +							      PAGE_SIZE :
> +							      frag_size);
> +		info->arr[i].frag_source = hd_split;
>  		buf_size += frag_size;
>  		i++;
>  	}
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 06/21] mlx5: add header_split flag
  2020-06-18 18:12   ` Eric Dumazet
@ 2020-06-18 20:25     ` Michal Kubecek
  2020-06-18 22:45       ` Eric Dumazet
  2020-06-18 21:50     ` Jonathan Lemon
  1 sibling, 1 reply; 28+ messages in thread
From: Michal Kubecek @ 2020-06-18 20:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jonathan Lemon, netdev, kernel-team, axboe, Govindarajulu Varadarajan

On Thu, Jun 18, 2020 at 11:12:57AM -0700, Eric Dumazet wrote:
> On 6/18/20 9:09 AM, Jonathan Lemon wrote:
> > Adds a "rx_hd_split" private flag parameter to ethtool.
> > 
> > This enables header splitting, and sets up the fragment mappings.
> > The feature is currently only enabled for netgpu channels.
> 
> We are using a similar idea (pseudo header split) to implement 4096+(headers) MTU at Google,
> to enable TCP RX zerocopy on x86.
> 
> Patch for mlx4 has not been sent upstream yet.
> 
> For mlx4, we are using a single buffer of 128*(number_of_slots_per_RX_RING),
> and 86 bytes for the first frag, so that the payload exactly fits a 4096 bytes page.
> 
> (In our case, most of our data TCP packets only have 12 bytes of TCP options)
> 
> I suggest that instead of a flag, you use a tunable, that can be set by ethtool,
> so that the exact number of bytes can be tuned, instead of hard coded in the driver.

I fully agree that such generic parameter would be a better solution
than a private flag. But I have my doubts about adding more tunables.
The point is that the concept of tunables looks like a workaround for
the lack of extensibility of the ioctl interface where the space for
adding new parameters to existing subcommands was limited (or none).

With netlink, adding new parameters is much easier and as only three
tunables were added in 6 years (or four with your proposal), we don't
have to worry about having too many different attributes (current code
isn't even designed to scale well to many tunables).

This new header split parameter could IMHO be naturally put together
with rx-copybreak and tx-copybreak and possibly any future parameters
to control how packet contents is passed between NIC/driver and
networking stack.

> (Patch for the counter part of [1] was resent 10 days ago on netdev@ by Govindarajulu Varadarajan)
> (Not sure if this has been merged yet)

Not yet, I want to take another look in the rest of this week.

Michal

> [1]
> 
> commit f0db9b073415848709dd59a6394969882f517da9
> Author: Govindarajulu Varadarajan <_govind@gmx.com>
> Date:   Wed Sep 3 03:17:20 2014 +0530
> 
>     ethtool: Add generic options for tunables
>     
>     This patch adds new ethtool cmd, ETHTOOL_GTUNABLE & ETHTOOL_STUNABLE for getting
>     tunable values from driver.
>     
>     Add get_tunable and set_tunable to ethtool_ops. Driver implements these
>     functions for getting/setting tunable value.
>     
>     Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 06/21] mlx5: add header_split flag
  2020-06-18 18:12   ` Eric Dumazet
  2020-06-18 20:25     ` Michal Kubecek
@ 2020-06-18 21:50     ` Jonathan Lemon
  2020-06-18 22:34       ` Eric Dumazet
  2020-06-18 22:36       ` Eric Dumazet
  1 sibling, 2 replies; 28+ messages in thread
From: Jonathan Lemon @ 2020-06-18 21:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, kernel-team, axboe, Govindarajulu Varadarajan, Michal Kubecek

On Thu, Jun 18, 2020 at 11:12:57AM -0700, Eric Dumazet wrote:
> 
> 
> On 6/18/20 9:09 AM, Jonathan Lemon wrote:
> > Adds a "rx_hd_split" private flag parameter to ethtool.
> > 
> > This enables header splitting, and sets up the fragment mappings.
> > The feature is currently only enabled for netgpu channels.
> 
> We are using a similar idea (pseudo header split) to implement 4096+(headers) MTU at Google,
> to enable TCP RX zerocopy on x86.
> 
> Patch for mlx4 has not been sent upstream yet.
> 
> For mlx4, we are using a single buffer of 128*(number_of_slots_per_RX_RING),
> and 86 bytes for the first frag, so that the payload exactly fits a 4096 bytes page.
> 
> (In our case, most of our data TCP packets only have 12 bytes of TCP options)
> 
> 
> I suggest that instead of a flag, you use a tunable, that can be set by ethtool,
> so that the exact number of bytes can be tuned, instead of hard coded in the driver.

Sounds reasonable - in the long run, it would be ideal to have the
hardware actually perform header splitting, but for now using a tunable
fixed offset will work.  In the same vein, there should be a similar
setting for the TCP option padding on the sender side.
-- 
Jonathan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 06/21] mlx5: add header_split flag
  2020-06-18 21:50     ` Jonathan Lemon
@ 2020-06-18 22:34       ` Eric Dumazet
  2020-06-18 22:36       ` Eric Dumazet
  1 sibling, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2020-06-18 22:34 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: netdev, kernel-team, axboe, Govindarajulu Varadarajan, Michal Kubecek



On 6/18/20 2:50 PM, Jonathan Lemon wrote:
> On Thu, Jun 18, 2020 at 11:12:57AM -0700, Eric Dumazet wrote:
>>
>>
>> On 6/18/20 9:09 AM, Jonathan Lemon wrote:
>>> Adds a "rx_hd_split" private flag parameter to ethtool.
>>>
>>> This enables header splitting, and sets up the fragment mappings.
>>> The feature is currently only enabled for netgpu channels.
>>
>> We are using a similar idea (pseudo header split) to implement 4096+(headers) MTU at Google,
>> to enable TCP RX zerocopy on x86.
>>
>> Patch for mlx4 has not been sent upstream yet.
>>
>> For mlx4, we are using a single buffer of 128*(number_of_slots_per_RX_RING),
>> and 86 bytes for the first frag, so that the payload exactly fits a 4096 bytes page.
>>
>> (In our case, most of our data TCP packets only have 12 bytes of TCP options)
>>
>>
>> I suggest that instead of a flag, you use a tunable, that can be set by ethtool,
>> so that the exact number of bytes can be tuned, instead of hard coded in the driver.
> 
> Sounds reasonable - in the long run, it would be ideal to have the
> hardware actually perform header splitting, but for now using a tunable
> fixed offset will work.  In the same vein, there should be a similar
> setting for the TCP option padding on the sender side.
> 

Some NIC have variable header split (Intel ixgbe I am pretty sure)

We use a mix of NIC, some with variable header splits, some with fixed pseudo header split (mlx4)

Because of this, we had to limit TCP advmss to 4108 (4096 + 12), regardless of the NIC abilities.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 06/21] mlx5: add header_split flag
  2020-06-18 21:50     ` Jonathan Lemon
  2020-06-18 22:34       ` Eric Dumazet
@ 2020-06-18 22:36       ` Eric Dumazet
  1 sibling, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2020-06-18 22:36 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: netdev, kernel-team, axboe, Govindarajulu Varadarajan, Michal Kubecek



On 6/18/20 2:50 PM, Jonathan Lemon wrote:

> In the same vein, there should be a similar
> setting for the TCP option padding on the sender side.


We had no need for such hack in the TCP stack :)



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 06/21] mlx5: add header_split flag
  2020-06-18 20:25     ` Michal Kubecek
@ 2020-06-18 22:45       ` Eric Dumazet
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2020-06-18 22:45 UTC (permalink / raw)
  To: Michal Kubecek, Eric Dumazet
  Cc: Jonathan Lemon, netdev, kernel-team, axboe, Govindarajulu Varadarajan



On 6/18/20 1:25 PM, Michal Kubecek wrote:
> On Thu, Jun 18, 2020 at 11:12:57AM -0700, Eric Dumazet wrote:
>> On 6/18/20 9:09 AM, Jonathan Lemon wrote:
>>> Adds a "rx_hd_split" private flag parameter to ethtool.
>>>
>>> This enables header splitting, and sets up the fragment mappings.
>>> The feature is currently only enabled for netgpu channels.
>>
>> We are using a similar idea (pseudo header split) to implement 4096+(headers) MTU at Google,
>> to enable TCP RX zerocopy on x86.
>>
>> Patch for mlx4 has not been sent upstream yet.
>>
>> For mlx4, we are using a single buffer of 128*(number_of_slots_per_RX_RING),
>> and 86 bytes for the first frag, so that the payload exactly fits a 4096 bytes page.
>>
>> (In our case, most of our data TCP packets only have 12 bytes of TCP options)
>>
>> I suggest that instead of a flag, you use a tunable, that can be set by ethtool,
>> so that the exact number of bytes can be tuned, instead of hard coded in the driver.
> 
> I fully agree that such generic parameter would be a better solution
> than a private flag. But I have my doubts about adding more tunables.
> The point is that the concept of tunables looks like a workaround for
> the lack of extensibility of the ioctl interface where the space for
> adding new parameters to existing subcommands was limited (or none).
> 
> With netlink, adding new parameters is much easier and as only three
> tunables were added in 6 years (or four with your proposal), we don't
> have to worry about having too many different attributes (current code
> isn't even designed to scale well to many tunables).
> 
> This new header split parameter could IMHO be naturally put together
> with rx-copybreak and tx-copybreak and possibly any future parameters
> to control how packet contents is passed between NIC/driver and
> networking stack.

This is what I suggested, maybe this was not clear.

Currently known tunables are :

enum tunable_id {
	ETHTOOL_ID_UNSPEC,
	ETHTOOL_RX_COPYBREAK,
	ETHTOOL_TX_COPYBREAK,
	ETHTOOL_PFC_PREVENTION_TOUT, /* timeout in msecs */
	/*
	 * Add your fresh new tunable attribute above and remember to update
	 * tunable_strings[] in net/core/ethtool.c
	 */
	__ETHTOOL_TUNABLE_COUNT,
};

Ie add a new ETHTOOL_RX_HEADER_SPLIT value.


Or maybe I am misunderstanding your point.

> 
>> (Patch for the counter part of [1] was resent 10 days ago on netdev@ by Govindarajulu Varadarajan)
>> (Not sure if this has been merged yet)
> 
> Not yet, I want to take another look in the rest of this week.
> 
> Michal
> 
>> [1]
>>
>> commit f0db9b073415848709dd59a6394969882f517da9
>> Author: Govindarajulu Varadarajan <_govind@gmx.com>
>> Date:   Wed Sep 3 03:17:20 2014 +0530
>>
>>     ethtool: Add generic options for tunables
>>     
>>     This patch adds new ethtool cmd, ETHTOOL_GTUNABLE & ETHTOOL_STUNABLE for getting
>>     tunable values from driver.
>>     
>>     Add get_tunable and set_tunable to ethtool_ops. Driver implements these
>>     functions for getting/setting tunable value.
>>     
>>     Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
>>     Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2020-06-18 22:45 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-18 16:09 [RFC PATCH 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 01/21] mm: add {add|release}_memory_pages Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 02/21] mm: Allow DMA mapping of pages which are not online Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 03/21] tcp: Pad TCP options out to a fixed size Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 04/21] mlx5: add definitions for header split and netgpu Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 05/21] mlx5/xsk: check that xsk does not conflict with netgpu Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 06/21] mlx5: add header_split flag Jonathan Lemon
2020-06-18 18:12   ` Eric Dumazet
2020-06-18 20:25     ` Michal Kubecek
2020-06-18 22:45       ` Eric Dumazet
2020-06-18 21:50     ` Jonathan Lemon
2020-06-18 22:34       ` Eric Dumazet
2020-06-18 22:36       ` Eric Dumazet
2020-06-18 16:09 ` [RFC PATCH 07/21] mlx5: remove the umem parameter from mlx5e_open_channel Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 08/21] misc: add shqueue.h for prototyping Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 09/21] include: add definitions for netgpu Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 10/21] mlx5: add netgpu queue functions Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 11/21] skbuff: add a zc_netgpu bitflag Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 12/21] mlx5: hook up the netgpu channel functions Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 13/21] netdevice: add SETUP_NETGPU to the netdev_bpf structure Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 14/21] kernel: export free_uid Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 15/21] netgpu: add network/gpu dma module Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 16/21] lib: have __zerocopy_sg_from_iter get netgpu pages for a sk Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 17/21] net/core: add the SO_REGISTER_DMA socket option Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 18/21] tcp: add MSG_NETDMA flag for sendmsg() Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 19/21] core: add page recycling logic for netgpu pages Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 20/21] core/skbuff: use skb_zdata for testing whether skb is zerocopy Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 21/21] mlx5: add XDP_SETUP_NETGPU hook Jonathan Lemon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).