netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU.
@ 2020-07-27 22:44 Jonathan Lemon
  2020-07-27 22:44 ` [RFC PATCH v2 01/21] linux/log2.h: enclose macro arg in parens Jonathan Lemon
                   ` (21 more replies)
  0 siblings, 22 replies; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-27 22:44 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

From: Jonathan Lemon <bsd@fb.com>

[ RESENDING, as apparently the initial submission was lost ]

This series is a working RFC proof-of-concept that implements DMA
zero-copy between the NIC and a GPU device for the data path, while
keeping the protocol processing on the host CPU.

This also works for zero-copy send/recv to host (CPU) memory.

Current limitations:
  - mlx5 only, header splitting is at a fixed offset.
  - currently only TCP protocol delivery is performed.
  - TX completion notification is planned, but not in this patchset.
  - not compatible with xsk (re-uses same datastructures)
  - not compatible with bpf payload inspection

Changes since v1:
  - user api restructured to provide more flexibility.
  - iommu issues resolved
  - performance issues fixed.
  - nvidia bits are built as an external module.

The next section provides a brief overview of how things work.

A transport context (aka RSS object) is created for a device, which acts
as a container for a group of queues.  Queues are opened on a context,
these correcpond to hardware queues on the NIC.  There exists the
ability to request a sepcific numbered queue, or "any" queue.  Only RX
queues are needed, the standard TX queues are used for packet
transmission.

Memory regions which participate in zero-copy transmission (either send
or recieve) are registered with a memory object, and then a specific
region can be attached to a context for usage.  This way, multiple
contexts can share the same memory regions.  These areas can be used as
either RX packet buffers or TX data areas (or both).  The memory can
come from either malloc/mmap or cudaMalloc().  The latter call provides
a handle to the userspace application, but the memory region is only
accessible to the GPU.

A socket is created and registered with the context, which sets
SOCK_ZEROCOPY, also creates a per-socket queue for recieving zero-copy
data.

Asymmetrical data paths are possible (zc TX, normal RX), and vice versa,
but the curreent PoC sets things up for symmetrical transport.  The
application needs to provide the RX buffers to the ifq fill queue,
similar to AF_XDP.

Once things are set up, data is sent to the network with sendmsg().  The
iovecs provided contain an address in the region previously registered.
The normal protocol stack processing constructs the packet, but the data
is not touched by the stack.  In this phase, the application is not
notified when the protocol processing is complete and the data area is
safe to modify again.

For RX, packets undergo the usual protocol processing and are delivered
up to the socket receive queue.  At this point, the skb data fragments
are delivered to the application as iovecs through an AF_XDP style queue
which belongs to the socket.  The application can poll for readability,
but does not use read() to receive the data.

The initial application used is iperf3, a modified version with the
userspace library corresponding to this patch is available at:
    https://github.com/jlemon/iperf
    https://github.com/jlemon/netgpu

Running "iperf3 -s -z --dport 8888" (host memory) on a 12Gbps link:
    11.4 Gbit/sec receive
    10.8 Gbit/sec transmit

Running "iperf3 -s -z --dport 8888 --gpu" on a 25Gbps link:
    23.4 Gbit/sec receive
    23.8 Gbit/sec transmit

For the GPU runs, the Intel PCI monitoring tools were used to confirm
that the host PCI bus was mostly idle. 

Patch series:
  1,4    : cleanup & extension
  2,3    : extend mm, allowing allocation of specific memory regions
  5,6,7  : add include support files
  8,9,11 : skbuff core changes for handling zc frags
  10,21  : netgpu and nvidia code
  12-15  : changes to connect up netgpu code
  16-20  : mlx5 driver changes

Comments eagerly solicited.
--
Jonathan


Jonathan Lemon (21):
  linux/log2.h: enclose macro arg in parens
  mm/memory_hotplug: add {add|release}_memory_pages
  mm: Allow DMA mapping of pages which are not online
  kernel/user: export free_uid
  uapi/misc: add shqueue.h for shared queues
  include: add netgpu UAPI and kernel definitions
  netdevice: add SETUP_NETGPU to the netdev_bpf structure
  skbuff: add a zc_netgpu bitflag
  core/skbuff: use skb_zdata for testing whether skb is zerocopy
  netgpu: add network/gpu/host dma module
  core/skbuff: add page recycling logic for netgpu pages
  lib: have __zerocopy_sg_from_iter get netgpu pages for a sk
  net/tcp: Pad TCP options out to a fixed size for netgpu
  net/tcp: add netgpu ioctl setting up zero copy RX queues
  net/tcp: add MSG_NETDMA flag for sendmsg()
  mlx5: remove the umem parameter from mlx5e_open_channel
  mlx5e: add header split ability
  mlx5e: add netgpu entries to mlx5 structures
  mlx5e: add the netgpu driver functions
  mlx5e: hook up the netgpu functions
  netgpu/nvidia: add Nvidia plugin for netgpu

 drivers/misc/Kconfig                          |    1 +
 drivers/misc/Makefile                         |    1 +
 drivers/misc/netgpu/Kconfig                   |   14 +
 drivers/misc/netgpu/Makefile                  |    6 +
 drivers/misc/netgpu/netgpu_host.c             |  284 ++++
 drivers/misc/netgpu/netgpu_main.c             | 1215 +++++++++++++++++
 drivers/misc/netgpu/netgpu_mem.c              |  351 +++++
 drivers/misc/netgpu/netgpu_priv.h             |   88 ++
 drivers/misc/netgpu/netgpu_stub.c             |  166 +++
 drivers/misc/netgpu/netgpu_stub.h             |   19 +
 drivers/misc/netgpu/nvidia/Kbuild             |    9 +
 drivers/misc/netgpu/nvidia/Kconfig            |   10 +
 drivers/misc/netgpu/nvidia/netgpu_cuda.c      |  416 ++++++
 .../net/ethernet/mellanox/mlx5/core/Kconfig   |    1 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |    1 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   21 +-
 .../mellanox/mlx5/core/en/netgpu/setup.c      |  340 +++++
 .../mellanox/mlx5/core/en/netgpu/setup.h      |   96 ++
 .../ethernet/mellanox/mlx5/core/en/params.c   |    3 +-
 .../ethernet/mellanox/mlx5/core/en/params.h   |    9 +
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h |    3 +
 .../ethernet/mellanox/mlx5/core/en/xsk/umem.c |    4 +
 .../ethernet/mellanox/mlx5/core/en/xsk/umem.h |    3 +
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  121 +-
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |   58 +-
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   |   19 +
 .../net/ethernet/mellanox/mlx5/core/en_txrx.c |   16 +-
 include/linux/dma-mapping.h                   |    4 +-
 include/linux/log2.h                          |    2 +-
 include/linux/memory_hotplug.h                |    4 +
 include/linux/mmzone.h                        |    7 +
 include/linux/netdevice.h                     |    6 +
 include/linux/skbuff.h                        |   27 +-
 include/linux/socket.h                        |    1 +
 include/linux/uio.h                           |    4 +
 include/net/netgpu.h                          |   66 +
 include/uapi/misc/netgpu.h                    |   69 +
 include/uapi/misc/shqueue.h                   |  200 +++
 kernel/user.c                                 |    1 +
 lib/iov_iter.c                                |   53 +
 mm/memory_hotplug.c                           |   65 +-
 net/core/datagram.c                           |    9 +-
 net/core/skbuff.c                             |   50 +-
 net/ipv4/tcp.c                                |   13 +
 net/ipv4/tcp_output.c                         |   20 +
 45 files changed, 3827 insertions(+), 49 deletions(-)
 create mode 100644 drivers/misc/netgpu/Kconfig
 create mode 100644 drivers/misc/netgpu/Makefile
 create mode 100644 drivers/misc/netgpu/netgpu_host.c
 create mode 100644 drivers/misc/netgpu/netgpu_main.c
 create mode 100644 drivers/misc/netgpu/netgpu_mem.c
 create mode 100644 drivers/misc/netgpu/netgpu_priv.h
 create mode 100644 drivers/misc/netgpu/netgpu_stub.c
 create mode 100644 drivers/misc/netgpu/netgpu_stub.h
 create mode 100644 drivers/misc/netgpu/nvidia/Kbuild
 create mode 100644 drivers/misc/netgpu/nvidia/Kconfig
 create mode 100644 drivers/misc/netgpu/nvidia/netgpu_cuda.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.h
 create mode 100644 include/net/netgpu.h
 create mode 100644 include/uapi/misc/netgpu.h
 create mode 100644 include/uapi/misc/shqueue.h

-- 
2.24.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 01/21] linux/log2.h: enclose macro arg in parens
  2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
@ 2020-07-27 22:44 ` Jonathan Lemon
  2020-07-27 22:44 ` [RFC PATCH v2 02/21] mm/memory_hotplug: add {add|release}_memory_pages Jonathan Lemon
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-27 22:44 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

From: Jonathan Lemon <bsd@fb.com>

roundup_pow_of_two uses its arg without enclosing it in parens.

A call of the form:

   roundup_pow_of_two(boolval ? PAGE_SIZE : frag_size)

resulted in an compile warning:

warning: ?: using integer constants in boolean context [-Wint-in-bool-context]
              PAGE_SIZE :
../include/linux/log2.h:176:4: note: in definition of macro ‘roundup_pow_of_two’
   (n == 1) ? 1 :  \
    ^
And the resulting code used '1' as the result of the operation.

Fixes: 312a0c170945 ("[PATCH] LOG2: Alter roundup_pow_of_two() so that it can use a ilog2() on a constant")

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/linux/log2.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/log2.h b/include/linux/log2.h
index 83a4a3ca3e8a..c619ec6eff4a 100644
--- a/include/linux/log2.h
+++ b/include/linux/log2.h
@@ -173,7 +173,7 @@ unsigned long __rounddown_pow_of_two(unsigned long n)
 #define roundup_pow_of_two(n)			\
 (						\
 	__builtin_constant_p(n) ? (		\
-		(n == 1) ? 1 :			\
+		((n) == 1) ? 1 :		\
 		(1UL << (ilog2((n) - 1) + 1))	\
 				   ) :		\
 	__roundup_pow_of_two(n)			\
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 02/21] mm/memory_hotplug: add {add|release}_memory_pages
  2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
  2020-07-27 22:44 ` [RFC PATCH v2 01/21] linux/log2.h: enclose macro arg in parens Jonathan Lemon
@ 2020-07-27 22:44 ` Jonathan Lemon
  2020-07-27 22:44 ` [RFC PATCH v2 03/21] mm: Allow DMA mapping of pages which are not online Jonathan Lemon
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-27 22:44 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

These calls allows creation of system pages at a specific physical
address, which is useful for creating dummy backing pages which
correspond to unaddressable external memory at specific locations.

__add_memory_pages() adds the requested page range to /proc/iomem and
verifies that there are no overlaps.  Once this succeeds, then the page
section is initialized, which may overlap with a prior request, since
section sizes are large, so ignore the latter overlap.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/linux/memory_hotplug.h |  4 +++
 mm/memory_hotplug.c            | 65 ++++++++++++++++++++++++++++++++--
 2 files changed, 67 insertions(+), 2 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 375515803cd8..05e012e1a203 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -138,6 +138,10 @@ extern void __remove_pages(unsigned long start_pfn, unsigned long nr_pages,
 extern int __add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
 		       struct mhp_params *params);
 
+struct resource *add_memory_pages(int nid, u64 start, u64 size,
+                                  struct mhp_params *params);
+void release_memory_pages(struct resource *res);
+
 #ifndef CONFIG_ARCH_HAS_ADD_PAGES
 static inline int add_pages(int nid, unsigned long start_pfn,
 		unsigned long nr_pages, struct mhp_params *params)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index da374cd3d45b..c1a923189869 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -125,8 +125,8 @@ static struct resource *register_memory_resource(u64 start, u64 size,
 			       resource_name, flags);
 
 	if (!res) {
-		pr_debug("Unable to reserve System RAM region: %016llx->%016llx\n",
-				start, start + size);
+		pr_debug("Unable to reserve %s region: %016llx->%016llx\n",
+				resource_name, start, start + size);
 		return ERR_PTR(-EEXIST);
 	}
 	return res;
@@ -1118,6 +1118,67 @@ int add_memory(int nid, u64 start, u64 size)
 }
 EXPORT_SYMBOL_GPL(add_memory);
 
+static int __ref add_memory_section(int nid, struct resource *res,
+				    struct mhp_params *params)
+{
+	u64 start, end, section_size;
+	int ret;
+
+	/* must align start/end with memory block size */
+	end = res->start + resource_size(res);
+	section_size = memory_block_size_bytes();
+	start = round_down(res->start, section_size);
+	end = round_up(end, section_size);
+
+	mem_hotplug_begin();
+	ret = __add_pages(nid,
+		PHYS_PFN(start), PHYS_PFN(end - start), params);
+	mem_hotplug_done();
+
+	return ret;
+}
+
+/* requires device_hotplug_lock, see add_memory_resource() */
+static struct resource * __ref __add_memory_pages(int nid, u64 start, u64 size,
+				    struct mhp_params *params)
+{
+	struct resource *res;
+	int ret;
+
+	res = register_memory_resource(start, size, "Private RAM");
+	if (IS_ERR(res))
+		return res;
+
+	ret = add_memory_section(nid, res, params);
+	if (ret < 0 && ret != -EEXIST) {
+		release_memory_resource(res);
+		return ERR_PTR(ret);
+	}
+
+	return res;
+}
+
+struct resource *add_memory_pages(int nid, u64 start, u64 size,
+				  struct mhp_params *params)
+{
+	struct resource *res;
+
+	lock_device_hotplug();
+	res = __add_memory_pages(nid, start, size, params);
+	unlock_device_hotplug();
+
+	return res;
+}
+EXPORT_SYMBOL_GPL(add_memory_pages);
+
+void release_memory_pages(struct resource *res)
+{
+	lock_device_hotplug();
+	release_memory_resource(res);
+	unlock_device_hotplug();
+}
+EXPORT_SYMBOL_GPL(release_memory_pages);
+
 /*
  * Add special, driver-managed memory to the system as system RAM. Such
  * memory is not exposed via the raw firmware-provided memmap as system
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 03/21] mm: Allow DMA mapping of pages which are not online
  2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
  2020-07-27 22:44 ` [RFC PATCH v2 01/21] linux/log2.h: enclose macro arg in parens Jonathan Lemon
  2020-07-27 22:44 ` [RFC PATCH v2 02/21] mm/memory_hotplug: add {add|release}_memory_pages Jonathan Lemon
@ 2020-07-27 22:44 ` Jonathan Lemon
  2020-07-27 22:44 ` [RFC PATCH v2 04/21] kernel/user: export free_uid Jonathan Lemon
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-27 22:44 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

Change the system RAM check from 'valid' to 'online', so dummy
pages which refer to external DMA resources can be mapped.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/linux/dma-mapping.h | 4 ++--
 include/linux/mmzone.h      | 7 +++++++
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index a33ed3954ed4..e9b1a8431568 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -348,8 +348,8 @@ static inline dma_addr_t dma_map_resource(struct device *dev,
 
 	BUG_ON(!valid_dma_direction(dir));
 
-	/* Don't allow RAM to be mapped */
-	if (WARN_ON_ONCE(pfn_valid(PHYS_PFN(phys_addr))))
+	/* Don't allow online RAM to be mapped */
+	if (WARN_ON_ONCE(pfn_online(PHYS_PFN(phys_addr))))
 		return DMA_MAPPING_ERROR;
 
 	if (dma_is_direct(ops))
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f6f884970511..d0c6fc553304 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1348,6 +1348,13 @@ static inline unsigned long next_present_section_nr(unsigned long section_nr)
 	return -1;
 }
 
+static inline int pfn_online(unsigned long pfn)
+{
+	if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
+		return 0;
+	return online_section(__nr_to_section(pfn_to_section_nr(pfn)));
+}
+
 /*
  * These are _only_ used during initialisation, therefore they
  * can use __initdata ...  They could have names to indicate
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 04/21] kernel/user: export free_uid
  2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (2 preceding siblings ...)
  2020-07-27 22:44 ` [RFC PATCH v2 03/21] mm: Allow DMA mapping of pages which are not online Jonathan Lemon
@ 2020-07-27 22:44 ` Jonathan Lemon
  2020-07-27 22:44 ` [RFC PATCH v2 05/21] uapi/misc: add shqueue.h for shared queues Jonathan Lemon
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-27 22:44 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

get_uid is a static inline which can be called from a module, so
free_uid should also be callable from a module.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 kernel/user.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/user.c b/kernel/user.c
index b1635d94a1f2..1e015abf0a2b 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -171,6 +171,7 @@ void free_uid(struct user_struct *up)
 	if (refcount_dec_and_lock_irqsave(&up->__count, &uidhash_lock, &flags))
 		free_user(up, flags);
 }
+EXPORT_SYMBOL_GPL(free_uid);
 
 struct user_struct *alloc_uid(kuid_t uid)
 {
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 05/21] uapi/misc: add shqueue.h for shared queues
  2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (3 preceding siblings ...)
  2020-07-27 22:44 ` [RFC PATCH v2 04/21] kernel/user: export free_uid Jonathan Lemon
@ 2020-07-27 22:44 ` Jonathan Lemon
  2020-07-27 22:44 ` [RFC PATCH v2 06/21] include: add netgpu UAPI and kernel definitions Jonathan Lemon
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-27 22:44 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

From: Jonathan Lemon <bsd@fb.com>

Shared queues between user and kernel use their own private structures
for accessing a shared data area, but they need to use the same queue
functions.

Have the kernel use the same UAPI file - this can be made private at
a later date if required.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/uapi/misc/shqueue.h | 200 ++++++++++++++++++++++++++++++++++++
 1 file changed, 200 insertions(+)
 create mode 100644 include/uapi/misc/shqueue.h

diff --git a/include/uapi/misc/shqueue.h b/include/uapi/misc/shqueue.h
new file mode 100644
index 000000000000..ff945942734c
--- /dev/null
+++ b/include/uapi/misc/shqueue.h
@@ -0,0 +1,200 @@
+#ifndef _UAPI_MISC_SHQUEUE_H
+#define _UAPI_MISC_SHQUEUE_H
+
+/* Placed under UAPI in order to avoid two identical copies between
+ * user and kernel space.
+ */
+
+/* user and kernel private copy - identical in order to share sq* fcns */
+struct shared_queue {
+	unsigned *prod;
+	unsigned *cons;
+	unsigned char *data;
+	unsigned elt_sz;
+	unsigned mask;
+	unsigned cached_prod;
+	unsigned cached_cons;
+	unsigned entries;
+
+	unsigned map_sz;
+	void *map_ptr;
+};
+
+/*
+ * see documenation in tools/include/linux/ring_buffer.h
+ * using  explicit smp_/_ONCE is an optimization over smp_{store|load}
+ */
+
+static inline void __sq_load_acquire_cons(struct shared_queue *q)
+{
+	/* Refresh the local tail pointer */
+	q->cached_cons = READ_ONCE(*q->cons);
+	/* A, matches D */
+}
+
+static inline void __sq_store_release_cons(struct shared_queue *q)
+{
+	smp_mb(); /* D, matches A */
+	WRITE_ONCE(*q->cons, q->cached_cons);
+}
+
+static inline void __sq_load_acquire_prod(struct shared_queue *q)
+{
+	/* Refresh the local pointer */
+	q->cached_prod = READ_ONCE(*q->prod);
+	smp_rmb(); /* C, matches B */
+}
+
+static inline void __sq_store_release_prod(struct shared_queue *q)
+{
+	smp_wmb(); /* B, matches C */
+	WRITE_ONCE(*q->prod, q->cached_prod);
+}
+
+static inline void sq_cons_refresh(struct shared_queue *q)
+{
+	__sq_store_release_cons(q);
+	__sq_load_acquire_prod(q);
+}
+
+static inline bool sq_is_empty(struct shared_queue *q)
+{
+	return READ_ONCE(*q->prod) == READ_ONCE(*q->cons);
+}
+
+static inline bool sq_cons_empty(struct shared_queue *q)
+{
+	return q->cached_prod == q->cached_cons;
+}
+
+static inline unsigned __sq_cons_ready(struct shared_queue *q)
+{
+	return q->cached_prod - q->cached_cons;
+}
+
+static inline unsigned sq_cons_ready(struct shared_queue *q)
+{
+	if (sq_cons_empty(q))
+		__sq_load_acquire_prod(q);
+
+	return __sq_cons_ready(q);
+}
+
+static inline bool sq_cons_avail(struct shared_queue *q, unsigned count)
+{
+	if (count <= __sq_cons_ready(q))
+		return true;
+	__sq_load_acquire_prod(q);
+	return count <= __sq_cons_ready(q);
+}
+
+static inline void *sq_get_ptr(struct shared_queue *q, unsigned idx)
+{
+	return q->data + (idx & q->mask) * q->elt_sz;
+}
+
+static inline void sq_cons_complete(struct shared_queue *q)
+{
+	__sq_store_release_cons(q);
+}
+
+static inline void *sq_cons_peek(struct shared_queue *q)
+{
+	if (sq_cons_empty(q)) {
+		sq_cons_refresh(q);
+		if (sq_cons_empty(q))
+			return NULL;
+	}
+	return sq_get_ptr(q, q->cached_cons);
+}
+
+static inline unsigned
+sq_peek_batch(struct shared_queue *q, void **ptr, unsigned count)
+{
+	unsigned i, idx, ready;
+
+	ready = sq_cons_ready(q);
+	if (!ready)
+		return 0;
+
+	count = count > ready ? ready : count;
+
+	idx = q->cached_cons;
+	for (i = 0; i < count; i++)
+		ptr[i] = sq_get_ptr(q, idx++);
+
+	q->cached_cons += count;
+
+	return count;
+}
+
+static inline unsigned
+sq_cons_batch(struct shared_queue *q, void **ptr, unsigned count)
+{
+	unsigned i, idx, ready;
+
+	ready = sq_cons_ready(q);
+	if (!ready)
+		return 0;
+
+	count = count > ready ? ready : count;
+
+	idx = q->cached_cons;
+	for (i = 0; i < count; i++)
+		ptr[i] = sq_get_ptr(q, idx++);
+
+	q->cached_cons += count;
+	sq_cons_complete(q);
+
+	return count;
+}
+
+static inline void sq_cons_advance(struct shared_queue *q)
+{
+	q->cached_cons++;
+}
+
+static inline unsigned __sq_prod_space(struct shared_queue *q)
+{
+	return q->entries - (q->cached_prod - q->cached_cons);
+}
+
+static inline unsigned sq_prod_space(struct shared_queue *q)
+{
+	unsigned space;
+
+	space = __sq_prod_space(q);
+	if (!space) {
+		__sq_load_acquire_cons(q);
+		space = __sq_prod_space(q);
+	}
+	return space;
+}
+
+static inline bool sq_prod_avail(struct shared_queue *q, unsigned count)
+{
+	if (count <= __sq_prod_space(q))
+		return true;
+	__sq_load_acquire_cons(q);
+	return count <= __sq_prod_space(q);
+}
+
+static inline void *sq_prod_get_ptr(struct shared_queue *q)
+{
+	return sq_get_ptr(q, q->cached_prod++);
+}
+
+static inline void *sq_prod_reserve(struct shared_queue *q)
+{
+	if (!sq_prod_space(q))
+		return NULL;
+
+	return sq_prod_get_ptr(q);
+}
+
+static inline void sq_prod_submit(struct shared_queue *q)
+{
+	__sq_store_release_prod(q);
+}
+
+#endif /* _UAPI_MISC_SHQUEUE_H */
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 06/21] include: add netgpu UAPI and kernel definitions
  2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (4 preceding siblings ...)
  2020-07-27 22:44 ` [RFC PATCH v2 05/21] uapi/misc: add shqueue.h for shared queues Jonathan Lemon
@ 2020-07-27 22:44 ` Jonathan Lemon
  2020-07-27 22:44 ` [RFC PATCH v2 07/21] netdevice: add SETUP_NETGPU to the netdev_bpf structure Jonathan Lemon
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-27 22:44 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

From: Jonathan Lemon <bsd@fb.com>

This provides the interface to the netgpu module.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/net/netgpu.h       | 66 ++++++++++++++++++++++++++++++++++++
 include/uapi/misc/netgpu.h | 69 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 135 insertions(+)
 create mode 100644 include/net/netgpu.h
 create mode 100644 include/uapi/misc/netgpu.h

diff --git a/include/net/netgpu.h b/include/net/netgpu.h
new file mode 100644
index 000000000000..14bd19412c38
--- /dev/null
+++ b/include/net/netgpu.h
@@ -0,0 +1,66 @@
+#ifndef _NET_NETGPU_H
+#define _NET_NETGPU_H
+
+#include <uapi/misc/netgpu.h>		/* IOCTL defines */
+#include <uapi/misc/shqueue.h>
+
+enum {
+	NETGPU_MEMTYPE_HOST,
+	NETGPU_MEMTYPE_CUDA,
+
+	NETGPU_MEMTYPE_MAX,
+};
+
+struct netgpu_pgcache {
+	struct netgpu_pgcache *next;
+	struct page *page[];
+};
+
+struct netgpu_ifq {
+	struct shared_queue fill;
+	struct wait_queue_head fill_wait;
+	struct netgpu_ctx *ctx;
+	int queue_id;
+	spinlock_t pgcache_lock;
+	struct netgpu_pgcache *napi_cache;
+	struct netgpu_pgcache *spare_cache;
+	struct netgpu_pgcache *any_cache;
+	int napi_cache_count;
+	int any_cache_count;
+	struct list_head ifq_node;
+};
+
+struct netgpu_skq {
+	struct shared_queue rx;
+	struct shared_queue cq;		/* for requested completions */
+	struct netgpu_ctx *ctx;
+	void (*sk_destruct)(struct sock *sk);
+	void (*sk_data_ready)(struct sock *sk);
+};
+
+struct netgpu_ctx {
+	struct xarray xa;		/* contains dmamaps */
+	refcount_t ref;
+	struct net_device *dev;
+	struct list_head ifq_list;
+};
+
+struct net_device;
+struct netgpu_ops;
+struct socket;
+
+dma_addr_t netgpu_get_dma(struct netgpu_ctx *ctx, struct page *page);
+int netgpu_get_page(struct netgpu_ifq *ifq, struct page **page,
+		    dma_addr_t *dma);
+void netgpu_put_page(struct netgpu_ifq *ifq, struct page *page, bool napi);
+int netgpu_get_pages(struct sock *sk, struct page **pages, unsigned long addr,
+		     int count);
+
+int netgpu_socket_mmap(struct file *file, struct socket *sock,
+		       struct vm_area_struct *vma);
+int netgpu_attach_socket(struct sock *sk, void __user *arg);
+
+int netgpu_register(struct netgpu_ops *ops);
+void netgpu_unregister(int memtype);
+
+#endif /* _NET_NETGPU_H */
diff --git a/include/uapi/misc/netgpu.h b/include/uapi/misc/netgpu.h
new file mode 100644
index 000000000000..1fa8a1d719ee
--- /dev/null
+++ b/include/uapi/misc/netgpu.h
@@ -0,0 +1,69 @@
+#ifndef _UAPI_MISC_NETGPU_H
+#define _UAPI_MISC_NETGPU_H
+
+#include <linux/ioctl.h>
+
+#define NETGPU_OFF_FILL_ID	(0ULL << 12)
+#define NETGPU_OFF_RX_ID	(1ULL << 12)
+#define NETGPU_OFF_CQ_ID	(2ULL << 12)
+
+struct netgpu_queue_offsets {
+	unsigned prod;
+	unsigned cons;
+	unsigned data;
+	unsigned resv;
+};
+
+struct netgpu_user_queue {
+	unsigned elt_sz;
+	unsigned entries;
+	unsigned mask;
+	unsigned map_sz;
+	unsigned map_off;
+	struct netgpu_queue_offsets off;
+};
+
+enum netgpu_memtype {
+	MEMTYPE_HOST,
+	MEMTYPE_CUDA,
+
+	MEMTYPE_MAX,
+};
+
+/* VA memory provided by a specific PCI device. */
+struct netgpu_region_param {
+	struct iovec iov;
+	enum netgpu_memtype memtype;
+};
+
+struct netgpu_attach_param {
+	int mem_fd;
+	int mem_idx;
+};
+
+struct netgpu_socket_param {
+	unsigned resv;
+	int ctx_fd;
+	struct netgpu_user_queue rx;
+	struct netgpu_user_queue cq;
+};
+
+struct netgpu_ifq_param {
+	unsigned resv;
+	unsigned ifq_fd;		/* OUT parameter */
+	unsigned queue_id;		/* IN/OUT, IN: -1 if don't care */
+	struct netgpu_user_queue fill;
+};
+
+struct netgpu_ctx_param {
+	unsigned resv;
+	unsigned ifindex;
+};
+
+#define NETGPU_CTX_IOCTL_ATTACH_DEV	_IOR( 0, 1, int)
+#define NETGPU_CTX_IOCTL_BIND_QUEUE	_IOWR(0, 2, struct netgpu_ifq_param)
+#define NETGPU_CTX_IOCTL_ATTACH_REGION	_IOW( 0, 3, struct netgpu_attach_param)
+#define NETGPU_MEM_IOCTL_ADD_REGION	_IOR( 0, 4, struct netgpu_region_param)
+#define NETGPU_SOCK_IOCTL_ATTACH_QUEUES	(SIOCPROTOPRIVATE + 0)
+
+#endif /* _UAPI_MISC_NETGPU_H */
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 07/21] netdevice: add SETUP_NETGPU to the netdev_bpf structure
  2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (5 preceding siblings ...)
  2020-07-27 22:44 ` [RFC PATCH v2 06/21] include: add netgpu UAPI and kernel definitions Jonathan Lemon
@ 2020-07-27 22:44 ` Jonathan Lemon
  2020-07-27 22:44 ` [RFC PATCH v2 08/21] skbuff: add a zc_netgpu bitflag Jonathan Lemon
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-27 22:44 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

From: Jonathan Lemon <bsd@fb.com>

This command will be used to setup/tear down netgpu queues.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/linux/netdevice.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ac2cd3f49aba..df72c762e562 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -882,6 +882,7 @@ enum bpf_netdev_command {
 	BPF_OFFLOAD_MAP_ALLOC,
 	BPF_OFFLOAD_MAP_FREE,
 	XDP_SETUP_XSK_UMEM,
+	XDP_SETUP_NETGPU,
 };
 
 struct bpf_prog_offload_ops;
@@ -913,6 +914,11 @@ struct netdev_bpf {
 			struct xdp_umem *umem;
 			u16 queue_id;
 		} xsk;
+		/* XDP_SETUP_NETGPU */
+		struct {
+			struct netgpu_ifq *ifq;
+			u16 queue_id;
+		} netgpu;
 	};
 };
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 08/21] skbuff: add a zc_netgpu bitflag
  2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (6 preceding siblings ...)
  2020-07-27 22:44 ` [RFC PATCH v2 07/21] netdevice: add SETUP_NETGPU to the netdev_bpf structure Jonathan Lemon
@ 2020-07-27 22:44 ` Jonathan Lemon
  2020-07-27 22:44 ` [RFC PATCH v2 09/21] core/skbuff: use skb_zdata for testing whether skb is zerocopy Jonathan Lemon
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-27 22:44 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

This could likely be moved elsewhere.  The presence of the flag on
the skb indicates that one of the fragments may contain zerocopy
RX data, where the data is not accessible to the cpu.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/linux/skbuff.h | 3 ++-
 net/core/skbuff.c      | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fa817a105517..006e10fcc7d9 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -782,7 +782,8 @@ struct sk_buff {
 				fclone:2,
 				peeked:1,
 				head_frag:1,
-				pfmemalloc:1;
+				pfmemalloc:1,
+				zc_netgpu:1;
 #ifdef CONFIG_SKB_EXTENSIONS
 	__u8			active_extensions;
 #endif
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b8afefe6f6b6..2a391042be53 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -992,6 +992,7 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
 	n->cloned = 1;
 	n->nohdr = 0;
 	n->peeked = 0;
+	C(zc_netgpu);
 	C(pfmemalloc);
 	n->destructor = NULL;
 	C(tail);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 09/21] core/skbuff: use skb_zdata for testing whether skb is zerocopy
  2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (7 preceding siblings ...)
  2020-07-27 22:44 ` [RFC PATCH v2 08/21] skbuff: add a zc_netgpu bitflag Jonathan Lemon
@ 2020-07-27 22:44 ` Jonathan Lemon
  2020-07-27 22:44 ` [RFC PATCH v2 10/21] netgpu: add network/gpu/host dma module Jonathan Lemon
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-27 22:44 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

From: Jonathan Lemon <bsd@fb.com>

skb_zcopy() flag indicates whether the skb has a zerocopy ubuf.
netgpu does not use ubufs, so skb_zdata() indicates whether the
skb is carrying zero copy data, and should be left alone, while
skb_zcopy() indicates whether there is an attached ubuf.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/linux/skbuff.h | 24 +++++++++++++++++++++++-
 net/core/skbuff.c      | 17 ++++++++++++++++-
 2 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 006e10fcc7d9..017c20792c23 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -443,8 +443,12 @@ enum {
 
 	/* generate software time stamp when entering packet scheduling */
 	SKBTX_SCHED_TSTAMP = 1 << 6,
+
+	/* fragments are accessed only via DMA */
+	SKBTX_DEV_NETDMA = 1 << 7,
 };
 
+#define SKBTX_ZERODATA_FRAG	(SKBTX_DEV_ZEROCOPY | SKBTX_DEV_NETDMA)
 #define SKBTX_ZEROCOPY_FRAG	(SKBTX_DEV_ZEROCOPY | SKBTX_SHARED_FRAG)
 #define SKBTX_ANY_SW_TSTAMP	(SKBTX_SW_TSTAMP    | \
 				 SKBTX_SCHED_TSTAMP)
@@ -1420,6 +1424,24 @@ static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
 	return &skb_shinfo(skb)->hwtstamps;
 }
 
+static inline bool skb_netdma(struct sk_buff *skb)
+{
+	return skb && skb_shinfo(skb)->tx_flags & SKBTX_DEV_NETDMA;
+}
+
+static inline bool skb_zdata(struct sk_buff *skb)
+{
+	return skb && skb_shinfo(skb)->tx_flags & SKBTX_ZERODATA_FRAG;
+}
+
+static inline void skb_netdma_set(struct sk_buff *skb, void *arg)
+{
+	if (skb && arg) {
+		skb_shinfo(skb)->tx_flags |= SKBTX_DEV_NETDMA;
+		skb_shinfo(skb)->destructor_arg = arg;
+	}
+}
+
 static inline struct ubuf_info *skb_zcopy(struct sk_buff *skb)
 {
 	bool is_zcopy = skb && skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY;
@@ -3264,7 +3286,7 @@ static inline int skb_add_data(struct sk_buff *skb,
 static inline bool skb_can_coalesce(struct sk_buff *skb, int i,
 				    const struct page *page, int off)
 {
-	if (skb_zcopy(skb))
+	if (skb_zdata(skb))
 		return false;
 	if (i) {
 		const skb_frag_t *frag = &skb_shinfo(skb)->frags[i - 1];
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2a391042be53..1422b99b7090 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -69,6 +69,7 @@
 #include <net/xfrm.h>
 #include <net/mpls.h>
 #include <net/mptcp.h>
+#include <net/netgpu.h>
 
 #include <linux/uaccess.h>
 #include <trace/events/skb.h>
@@ -1300,6 +1301,8 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
 	}
 
 	skb_zcopy_set(skb, uarg, NULL);
+	skb_netdma_set(skb, sk->sk_user_data);
+
 	return skb->len - orig_len;
 }
 EXPORT_SYMBOL_GPL(skb_zerocopy_iter_stream);
@@ -1307,6 +1310,16 @@ EXPORT_SYMBOL_GPL(skb_zerocopy_iter_stream);
 static int skb_zerocopy_clone(struct sk_buff *nskb, struct sk_buff *orig,
 			      gfp_t gfp_mask)
 {
+	if (skb_netdma(orig)) {
+		if (skb_netdma(nskb)) {
+			WARN_ONCE(1, "zc clone, dst skb is set\n");
+			if (skb_uarg(nskb) != skb_uarg(orig))
+				return -EIO;
+		}
+		skb_netdma_set(nskb, skb_shinfo(orig)->destructor_arg);
+		return 0;
+	}
+
 	if (skb_zcopy(orig)) {
 		if (skb_zcopy(nskb)) {
 			/* !gfp_mask callers are verified to !skb_zcopy(nskb) */
@@ -2055,6 +2068,8 @@ void *__pskb_pull_tail(struct sk_buff *skb, int delta)
 	 */
 	int i, k, eat = (skb->tail + delta) - skb->end;
 
+	BUG_ON(skb_netdma(skb));
+
 	if (eat > 0 || skb_cloned(skb)) {
 		if (pskb_expand_head(skb, 0, eat > 0 ? eat + 128 : 0,
 				     GFP_ATOMIC))
@@ -3305,7 +3320,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
 
 	if (skb_headlen(skb))
 		return 0;
-	if (skb_zcopy(tgt) || skb_zcopy(skb))
+	if (skb_zdata(tgt) || skb_zdata(skb))
 		return 0;
 
 	todo = shiftlen;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 10/21] netgpu: add network/gpu/host dma module
  2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (8 preceding siblings ...)
  2020-07-27 22:44 ` [RFC PATCH v2 09/21] core/skbuff: use skb_zdata for testing whether skb is zerocopy Jonathan Lemon
@ 2020-07-27 22:44 ` Jonathan Lemon
  2020-07-28 16:26   ` Greg KH
  2020-07-27 22:44 ` [RFC PATCH v2 11/21] core/skbuff: add page recycling logic for netgpu pages Jonathan Lemon
                   ` (11 subsequent siblings)
  21 siblings, 1 reply; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-27 22:44 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

From: Jonathan Lemon <bsd@fb.com>

Netgpu provides a data path for zero-copy sends and receives
without having the host CPU touch the data.  Protocol processing
is done on the host CPU, while data is DMA'd to and from DMA
mapped memory areas.  The initial code provides transfers between
(mlx5 / host memory) and (mlx5 / nvidia GPU memory).

The use case for this module are GPUs used for machine learning,
which are located near the NICs, and have a high bandwidth PCI
connection between the GPU/NIC.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 drivers/misc/Kconfig              |    1 +
 drivers/misc/Makefile             |    1 +
 drivers/misc/netgpu/Kconfig       |   14 +
 drivers/misc/netgpu/Makefile      |    6 +
 drivers/misc/netgpu/netgpu_host.c |  284 +++++++
 drivers/misc/netgpu/netgpu_main.c | 1215 +++++++++++++++++++++++++++++
 drivers/misc/netgpu/netgpu_mem.c  |  351 +++++++++
 drivers/misc/netgpu/netgpu_priv.h |   88 +++
 drivers/misc/netgpu/netgpu_stub.c |  166 ++++
 drivers/misc/netgpu/netgpu_stub.h |   19 +
 10 files changed, 2145 insertions(+)
 create mode 100644 drivers/misc/netgpu/Kconfig
 create mode 100644 drivers/misc/netgpu/Makefile
 create mode 100644 drivers/misc/netgpu/netgpu_host.c
 create mode 100644 drivers/misc/netgpu/netgpu_main.c
 create mode 100644 drivers/misc/netgpu/netgpu_mem.c
 create mode 100644 drivers/misc/netgpu/netgpu_priv.h
 create mode 100644 drivers/misc/netgpu/netgpu_stub.c
 create mode 100644 drivers/misc/netgpu/netgpu_stub.h

diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index e1b1ba5e2b92..13ae8e55d2a2 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -472,4 +472,5 @@ source "drivers/misc/ocxl/Kconfig"
 source "drivers/misc/cardreader/Kconfig"
 source "drivers/misc/habanalabs/Kconfig"
 source "drivers/misc/uacce/Kconfig"
+source "drivers/misc/netgpu/Kconfig"
 endmenu
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index c7bd01ac6291..216da8b84c86 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -57,3 +57,4 @@ obj-$(CONFIG_PVPANIC)   	+= pvpanic.o
 obj-$(CONFIG_HABANA_AI)		+= habanalabs/
 obj-$(CONFIG_UACCE)		+= uacce/
 obj-$(CONFIG_XILINX_SDFEC)	+= xilinx_sdfec.o
+obj-y				+= netgpu/
diff --git a/drivers/misc/netgpu/Kconfig b/drivers/misc/netgpu/Kconfig
new file mode 100644
index 000000000000..5d8f27ed3a19
--- /dev/null
+++ b/drivers/misc/netgpu/Kconfig
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# NetGPU framework
+#
+config NETGPU
+	tristate "Network/GPU driver"
+	depends on PCI
+	imply NETGPU_STUB
+	help
+	  Experimental Network / GPU driver
+
+config NETGPU_STUB
+	bool
+	depends on NETGPU = m
diff --git a/drivers/misc/netgpu/Makefile b/drivers/misc/netgpu/Makefile
new file mode 100644
index 000000000000..bec4eb5ea04f
--- /dev/null
+++ b/drivers/misc/netgpu/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+obj-$(CONFIG_NETGPU) := netgpu.o
+netgpu-y := netgpu_mem.o netgpu_main.o netgpu_host.o
+
+obj-$(CONFIG_NETGPU_STUB) := netgpu_stub.o
diff --git a/drivers/misc/netgpu/netgpu_host.c b/drivers/misc/netgpu/netgpu_host.c
new file mode 100644
index 000000000000..ea84f8cae671
--- /dev/null
+++ b/drivers/misc/netgpu/netgpu_host.c
@@ -0,0 +1,284 @@
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/miscdevice.h>
+#include <linux/uio.h>
+#include <linux/errno.h>
+#include <linux/netdevice.h>
+#include <linux/pci.h>
+#include <linux/memory.h>
+#include <linux/device.h>
+#include <linux/mutex.h>
+#include <linux/indirect_call_wrapper.h>
+
+#include <net/netgpu.h>
+#include <uapi/misc/netgpu.h>
+
+#include "netgpu_priv.h"
+
+struct netgpu_host_region {
+	struct netgpu_region r;				/* must be first */
+	struct page **page;
+};
+
+struct netgpu_host_dmamap {
+	struct netgpu_dmamap map;			/* must be first */
+	dma_addr_t dma[];
+};
+
+static inline struct netgpu_host_region *
+host_region(struct netgpu_region *r)
+{
+	return (struct netgpu_host_region *)r;
+}
+
+static inline struct netgpu_host_dmamap *
+host_map(struct netgpu_dmamap *map)
+{
+	return (struct netgpu_host_dmamap *)map;
+}
+
+/* Used by the lib/iov_iter to obtain a set of pages for TX */
+INDIRECT_CALLABLE_SCOPE int
+netgpu_host_get_pages(struct netgpu_region *r, struct page **pages,
+		      unsigned long addr, int count)
+{
+	unsigned long idx;
+	struct page *p;
+	int i, n;
+
+	idx = (addr - r->start) >> PAGE_SHIFT;
+	n = r->nr_pages - idx + 1;
+	n = min(count, n);
+
+	for (i = 0; i < n; i++) {
+		p = host_region(r)->page[idx + i];
+		get_page(p);
+		pages[i] = p;
+	}
+
+	return n;
+}
+
+INDIRECT_CALLABLE_SCOPE int
+netgpu_host_get_page(struct netgpu_dmamap *map, unsigned long addr,
+		     struct page **page, dma_addr_t *dma)
+{
+	unsigned long idx;
+
+	idx = (addr - map->start) >> PAGE_SHIFT;
+
+	*dma = host_map(map)->dma[idx];
+	*page = host_region(map->r)->page[idx];
+	get_page(*page);
+
+	return 0;
+}
+
+INDIRECT_CALLABLE_SCOPE dma_addr_t
+netgpu_host_get_dma(struct netgpu_dmamap *map, unsigned long addr)
+{
+	unsigned long idx;
+
+	idx = (addr - map->start) >> PAGE_SHIFT;
+	return host_map(map)->dma[idx];
+}
+
+static void
+netgpu_unaccount_mem(struct user_struct *user, unsigned long nr_pages)
+{
+	atomic_long_sub(nr_pages, &user->locked_vm);
+}
+
+static int
+netgpu_account_mem(struct user_struct *user, unsigned long nr_pages)
+{
+	unsigned long page_limit, cur_pages, new_pages;
+
+	page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+
+	do {
+		cur_pages = atomic_long_read(&user->locked_vm);
+		new_pages = cur_pages + nr_pages;
+		if (new_pages > page_limit)
+			return -ENOMEM;
+	} while (atomic_long_cmpxchg(&user->locked_vm, cur_pages,
+				     new_pages) != cur_pages);
+
+	return 0;
+}
+
+static void
+netgpu_host_unmap_region(struct netgpu_dmamap *map)
+{
+	int i;
+
+	for (i = 0; i < map->nr_pages; i++)
+		dma_unmap_page(map->device, host_map(map)->dma[i],
+			       PAGE_SIZE, DMA_BIDIRECTIONAL);
+}
+
+static struct netgpu_dmamap *
+netgpu_host_map_region(struct netgpu_region *r, struct device *device)
+{
+	struct netgpu_dmamap *map;
+	struct page *page;
+	dma_addr_t dma;
+	size_t sz;
+	int i;
+
+	sz = struct_size(host_map(map), dma, r->nr_pages);
+	map = kvmalloc(sz, GFP_KERNEL);
+	if (!map)
+		return ERR_PTR(-ENOMEM);
+
+	for (i = 0; i < r->nr_pages; i++) {
+		page = host_region(r)->page[i];
+		dma = dma_map_page(device, page, 0, PAGE_SIZE,
+				   DMA_BIDIRECTIONAL);
+		if (unlikely(dma_mapping_error(device, dma)))
+			goto out;
+
+		host_map(map)->dma[i] = dma;
+	}
+
+	return map;
+
+out:
+	while (i--)
+		dma_unmap_page(device, host_map(map)->dma[i], PAGE_SIZE,
+			       DMA_BIDIRECTIONAL);
+
+	kvfree(map);
+	return ERR_PTR(-ENXIO);
+}
+
+/* NOTE: nr_pages may be negative on error. */
+static void
+netgpu_host_put_pages(struct netgpu_region *r, int nr_pages, bool clear)
+{
+	struct page *page;
+	int i;
+
+	for (i = 0; i < nr_pages; i++) {
+		page = host_region(r)->page[i];
+		if (clear) {
+			ClearPagePrivate(page);
+			set_page_private(page, 0);
+		}
+		put_page(page);
+	}
+}
+
+static void
+netgpu_host_free_region(struct netgpu_mem *mem, struct netgpu_region *r)
+{
+
+	netgpu_host_put_pages(r, r->nr_pages, true);
+	if (mem->account_mem)
+		netgpu_unaccount_mem(mem->user, r->nr_pages);
+	kvfree(host_region(r)->page);
+	kfree(r);
+}
+
+static int
+netgpu_assign_page_addrs(struct netgpu_region *r)
+{
+	struct page *page;
+	int i;
+
+	for (i = 0; i < r->nr_pages; i++) {
+		page = host_region(r)->page[i];
+		if (PagePrivate(page))
+			goto out;
+		SetPagePrivate(page);
+		set_page_private(page, r->start + i * PAGE_SIZE);
+	}
+
+	return 0;
+
+out:
+	while (i--) {
+		page = host_region(r)->page[i];
+		ClearPagePrivate(page);
+		set_page_private(page, 0);
+	}
+
+	return -EEXIST;
+}
+
+static struct netgpu_region *
+netgpu_host_add_region(struct netgpu_mem *mem, const struct iovec *iov)
+{
+	struct netgpu_region *r;
+	int err, nr_pages;
+	u64 addr, len;
+	int count = 0;
+
+	err = -ENOMEM;
+	r = kzalloc(sizeof(struct netgpu_host_region), GFP_KERNEL);
+	if (!r)
+		return ERR_PTR(err);
+
+	addr = (u64)iov->iov_base;
+	r->start = round_down(addr, PAGE_SIZE);
+	len = round_up(addr - r->start + iov->iov_len, PAGE_SIZE);
+	nr_pages = len >> PAGE_SHIFT;
+
+	r->mem = mem;
+	r->nr_pages = nr_pages;
+	INIT_LIST_HEAD(&r->ctx_list);
+	INIT_LIST_HEAD(&r->dma_list);
+	spin_lock_init(&r->lock);
+
+	host_region(r)->page = kvmalloc_array(nr_pages, sizeof(struct page *),
+					      GFP_KERNEL);
+	if (!host_region(r)->page)
+		goto out;
+
+	if (mem->account_mem) {
+		err = netgpu_account_mem(mem->user, nr_pages);
+		if (err) {
+			nr_pages = 0;
+			goto out;
+		}
+	}
+
+	mmap_read_lock(current->mm);
+	count = pin_user_pages(r->start, nr_pages,
+			       FOLL_WRITE | FOLL_LONGTERM,
+			       host_region(r)->page, NULL);
+	mmap_read_unlock(current->mm);
+
+	if (count != nr_pages) {
+		err = count < 0 ? count : -EFAULT;
+		goto out;
+	}
+
+	err = netgpu_assign_page_addrs(r);
+	if (err)
+		goto out;
+
+	return r;
+
+out:
+	netgpu_host_put_pages(r, count, false);
+	if (mem->account_mem && nr_pages)
+		netgpu_unaccount_mem(mem->user, nr_pages);
+	kvfree(host_region(r)->page);
+	kfree(r);
+
+	return ERR_PTR(err);
+}
+
+struct netgpu_ops host_ops = {
+	.owner		= THIS_MODULE,
+	.memtype	= NETGPU_MEMTYPE_HOST,
+	.add_region	= netgpu_host_add_region,
+	.free_region	= netgpu_host_free_region,
+	.map_region	= netgpu_host_map_region,
+	.unmap_region	= netgpu_host_unmap_region,
+	.get_dma	= netgpu_host_get_dma,
+	.get_page	= netgpu_host_get_page,
+	.get_pages	= netgpu_host_get_pages,
+};
diff --git a/drivers/misc/netgpu/netgpu_main.c b/drivers/misc/netgpu/netgpu_main.c
new file mode 100644
index 000000000000..54264fb46d18
--- /dev/null
+++ b/drivers/misc/netgpu/netgpu_main.c
@@ -0,0 +1,1215 @@
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/miscdevice.h>
+#include <linux/uio.h>
+#include <linux/errno.h>
+#include <linux/netdevice.h>
+#include <linux/memory.h>
+#include <linux/device.h>
+#include <linux/mutex.h>
+#include <linux/anon_inodes.h>
+#include <linux/indirect_call_wrapper.h>
+
+#include <net/tcp.h>
+
+#include <net/netgpu.h>
+#include <uapi/misc/netgpu.h>
+#include "netgpu_priv.h"
+
+static struct mutex netgpu_lock;
+static const struct file_operations netgpu_fops;
+static void netgpu_free_ctx(struct netgpu_ctx *ctx);
+
+INDIRECT_CALLABLE_DECLARE(dma_addr_t
+	netgpu_host_get_dma(struct netgpu_dmamap *map, unsigned long addr));
+INDIRECT_CALLABLE_DECLARE(int
+	netgpu_host_get_page(struct netgpu_dmamap *map, unsigned long addr,
+			     struct page **page, dma_addr_t *dma));
+INDIRECT_CALLABLE_DECLARE(int
+	netgpu_host_get_pages(struct netgpu_region *r, struct page **pages,
+			      unsigned long addr, int count));
+
+#if IS_MODULE(CONFIG_NETGPU)
+#define MAYBE_EXPORT_SYMBOL(s)
+#else
+#define MAYBE_EXPORT_SYMBOL(s)	EXPORT_SYMBOL(s)
+#endif
+
+#define NETGPU_CACHE_COUNT	63
+
+enum netgpu_match_id {
+	NETGPU_MATCH_TCP6,
+	NETGPU_MATCH_UDP6,
+	NETGPU_MATCH_TCP,
+	NETGPU_MATCH_UDP,
+};
+
+struct netgpu_sock_match {
+	u16 family;
+	u16 type;
+	u16 protocol;
+	u16 initialized;
+	struct proto *base_prot;
+	const struct proto_ops *base_ops;
+	struct proto prot;
+	struct proto_ops ops;
+};
+
+static struct netgpu_sock_match netgpu_match_tbl[] = {
+	[NETGPU_MATCH_TCP6] = {
+		.family		= AF_INET6,
+		.type		= SOCK_STREAM,
+		.protocol	= IPPROTO_TCP,
+	},
+	[NETGPU_MATCH_UDP6] = {
+		.family		= AF_INET6,
+		.type		= SOCK_DGRAM,
+		.protocol	= IPPROTO_UDP,
+	},
+	[NETGPU_MATCH_TCP] = {
+		.family		= AF_INET,
+		.type		= SOCK_STREAM,
+		.protocol	= IPPROTO_TCP,
+	},
+	[NETGPU_MATCH_UDP] = {
+		.family		= AF_INET,
+		.type		= SOCK_DGRAM,
+		.protocol	= IPPROTO_UDP,
+	},
+};
+
+static void
+__netgpu_put_page_any(struct netgpu_ifq *ifq, struct page *page)
+{
+	struct netgpu_pgcache *cache = ifq->any_cache;
+	unsigned count;
+	size_t sz;
+
+	/* unsigned: count == -1 if !cache, so the check will fail. */
+	count = ifq->any_cache_count;
+	if (count < NETGPU_CACHE_COUNT) {
+		cache->page[count] = page;
+		ifq->any_cache_count = count + 1;
+		return;
+	}
+
+	sz = struct_size(cache, page, NETGPU_CACHE_COUNT);
+	cache = kmalloc(sz, GFP_ATOMIC);
+	if (!cache) {
+		/* XXX fixme */
+		pr_err("netgpu: addr 0x%lx lost to overflow\n",
+		       page_private(page));
+		return;
+	}
+	cache->next = ifq->any_cache;
+
+	cache->page[0] = page;
+	ifq->any_cache = cache;
+	ifq->any_cache_count = 1;
+}
+
+static void
+netgpu_put_page_any(struct netgpu_ifq *ifq, struct page *page)
+{
+	spin_lock(&ifq->pgcache_lock);
+
+	__netgpu_put_page_any(ifq, page);
+
+	spin_unlock(&ifq->pgcache_lock);
+}
+
+static void
+netgpu_put_page_napi(struct netgpu_ifq *ifq, struct page *page)
+{
+	struct netgpu_pgcache *spare;
+	unsigned count;
+	size_t sz;
+
+	count = ifq->napi_cache_count;
+	if (count < NETGPU_CACHE_COUNT) {
+		ifq->napi_cache->page[count] = page;
+		ifq->napi_cache_count = count + 1;
+		return;
+	}
+
+	spare = ifq->spare_cache;
+	if (spare) {
+		ifq->spare_cache = NULL;
+		goto out;
+	}
+
+	sz = struct_size(spare, page, NETGPU_CACHE_COUNT);
+	spare = kmalloc(sz, GFP_ATOMIC);
+	if (!spare) {
+		pr_err("netgpu: addr 0x%lx lost to overflow\n",
+		       page_private(page));
+		return;
+	}
+	spare->next = ifq->napi_cache;
+
+out:
+	spare->page[0] = page;
+	ifq->napi_cache = spare;
+	ifq->napi_cache_count = 1;
+}
+
+void
+netgpu_put_page(struct netgpu_ifq *ifq, struct page *page, bool napi)
+{
+	if (napi)
+		netgpu_put_page_napi(ifq, page);
+	else
+		netgpu_put_page_any(ifq, page);
+}
+MAYBE_EXPORT_SYMBOL(netgpu_put_page);
+
+static int
+netgpu_swap_caches(struct netgpu_ifq *ifq, struct netgpu_pgcache **cachep)
+{
+	int count;
+
+	spin_lock(&ifq->pgcache_lock);
+
+	count = ifq->any_cache_count;
+	*cachep = ifq->any_cache;
+	ifq->any_cache = ifq->napi_cache;
+	ifq->any_cache_count = 0;
+
+	spin_unlock(&ifq->pgcache_lock);
+
+	return count;
+}
+
+static struct page *
+netgpu_get_cached_page(struct netgpu_ifq *ifq)
+{
+	struct netgpu_pgcache *cache = ifq->napi_cache;
+	struct page *page;
+	int count;
+
+	count = ifq->napi_cache_count;
+
+	if (!count) {
+		if (cache->next) {
+			kfree(ifq->spare_cache);
+			ifq->spare_cache = cache;
+			cache = cache->next;
+			count = NETGPU_CACHE_COUNT;
+			goto out;
+		}
+
+		/* lockless read of any count - if <= 0, skip */
+		count = READ_ONCE(ifq->any_cache_count);
+		if (count > 0) {
+			count = netgpu_swap_caches(ifq, &cache);
+			goto out;
+		}
+
+		return NULL;
+out:
+		ifq->napi_cache = cache;
+	}
+
+	page = cache->page[--count];
+	ifq->napi_cache_count = count;
+
+	return page;
+}
+
+/*
+ * Free cache structures.  Pages have already been released.
+ */
+static void
+netgpu_free_cache(struct netgpu_ifq *ifq)
+{
+	struct netgpu_pgcache *cache, *next;
+
+	kfree(ifq->spare_cache);
+
+	for (cache = ifq->napi_cache; cache; cache = next) {
+		next = cache->next;
+		kfree(cache);
+	}
+
+	for (cache = ifq->any_cache; cache; cache = next) {
+		next = cache->next;
+		kfree(cache);
+	}
+}
+
+/*
+ * Called from iov_iter when addr is provided for TX.
+ */
+int
+netgpu_get_pages(struct sock *sk, struct page **pages, unsigned long addr,
+		 int count)
+{
+	struct netgpu_dmamap *map;
+	struct netgpu_skq *skq;
+
+	skq = sk->sk_user_data;
+	if (!skq)
+		return -EEXIST;
+
+	map = xa_load(&skq->ctx->xa, addr >> PAGE_SHIFT);
+	if (!map)
+		return -EINVAL;
+
+	return INDIRECT_CALL_1(map->get_pages, netgpu_host_get_pages,
+			       map->r, pages, addr, count);
+}
+
+static int
+netgpu_get_fill_page(struct netgpu_ifq *ifq, dma_addr_t *dma,
+		     struct page **page)
+{
+	struct netgpu_dmamap *map;
+	u64 *addrp, addr;
+	int err;
+
+	addrp = sq_cons_peek(&ifq->fill);
+	if (!addrp)
+		return -ENOMEM;
+
+	addr = READ_ONCE(*addrp);
+
+	map = xa_load(&ifq->ctx->xa, addr >> PAGE_SHIFT);
+	if (!map)
+		return -EINVAL;
+
+	err = INDIRECT_CALL_1(map->get_page, netgpu_host_get_page,
+			      map, addr, page, dma);
+
+	if (!err)
+		sq_cons_advance(&ifq->fill);
+
+	return err;
+}
+
+dma_addr_t
+netgpu_get_dma(struct netgpu_ctx *ctx, struct page *page)
+{
+	struct netgpu_dmamap *map;
+	unsigned long addr;
+
+	addr = page_private(page);
+	map = xa_load(&ctx->xa, addr >> PAGE_SHIFT);
+
+	return INDIRECT_CALL_1(map->get_dma, netgpu_host_get_dma,
+			       map, addr);
+}
+MAYBE_EXPORT_SYMBOL(netgpu_get_dma);
+
+int
+netgpu_get_page(struct netgpu_ifq *ifq, struct page **page, dma_addr_t *dma)
+{
+	*page = netgpu_get_cached_page(ifq);
+	if (*page) {
+		get_page(*page);
+		*dma = netgpu_get_dma(ifq->ctx, *page);
+		return 0;
+	}
+
+	return netgpu_get_fill_page(ifq, dma, page);
+}
+MAYBE_EXPORT_SYMBOL(netgpu_get_page);
+
+static int
+netgpu_shared_queue_validate(struct netgpu_user_queue *u, unsigned elt_size,
+			     unsigned map_off)
+{
+	struct netgpu_queue_map *map;
+	unsigned count;
+	size_t size;
+
+	if (u->elt_sz != elt_size)
+		return -EINVAL;
+
+	count = roundup_pow_of_two(u->entries);
+	if (!count)
+		return -EINVAL;
+	u->entries = count;
+	u->mask = count - 1;
+	u->map_off = map_off;
+
+	size = struct_size(map, data, count * elt_size);
+	if (size == SIZE_MAX || size > U32_MAX)
+		return -EOVERFLOW;
+	u->map_sz = size;
+
+	return 0;
+}
+
+static void
+netgpu_shared_queue_free(struct shared_queue *q)
+{
+	free_pages((uintptr_t)q->map_ptr, get_order(q->map_sz));
+}
+
+static int
+netgpu_shared_queue_create(struct shared_queue *q, struct netgpu_user_queue *u)
+{
+	gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN |
+			  __GFP_COMP | __GFP_NORETRY;
+	struct netgpu_queue_map *map;
+
+	map = (void *)__get_free_pages(gfp_flags, get_order(u->map_sz));
+	if (!map)
+		return -ENOMEM;
+
+	q->map_ptr = map;
+	q->prod = &map->prod;
+	q->cons = &map->cons;
+	q->data = &map->data[0];
+	q->elt_sz = u->elt_sz;
+	q->mask = u->mask;
+	q->entries = u->entries;
+	q->map_sz = u->map_sz;
+
+	memset(&u->off, 0, sizeof(u->off));
+	u->off.prod = offsetof(struct netgpu_queue_map, prod);
+	u->off.cons = offsetof(struct netgpu_queue_map, cons);
+	u->off.data = offsetof(struct netgpu_queue_map, data);
+
+	return 0;
+}
+
+static int
+__netgpu_queue_mgmt(struct net_device *dev, struct netgpu_ifq *ifq,
+		    u32 *queue_id)
+{
+	struct netdev_bpf cmd;
+	bpf_op_t ndo_bpf;
+	int err;
+
+	ndo_bpf = dev->netdev_ops->ndo_bpf;
+	if (!ndo_bpf)
+		return -EINVAL;
+
+	cmd.command = XDP_SETUP_NETGPU;
+	cmd.netgpu.ifq = ifq;
+	cmd.netgpu.queue_id = *queue_id;
+
+	err = ndo_bpf(dev, &cmd);
+	if (!err)
+		*queue_id = cmd.netgpu.queue_id;
+
+	return err;
+}
+
+static int
+netgpu_open_queue(struct netgpu_ifq *ifq, u32 *queue_id)
+{
+	return __netgpu_queue_mgmt(ifq->ctx->dev, ifq, queue_id);
+}
+
+static int
+netgpu_close_queue(struct netgpu_ifq *ifq, u32 queue_id)
+{
+	return __netgpu_queue_mgmt(ifq->ctx->dev, NULL, &queue_id);
+}
+
+static int
+netgpu_mmap(void *priv, struct vm_area_struct *vma,
+	    void *(*validate_request)(void *priv, loff_t, size_t))
+{
+	size_t sz = vma->vm_end - vma->vm_start;
+	unsigned long pfn;
+	void *ptr;
+
+	ptr = validate_request(priv, vma->vm_pgoff, sz);
+	if (IS_ERR(ptr))
+		return PTR_ERR(ptr);
+
+	pfn = virt_to_phys(ptr) >> PAGE_SHIFT;
+	return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot);
+}
+
+static void *
+netgpu_validate_ifq_mmap_request(void *priv, loff_t pgoff, size_t sz)
+{
+	struct netgpu_ifq *ifq = priv;
+	struct page *page;
+	void *ptr;
+
+	/* each returned ptr is a separate allocation. */
+	switch (pgoff << PAGE_SHIFT) {
+	case NETGPU_OFF_FILL_ID:
+		ptr = ifq->fill.map_ptr;
+		break;
+	default:
+		return ERR_PTR(-EINVAL);
+	}
+
+	page = virt_to_head_page(ptr);
+	if (sz > page_size(page))
+		return ERR_PTR(-EINVAL);
+
+	return ptr;
+}
+
+static int
+netgpu_ifq_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	return netgpu_mmap(file->private_data, vma,
+			   netgpu_validate_ifq_mmap_request);
+}
+
+static void
+netgpu_free_ifq(struct netgpu_ifq *ifq)
+{
+	/* assume ifq has been released from ifq list */
+	if (ifq->queue_id != -1)
+		netgpu_close_queue(ifq, ifq->queue_id);
+	netgpu_shared_queue_free(&ifq->fill);
+	netgpu_free_cache(ifq);
+	kfree(ifq);
+}
+
+static int
+netgpu_ifq_release(struct inode *inode, struct file *file)
+{
+	struct netgpu_ifq *ifq = file->private_data;
+	struct netgpu_ctx *ctx = ifq->ctx;
+
+	/* CTX LOCKING */
+	list_del(&ifq->ifq_node);
+	netgpu_free_ifq(ifq);
+
+	netgpu_free_ctx(ctx);
+	return 0;
+}
+
+#if 0
+static int
+netgpu_ifq_wakeup(struct netgpu_ifq *ifq)
+{
+	struct net_device *dev = ifq->ctx->dev;
+	int err;
+
+	rcu_read_lock();
+	err = dev->netdev_ops->ndo_xsk_wakeup(dev, ifq->queue_id, flags);
+	rcu_read_unlock();
+
+	return err;
+}
+#endif
+
+static __poll_t
+netgpu_ifq_poll(struct file *file, poll_table *wait)
+{
+	struct netgpu_ifq *ifq = file->private_data;
+	__poll_t mask = 0;
+
+	poll_wait(file, &ifq->fill_wait, wait);
+
+	if (sq_prod_space(&ifq->fill))
+		mask = EPOLLOUT | EPOLLWRNORM;
+
+#if 0
+	if (driver is asleep because fq is/was empty)
+		netgpu_ifq_wakeup(ifq);
+#endif
+
+	return mask;
+}
+
+static const struct file_operations netgpu_ifq_fops = {
+	.owner =		THIS_MODULE,
+	.mmap =			netgpu_ifq_mmap,
+	.poll =			netgpu_ifq_poll,
+	.release =		netgpu_ifq_release,
+};
+
+static int
+netgpu_create_fd(struct netgpu_ifq *ifq, struct file **filep)
+{
+	struct file *file;
+	unsigned flags;
+	int fd;
+
+	flags = O_RDWR | O_CLOEXEC;
+	fd = get_unused_fd_flags(flags);
+	if (fd < 0)
+		return fd;
+
+	file = anon_inode_getfile("[netgpu]", &netgpu_ifq_fops, ifq, flags);
+	if (IS_ERR(file)) {
+		put_unused_fd(fd);
+		return PTR_ERR(file);
+	}
+
+	*filep = file;
+	return fd;
+}
+
+static struct netgpu_ifq *
+netgpu_alloc_ifq(void)
+{
+	struct netgpu_ifq *ifq;
+	size_t sz;
+
+	ifq = kzalloc(sizeof(*ifq), GFP_KERNEL);
+	if (!ifq)
+		return NULL;
+
+	sz = struct_size(ifq->napi_cache, page, NETGPU_CACHE_COUNT);
+	ifq->napi_cache = kmalloc(sz, GFP_KERNEL);
+	if (!ifq->napi_cache)
+		goto out;
+	ifq->napi_cache->next = NULL;
+
+	ifq->queue_id = -1;
+	ifq->any_cache_count = -1;
+	spin_lock_init(&ifq->pgcache_lock);
+
+	return ifq;
+
+out:
+	kfree(ifq->napi_cache);
+	kfree(ifq);
+
+	return NULL;
+}
+
+static int
+netgpu_bind_queue(struct netgpu_ctx *ctx, void __user *arg)
+{
+	struct netgpu_ifq_param p;
+	struct file *file = NULL;
+	struct netgpu_ifq *ifq;
+	int err;
+
+	if (!ctx->dev)
+		return -ENODEV;
+
+	if (copy_from_user(&p, arg, sizeof(p)))
+		return -EFAULT;
+
+	if (p.resv != 0)
+		return -EINVAL;
+
+	if (p.queue_id != -1) {
+	        list_for_each_entry(ifq, &ctx->ifq_list, ifq_node)
+			if (ifq->queue_id == p.queue_id)
+				return -EALREADY;
+	}
+
+	err = netgpu_shared_queue_validate(&p.fill, sizeof(u64),
+					   NETGPU_OFF_FILL_ID);
+	if (err)
+		return err;
+
+	ifq = netgpu_alloc_ifq();
+	if (!ifq)
+		return -ENOMEM;
+	ifq->ctx = ctx;
+
+	err = netgpu_shared_queue_create(&ifq->fill, &p.fill);
+	if (err)
+		goto out;
+
+	err = netgpu_open_queue(ifq, &p.queue_id);
+	if (err)
+		goto out;
+	ifq->queue_id = p.queue_id;
+
+	p.ifq_fd = netgpu_create_fd(ifq, &file);
+	if (p.ifq_fd < 0) {
+		err = p.ifq_fd;
+		goto out;
+	}
+
+	if (copy_to_user(arg, &p, sizeof(p))) {
+		err = -EFAULT;
+		goto out;
+	}
+
+	fd_install(p.ifq_fd, file);
+	list_add(&ifq->ifq_node, &ctx->ifq_list);
+	refcount_inc(&ctx->ref);
+
+	return 0;
+
+out:
+	if (file) {
+		fput(file);
+		put_unused_fd(p.ifq_fd);
+	}
+	netgpu_free_ifq(ifq);
+
+	return err;
+}
+
+static bool
+netgpu_region_overlap(struct netgpu_ctx *ctx, struct netgpu_dmamap *map)
+{
+	unsigned long index, last;
+
+	index = map->start >> PAGE_SHIFT;
+	last = index + map->nr_pages - 1;
+
+	return xa_find(&ctx->xa, &index, last, XA_PRESENT) != NULL;
+}
+
+struct netgpu_dmamap *
+netgpu_ctx_detach_region(struct netgpu_ctx *ctx, struct netgpu_region *r)
+{
+	struct netgpu_dmamap *map;
+	unsigned long start;
+
+	start = r->start >> PAGE_SHIFT;
+	map = xa_load(&ctx->xa, start);
+	xa_store_range(&ctx->xa, start, start + r->nr_pages - 1,
+		       NULL, GFP_KERNEL);
+
+	return map;
+}
+
+static int
+netgpu_attach_region(struct netgpu_ctx *ctx, void __user *arg)
+{
+	struct netgpu_attach_param p;
+	struct netgpu_dmamap *map;
+	struct netgpu_mem *mem;
+	unsigned long start;
+	struct fd f;
+	int err;
+
+	if (!ctx->dev)
+		return -ENODEV;
+
+	if (copy_from_user(&p, arg, sizeof(p)))
+		return -EFAULT;
+
+	f = fdget(p.mem_fd);
+	if (!f.file)
+		return -EBADF;
+
+	if (f.file->f_op != &netgpu_mem_fops) {
+		fdput(f);
+		return -EOPNOTSUPP;
+	}
+
+	mem = f.file->private_data;
+	map = netgpu_mem_attach_ctx(mem, p.mem_idx, ctx);
+	if (IS_ERR(map)) {
+		fdput(f);
+		return PTR_ERR(map);
+	}
+
+	/* XXX "should not happen", validate anyway */
+	if (netgpu_region_overlap(ctx, map)) {
+		netgpu_map_detach_ctx(map, ctx);
+		return -EEXIST;
+	}
+
+	start = map->start >> PAGE_SHIFT;
+	err = xa_err(xa_store_range(&ctx->xa, start, start + map->nr_pages - 1,
+				    map, GFP_KERNEL));
+	if (err)
+		netgpu_map_detach_ctx(map, ctx);
+
+	return err;
+}
+
+static int
+netgpu_attach_dev(struct netgpu_ctx *ctx, void __user *arg)
+{
+	struct net_device *dev;
+	int ifindex;
+	int err;
+
+	if (copy_from_user(&ifindex, arg, sizeof(ifindex)))
+		return -EFAULT;
+
+	dev = dev_get_by_index(&init_net, ifindex);
+	if (!dev)
+		return -ENODEV;
+
+	if (ctx->dev) {
+		err = dev == ctx->dev ? 0 : -EALREADY;
+		dev_put(dev);
+		return err;
+	}
+
+	ctx->dev = dev;
+
+	return 0;
+}
+
+static struct netgpu_ctx *
+netgpu_file_to_ctx(struct file *file)
+{
+	return file->private_data;
+}
+
+static long
+netgpu_ioctl(struct file *file, unsigned cmd, unsigned long arg)
+{
+	struct netgpu_ctx *ctx = netgpu_file_to_ctx(file);
+
+	switch (cmd) {
+	case NETGPU_CTX_IOCTL_ATTACH_DEV:
+		return netgpu_attach_dev(ctx, (void __user *)arg);
+
+	case NETGPU_CTX_IOCTL_BIND_QUEUE:
+		return netgpu_bind_queue(ctx, (void __user *)arg);
+
+	case NETGPU_CTX_IOCTL_ATTACH_REGION:
+		return netgpu_attach_region(ctx, (void __user *)arg);
+	}
+	return -ENOTTY;
+}
+
+static void
+__netgpu_free_ctx(struct netgpu_ctx *ctx)
+{
+	struct netgpu_dmamap *map;
+	unsigned long index;
+
+	xa_for_each(&ctx->xa, index, map) {
+		index = (map->start >> PAGE_SHIFT) + map->nr_pages - 1;
+		netgpu_map_detach_ctx(map, ctx);
+	}
+
+	xa_destroy(&ctx->xa);
+
+	if (ctx->dev)
+		dev_put(ctx->dev);
+	kfree(ctx);
+
+	module_put(THIS_MODULE);
+}
+
+static void
+netgpu_free_ctx(struct netgpu_ctx *ctx)
+{
+	if (refcount_dec_and_test(&ctx->ref))
+		__netgpu_free_ctx(ctx);
+}
+
+static int
+netgpu_release(struct inode *inode, struct file *file)
+{
+	struct netgpu_ctx *ctx = netgpu_file_to_ctx(file);
+
+	netgpu_free_ctx(ctx);
+	return 0;
+}
+
+static struct netgpu_ctx *
+netgpu_alloc_ctx(void)
+{
+	struct netgpu_ctx *ctx;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return NULL;
+
+	xa_init(&ctx->xa);
+	refcount_set(&ctx->ref, 1);
+	INIT_LIST_HEAD(&ctx->ifq_list);
+
+	return ctx;
+}
+
+static int
+netgpu_open(struct inode *inode, struct file *file)
+{
+	struct netgpu_ctx *ctx;
+
+	ctx = netgpu_alloc_ctx();
+	if (!ctx)
+		return -ENOMEM;
+
+	file->private_data = ctx;
+
+	__module_get(THIS_MODULE);
+
+	return 0;
+}
+
+static const struct file_operations netgpu_fops = {
+	.owner =		THIS_MODULE,
+	.open =			netgpu_open,
+	.unlocked_ioctl =	netgpu_ioctl,
+	.release =		netgpu_release,
+};
+
+static struct miscdevice netgpu_dev = {
+	.minor		= MISC_DYNAMIC_MINOR,
+	.name		= "netgpu",
+	.fops		= &netgpu_fops,
+};
+
+/* Our version of __skb_datagram_iter */
+static int
+netgpu_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
+		unsigned int offset, size_t len)
+{
+	struct netgpu_skq *skq = desc->arg.data;
+	struct sk_buff *frag_iter;
+	struct iovec *iov;
+	struct page *page;
+	unsigned start;
+	int i, used;
+	u64 addr;
+
+	if (skb_headlen(skb)) {
+		WARN_ONCE(1, "zc socket receiving non-zc data");
+		return -EFAULT;
+	}
+
+	used = 0;
+	start = 0;
+
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+		skb_frag_t *frag;
+		int end, off, frag_len;
+
+		frag = &skb_shinfo(skb)->frags[i];
+		frag_len = skb_frag_size(frag);
+
+		end = start + frag_len;
+		if (offset < end) {
+			off = offset - start;
+
+			iov = sq_prod_reserve(&skq->rx);
+			if (!iov)
+				break;
+
+			page = skb_frag_page(frag);
+			addr = (u64)page_private(page) + off;
+
+			iov->iov_base = (void *)(addr + skb_frag_off(frag));
+			iov->iov_len = frag_len - off;
+
+			used += (frag_len - off);
+			offset += (frag_len - off);
+
+			put_page(page);
+			__skb_frag_set_page(frag, NULL);
+		}
+		start = end;
+	}
+
+	if (used)
+		sq_prod_submit(&skq->rx);
+
+	skb_walk_frags(skb, frag_iter) {
+		int end, off, ret;
+
+		end = start + frag_iter->len;
+		if (offset < end) {
+			off = offset - start;
+			len = frag_iter->len - off;
+
+			ret = netgpu_recv_skb(desc, frag_iter, off, len);
+			if (ret < 0) {
+				if (!used)
+					used = ret;
+				goto out;
+			}
+			used += ret;
+			if (ret < len)
+				goto out;
+			offset += ret;
+		}
+		start = end;
+	}
+
+out:
+	return used;
+}
+
+static void
+netgpu_read_sock(struct sock *sk, struct netgpu_skq *skq)
+{
+	read_descriptor_t desc;
+	int used;
+
+	desc.arg.data = skq;
+	desc.count = 1;
+	used = tcp_read_sock(sk, &desc, netgpu_recv_skb);
+}
+
+static void
+netgpu_data_ready(struct sock *sk)
+{
+	struct netgpu_skq *skq = sk->sk_user_data;
+
+	if (skq->rx.entries)
+		netgpu_read_sock(sk, skq);
+
+	skq->sk_data_ready(sk);
+}
+
+static bool
+netgpu_stream_memory_read(const struct sock *sk)
+{
+	struct netgpu_skq *skq = sk->sk_user_data;
+
+	return !sq_is_empty(&skq->rx);
+}
+
+static void *
+netgpu_validate_skq_mmap_request(void *priv, loff_t pgoff, size_t sz)
+{
+	struct netgpu_skq *skq = priv;
+	struct page *page;
+	void *ptr;
+
+	/* each returned ptr is a separate allocation. */
+	switch (pgoff << PAGE_SHIFT) {
+	case NETGPU_OFF_RX_ID:
+		ptr = skq->rx.map_ptr;
+		break;
+	case NETGPU_OFF_CQ_ID:
+		ptr = skq->cq.map_ptr;
+		break;
+	default:
+		return ERR_PTR(-EINVAL);
+	}
+
+	page = virt_to_head_page(ptr);
+	if (sz > page_size(page))
+		return ERR_PTR(-EINVAL);
+
+	return ptr;
+}
+
+int
+netgpu_socket_mmap(struct file *file, struct socket *sock,
+		struct vm_area_struct *vma)
+{
+	struct sock *sk;
+
+	sk = sock->sk;
+	if (!sk || !sk->sk_user_data)
+		return -EINVAL;
+
+	return netgpu_mmap(sk->sk_user_data, vma,
+			   netgpu_validate_skq_mmap_request);
+}
+
+static void
+netgpu_release_sk(struct sock *sk, struct netgpu_skq *skq)
+{
+	struct netgpu_sock_match *m;
+
+	m = container_of(sk->sk_prot, struct netgpu_sock_match, prot);
+
+	sk->sk_destruct = skq->sk_destruct;
+	sk->sk_data_ready = skq->sk_data_ready;
+	sk->sk_prot = m->base_prot;
+	sk->sk_user_data = NULL;
+
+	/* XXX reclaim and recycle pending data? */
+	netgpu_shared_queue_free(&skq->rx);
+	netgpu_shared_queue_free(&skq->cq);
+	kfree(skq);
+}
+
+static void
+netgpu_skq_destruct(struct sock *sk)
+{
+	struct netgpu_skq *skq = sk->sk_user_data;
+	struct netgpu_ctx *ctx = skq->ctx;
+
+	netgpu_release_sk(sk, skq);
+
+	if (sk->sk_destruct)
+		sk->sk_destruct(sk);
+
+	netgpu_free_ctx(ctx);
+}
+
+static struct netgpu_skq *
+netgpu_create_skq(struct netgpu_socket_param *p)
+{
+	struct netgpu_skq *skq;
+	int err;
+
+	skq = kzalloc(sizeof(*skq), GFP_KERNEL);
+	if (!skq)
+		return ERR_PTR(-ENOMEM);
+
+	err = netgpu_shared_queue_create(&skq->rx, &p->rx);
+	if (err)
+		goto out;
+
+	err = netgpu_shared_queue_create(&skq->cq, &p->cq);
+	if (err)
+		goto out;
+
+	return skq;
+
+out:
+	netgpu_shared_queue_free(&skq->rx);
+	netgpu_shared_queue_free(&skq->cq);
+	kfree(skq);
+
+	return ERR_PTR(err);
+}
+
+static void
+netgpu_rebuild_match(struct netgpu_sock_match *m, struct sock *sk)
+{
+	mutex_lock(&netgpu_lock);
+
+	if (m->initialized)
+		goto out;
+
+	m->base_ops = sk->sk_socket->ops;
+	m->base_prot = sk->sk_prot;
+
+	m->ops = *m->base_ops;
+	m->prot = *m->base_prot;
+
+	/* XXX need UDP specific vector here */
+	m->prot.stream_memory_read = netgpu_stream_memory_read;
+	m->ops.mmap = netgpu_socket_mmap;
+
+	smp_wmb();
+	m->initialized = 1;
+
+out:
+	mutex_unlock(&netgpu_lock);
+}
+
+static int
+netgpu_match_socket(struct sock *sk)
+{
+	struct netgpu_sock_match *m;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(netgpu_match_tbl); i++) {
+		m = &netgpu_match_tbl[i];
+
+		if (m->family != sk->sk_family ||
+		    m->type != sk->sk_type ||
+		    m->protocol != sk->sk_protocol)
+			continue;
+
+		if (!m->initialized)
+			netgpu_rebuild_match(m, sk);
+
+		if (m->base_prot != sk->sk_prot)
+			return -EPROTO;
+
+		if (m->base_ops != sk->sk_socket->ops)
+			return -EPROTO;
+
+		return i;
+	}
+	return -EOPNOTSUPP;
+}
+
+int
+netgpu_attach_socket(struct sock *sk, void __user *arg)
+{
+	struct netgpu_socket_param p;
+	struct netgpu_ctx *ctx;
+	struct netgpu_skq *skq;
+	struct fd f;
+	int id, err;
+
+	if (sk->sk_user_data)
+		return -EALREADY;
+
+	if (copy_from_user(&p, arg, sizeof(p)))
+		return -EFAULT;
+
+	if (p.resv != 0)
+		return -EINVAL;
+
+	err = netgpu_shared_queue_validate(&p.rx, sizeof(struct iovec),
+					   NETGPU_OFF_RX_ID);
+	if (err)
+		return err;
+
+	err = netgpu_shared_queue_validate(&p.cq, sizeof(u64),
+					   NETGPU_OFF_CQ_ID);
+	if (err)
+		return err;
+
+	id = netgpu_match_socket(sk);
+	if (id < 0)
+		return id;
+
+	f = fdget(p.ctx_fd);
+	if (!f.file)
+		return -EBADF;
+
+	if (f.file->f_op != &netgpu_fops) {
+		fdput(f);
+		return -EOPNOTSUPP;
+	}
+
+	skq = netgpu_create_skq(&p);
+	if (IS_ERR(skq)) {
+		fdput(f);
+		return PTR_ERR(skq);
+	}
+
+	ctx = netgpu_file_to_ctx(f.file);
+	refcount_inc(&ctx->ref);
+	skq->ctx = ctx;
+	fdput(f);
+
+	skq->sk_destruct = sk->sk_destruct;
+	skq->sk_data_ready = sk->sk_data_ready;
+
+	sk->sk_destruct = netgpu_skq_destruct;
+	sk->sk_data_ready = netgpu_data_ready;
+	sk->sk_prot = &netgpu_match_tbl[id].prot;
+	sk->sk_socket->ops = &netgpu_match_tbl[id].ops;
+
+	sk->sk_user_data = skq;
+
+	if (copy_to_user(arg, &p, sizeof(p))) {
+		netgpu_release_sk(sk, skq);
+		return -EFAULT;
+	}
+
+	return 0;
+}
+
+#if IS_MODULE(CONFIG_NETGPU)
+#include "netgpu_stub.h"
+static struct netgpu_functions netgpu_fcn = {
+        .get_dma        = netgpu_get_dma,
+        .get_page       = netgpu_get_page,
+        .put_page       = netgpu_put_page,
+        .get_pages      = netgpu_get_pages,
+        .socket_mmap    = netgpu_socket_mmap,
+        .attach_socket  = netgpu_attach_socket,
+};
+#else
+#define netgpu_fcn_register(x)
+#define netgpu_fcn_unregister()
+#endif
+
+static int __init
+netgpu_init(void)
+{
+	misc_register(&netgpu_dev);
+	misc_register(&netgpu_mem_dev);
+	netgpu_fcn_register(&netgpu_fcn);
+
+	return 0;
+}
+
+static void __exit
+netgpu_fini(void)
+{
+	misc_deregister(&netgpu_dev);
+	misc_deregister(&netgpu_mem_dev);
+	netgpu_fcn_unregister();
+}
+
+module_init(netgpu_init);
+module_exit(netgpu_fini);
+MODULE_LICENSE("GPL v2");
diff --git a/drivers/misc/netgpu/netgpu_mem.c b/drivers/misc/netgpu/netgpu_mem.c
new file mode 100644
index 000000000000..184bf77e838c
--- /dev/null
+++ b/drivers/misc/netgpu/netgpu_mem.c
@@ -0,0 +1,351 @@
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/miscdevice.h>
+#include <linux/uio.h>
+#include <linux/errno.h>
+#include <linux/netdevice.h>
+#include <linux/memory.h>
+#include <linux/device.h>
+#include <linux/mutex.h>
+
+#include <net/netgpu.h>
+#include <uapi/misc/netgpu.h>
+
+#include "netgpu_priv.h"
+
+static struct netgpu_ops *netgpu_ops[MEMTYPE_MAX] = {
+	[MEMTYPE_HOST]	= &host_ops,
+};
+static const char *netgpu_name[] = {
+	[MEMTYPE_HOST]	= "host",
+	[MEMTYPE_CUDA]	= "cuda",
+};
+static DEFINE_SPINLOCK(netgpu_lock);
+
+int
+netgpu_register(struct netgpu_ops *ops)
+{
+	int err;
+
+	if (ops->memtype >= MEMTYPE_MAX)
+		return -EBADR;
+
+	err = -EEXIST;
+	spin_lock(&netgpu_lock);
+	if (!rcu_dereference_protected(netgpu_ops[ops->memtype],
+				       lockdep_is_held(&netgpu_lock))) {
+		rcu_assign_pointer(netgpu_ops[ops->memtype], ops);
+		err = 0;
+	}
+	spin_unlock(&netgpu_lock);
+
+	return err;
+}
+EXPORT_SYMBOL(netgpu_register);
+
+void
+netgpu_unregister(int memtype)
+{
+	BUG_ON(memtype < 0 || memtype >= MEMTYPE_MAX);
+
+	spin_lock(&netgpu_lock);
+	rcu_assign_pointer(netgpu_ops[memtype], NULL);
+	spin_unlock(&netgpu_lock);
+
+	synchronize_rcu();
+}
+EXPORT_SYMBOL(netgpu_unregister);
+
+static inline struct device *
+netdev2device(struct net_device *dev)
+{
+	return dev->dev.parent;			/* from SET_NETDEV_DEV() */
+}
+
+static struct netgpu_ctx_entry *
+__netgpu_region_find_ctx(struct netgpu_region *r, struct netgpu_ctx *ctx)
+{
+	struct netgpu_ctx_entry *ce;
+
+	list_for_each_entry(ce, &r->ctx_list, ctx_node)
+		if (ce->ctx == ctx)
+			return ce;
+	return NULL;
+}
+
+void
+netgpu_map_detach_ctx(struct netgpu_dmamap *map, struct netgpu_ctx *ctx)
+{
+	struct netgpu_region *r = map->r;
+	struct netgpu_ctx_entry *ce;
+	bool unmap;
+
+	spin_lock(&r->lock);
+
+	ce = __netgpu_region_find_ctx(r, ctx);
+	list_del(&ce->ctx_node);
+
+	unmap = refcount_dec_and_test(&map->ref);
+	if (unmap)
+		list_del(&map->dma_node);
+
+	spin_unlock(&r->lock);
+
+	if (unmap) {
+		r->ops->unmap_region(map);
+		kvfree(map);
+	}
+
+	kfree(ce);
+	fput(r->mem->file);
+}
+
+static struct netgpu_dmamap *
+__netgpu_region_find_device(struct netgpu_region *r, struct device *device)
+{
+	struct netgpu_dmamap *map;
+
+	list_for_each_entry(map, &r->dma_list, dma_node)
+		if (map->device == device) {
+			refcount_inc(&map->ref);
+			return map;
+		}
+	return NULL;
+}
+
+static struct netgpu_region *
+__netgpu_mem_find_region(struct netgpu_mem *mem, int idx)
+{
+	struct netgpu_region *r;
+
+	list_for_each_entry(r, &mem->region_list, mem_node)
+		if (r->index == idx)
+			return r;
+	return NULL;
+}
+
+struct netgpu_dmamap *
+netgpu_mem_attach_ctx(struct netgpu_mem *mem, int idx, struct netgpu_ctx *ctx)
+{
+	struct netgpu_ctx_entry *ce;
+	struct netgpu_dmamap *map;
+	struct netgpu_region *r;
+	struct device *device;
+
+	rcu_read_lock();
+	r = __netgpu_mem_find_region(mem, idx);
+	rcu_read_unlock();
+
+	if (!r)
+		return ERR_PTR(-ENOENT);
+
+	spin_lock(&r->lock);
+
+	ce = __netgpu_region_find_ctx(r, ctx);
+	if (ce) {
+		map = ERR_PTR(-EEXIST);
+		goto out_unlock;
+	}
+
+	ce = kmalloc(sizeof(*ce), GFP_KERNEL);
+	if (!ce) {
+		map = ERR_PTR(-ENOMEM);
+		goto out_unlock;
+	}
+
+	device = netdev2device(ctx->dev);
+	map = __netgpu_region_find_device(r, device);
+	if (!map) {
+		map = r->ops->map_region(r, device);
+		if (IS_ERR(map)) {
+			kfree(ce);
+			goto out_unlock;
+		}
+
+		map->r = r;
+		map->start = r->start;
+		map->device = device;
+		map->nr_pages = r->nr_pages;
+		map->get_dma = r->ops->get_dma;
+		map->get_page = r->ops->get_page;
+		map->get_pages = r->ops->get_pages;
+
+		refcount_set(&map->ref, 1);
+
+		list_add(&map->dma_node, &r->dma_list);
+	}
+
+	ce->ctx = ctx;
+	list_add(&ce->ctx_node, &r->ctx_list);
+	get_file(mem->file);
+
+out_unlock:
+	spin_unlock(&r->lock);
+	return map;
+}
+
+static void
+netgpu_mem_free_region(struct netgpu_mem *mem, struct netgpu_region *r)
+{
+	struct netgpu_ops *ops = r->ops;
+
+	WARN_ONCE(!list_empty(&r->ctx_list), "context list not empty!");
+	WARN_ONCE(!list_empty(&r->dma_list), "DMA list not empty!");
+
+	/* removes page mappings, frees r */
+	ops->free_region(mem, r);
+	module_put(ops->owner);
+}
+
+/* region overlaps will fail due to PagePrivate bit */
+static int
+netgpu_mem_add_region(struct netgpu_mem *mem, void __user *arg)
+{
+	struct netgpu_region_param p;
+	struct netgpu_region *r;
+	struct netgpu_ops *ops;
+
+	if (copy_from_user(&p, arg, sizeof(p)))
+		return -EFAULT;
+
+	if (p.memtype < 0 || p.memtype >= MEMTYPE_MAX)
+		return -ENXIO;
+
+#ifdef CONFIG_MODULES
+	if (!rcu_access_pointer(netgpu_ops[p.memtype]))
+		request_module("netgpu_%s", netgpu_name[p.memtype]);
+#endif
+
+	rcu_read_lock();
+	ops = rcu_dereference(netgpu_ops[p.memtype]);
+	if (!ops || !try_module_get(ops->owner)) {
+		rcu_read_unlock();
+		return -ENXIO;
+	}
+	rcu_read_unlock();
+
+	r = ops->add_region(mem, &p.iov);
+	if (IS_ERR(r)) {
+		module_put(ops->owner);
+		return PTR_ERR(r);
+	}
+
+	r->ops = ops;
+
+	mutex_lock(&mem->lock);
+	r->index = ++mem->index_generator;
+	list_add_rcu(&r->mem_node, &mem->region_list);
+	mutex_unlock(&mem->lock);
+
+	return r->index;
+}
+
+/* This function is called from the nvidia callback, ick. */
+void
+netgpu_detach_region(struct netgpu_region *r)
+{
+	struct netgpu_mem *mem = r->mem;
+	struct netgpu_ctx_entry *ce, *tmp;
+	struct netgpu_dmamap *map;
+
+	mutex_lock(&mem->lock);
+	list_del(&r->mem_node);
+	mutex_unlock(&mem->lock);
+
+	spin_lock(&r->lock);
+
+	list_for_each_entry_safe(ce, tmp, &r->ctx_list, ctx_node) {
+		list_del(&ce->ctx_node);
+		map = netgpu_ctx_detach_region(ce->ctx, r);
+
+		if (refcount_dec_and_test(&map->ref)) {
+			list_del(&map->dma_node);
+			r->ops->unmap_region(map);
+			kvfree(map);
+		}
+
+		kfree(ce);
+		fput(r->mem->file);
+	}
+
+	spin_unlock(&r->lock);
+	netgpu_mem_free_region(mem, r);
+
+	/* XXX nvidia bug - keeps extra file reference?? */
+	fput(mem->file);
+}
+EXPORT_SYMBOL(netgpu_detach_region);
+
+static long
+netgpu_mem_ioctl(struct file *file, unsigned cmd, unsigned long arg)
+{
+	struct netgpu_mem *mem = file->private_data;
+
+	switch (cmd) {
+	case NETGPU_MEM_IOCTL_ADD_REGION:
+		return netgpu_mem_add_region(mem, (void __user *)arg);
+	}
+	return -ENOTTY;
+}
+
+static void
+__netgpu_free_mem(struct netgpu_mem *mem)
+{
+	struct netgpu_region *r, *tmp;
+
+	/* no lock needed - no refs at this point */
+	list_for_each_entry_safe(r, tmp, &mem->region_list, mem_node)
+		netgpu_mem_free_region(mem, r);
+
+	free_uid(mem->user);
+	kfree(mem);
+}
+
+static int
+netgpu_mem_release(struct inode *inode, struct file *file)
+{
+	struct netgpu_mem *mem = file->private_data;
+
+	__netgpu_free_mem(mem);
+
+	module_put(THIS_MODULE);
+
+	return 0;
+}
+
+static int
+netgpu_mem_open(struct inode *inode, struct file *file)
+{
+	struct netgpu_mem *mem;
+
+	mem = kmalloc(sizeof(*mem), GFP_KERNEL);
+	if (!mem)
+		return -ENOMEM;
+
+	mem->account_mem = !capable(CAP_IPC_LOCK);
+	mem->user = get_uid(current_user());
+	mem->file = file;
+	mem->index_generator = 0;
+	mutex_init(&mem->lock);
+	INIT_LIST_HEAD(&mem->region_list);
+
+	file->private_data = mem;
+
+	__module_get(THIS_MODULE);
+
+	return 0;
+}
+
+const struct file_operations netgpu_mem_fops = {
+	.owner =		THIS_MODULE,
+	.open =			netgpu_mem_open,
+	.unlocked_ioctl =	netgpu_mem_ioctl,
+	.release =		netgpu_mem_release,
+};
+
+struct miscdevice netgpu_mem_dev = {
+	.minor		= MISC_DYNAMIC_MINOR,
+	.name		= "netgpu_mem",
+	.fops		= &netgpu_mem_fops,
+};
diff --git a/drivers/misc/netgpu/netgpu_priv.h b/drivers/misc/netgpu/netgpu_priv.h
new file mode 100644
index 000000000000..4dc9941767cb
--- /dev/null
+++ b/drivers/misc/netgpu/netgpu_priv.h
@@ -0,0 +1,88 @@
+#ifndef _NETGPU_PRIV_H
+#define _NETGPU_PRIV_H
+
+struct netgpu_queue_map {
+	unsigned prod ____cacheline_aligned_in_smp;
+	unsigned cons ____cacheline_aligned_in_smp;
+	unsigned char data[] ____cacheline_aligned_in_smp;
+};
+
+struct netgpu_dmamap {
+	struct list_head dma_node;		/* dma map of region */
+	struct netgpu_region *r;		/* owning region */
+	struct device *device;			/* device map is for */
+	refcount_t ref;				/* ctxs holding this map */
+
+	unsigned long start;			/* copies from region */
+	unsigned long nr_pages;
+	dma_addr_t
+		(*get_dma)(struct netgpu_dmamap *map, unsigned long addr);
+	int	(*get_page)(struct netgpu_dmamap *map, unsigned long addr,
+			    struct page **page, dma_addr_t *dma);
+	int	(*get_pages)(struct netgpu_region *r, struct page **pages,
+			     unsigned long addr, int count);
+};
+
+struct netgpu_ctx;
+
+struct netgpu_ctx_entry {
+	struct list_head ctx_node;
+	struct netgpu_ctx *ctx;
+};
+
+struct netgpu_region {
+	struct list_head dma_list;		/* dma mappings of region */
+	struct list_head ctx_list;		/* contexts using region */
+	struct list_head mem_node;		/* mem area owning region */
+	struct netgpu_mem *mem;
+	struct netgpu_ops *ops;
+	unsigned long start;
+	unsigned long nr_pages;
+	int index;				/* unique per mem */
+	spinlock_t lock;
+};
+
+/* assign the id on creation, just bump counter and match. */
+struct netgpu_mem {
+	struct file *file;
+	struct mutex lock;
+	struct user_struct *user;
+	int index_generator;
+	unsigned account_mem : 1;
+	struct list_head region_list;
+};
+
+struct netgpu_ops {
+	int	memtype;
+	struct module *owner;
+
+	struct netgpu_region *
+		(*add_region)(struct netgpu_mem *, const struct iovec *);
+	void	(*free_region)(struct netgpu_mem *, struct netgpu_region *);
+
+	struct netgpu_dmamap *
+		(*map_region)(struct netgpu_region *, struct device *);
+	void	(*unmap_region)(struct netgpu_dmamap *);
+
+	dma_addr_t
+		(*get_dma)(struct netgpu_dmamap *map, unsigned long addr);
+	int	(*get_page)(struct netgpu_dmamap *map, unsigned long addr,
+			    struct page **page, dma_addr_t *dma);
+	int	(*get_pages)(struct netgpu_region *r, struct page **pages,
+			     unsigned long addr, int count);
+};
+
+extern const struct file_operations netgpu_mem_fops;
+extern struct miscdevice netgpu_mem_dev;
+extern struct netgpu_ops host_ops;
+
+struct netgpu_dmamap *
+	netgpu_mem_attach_ctx(struct netgpu_mem *mem,
+			      int idx, struct netgpu_ctx *ctx);
+void netgpu_map_detach_ctx(struct netgpu_dmamap *map, struct netgpu_ctx *ctx);
+struct netgpu_dmamap *
+	netgpu_ctx_detach_region(struct netgpu_ctx *ctx,
+				 struct netgpu_region *r);
+void netgpu_detach_region(struct netgpu_region *r);
+
+#endif /* _NETGPU_PRIV_H */
diff --git a/drivers/misc/netgpu/netgpu_stub.c b/drivers/misc/netgpu/netgpu_stub.c
new file mode 100644
index 000000000000..112bca3dcd60
--- /dev/null
+++ b/drivers/misc/netgpu/netgpu_stub.c
@@ -0,0 +1,166 @@
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/uio.h>
+#include <linux/errno.h>
+#include <linux/mutex.h>
+
+#include <net/netgpu.h>
+#include <uapi/misc/netgpu.h>
+
+#include "netgpu_stub.h"
+
+static dma_addr_t
+netgpu_nop_get_dma(struct netgpu_ctx *ctx, struct page *page)
+{
+	return 0;
+}
+
+static int
+netgpu_nop_get_page(struct netgpu_ifq *ifq, struct page **page,
+		    dma_addr_t *dma)
+{
+	return -ENXIO;
+}
+
+static void
+netgpu_nop_put_page(struct netgpu_ifq *ifq, struct page *page, bool napi)
+{
+}
+
+static int
+netgpu_nop_get_pages(struct sock *sk, struct page **pages, unsigned long addr,
+		     int count)
+{
+	return -ENXIO;
+}
+
+static int
+netgpu_nop_socket_mmap(struct file *file, struct socket *sock,
+		       struct vm_area_struct *vma)
+{
+	return -ENOIOCTLCMD;
+}
+
+static int
+netgpu_nop_attach_socket(struct sock *sk, void __user *arg)
+{
+	return -ENOIOCTLCMD;
+}
+
+static struct netgpu_functions netgpu_nop = {
+	.get_dma	= netgpu_nop_get_dma,
+	.get_page	= netgpu_nop_get_page,
+	.put_page	= netgpu_nop_put_page,
+	.get_pages	= netgpu_nop_get_pages,
+	.socket_mmap	= netgpu_nop_socket_mmap,
+	.attach_socket	= netgpu_nop_attach_socket,
+};
+
+static struct netgpu_functions *netgpu_fcn;
+static DEFINE_SPINLOCK(netgpu_fcn_lock);
+
+void
+netgpu_fcn_register(struct netgpu_functions *f)
+{
+	spin_lock(&netgpu_fcn_lock);
+	rcu_assign_pointer(netgpu_fcn, f);
+	spin_unlock(&netgpu_fcn_lock);
+
+	synchronize_rcu();
+}
+EXPORT_SYMBOL(netgpu_fcn_register);
+
+void
+netgpu_fcn_unregister(void)
+{
+	netgpu_fcn_register(&netgpu_nop);
+}
+EXPORT_SYMBOL(netgpu_fcn_unregister);
+
+dma_addr_t
+netgpu_get_dma(struct netgpu_ctx *ctx, struct page *page)
+{
+	struct netgpu_functions *f;
+	dma_addr_t dma;
+
+	rcu_read_lock();
+	f = rcu_dereference(netgpu_fcn);
+	dma = f->get_dma(ctx, page);
+	rcu_read_unlock();
+
+	return dma;
+}
+EXPORT_SYMBOL(netgpu_get_dma);
+
+int
+netgpu_get_page(struct netgpu_ifq *ifq, struct page **page,
+		dma_addr_t *dma)
+{
+	struct netgpu_functions *f;
+	int err;
+
+	rcu_read_lock();
+	f = rcu_dereference(netgpu_fcn);
+	err = f->get_page(ifq, page, dma);
+	rcu_read_unlock();
+
+	return err;
+}
+EXPORT_SYMBOL(netgpu_get_page);
+
+void
+netgpu_put_page(struct netgpu_ifq *ifq, struct page *page, bool napi)
+{
+	struct netgpu_functions *f;
+
+	rcu_read_lock();
+	f = rcu_dereference(netgpu_fcn);
+	f->put_page(ifq, page, napi);
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL(netgpu_put_page);
+
+int
+netgpu_get_pages(struct sock *sk, struct page **pages, unsigned long addr,
+		 int count)
+{
+	struct netgpu_functions *f;
+	int err;
+
+	rcu_read_lock();
+	f = rcu_dereference(netgpu_fcn);
+	err = f->get_pages(sk, pages, addr, count);
+	rcu_read_unlock();
+
+	return err;
+}
+
+int
+netgpu_socket_mmap(struct file *file, struct socket *sock,
+		   struct vm_area_struct *vma)
+{
+	struct netgpu_functions *f;
+	int err;
+
+	rcu_read_lock();
+	f = rcu_dereference(netgpu_fcn);
+	err = f->socket_mmap(file, sock, vma);
+	rcu_read_unlock();
+
+	return err;
+}
+
+int
+netgpu_attach_socket(struct sock *sk, void __user *arg)
+{
+	struct netgpu_functions *f;
+	int err;
+
+	rcu_read_lock();
+	f = rcu_dereference(netgpu_fcn);
+	err = f->attach_socket(sk, arg);
+	rcu_read_unlock();
+
+	return err;
+}
diff --git a/drivers/misc/netgpu/netgpu_stub.h b/drivers/misc/netgpu/netgpu_stub.h
new file mode 100644
index 000000000000..9b682d8ccf0c
--- /dev/null
+++ b/drivers/misc/netgpu/netgpu_stub.h
@@ -0,0 +1,19 @@
+#pragma once
+
+/* development-only support for module loading. */
+
+struct netgpu_functions {
+	dma_addr_t (*get_dma)(struct netgpu_ctx *ctx, struct page *page);
+	int (*get_page)(struct netgpu_ifq *ifq,
+			struct page **page, dma_addr_t *dma);
+	void (*put_page)(struct netgpu_ifq *, struct page *, bool);
+	int (*get_pages)(struct sock *, struct page **,
+			 unsigned long, int);
+
+	int (*socket_mmap)(struct file *file, struct socket *sock,
+			   struct vm_area_struct *vma);
+	int (*attach_socket)(struct sock *sk, void __user *arg);
+};
+
+void netgpu_fcn_register(struct netgpu_functions *f);
+void netgpu_fcn_unregister(void);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 11/21] core/skbuff: add page recycling logic for netgpu pages
  2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (9 preceding siblings ...)
  2020-07-27 22:44 ` [RFC PATCH v2 10/21] netgpu: add network/gpu/host dma module Jonathan Lemon
@ 2020-07-27 22:44 ` Jonathan Lemon
  2020-07-28 16:28   ` Greg KH
  2020-07-27 22:44 ` [RFC PATCH v2 12/21] lib: have __zerocopy_sg_from_iter get netgpu pages for a sk Jonathan Lemon
                   ` (10 subsequent siblings)
  21 siblings, 1 reply; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-27 22:44 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

From: Jonathan Lemon <bsd@fb.com>

netgpu pages will always have a refcount of at least one (held by
the netgpu module).  If the skb is marked as containing netgpu ZC
pages, recycle them back to netgpu.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 net/core/skbuff.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1422b99b7090..50dbb7ce1965 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -591,6 +591,27 @@ static void skb_free_head(struct sk_buff *skb)
 		kfree(head);
 }
 
+#if IS_ENABLED(CONFIG_NETGPU)
+static void skb_netgpu_unref(struct skb_shared_info *shinfo)
+{
+	struct netgpu_ifq *ifq = shinfo->destructor_arg;
+	struct page *page;
+	int i;
+
+	/* pages attached for skbs for TX shouldn't come here, since
+	 * the skb is not marked as "zc_netgpu". (only RX skbs have this).
+	 * dummy page does come here, but always has elevated refc.
+	 *
+	 * Undelivered zc skb's will arrive at this point.
+	 */
+	for (i = 0; i < shinfo->nr_frags; i++) {
+		page = skb_frag_page(&shinfo->frags[i]);
+		if (page && page_ref_dec_return(page) <= 2)
+			netgpu_put_page(ifq, page, false);
+	}
+}
+#endif
+
 static void skb_release_data(struct sk_buff *skb)
 {
 	struct skb_shared_info *shinfo = skb_shinfo(skb);
@@ -601,8 +622,15 @@ static void skb_release_data(struct sk_buff *skb)
 			      &shinfo->dataref))
 		return;
 
-	for (i = 0; i < shinfo->nr_frags; i++)
-		__skb_frag_unref(&shinfo->frags[i]);
+#if IS_ENABLED(CONFIG_NETGPU)
+	if (skb->zc_netgpu && shinfo->nr_frags) {
+		skb_netgpu_unref(shinfo);
+	} else
+#endif
+	{
+		for (i = 0; i < shinfo->nr_frags; i++)
+			__skb_frag_unref(&shinfo->frags[i]);
+	}
 
 	if (shinfo->frag_list)
 		kfree_skb_list(shinfo->frag_list);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 12/21] lib: have __zerocopy_sg_from_iter get netgpu pages for a sk
  2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (10 preceding siblings ...)
  2020-07-27 22:44 ` [RFC PATCH v2 11/21] core/skbuff: add page recycling logic for netgpu pages Jonathan Lemon
@ 2020-07-27 22:44 ` Jonathan Lemon
  2020-07-27 22:44 ` [RFC PATCH v2 13/21] net/tcp: Pad TCP options out to a fixed size for netgpu Jonathan Lemon
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-27 22:44 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

From: Jonathan Lemon <bsd@fb.com>

If a sock is marked as sending zc data, have the iterator
retrieve the correct zc pages from the netgpu module.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/linux/uio.h |  4 ++++
 lib/iov_iter.c      | 53 +++++++++++++++++++++++++++++++++++++++++++++
 net/core/datagram.c |  9 ++++++--
 3 files changed, 64 insertions(+), 2 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 9576fd8158d7..9d9a68e224b0 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -227,6 +227,10 @@ ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
 			size_t maxsize, size_t *start);
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
+struct sock;
+ssize_t iov_iter_sk_get_pages(struct iov_iter *i, struct page **pages,
+			size_t maxsize, unsigned maxpages, size_t *pgoff,
+			struct sock *sk);
 
 const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags);
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index bf538c2bec77..69457df64339 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -10,6 +10,9 @@
 #include <linux/scatterlist.h>
 #include <linux/instrumented.h>
 
+#include <net/netgpu.h>
+#include <net/sock.h>
+
 #define PIPE_PARANOIA /* for now */
 
 #define iterate_iovec(i, n, __v, __p, skip, STEP) {	\
@@ -1349,6 +1352,56 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 }
 EXPORT_SYMBOL(iov_iter_get_pages);
 
+#if IS_ENABLED(CONFIG_NETGPU)
+ssize_t iov_iter_sk_get_pages(struct iov_iter *i, struct page **pages,
+		size_t maxsize, unsigned maxpages, size_t *pgoff,
+		struct sock *sk)
+{
+	const struct iovec *iov;
+	unsigned long addr;
+	struct iovec v;
+	size_t len;
+	unsigned n;
+	int ret;
+
+	if (!sk->sk_user_data)
+		return iov_iter_get_pages(i, pages, maxsize, maxpages, pgoff);
+
+	if (maxsize > i->count)
+		maxsize = i->count;
+
+	if (!iter_is_iovec(i))
+		return -EFAULT;
+
+	if (iov_iter_rw(i) != WRITE)
+		return -EFAULT;
+
+	iterate_iovec(i, maxsize, v, iov, i->iov_offset, ({
+		addr = (unsigned long)v.iov_base;
+		*pgoff = addr & (PAGE_SIZE - 1);
+		len = v.iov_len + *pgoff;
+
+		if (len > maxpages * PAGE_SIZE)
+			len = maxpages * PAGE_SIZE;
+
+		n = DIV_ROUND_UP(len, PAGE_SIZE);
+
+		ret = netgpu_get_pages(sk, pages, addr, n);
+		if (ret > 0)
+			ret = (ret == n ? len : ret * PAGE_SIZE) - *pgoff;
+		return ret;
+	0;}));
+	return 0;
+}
+#else
+ssize_t iov_iter_sk_get_pages(struct iov_iter *i, struct page **pages,
+		size_t maxsize, unsigned maxpages, size_t *pgoff,
+		struct sock *sk)
+{
+	return iov_iter_get_pages(i, pages, maxsize, maxpages, pgoff);
+}
+#endif
+
 static struct page **get_pages_array(size_t n)
 {
 	return kvmalloc_array(n, sizeof(struct page *), GFP_KERNEL);
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 639745d4f3b9..d91f14dc56be 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -530,6 +530,10 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
 			   struct iov_iter *to, int len)
 {
 	trace_skb_copy_datagram_iovec(skb, len);
+	if (skb->zc_netgpu) {
+		pr_err("skb netgpu datagram on !netgpu sk\n");
+		return -EFAULT;
+	}
 	return __skb_datagram_iter(skb, offset, to, len, false,
 			simple_copy_to_iter, NULL);
 }
@@ -631,8 +635,9 @@ int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
 		if (frag == MAX_SKB_FRAGS)
 			return -EMSGSIZE;
 
-		copied = iov_iter_get_pages(from, pages, length,
-					    MAX_SKB_FRAGS - frag, &start);
+		copied = iov_iter_sk_get_pages(from, pages, length,
+					       MAX_SKB_FRAGS - frag, &start,
+					       sk);
 		if (copied < 0)
 			return -EFAULT;
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 13/21] net/tcp: Pad TCP options out to a fixed size for netgpu
  2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (11 preceding siblings ...)
  2020-07-27 22:44 ` [RFC PATCH v2 12/21] lib: have __zerocopy_sg_from_iter get netgpu pages for a sk Jonathan Lemon
@ 2020-07-27 22:44 ` Jonathan Lemon
  2020-07-27 22:44 ` [RFC PATCH v2 14/21] net/tcp: add netgpu ioctl setting up zero copy RX queues Jonathan Lemon
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-27 22:44 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

From: Jonathan Lemon <bsd@fb.com>

The "header splitting" feature used by netgpu doesn't actually parse
the incoming packet header.  Instead, it splits the packet at a fixed
offset.  In order for this to work, the sender needs to send packets
with a fixed header size.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 net/ipv4/tcp_output.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d8f16f6a9b02..e8a74d0f7ad2 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -438,6 +438,7 @@ struct tcp_out_options {
 	u8 ws;			/* window scale, 0 to disable */
 	u8 num_sack_blocks;	/* number of SACK blocks to include */
 	u8 hash_size;		/* bytes in hash_location */
+	u8 pad_size;		/* additional nops for padding */
 	__u8 *hash_location;	/* temporary pointer, overloaded */
 	__u32 tsval, tsecr;	/* need to include OPTION_TS */
 	struct tcp_fastopen_cookie *fastopen_cookie;	/* Fast open cookie */
@@ -562,6 +563,17 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 	smc_options_write(ptr, &options);
 
 	mptcp_options_write(ptr, opts);
+
+#if IS_ENABLED(CONFIG_NETGPU)
+	/* pad out options */
+	if (opts->pad_size) {
+		int len = opts->pad_size;
+		u8 *p = (u8 *)ptr;
+
+		while (len--)
+			*p++ = TCPOPT_NOP;
+	}
+#endif
 }
 
 static void smc_set_option(const struct tcp_sock *tp,
@@ -826,6 +838,14 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 			opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
 	}
 
+#if IS_ENABLED(CONFIG_NETGPU)
+	/* force padding */
+	if (size < 20) {
+		opts->pad_size = 20 - size;
+		size += opts->pad_size;
+	}
+#endif
+
 	return size;
 }
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 14/21] net/tcp: add netgpu ioctl setting up zero copy RX queues
  2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (12 preceding siblings ...)
  2020-07-27 22:44 ` [RFC PATCH v2 13/21] net/tcp: Pad TCP options out to a fixed size for netgpu Jonathan Lemon
@ 2020-07-27 22:44 ` Jonathan Lemon
  2020-07-28  2:16   ` Jonathan Lemon
  2020-07-27 22:44 ` [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg() Jonathan Lemon
                   ` (7 subsequent siblings)
  21 siblings, 1 reply; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-27 22:44 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

From: Jonathan Lemon <bsd@fb.com>

Netgpu delivers iovecs to userspace for incoming data, but the
destination queue must be attached to the socket.  Do this via
and ioctl call on the socket itself.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 net/ipv4/tcp.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 27de9380ed14..261c28ccc8f6 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -279,6 +279,7 @@
 #include <linux/uaccess.h>
 #include <asm/ioctls.h>
 #include <net/busy_poll.h>
+#include <net/netgpu.h>
 
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
@@ -636,6 +637,10 @@ int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg)
 			answ = READ_ONCE(tp->write_seq) -
 			       READ_ONCE(tp->snd_nxt);
 		break;
+#if IS_ENABLED(CONFIG_NETGPU)
+	case NETGPU_SOCK_IOCTL_ATTACH_QUEUES:	/* SIOCPROTOPRIVATE */
+		return netgpu_attach_socket(sk, (void __user *)arg);
+#endif
 	default:
 		return -ENOIOCTLCMD;
 	}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg()
  2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (13 preceding siblings ...)
  2020-07-27 22:44 ` [RFC PATCH v2 14/21] net/tcp: add netgpu ioctl setting up zero copy RX queues Jonathan Lemon
@ 2020-07-27 22:44 ` Jonathan Lemon
  2020-07-27 22:44 ` [RFC PATCH v2 16/21] mlx5: remove the umem parameter from mlx5e_open_channel Jonathan Lemon
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-27 22:44 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

This flag indicates that the attached data is a zero-copy send,
and the pages should be retrieved from the netgpu module.  The
socket should should already have been attached to a netgpu queue.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/linux/socket.h | 1 +
 net/ipv4/tcp.c         | 8 ++++++++
 2 files changed, 9 insertions(+)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 04d2bc97f497..63816cc25dee 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -310,6 +310,7 @@ struct ucred {
 					  */
 
 #define MSG_ZEROCOPY	0x4000000	/* Use user data in kernel path */
+#define MSG_NETDMA  	0x8000000
 #define MSG_FASTOPEN	0x20000000	/* Send data in TCP SYN */
 #define MSG_CMSG_CLOEXEC 0x40000000	/* Set close_on_exec for file
 					   descriptor received through
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 261c28ccc8f6..340ce319edc9 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1214,6 +1214,14 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 			uarg->zerocopy = 0;
 	}
 
+	if (flags & MSG_NETDMA && size && sock_flag(sk, SOCK_ZEROCOPY)) {
+		zc = sk->sk_route_caps & NETIF_F_SG;
+		if (!zc) {
+			err = -EFAULT;
+			goto out_err;
+		}
+	}
+
 	if (unlikely(flags & MSG_FASTOPEN || inet_sk(sk)->defer_connect) &&
 	    !tp->repair) {
 		err = tcp_sendmsg_fastopen(sk, msg, &copied_syn, size, uarg);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 16/21] mlx5: remove the umem parameter from mlx5e_open_channel
  2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (14 preceding siblings ...)
  2020-07-27 22:44 ` [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg() Jonathan Lemon
@ 2020-07-27 22:44 ` Jonathan Lemon
  2020-07-27 22:44 ` [RFC PATCH v2 17/21] mlx5e: add header split ability Jonathan Lemon
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-27 22:44 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

From: Jonathan Lemon <bsd@fb.com>

Instead of obtaining the umem parameter from the channel parameters
and passing it to the function, push this down into the function itself.

Move xsk open logic into its own function, in preparation for the
upcoming netgpu commit.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 37 +++++++++++++------
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 9d5d8b28bcd8..3762d4527afe 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -393,7 +393,7 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 	rq->xdpsq   = &c->rq_xdpsq;
 	rq->umem    = umem;
 
-	if (rq->umem)
+	if (xsk)
 		rq->stats = &c->priv->channel_stats[c->ix].xskrq;
 	else
 		rq->stats = &c->priv->channel_stats[c->ix].rq;
@@ -1946,15 +1946,33 @@ static u8 mlx5e_enumerate_lag_port(struct mlx5_core_dev *mdev, int ix)
 	return (ix + port_aff_bias) % mlx5e_get_num_lag_ports(mdev);
 }
 
+static int
+mlx5e_xsk_optional_open(struct mlx5e_priv *priv, int ix,
+			struct mlx5e_params *params,
+			struct mlx5e_channel_param *cparam,
+			struct mlx5e_channel *c)
+{
+	struct mlx5e_xsk_param xsk;
+	struct xdp_umem *umem;
+	int err = 0;
+
+	umem = mlx5e_xsk_get_umem(params, params->xsk, ix);
+
+	if (umem) {
+		mlx5e_build_xsk_param(umem, &xsk);
+		err = mlx5e_open_xsk(priv, params, &xsk, umem, c);
+	}
+
+	return err;
+}
+
 static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 			      struct mlx5e_params *params,
 			      struct mlx5e_channel_param *cparam,
-			      struct xdp_umem *umem,
 			      struct mlx5e_channel **cp)
 {
 	int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix));
 	struct net_device *netdev = priv->netdev;
-	struct mlx5e_xsk_param xsk;
 	struct mlx5e_channel *c;
 	unsigned int irq;
 	int err;
@@ -1988,9 +2006,9 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 	if (unlikely(err))
 		goto err_napi_del;
 
-	if (umem) {
-		mlx5e_build_xsk_param(umem, &xsk);
-		err = mlx5e_open_xsk(priv, params, &xsk, umem, c);
+	/* This opens a second set of shadow queues for xsk */
+	if (params->xdp_prog) {
+		err = mlx5e_xsk_optional_open(priv, ix, params, cparam, c);
 		if (unlikely(err))
 			goto err_close_queues;
 	}
@@ -2351,12 +2369,7 @@ int mlx5e_open_channels(struct mlx5e_priv *priv,
 
 	mlx5e_build_channel_param(priv, &chs->params, cparam);
 	for (i = 0; i < chs->num; i++) {
-		struct xdp_umem *umem = NULL;
-
-		if (chs->params.xdp_prog)
-			umem = mlx5e_xsk_get_umem(&chs->params, chs->params.xsk, i);
-
-		err = mlx5e_open_channel(priv, i, &chs->params, cparam, umem, &chs->c[i]);
+		err = mlx5e_open_channel(priv, i, &chs->params, cparam, &chs->c[i]);
 		if (err)
 			goto err_close_channels;
 	}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 17/21] mlx5e: add header split ability
  2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (15 preceding siblings ...)
  2020-07-27 22:44 ` [RFC PATCH v2 16/21] mlx5: remove the umem parameter from mlx5e_open_channel Jonathan Lemon
@ 2020-07-27 22:44 ` Jonathan Lemon
  2020-07-27 22:44 ` [RFC PATCH v2 18/21] mlx5e: add netgpu entries to mlx5 structures Jonathan Lemon
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-27 22:44 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

From: Jonathan Lemon <bsd@fb.com>

Header split may be requested for a specific rq via a flag in the
xsk parameter.  If splitting is enabled (defaults to ipv6), set the
wq_type to WQ_TYPE_CYCLIC.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  6 +++
 .../ethernet/mellanox/mlx5/core/en/params.c   |  3 +-
 .../ethernet/mellanox/mlx5/core/en/params.h   |  1 +
 .../ethernet/mellanox/mlx5/core/en/xsk/umem.c |  1 +
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 47 ++++++++++++++-----
 5 files changed, 45 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index c44669102626..24d88e8952ed 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -58,6 +58,11 @@
 
 extern const struct net_device_ops mlx5e_netdev_ops;
 struct page_pool;
+#define TCP_HDRS_LEN (20 + 20)  /* headers + options */
+#define IP6_HDRS_LEN (40)
+#define MAC_HDR_LEN (14)
+#define TOTAL_HEADERS (TCP_HDRS_LEN + IP6_HDRS_LEN + MAC_HDR_LEN)
+#define HD_SPLIT_DEFAULT_FRAG_SIZE (4096)
 
 #define MLX5E_METADATA_ETHER_TYPE (0x8CE4)
 #define MLX5E_METADATA_ETHER_LEN 8
@@ -538,6 +543,7 @@ enum mlx5e_rq_flag {
 struct mlx5e_rq_frag_info {
 	int frag_size;
 	int frag_stride;
+	int frag_source;
 };
 
 struct mlx5e_rq_frags_info {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
index 38e4f19d69f8..a83a7d4d2551 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
@@ -146,7 +146,8 @@ u16 mlx5e_get_rq_headroom(struct mlx5_core_dev *mdev,
 			  struct mlx5e_params *params,
 			  struct mlx5e_xsk_param *xsk)
 {
-	bool is_linear_skb = (params->rq_wq_type == MLX5_WQ_TYPE_CYCLIC) ?
+	bool is_linear_skb = (xsk && xsk->hd_split) ? false :
+		(params->rq_wq_type == MLX5_WQ_TYPE_CYCLIC) ?
 		mlx5e_rx_is_linear_skb(params, xsk) :
 		mlx5e_rx_mpwqe_is_linear_skb(mdev, params, xsk);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.h b/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
index a87273e801b2..eb2d05a7c5b9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
@@ -9,6 +9,7 @@
 struct mlx5e_xsk_param {
 	u16 headroom;
 	u16 chunk_size;
+	bool hd_split;
 };
 
 struct mlx5e_cq_param {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c
index 331ca2b0f8a4..8ecfbcc3c826 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c
@@ -72,6 +72,7 @@ void mlx5e_build_xsk_param(struct xdp_umem *umem, struct mlx5e_xsk_param *xsk)
 {
 	xsk->headroom = xsk_umem_get_headroom(umem);
 	xsk->chunk_size = xsk_umem_get_chunk_size(umem);
+	xsk->hd_split = false;
 }
 
 static int mlx5e_xsk_enable_locked(struct mlx5e_priv *priv,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 3762d4527afe..5a0b181f92f7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -62,6 +62,7 @@
 #include "en/xsk/setup.h"
 #include "en/xsk/rx.h"
 #include "en/xsk/tx.h"
+#include "en/netgpu/setup.h"
 #include "en/hv_vhca_stats.h"
 #include "en/devlink.h"
 #include "lib/mlx5.h"
@@ -373,6 +374,7 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 	struct mlx5_core_dev *mdev = c->mdev;
 	void *rqc = rqp->rqc;
 	void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
+	bool hd_split = xsk && xsk->hd_split;
 	u32 rq_xdp_ix;
 	u32 pool_size;
 	int wq_sz;
@@ -381,7 +383,7 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 
 	rqp->wq.db_numa_node = cpu_to_node(c->cpu);
 
-	rq->wq_type = params->rq_wq_type;
+	rq->wq_type = hd_split ? MLX5_WQ_TYPE_CYCLIC : params->rq_wq_type;
 	rq->pdev    = c->pdev;
 	rq->netdev  = c->netdev;
 	rq->tstamp  = c->tstamp;
@@ -508,15 +510,16 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 			goto err_free;
 		}
 
-		rq->wqe.skb_from_cqe = xsk ?
-			mlx5e_xsk_skb_from_cqe_linear :
+		rq->wqe.skb_from_cqe =
+			hd_split ? mlx5e_skb_from_cqe_nonlinear :
+			xsk ? mlx5e_xsk_skb_from_cqe_linear :
 			mlx5e_rx_is_linear_skb(params, NULL) ?
 				mlx5e_skb_from_cqe_linear :
 				mlx5e_skb_from_cqe_nonlinear;
 		rq->mkey_be = c->mkey_be;
 	}
 
-	if (xsk) {
+	if (xsk && !hd_split) {
 		err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
 						 MEM_TYPE_XSK_BUFF_POOL, NULL);
 		xsk_buff_set_rxq_info(rq->umem, &rq->xdp_rxq);
@@ -2074,16 +2077,20 @@ static void mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
 				      struct mlx5e_rq_frags_info *info)
 {
 	u32 byte_count = MLX5E_SW2HW_MTU(params, params->sw_mtu);
-	int frag_size_max = DEFAULT_FRAG_SIZE;
+	bool hd_split = xsk && xsk->hd_split;
+	int frag_size_max;
 	u32 buf_size = 0;
 	int i;
 
+	frag_size_max = hd_split ? HD_SPLIT_DEFAULT_FRAG_SIZE :
+			DEFAULT_FRAG_SIZE;
+
 #ifdef CONFIG_MLX5_EN_IPSEC
 	if (MLX5_IPSEC_DEV(mdev))
 		byte_count += MLX5E_METADATA_ETHER_LEN;
 #endif
 
-	if (mlx5e_rx_is_linear_skb(params, xsk)) {
+	if (!hd_split && mlx5e_rx_is_linear_skb(params, xsk)) {
 		int frag_stride;
 
 		frag_stride = mlx5e_rx_get_linear_frag_sz(params, xsk);
@@ -2101,6 +2108,16 @@ static void mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
 		frag_size_max = PAGE_SIZE;
 
 	i = 0;
+
+	if (hd_split) {
+		// Start with one fragment for all headers (implementing HDS)
+		info->arr[0].frag_size = TOTAL_HEADERS;
+		info->arr[0].frag_stride = roundup_pow_of_two(PAGE_SIZE);
+		buf_size += TOTAL_HEADERS;
+		// Now, continue with the payload frags.
+		i = 1;
+	}
+
 	while (buf_size < byte_count) {
 		int frag_size = byte_count - buf_size;
 
@@ -2108,8 +2125,10 @@ static void mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
 			frag_size = min(frag_size, frag_size_max);
 
 		info->arr[i].frag_size = frag_size;
-		info->arr[i].frag_stride = roundup_pow_of_two(frag_size);
-
+		info->arr[i].frag_stride = roundup_pow_of_two(hd_split ?
+							      PAGE_SIZE :
+							      frag_size);
+		info->arr[i].frag_source = hd_split;
 		buf_size += frag_size;
 		i++;
 	}
@@ -2152,9 +2171,11 @@ void mlx5e_build_rq_param(struct mlx5e_priv *priv,
 	struct mlx5_core_dev *mdev = priv->mdev;
 	void *rqc = param->rqc;
 	void *wq = MLX5_ADDR_OF(rqc, rqc, wq);
+	bool hd_split = xsk && xsk->hd_split;
+	u8 wq_type = hd_split ? MLX5_WQ_TYPE_CYCLIC : params->rq_wq_type;
 	int ndsegs = 1;
 
-	switch (params->rq_wq_type) {
+	switch (wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
 		MLX5_SET(wq, wq, log_wqe_num_of_strides,
 			 mlx5e_mpwqe_get_log_num_strides(mdev, params, xsk) -
@@ -2170,10 +2191,10 @@ void mlx5e_build_rq_param(struct mlx5e_priv *priv,
 		ndsegs = param->frags_info.num_frags;
 	}
 
-	MLX5_SET(wq, wq, wq_type,          params->rq_wq_type);
+	MLX5_SET(wq, wq, wq_type,          wq_type);
 	MLX5_SET(wq, wq, end_padding_mode, MLX5_WQ_END_PAD_MODE_ALIGN);
 	MLX5_SET(wq, wq, log_wq_stride,
-		 mlx5e_get_rqwq_log_stride(params->rq_wq_type, ndsegs));
+		 mlx5e_get_rqwq_log_stride(wq_type, ndsegs));
 	MLX5_SET(wq, wq, pd,               mdev->mlx5e_res.pdn);
 	MLX5_SET(rqc, rqc, counter_set_id, priv->q_counter);
 	MLX5_SET(rqc, rqc, vsd,            params->vlan_strip_disable);
@@ -2243,9 +2264,11 @@ void mlx5e_build_rx_cq_param(struct mlx5e_priv *priv,
 {
 	struct mlx5_core_dev *mdev = priv->mdev;
 	void *cqc = param->cqc;
+	bool hd_split = xsk && xsk->hd_split;
+	u8 wq_type = hd_split ? MLX5_WQ_TYPE_CYCLIC : params->rq_wq_type;
 	u8 log_cq_size;
 
-	switch (params->rq_wq_type) {
+	switch (wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
 		log_cq_size = mlx5e_mpwqe_get_log_rq_size(params, xsk) +
 			mlx5e_mpwqe_get_log_num_strides(mdev, params, xsk);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 18/21] mlx5e: add netgpu entries to mlx5 structures
  2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (16 preceding siblings ...)
  2020-07-27 22:44 ` [RFC PATCH v2 17/21] mlx5e: add header split ability Jonathan Lemon
@ 2020-07-27 22:44 ` Jonathan Lemon
  2020-07-27 22:44 ` [RFC PATCH v2 19/21] mlx5e: add the netgpu driver functions Jonathan Lemon
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-27 22:44 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

From: Jonathan Lemon <bsd@fb.com>

Modify mlx5e structures in order to add support for netgpu, which
shares some of the same structures as AF_XDP.  Add logic to make sure
they are not both in use.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h         | 12 ++++++++++--
 .../net/ethernet/mellanox/mlx5/core/en/xsk/umem.c    |  3 +++
 .../net/ethernet/mellanox/mlx5/core/en/xsk/umem.h    |  3 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c    |  3 ++-
 4 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 24d88e8952ed..ae555c6be847 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -365,6 +365,7 @@ struct mlx5e_dma_info {
 		struct page *page;
 		struct xdp_buff *xsk;
 	};
+	bool netgpu_source;
 };
 
 /* XDP packets can be transmitted in different ways. On completion, we need to
@@ -553,6 +554,7 @@ struct mlx5e_rq_frags_info {
 	u8 wqe_bulk;
 };
 
+struct netgpu_ifq;
 struct mlx5e_rq {
 	/* data path */
 	union {
@@ -608,8 +610,9 @@ struct mlx5e_rq {
 	DECLARE_BITMAP(flags, 8);
 	struct page_pool      *page_pool;
 
-	/* AF_XDP zero-copy */
+	/* AF_XDP or NETGPU zero-copy */
 	struct xdp_umem       *umem;
+	struct netgpu_ifq     *netgpu;
 
 	struct work_struct     recover_work;
 
@@ -627,6 +630,7 @@ struct mlx5e_rq {
 
 enum mlx5e_channel_state {
 	MLX5E_CHANNEL_STATE_XSK,
+	MLX5E_CHANNEL_STATE_NETGPU,
 	MLX5E_CHANNEL_NUM_STATES
 };
 
@@ -737,9 +741,13 @@ struct mlx5e_xsk {
 	 * but it doesn't distinguish between zero-copy and non-zero-copy UMEMs,
 	 * so rely on our mechanism.
 	 */
-	struct xdp_umem **umems;
+	union {
+		struct xdp_umem **umems;
+		struct netgpu_ifq **ifq_tbl;
+	};
 	u16 refcnt;
 	bool ever_used;
+	bool is_umem;
 };
 
 /* Temporary storage for variables that are allocated when struct mlx5e_priv is
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c
index 8ecfbcc3c826..1fad8dbbf59d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c
@@ -27,7 +27,10 @@ static int mlx5e_xsk_get_umems(struct mlx5e_xsk *xsk)
 				     sizeof(*xsk->umems), GFP_KERNEL);
 		if (unlikely(!xsk->umems))
 			return -ENOMEM;
+		xsk->is_umem = true;
 	}
+	if (!xsk->is_umem)
+		return -EINVAL;
 
 	xsk->refcnt++;
 	xsk->ever_used = true;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.h
index bada94973586..13ef03446571 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.h
@@ -15,6 +15,9 @@ static inline struct xdp_umem *mlx5e_xsk_get_umem(struct mlx5e_params *params,
 	if (unlikely(ix >= params->num_channels))
 		return NULL;
 
+	if (unlikely(!xsk->is_umem))
+		return NULL;
+
 	return xsk->umems[ix];
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 5a0b181f92f7..d75f22471357 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -62,7 +62,6 @@
 #include "en/xsk/setup.h"
 #include "en/xsk/rx.h"
 #include "en/xsk/tx.h"
-#include "en/netgpu/setup.h"
 #include "en/hv_vhca_stats.h"
 #include "en/devlink.h"
 #include "lib/mlx5.h"
@@ -324,6 +323,8 @@ static void mlx5e_init_frags_partition(struct mlx5e_rq *rq)
 				if (prev)
 					prev->last_in_page = true;
 			}
+			next_frag.di->netgpu_source =
+						!!frag_info[f].frag_source;
 			*frag = next_frag;
 
 			/* prepare next */
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 19/21] mlx5e: add the netgpu driver functions
  2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (17 preceding siblings ...)
  2020-07-27 22:44 ` [RFC PATCH v2 18/21] mlx5e: add netgpu entries to mlx5 structures Jonathan Lemon
@ 2020-07-27 22:44 ` Jonathan Lemon
  2020-07-28 16:27   ` Greg KH
  2020-07-27 22:44 ` [RFC PATCH v2 20/21] mlx5e: hook up the netgpu functions Jonathan Lemon
                   ` (2 subsequent siblings)
  21 siblings, 1 reply; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-27 22:44 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

From: Jonathan Lemon <bsd@fb.com>

Add the netgpu queue setup/teardown functions, and the interface into
the main netgpu core code.  These will be hooked up to the mlx5 driver
in the next commit.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 .../net/ethernet/mellanox/mlx5/core/Kconfig   |   1 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   1 +
 .../mellanox/mlx5/core/en/netgpu/setup.c      | 340 ++++++++++++++++++
 .../mellanox/mlx5/core/en/netgpu/setup.h      |  96 +++++
 .../ethernet/mellanox/mlx5/core/en/params.h   |   8 +
 5 files changed, 446 insertions(+)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index 99f1ec3b2575..ceedc443666b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -33,6 +33,7 @@ config MLX5_FPGA
 config MLX5_CORE_EN
 	bool "Mellanox 5th generation network adapters (ConnectX series) Ethernet support"
 	depends on NETDEVICES && ETHERNET && INET && PCI && MLX5_CORE
+	depends on NETGPU || !NETGPU
 	select PAGE_POOL
 	select DIMLIB
 	default n
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 10e6886c96ba..5a5966bb3cb5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -41,6 +41,7 @@ mlx5_core-$(CONFIG_MLX5_CLS_ACT)     += en_tc.o en/rep/tc.o en/rep/neigh.o \
 					en/tc_tun_vxlan.o en/tc_tun_gre.o en/tc_tun_geneve.o \
 					en/tc_tun_mplsoudp.o diag/en_tc_tracepoint.o
 mlx5_core-$(CONFIG_MLX5_TC_CT)	     += en/tc_ct.o
+mlx5_core-$(CONFIG_NETGPU)	     += en/netgpu/setup.o
 
 #
 # Core extra
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c b/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c
new file mode 100644
index 000000000000..6ece4ad0aed6
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c
@@ -0,0 +1,340 @@
+#include "en.h"
+#include "en/xdp.h"
+#include "en/params.h"
+#include "en/netgpu/setup.h"
+
+struct netgpu_ifq *
+mlx5e_netgpu_get_ifq(struct mlx5e_params *params, struct mlx5e_xsk *xsk,
+		     u16 ix)
+{
+	if (!xsk || !xsk->ifq_tbl)
+		return NULL;
+
+	if (unlikely(ix >= params->num_channels))
+		return NULL;
+
+	if (unlikely(xsk->is_umem))
+		return NULL;
+
+	return xsk->ifq_tbl[ix];
+}
+
+static int mlx5e_netgpu_get_tbl(struct mlx5e_xsk *xsk)
+{
+	if (!xsk->ifq_tbl) {
+		xsk->ifq_tbl = kcalloc(MLX5E_MAX_NUM_CHANNELS,
+				       sizeof(*xsk->ifq_tbl), GFP_KERNEL);
+		if (unlikely(!xsk->ifq_tbl))
+			return -ENOMEM;
+		xsk->is_umem = false;
+	}
+	if (xsk->is_umem)
+		return -EINVAL;
+
+	xsk->refcnt++;
+	xsk->ever_used = true;
+
+	return 0;
+}
+
+static void mlx5e_netgpu_put_tbl(struct mlx5e_xsk *xsk)
+{
+	if (!--xsk->refcnt) {
+		kfree(xsk->ifq_tbl);
+		xsk->ifq_tbl = NULL;
+	}
+}
+
+static void mlx5e_netgpu_remove_ifq(struct mlx5e_xsk *xsk, u16 ix)
+{
+	xsk->ifq_tbl[ix] = NULL;
+
+	mlx5e_netgpu_put_tbl(xsk);
+}
+
+static int mlx5e_netgpu_add_ifq(struct mlx5e_xsk *xsk, struct netgpu_ifq *ifq,
+				u16 ix)
+{
+	int err;
+
+	err = mlx5e_netgpu_get_tbl(xsk);
+	if (unlikely(err))
+		return err;
+
+	xsk->ifq_tbl[ix] = ifq;
+
+	return 0;
+}
+
+static u16
+mlx5e_netgpu_find_unused_ifq(struct mlx5e_priv *priv,
+			     struct mlx5e_params *params)
+{
+	u16 ix;
+
+	for (ix = 0; ix < params->num_channels; ix++) {
+		if (!mlx5e_netgpu_get_ifq(params, &priv->xsk, ix))
+			break;
+	}
+	return ix;
+}
+
+static int
+mlx5e_redirect_netgpu_rqt(struct mlx5e_priv *priv, u16 ix, u32 rqn)
+{
+	struct mlx5e_redirect_rqt_param direct_rrp = {
+		.is_rss = false,
+		{
+			.rqn = rqn,
+		},
+	};
+
+	u32 rqtn = priv->xsk_tir[ix].rqt.rqtn;
+
+	return mlx5e_redirect_rqt(priv, rqtn, 1, direct_rrp);
+}
+
+static int
+mlx5e_netgpu_redirect_rqt_to_channel(struct mlx5e_priv *priv,
+				     struct mlx5e_channel *c)
+{
+	return mlx5e_redirect_netgpu_rqt(priv, c->ix, c->xskrq.rqn);
+}
+
+static int
+mlx5e_netgpu_redirect_rqt_to_drop(struct mlx5e_priv *priv, u16 ix)
+{
+	return mlx5e_redirect_netgpu_rqt(priv, ix, priv->drop_rq.rqn);
+}
+
+int mlx5e_netgpu_redirect_rqts_to_channels(struct mlx5e_priv *priv,
+					   struct mlx5e_channels *chs)
+{
+	int err, i;
+
+	for (i = 0; i < chs->num; i++) {
+		struct mlx5e_channel *c = chs->c[i];
+
+		if (!test_bit(MLX5E_CHANNEL_STATE_NETGPU, c->state))
+			continue;
+
+		err = mlx5e_netgpu_redirect_rqt_to_channel(priv, c);
+		if (unlikely(err))
+			goto err_stop;
+	}
+
+	return 0;
+
+err_stop:
+	for (i--; i >= 0; i--) {
+		if (!test_bit(MLX5E_CHANNEL_STATE_NETGPU, chs->c[i]->state))
+			continue;
+
+		mlx5e_netgpu_redirect_rqt_to_drop(priv, i);
+	}
+
+	return err;
+}
+
+void mlx5e_netgpu_redirect_rqts_to_drop(struct mlx5e_priv *priv,
+					struct mlx5e_channels *chs)
+{
+	int i;
+
+	for (i = 0; i < chs->num; i++) {
+		if (!test_bit(MLX5E_CHANNEL_STATE_NETGPU, chs->c[i]->state))
+			continue;
+
+		mlx5e_netgpu_redirect_rqt_to_drop(priv, i);
+	}
+}
+
+static void mlx5e_activate_netgpu(struct mlx5e_channel *c)
+{
+	set_bit(MLX5E_RQ_STATE_ENABLED, &c->xskrq.state);
+
+	spin_lock(&c->async_icosq_lock);
+	mlx5e_trigger_irq(&c->async_icosq);
+	spin_unlock(&c->async_icosq_lock);
+}
+
+void mlx5e_deactivate_netgpu(struct mlx5e_channel *c)
+{
+	mlx5e_deactivate_rq(&c->xskrq);
+}
+
+static int mlx5e_netgpu_enable_locked(struct mlx5e_priv *priv,
+				      struct netgpu_ifq *ifq, u16 *qid)
+{
+	struct mlx5e_params *params = &priv->channels.params;
+	struct mlx5e_channel *c;
+	int err;
+	u16 ix;
+
+	if (*qid == (u16)-1) {
+		ix = mlx5e_netgpu_find_unused_ifq(priv, params);
+		if (ix >= params->num_channels)
+			return -EBUSY;
+
+		mlx5e_get_qid_for_ch_in_group(params, qid, ix,
+					      MLX5E_RQ_GROUP_XSK);
+	} else {
+		if (!mlx5e_qid_get_ch_if_in_group(params, *qid,
+						  MLX5E_RQ_GROUP_XSK, &ix))
+			return -EINVAL;
+
+		if (unlikely(mlx5e_netgpu_get_ifq(params, &priv->xsk, ix)))
+			return -EBUSY;
+	}
+
+	err = mlx5e_netgpu_add_ifq(&priv->xsk, ifq, ix);
+	if (unlikely(err))
+		return err;
+
+	if (!test_bit(MLX5E_STATE_OPENED, &priv->state)) {
+		/* XSK objects will be created on open. */
+		goto validate_closed;
+	}
+
+	c = priv->channels.c[ix];
+
+	err = mlx5e_open_netgpu(priv, params, ifq, c);
+	if (unlikely(err))
+		goto err_remove_ifq;
+
+	mlx5e_activate_netgpu(c);
+
+	/* Don't wait for WQEs, because the newer xdpsock sample doesn't provide
+	 * any Fill Ring entries at the setup stage.
+	 */
+
+	err = mlx5e_netgpu_redirect_rqt_to_channel(priv, priv->channels.c[ix]);
+	if (unlikely(err))
+		goto err_deactivate;
+
+	return 0;
+
+err_deactivate:
+	mlx5e_deactivate_netgpu(c);
+	mlx5e_close_netgpu(c);
+
+err_remove_ifq:
+	mlx5e_netgpu_remove_ifq(&priv->xsk, ix);
+
+	return err;
+
+validate_closed:
+	return 0;
+}
+
+static int mlx5e_netgpu_disable_locked(struct mlx5e_priv *priv, u16 *qid)
+{
+	struct mlx5e_params *params = &priv->channels.params;
+	struct mlx5e_channel *c;
+	struct netgpu_ifq *ifq;
+	u16 ix;
+
+	if (unlikely(!mlx5e_qid_get_ch_if_in_group(params, *qid,
+						   MLX5E_RQ_GROUP_XSK, &ix)))
+		return -EINVAL;
+
+	ifq = mlx5e_netgpu_get_ifq(params, &priv->xsk, ix);
+
+	if (unlikely(!ifq))
+		return -EINVAL;
+
+	if (!test_bit(MLX5E_STATE_OPENED, &priv->state))
+		goto remove_ifq;
+
+	c = priv->channels.c[ix];
+	mlx5e_netgpu_redirect_rqt_to_drop(priv, ix);
+	mlx5e_deactivate_netgpu(c);
+	mlx5e_close_netgpu(c);
+
+remove_ifq:
+	mlx5e_netgpu_remove_ifq(&priv->xsk, ix);
+
+	return 0;
+}
+
+static int mlx5e_netgpu_enable_ifq(struct mlx5e_priv *priv,
+				   struct netgpu_ifq *ifq, u16 *qid)
+{
+	int err;
+
+	mutex_lock(&priv->state_lock);
+	err = mlx5e_netgpu_enable_locked(priv, ifq, qid);
+	mutex_unlock(&priv->state_lock);
+
+	return err;
+}
+
+static int mlx5e_netgpu_disable_ifq(struct mlx5e_priv *priv, u16 *qid)
+{
+	int err;
+
+	mutex_lock(&priv->state_lock);
+	err = mlx5e_netgpu_disable_locked(priv, qid);
+	mutex_unlock(&priv->state_lock);
+
+	return err;
+}
+
+int
+mlx5e_netgpu_setup_ifq(struct net_device *dev, struct netgpu_ifq *ifq, u16 *qid)
+{
+	struct mlx5e_priv *priv = netdev_priv(dev);
+
+	return ifq ? mlx5e_netgpu_enable_ifq(priv, ifq, qid) :
+		     mlx5e_netgpu_disable_ifq(priv, qid);
+}
+
+int mlx5e_open_netgpu(struct mlx5e_priv *priv, struct mlx5e_params *params,
+		      struct netgpu_ifq *ifq, struct mlx5e_channel *c)
+{
+	struct mlx5e_channel_param *cparam;
+	struct mlx5e_xsk_param xsk = { .hd_split = true };
+	int err;
+
+	cparam = kvzalloc(sizeof(*cparam), GFP_KERNEL);
+	if (!cparam)
+		return -ENOMEM;
+
+	mlx5e_build_rq_param(priv, params, &xsk, &cparam->rq);
+
+	err = mlx5e_open_cq(c, params->rx_cq_moderation, &cparam->rq.cqp,
+			    &c->xskrq.cq);
+	if (unlikely(err))
+		goto err_free_cparam;
+
+	err = mlx5e_open_rq(c, params, &cparam->rq, &xsk, NULL, &c->xskrq);
+	if (unlikely(err))
+		goto err_close_rx_cq;
+	c->xskrq.netgpu = ifq;
+
+	kvfree(cparam);
+
+	set_bit(MLX5E_CHANNEL_STATE_NETGPU, c->state);
+
+	return 0;
+
+err_close_rx_cq:
+	mlx5e_close_cq(&c->xskrq.cq);
+
+err_free_cparam:
+	kvfree(cparam);
+
+	return err;
+}
+
+void mlx5e_close_netgpu(struct mlx5e_channel *c)
+{
+	clear_bit(MLX5E_CHANNEL_STATE_NETGPU, c->state);
+	napi_synchronize(&c->napi);
+	synchronize_rcu(); /* Sync with the XSK wakeup. */
+
+	mlx5e_close_rq(&c->xskrq);
+	mlx5e_close_cq(&c->xskrq.cq);
+
+	memset(&c->xskrq, 0, sizeof(c->xskrq));
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.h b/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.h
new file mode 100644
index 000000000000..5a199fb1873b
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.h
@@ -0,0 +1,96 @@
+#ifndef _MLX5_EN_NETGPU_SETUP_H
+#define _MLX5_EN_NETGPU_SETUP_H
+
+#include <net/netgpu.h>
+
+#if IS_ENABLED(CONFIG_NETGPU)
+
+static inline dma_addr_t
+mlx5e_netgpu_get_dma(struct sk_buff *skb, skb_frag_t *frag)
+{
+	struct netgpu_skq *skq = skb_shinfo(skb)->destructor_arg;
+
+	return netgpu_get_dma(skq->ctx, skb_frag_page(frag));
+}
+
+static inline int
+mlx5e_netgpu_get_page(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info)
+{
+	struct netgpu_ifq *ifq = rq->netgpu;
+
+	return netgpu_get_page(ifq, &dma_info->page, &dma_info->addr);
+}
+
+static inline void
+mlx5e_netgpu_put_page(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
+		      bool recycle)
+{
+	struct netgpu_ifq *ifq = rq->netgpu;
+	struct page *page = dma_info->page;
+
+	if (page) {
+		put_page(page);
+		netgpu_put_page(ifq, page, recycle);
+	}
+}
+
+static inline bool
+mlx5e_netgpu_avail(struct mlx5e_rq *rq, u8 count)
+{
+	struct netgpu_ifq *ifq = rq->netgpu;
+
+	/* XXX
+	 * napi_cache_count is not a total count, and this also
+	 * doesn't consider any_cache_count.
+	 */
+	return ifq->napi_cache_count >= count ||
+		sq_cons_avail(&ifq->fill, count - ifq->napi_cache_count);
+}
+
+static inline void
+mlx5e_netgpu_taken(struct mlx5e_rq *rq)
+{
+	struct netgpu_ifq *ifq = rq->netgpu;
+
+	sq_cons_complete(&ifq->fill);
+}
+
+struct netgpu_ifq *
+mlx5e_netgpu_get_ifq(struct mlx5e_params *params, struct mlx5e_xsk *xsk,
+                     u16 ix);
+
+int
+mlx5e_netgpu_setup_ifq(struct net_device *dev, struct netgpu_ifq *ifq,
+		       u16 *qid);
+
+int mlx5e_open_netgpu(struct mlx5e_priv *priv, struct mlx5e_params *params,
+		      struct netgpu_ifq *ifq, struct mlx5e_channel *c);
+
+void mlx5e_close_netgpu(struct mlx5e_channel *c);
+
+void mlx5e_deactivate_netgpu(struct mlx5e_channel *c);
+
+int mlx5e_netgpu_redirect_rqts_to_channels(struct mlx5e_priv *priv,
+					    struct mlx5e_channels *chs);
+
+void mlx5e_netgpu_redirect_rqts_to_drop(struct mlx5e_priv *priv,
+					struct mlx5e_channels *chs);
+
+#else
+
+#define mlx5e_netgpu_get_dma(skb, frag)				0
+#define mlx5e_netgpu_get_page(rq, dma_info)			0
+#define mlx5e_netgpu_put_page(rq, dma_info, recycle)
+#define mlx5e_netgpu_avail(rq, u8)				false
+#define mlx5e_netgpu_taken(rq)
+#define mlx5e_netgpu_get_ifq(params, xsk, ix)			NULL
+#define mlx5e_netgpu_setup_ifq(dev, ifq, qid)			-EINVAL
+#define mlx5e_open_netgpu(priv, params, ifq, c)			-EINVAL
+#define mlx5e_close_netgpu(c)
+#define mlx5e_deactivate_netgpu(c)
+#define mlx5e_netgpu_redirect_rqts_to_channels(priv, chs)	/* ignored */
+#define mlx5e_netgpu_redirect_rqts_to_drop(priv, chs)
+
+#endif /* IS_ENABLED(CONFIG_NETGPU) */
+
+#endif /* _MLX5_EN_NETGPU_SETUP_H */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.h b/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
index eb2d05a7c5b9..9700a984f5c9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
@@ -67,6 +67,14 @@ static inline void mlx5e_qid_get_ch_and_group(struct mlx5e_params *params,
 	*group = qid / nch;
 }
 
+static inline void mlx5e_get_qid_for_ch_in_group(struct mlx5e_params *params,
+						 u16 *qid,
+						 u16 ix,
+						 enum mlx5e_rq_group group)
+{
+	*qid = params->num_channels * group + ix;
+}
+
 static inline bool mlx5e_qid_validate(const struct mlx5e_profile *profile,
 				      struct mlx5e_params *params, u64 qid)
 {
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 20/21] mlx5e: hook up the netgpu functions
  2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (18 preceding siblings ...)
  2020-07-27 22:44 ` [RFC PATCH v2 19/21] mlx5e: add the netgpu driver functions Jonathan Lemon
@ 2020-07-27 22:44 ` Jonathan Lemon
  2020-07-27 22:44 ` [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu Jonathan Lemon
  2020-07-28 19:55 ` [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Stephen Hemminger
  21 siblings, 0 replies; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-27 22:44 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

From: Jonathan Lemon <bsd@fb.com>

Hook up all the netgpu functions to the mlx5e driver.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  3 +-
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  3 +
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 36 ++++++++++++
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 58 ++++++++++++++++---
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 19 ++++++
 .../net/ethernet/mellanox/mlx5/core/en_txrx.c | 16 ++++-
 6 files changed, 125 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index ae555c6be847..f6d63e99a6b9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -297,7 +297,8 @@ struct mlx5e_cq_decomp {
 
 enum mlx5e_dma_map_type {
 	MLX5E_DMA_MAP_SINGLE,
-	MLX5E_DMA_MAP_PAGE
+	MLX5E_DMA_MAP_PAGE,
+	MLX5E_DMA_MAP_FIXED
 };
 
 struct mlx5e_sq_dma {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
index cf425a60cddc..eb5dbcbc0f58 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
@@ -253,6 +253,9 @@ mlx5e_tx_dma_unmap(struct device *pdev, struct mlx5e_sq_dma *dma)
 	case MLX5E_DMA_MAP_PAGE:
 		dma_unmap_page(pdev, dma->addr, dma->size, DMA_TO_DEVICE);
 		break;
+	case MLX5E_DMA_MAP_FIXED:
+		/* DMA mappings are fixed, or managed elsewhere. */
+		break;
 	default:
 		WARN_ONCE(true, "mlx5e_tx_dma_unmap unknown DMA type!\n");
 	}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index d75f22471357..36afe73faa0e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -62,6 +62,7 @@
 #include "en/xsk/setup.h"
 #include "en/xsk/rx.h"
 #include "en/xsk/tx.h"
+#include "en/netgpu/setup.h"
 #include "en/hv_vhca_stats.h"
 #include "en/devlink.h"
 #include "lib/mlx5.h"
@@ -1970,6 +1971,24 @@ mlx5e_xsk_optional_open(struct mlx5e_priv *priv, int ix,
 	return err;
 }
 
+static int
+mlx5e_netgpu_optional_open(struct mlx5e_priv *priv, int ix,
+			   struct mlx5e_params *params,
+			   struct mlx5e_channel_param *cparam,
+			   struct mlx5e_channel *c)
+{
+	struct netgpu_ifq *ifq;
+	int err = 0;
+
+	ifq = mlx5e_netgpu_get_ifq(params, params->xsk, ix);
+
+	if (ifq)
+		err = mlx5e_open_netgpu(priv, params, ifq, c);
+
+	return err;
+}
+
+
 static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 			      struct mlx5e_params *params,
 			      struct mlx5e_channel_param *cparam,
@@ -2017,6 +2036,11 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 			goto err_close_queues;
 	}
 
+	/* This opens a second set of shadow queues for netgpu */
+	err = mlx5e_netgpu_optional_open(priv, ix, params, cparam, c);
+	if (unlikely(err))
+		goto err_close_queues;
+
 	*cp = c;
 
 	return 0;
@@ -2053,6 +2077,9 @@ static void mlx5e_deactivate_channel(struct mlx5e_channel *c)
 	if (test_bit(MLX5E_CHANNEL_STATE_XSK, c->state))
 		mlx5e_deactivate_xsk(c);
 
+	if (test_bit(MLX5E_CHANNEL_STATE_NETGPU, c->state))
+		mlx5e_deactivate_netgpu(c);
+
 	mlx5e_deactivate_rq(&c->rq);
 	mlx5e_deactivate_icosq(&c->async_icosq);
 	mlx5e_deactivate_icosq(&c->icosq);
@@ -2064,6 +2091,10 @@ static void mlx5e_close_channel(struct mlx5e_channel *c)
 {
 	if (test_bit(MLX5E_CHANNEL_STATE_XSK, c->state))
 		mlx5e_close_xsk(c);
+
+	if (test_bit(MLX5E_CHANNEL_STATE_NETGPU, c->state))
+		mlx5e_close_netgpu(c);
+
 	mlx5e_close_queues(c);
 	netif_napi_del(&c->napi);
 
@@ -3042,11 +3073,13 @@ void mlx5e_activate_priv_channels(struct mlx5e_priv *priv)
 	mlx5e_redirect_rqts_to_channels(priv, &priv->channels);
 
 	mlx5e_xsk_redirect_rqts_to_channels(priv, &priv->channels);
+	mlx5e_netgpu_redirect_rqts_to_channels(priv, &priv->channels);
 }
 
 void mlx5e_deactivate_priv_channels(struct mlx5e_priv *priv)
 {
 	mlx5e_xsk_redirect_rqts_to_drop(priv, &priv->channels);
+	mlx5e_netgpu_redirect_rqts_to_drop(priv, &priv->channels);
 
 	mlx5e_redirect_rqts_to_drop(priv);
 
@@ -4581,6 +4614,9 @@ static int mlx5e_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	case XDP_SETUP_XSK_UMEM:
 		return mlx5e_xsk_setup_umem(dev, xdp->xsk.umem,
 					    xdp->xsk.queue_id);
+	case XDP_SETUP_NETGPU:
+		return mlx5e_netgpu_setup_ifq(dev, xdp->netgpu.ifq,
+					      &xdp->netgpu.queue_id);
 	default:
 		return -EINVAL;
 	}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 74860f3827b1..746fbb417c3a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -50,6 +50,7 @@
 #include "en/xdp.h"
 #include "en/xsk/rx.h"
 #include "en/health.h"
+#include "en/netgpu/setup.h"
 
 static inline bool mlx5e_rx_hw_stamp(struct hwtstamp_config *config)
 {
@@ -266,8 +267,11 @@ static inline int mlx5e_page_alloc(struct mlx5e_rq *rq,
 {
 	if (rq->umem)
 		return mlx5e_xsk_page_alloc_umem(rq, dma_info);
-	else
-		return mlx5e_page_alloc_pool(rq, dma_info);
+
+	if (dma_info->netgpu_source)
+		return mlx5e_netgpu_get_page(rq, dma_info);
+
+	return mlx5e_page_alloc_pool(rq, dma_info);
 }
 
 void mlx5e_page_dma_unmap(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info)
@@ -279,6 +283,9 @@ void mlx5e_page_release_dynamic(struct mlx5e_rq *rq,
 				struct mlx5e_dma_info *dma_info,
 				bool recycle)
 {
+	if (dma_info->netgpu_source)
+		return mlx5e_netgpu_put_page(rq, dma_info, recycle);
+
 	if (likely(recycle)) {
 		if (mlx5e_rx_cache_put(rq, dma_info))
 			return;
@@ -394,6 +401,9 @@ static int mlx5e_alloc_rx_wqes(struct mlx5e_rq *rq, u16 ix, u8 wqe_bulk)
 			return -ENOMEM;
 	}
 
+	if (rq->netgpu && !mlx5e_netgpu_avail(rq, wqe_bulk))
+		return -ENOMEM;
+
 	for (i = 0; i < wqe_bulk; i++) {
 		struct mlx5e_rx_wqe_cyc *wqe = mlx5_wq_cyc_get_wqe(wq, ix + i);
 
@@ -402,6 +412,9 @@ static int mlx5e_alloc_rx_wqes(struct mlx5e_rq *rq, u16 ix, u8 wqe_bulk)
 			goto free_wqes;
 	}
 
+	if (rq->netgpu)
+		mlx5e_netgpu_taken(rq);
+
 	return 0;
 
 free_wqes:
@@ -416,12 +429,18 @@ mlx5e_add_skb_frag(struct mlx5e_rq *rq, struct sk_buff *skb,
 		   struct mlx5e_dma_info *di, u32 frag_offset, u32 len,
 		   unsigned int truesize)
 {
-	dma_sync_single_for_cpu(rq->pdev,
-				di->addr + frag_offset,
-				len, DMA_FROM_DEVICE);
-	page_ref_inc(di->page);
 	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
 			di->page, frag_offset, len, truesize);
+
+	if (skb->zc_netgpu) {
+		di->page = NULL;
+	} else {
+		page_ref_inc(di->page);
+
+		dma_sync_single_for_cpu(rq->pdev,
+					di->addr + frag_offset,
+					len, DMA_FROM_DEVICE);
+	}
 }
 
 static inline void
@@ -1152,16 +1171,26 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 {
 	struct mlx5e_rq_frag_info *frag_info = &rq->wqe.info.arr[0];
 	struct mlx5e_wqe_frag_info *head_wi = wi;
-	u16 headlen      = min_t(u32, MLX5E_RX_MAX_HEAD, cqe_bcnt);
+	bool hd_split	 = rq->netgpu;
+	u16 header_len	 = hd_split ? TOTAL_HEADERS : MLX5E_RX_MAX_HEAD;
+	u16 headlen      = min_t(u32, header_len, cqe_bcnt);
 	u16 frag_headlen = headlen;
 	u16 byte_cnt     = cqe_bcnt - headlen;
 	struct sk_buff *skb;
 
+	/* RST packets may have short headers (74) and no payload */
+	if (hd_split && headlen != TOTAL_HEADERS && byte_cnt) {
+		/* XXX add drop counter */
+		pr_warn_once("BAD hd_split: headlen %d != %d\n",
+			     headlen, TOTAL_HEADERS);
+		return NULL;
+	}
+
 	/* XDP is not supported in this configuration, as incoming packets
 	 * might spread among multiple pages.
 	 */
 	skb = napi_alloc_skb(rq->cq.napi,
-			     ALIGN(MLX5E_RX_MAX_HEAD, sizeof(long)));
+			     ALIGN(header_len, sizeof(long)));
 	if (unlikely(!skb)) {
 		rq->stats->buff_alloc_err++;
 		return NULL;
@@ -1169,6 +1198,19 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 
 	prefetchw(skb->data);
 
+	if (hd_split) {
+		/* first frag is only headers, should skip this frag and
+		 * assume that all of the headers already copied to the skb
+		 * inline data.
+		 */
+		frag_info++;
+		frag_headlen = 0;
+		wi++;
+
+		skb->zc_netgpu = 1;
+		skb_shinfo(skb)->destructor_arg = rq->netgpu;
+	}
+
 	while (byte_cnt) {
 		u16 frag_consumed_bytes =
 			min_t(u16, frag_info->frag_size - frag_headlen, byte_cnt);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index da596de3abba..4a5f884771e4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -39,6 +39,7 @@
 #include "ipoib/ipoib.h"
 #include "en_accel/en_accel.h"
 #include "lib/clock.h"
+#include "en/netgpu/setup.h"
 
 static void mlx5e_dma_unmap_wqe_err(struct mlx5e_txqsq *sq, u8 num_dma)
 {
@@ -207,6 +208,24 @@ mlx5e_txwqe_build_dsegs(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 		dseg++;
 	}
 
+	if (skb_netdma(skb)) {
+		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+			int fsz = skb_frag_size(frag);
+
+			dma_addr = mlx5e_netgpu_get_dma(skb, frag);
+
+			dseg->addr       = cpu_to_be64(dma_addr);
+			dseg->lkey       = sq->mkey_be;
+			dseg->byte_count = cpu_to_be32(fsz);
+
+			mlx5e_dma_push(sq, dma_addr, fsz, MLX5E_DMA_MAP_FIXED);
+			num_dma++;
+			dseg++;
+		}
+		return num_dma;
+	}
+
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 		int fsz = skb_frag_size(frag);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index e3dbab2a294c..383289e85b01 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -122,6 +122,7 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
 	struct mlx5e_rq *xskrq = &c->xskrq;
 	struct mlx5e_rq *rq = &c->rq;
 	bool xsk_open = test_bit(MLX5E_CHANNEL_STATE_XSK, c->state);
+	bool netgpu_open = test_bit(MLX5E_CHANNEL_STATE_NETGPU, c->state);
 	bool aff_change = false;
 	bool busy_xsk = false;
 	bool busy = false;
@@ -139,7 +140,7 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
 		busy |= mlx5e_poll_xdpsq_cq(&c->rq_xdpsq.cq);
 
 	if (likely(budget)) { /* budget=0 means: don't poll rx rings */
-		if (xsk_open)
+		if (xsk_open || netgpu_open)
 			work_done = mlx5e_poll_rx_cq(&xskrq->cq, budget);
 
 		if (likely(budget - work_done))
@@ -159,6 +160,14 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
 				mlx5e_post_rx_mpwqes,
 				mlx5e_post_rx_wqes,
 				rq);
+
+	if (netgpu_open) {
+		busy_xsk |= INDIRECT_CALL_2(xskrq->post_wqes,
+					    mlx5e_post_rx_mpwqes,
+					    mlx5e_post_rx_wqes,
+					    xskrq);
+	}
+
 	if (xsk_open) {
 		busy |= mlx5e_poll_xdpsq_cq(&xsksq->cq);
 		busy_xsk |= mlx5e_napi_xsk_post(xsksq, xskrq);
@@ -192,6 +201,11 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
 	mlx5e_cq_arm(&c->async_icosq.cq);
 	mlx5e_cq_arm(&c->xdpsq.cq);
 
+	if (netgpu_open) {
+		mlx5e_handle_rx_dim(xskrq);
+		mlx5e_cq_arm(&xskrq->cq);
+	}
+
 	if (xsk_open) {
 		mlx5e_handle_rx_dim(xskrq);
 		mlx5e_cq_arm(&xsksq->cq);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (19 preceding siblings ...)
  2020-07-27 22:44 ` [RFC PATCH v2 20/21] mlx5e: hook up the netgpu functions Jonathan Lemon
@ 2020-07-27 22:44 ` Jonathan Lemon
  2020-07-28 16:31   ` Greg KH
  2020-07-28 19:55 ` [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Stephen Hemminger
  21 siblings, 1 reply; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-27 22:44 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

From: Jonathan Lemon <bsd@fb.com>

This provides the interface between the netgpu core module and the
nvidia kernel driver.  This should be built as an external module,
pointing to the nvidia build.  For example:

export NV_PACKAGE_DIR=/w/nvidia/NVIDIA-Linux-x86_64-440.64
make -C ${kdir} M=`pwd` O=obj $*

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 drivers/misc/netgpu/nvidia/Kbuild        |   9 +
 drivers/misc/netgpu/nvidia/Kconfig       |  10 +
 drivers/misc/netgpu/nvidia/netgpu_cuda.c | 416 +++++++++++++++++++++++
 3 files changed, 435 insertions(+)
 create mode 100644 drivers/misc/netgpu/nvidia/Kbuild
 create mode 100644 drivers/misc/netgpu/nvidia/Kconfig
 create mode 100644 drivers/misc/netgpu/nvidia/netgpu_cuda.c

diff --git a/drivers/misc/netgpu/nvidia/Kbuild b/drivers/misc/netgpu/nvidia/Kbuild
new file mode 100644
index 000000000000..10a3b3156f30
--- /dev/null
+++ b/drivers/misc/netgpu/nvidia/Kbuild
@@ -0,0 +1,9 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+nv_dir = $(NV_PACKAGE_DIR)/kernel
+
+KBUILD_EXTRA_SYMBOLS = $(nv_dir)/Module.symvers
+
+obj-m := netgpu_cuda.o
+
+ccflags-y += -I$(nv_dir)
diff --git a/drivers/misc/netgpu/nvidia/Kconfig b/drivers/misc/netgpu/nvidia/Kconfig
new file mode 100644
index 000000000000..6bb8be158943
--- /dev/null
+++ b/drivers/misc/netgpu/nvidia/Kconfig
@@ -0,0 +1,10 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# NetGPU framework
+#
+
+config NETGPU_CUDA
+	tristate "Network/GPU driver for Nvidia"
+	depends on NETGPU && m
+	help
+	  Experimental Network / GPU driver for Nvidia
diff --git a/drivers/misc/netgpu/nvidia/netgpu_cuda.c b/drivers/misc/netgpu/nvidia/netgpu_cuda.c
new file mode 100644
index 000000000000..2cd93dab52ad
--- /dev/null
+++ b/drivers/misc/netgpu/nvidia/netgpu_cuda.c
@@ -0,0 +1,416 @@
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/uio.h>
+#include <linux/errno.h>
+#include <linux/netdevice.h>
+#include <linux/pci.h>
+#include <linux/memory.h>
+#include <linux/interval_tree.h>
+
+#include <net/netgpu.h>
+#include "../netgpu_priv.h"
+
+#include "nvidia/nv-p2p.h"
+
+/* nvidia GPU uses 64K pages */
+#define GPU_PAGE_SHIFT		16
+#define GPU_PAGE_SIZE		(1UL << GPU_PAGE_SHIFT)
+#define GPU_PAGE_MASK		(GPU_PAGE_SIZE - 1)
+
+struct netgpu_cuda_region {
+	struct netgpu_region r;				/* must be first */
+	struct rb_root_cached root;
+	struct nvidia_p2p_page_table *gpu_pgtbl;
+};
+
+struct netgpu_cuda_dmamap {
+	struct netgpu_dmamap map;			/* must be first */
+	unsigned pg_shift;
+	unsigned long pg_mask;
+	u64 *dma;
+	struct nvidia_p2p_dma_mapping *gpu_map;
+};
+
+/* page_range represents one contiguous GPU PA region */
+struct netgpu_page_range {
+	unsigned long pfn;
+	struct resource *res;
+	struct interval_tree_node va_node;
+};
+
+static int nvidia_pg_shift[] = {
+	[NVIDIA_P2P_PAGE_SIZE_4KB]   = 12,
+	[NVIDIA_P2P_PAGE_SIZE_64KB]  = 16,
+	[NVIDIA_P2P_PAGE_SIZE_128KB] = 17,
+};
+
+#define node2page_range(itn) \
+	container_of(itn, struct netgpu_page_range, va_node)
+
+#define region_remove_each(root, first, last, itn)			\
+	while ((itn = interval_tree_iter_first(root, first, last)) &&	\
+	       (interval_tree_remove(itn, root), 1))
+
+#define cuda_region_remove_each(r, itn)					\
+	region_remove_each(&cuda_region(r)->root, r->start,		\
+			   r->start + (r->nr_pages << PAGE_SHIFT) - 1,	\
+			   itn)
+
+static inline struct netgpu_cuda_region *
+cuda_region(struct netgpu_region *r)
+{
+	return (struct netgpu_cuda_region *)r;
+}
+
+static inline struct netgpu_cuda_dmamap *
+cuda_map(struct netgpu_dmamap *map)
+{
+	return (struct netgpu_cuda_dmamap *)map;
+}
+
+static inline struct netgpu_page_range *
+region_find(struct netgpu_region *r, unsigned long start, int count)
+{
+	struct interval_tree_node *itn;
+	unsigned long last;
+
+	last = start + count * PAGE_SIZE - 1;
+
+	itn = interval_tree_iter_first(&cuda_region(r)->root, start, last);
+	return itn ? node2page_range(itn) : 0;
+}
+
+static dma_addr_t
+netgpu_cuda_get_dma(struct netgpu_dmamap *map, unsigned long addr)
+{
+	unsigned long base, idx;
+
+	base = addr - map->start;
+	idx = base >> cuda_map(map)->pg_shift;
+	return cuda_map(map)->dma[idx] + (base & cuda_map(map)->pg_mask);
+}
+
+static int
+netgpu_cuda_get_page(struct netgpu_dmamap *map, unsigned long addr,
+		     struct page **page, dma_addr_t *dma)
+{
+	struct netgpu_page_range *pr;
+	unsigned long idx;
+
+	pr = region_find(map->r, addr, 1);
+	if (!pr)
+		return -EFAULT;
+	idx = (addr - pr->va_node.start) >> PAGE_SHIFT;
+
+	*page = pfn_to_page(pr->pfn + idx);
+	get_page(*page);
+	*dma = netgpu_cuda_get_dma(map, addr);
+
+	return 0;
+}
+
+static void
+region_get_pages(struct page **pages, unsigned long pfn, int n)
+{
+	struct page *p;
+	int i;
+
+	for (i = 0; i < n; i++) {
+		p = pfn_to_page(pfn + i);
+		get_page(p);
+		pages[i] = p;
+	}
+}
+
+static int
+netgpu_cuda_get_pages(struct netgpu_region *r, struct page **pages,
+		      unsigned long addr, int count)
+{
+	struct netgpu_page_range *pr;
+	unsigned long idx, end;
+	int n;
+
+	pr = region_find(r, addr, count);
+	if (!pr)
+		return -EFAULT;
+
+	idx = (addr - pr->va_node.start) >> PAGE_SHIFT;
+	end = (pr->va_node.last - pr->va_node.start) >> PAGE_SHIFT;
+	n = end - idx + 1;
+	n = min(count, n);
+
+	region_get_pages(pages, pr->pfn + idx, n);
+
+	return n;
+}
+
+static void
+netgpu_cuda_unmap_region(struct netgpu_dmamap *map)
+{
+	struct pci_dev *pdev;
+	int err;
+
+	pdev = cuda_map(map)->gpu_map->pci_dev;
+
+	err = nvidia_p2p_dma_unmap_pages(pdev, cuda_region(map->r)->gpu_pgtbl,
+					 cuda_map(map)->gpu_map);
+	if (err)
+		pr_err("nvidia_p2p_dma_unmap failed: %d\n", err);
+}
+
+static struct netgpu_dmamap *
+netgpu_cuda_map_region(struct netgpu_region *r, struct device *device)
+{
+	struct netgpu_cuda_region *cr = cuda_region(r);
+	struct nvidia_p2p_dma_mapping *gpu_map;
+	struct netgpu_dmamap *map;
+	struct pci_dev *pdev;
+	int err;
+
+	map = kmalloc(sizeof(struct netgpu_cuda_dmamap), GFP_KERNEL);
+	if (!map)
+		return ERR_PTR(-ENOMEM);
+
+	pdev = to_pci_dev(device);
+
+	/*
+	 * takes PA from pgtbl, performs mapping, saves mapping
+	 * dma_mapping holds dma mapped addresses, and pdev.
+	 * mem_info contains pgtbl and mapping list.  mapping is added to list.
+	 * rm_p2p_dma_map_pages() does the work.
+	 */
+	err = nvidia_p2p_dma_map_pages(pdev, cr->gpu_pgtbl, &gpu_map);
+	if (err) {
+		kfree(map);
+		return ERR_PTR(err);
+	}
+
+	cuda_map(map)->gpu_map = gpu_map;
+	cuda_map(map)->dma = gpu_map->dma_addresses;
+	cuda_map(map)->pg_shift = nvidia_pg_shift[gpu_map->page_size_type];
+	cuda_map(map)->pg_mask = (1UL << cuda_map(map)->pg_shift) - 1;
+
+	return map;
+}
+
+static struct resource *
+netgpu_add_pages(int nid, u64 start, u64 end)
+{
+	struct mhp_params params = { .pgprot = PAGE_KERNEL };
+
+	return add_memory_pages(nid, start, end - start, &params);
+}
+
+static void
+netgpu_free_pages(struct resource *res)
+{
+	release_memory_pages(res);
+}
+
+static void
+netgpu_free_page_range(struct netgpu_page_range *pr)
+{
+	unsigned long pfn, pfn_end;
+	struct page *page;
+
+	pfn_end = pr->pfn +
+		  ((pr->va_node.last + 1 - pr->va_node.start) >> PAGE_SHIFT);
+
+	/* XXX verify page count is 2! */
+	for (pfn = pr->pfn; pfn < pfn_end; pfn++) {
+		page = pfn_to_page(pfn);
+		set_page_count(page, 0);
+	}
+	netgpu_free_pages(pr->res);
+	kfree(pr);
+}
+
+static void
+netgpu_cuda_release_pages(struct netgpu_region *r)
+{
+	struct interval_tree_node *va_node;
+
+	cuda_region_remove_each(r, va_node)
+		netgpu_free_page_range(node2page_range(va_node));
+}
+
+static void
+netgpu_init_pages(u64 va, unsigned long pfn_start, unsigned long pfn_end)
+{
+	unsigned long pfn;
+	struct page *page;
+
+	for (pfn = pfn_start; pfn < pfn_end; pfn++) {
+		page = pfn_to_page(pfn);
+		mm_zero_struct_page(page);
+
+		set_page_count(page, 2);	/* matches host logic */
+		page->page_type = 7;		/* XXX differential flag */
+		__SetPageReserved(page);
+
+		SetPagePrivate(page);
+		set_page_private(page, va);
+		va += PAGE_SIZE;
+	}
+}
+
+static int
+netgpu_add_page_range(struct netgpu_region *r, u64 va, u64 start, u64 end)
+{
+	struct netgpu_page_range *pr;
+	struct resource *res;
+
+	pr = kmalloc(sizeof(*pr), GFP_KERNEL);
+	if (!pr)
+		return -ENOMEM;
+
+	res = netgpu_add_pages(numa_mem_id(), start, end);
+	if (IS_ERR(res)) {
+		kfree(pr);
+		return PTR_ERR(res);
+	}
+
+	pr->pfn = PHYS_PFN(start);
+	pr->va_node.start = va;
+	pr->va_node.last = va + (end - start) - 1;
+	pr->res = res;
+
+	netgpu_init_pages(va, PHYS_PFN(start), PHYS_PFN(end));
+
+	interval_tree_insert(&pr->va_node, &cuda_region(r)->root);
+
+	return 0;
+}
+
+static void
+netgpu_cuda_pgtbl_cb(void *data)
+{
+	struct netgpu_region *r = data;
+
+	/* This is required - nvidia gets unhappy if the page table is
+	 * freed from the page table callback.
+	 */
+	cuda_region(r)->gpu_pgtbl = NULL;
+	netgpu_detach_region(r);
+}
+
+static struct netgpu_region *
+netgpu_cuda_add_region(struct netgpu_mem *mem, const struct iovec *iov)
+{
+	struct nvidia_p2p_page_table *gpu_pgtbl = NULL;
+	u64 va, pa, len, start, end;
+	struct netgpu_region *r;
+	int err, i, gpu_pgsize;
+
+	err = -ENOMEM;
+	r = kzalloc(sizeof(struct netgpu_cuda_region), GFP_KERNEL);
+	if (!r)
+		return ERR_PTR(err);
+
+	start = (u64)iov->iov_base;
+	r->start = round_down(start, GPU_PAGE_SIZE);
+	len = round_up(start - r->start + iov->iov_len, GPU_PAGE_SIZE);
+	r->nr_pages = len >> PAGE_SHIFT;
+
+	r->mem = mem;
+	INIT_LIST_HEAD(&r->ctx_list);
+	INIT_LIST_HEAD(&r->dma_list);
+	spin_lock_init(&r->lock);
+
+	/*
+	 * allocates page table, sets gpu_uuid to owning gpu.
+	 * allocates page array, set PA for each page.
+	 * sets page_size (64K here)
+	 * rm_p2p_get_pages() does the actual work.
+	 */
+	err = nvidia_p2p_get_pages(0, 0, r->start, len, &gpu_pgtbl,
+				   netgpu_cuda_pgtbl_cb, r);
+	if (err)
+		goto out;
+
+	/* gpu pgtbl owns r, will free via netgpu_cuda_pgtbl_cb */
+	cuda_region(r)->gpu_pgtbl = gpu_pgtbl;
+
+	if (!NVIDIA_P2P_PAGE_TABLE_VERSION_COMPATIBLE(gpu_pgtbl)) {
+		pr_err("incompatible page table\n");
+		err = -EINVAL;
+		goto out;
+	}
+
+	gpu_pgsize = 1UL << nvidia_pg_shift[gpu_pgtbl->page_size];
+	if (r->nr_pages != gpu_pgtbl->entries * gpu_pgsize / PAGE_SIZE) {
+		pr_err("GPU page count %ld != host page count %ld\n",
+		       gpu_pgtbl->entries * gpu_pgsize / PAGE_SIZE,
+		       r->nr_pages);
+		err = -EINVAL;
+		goto out;
+	}
+
+	start = U64_MAX;
+	end = 0;
+
+	for (i = 0; i < gpu_pgtbl->entries; i++) {
+		pa = gpu_pgtbl->pages[i]->physical_address;
+		if (pa != end) {
+			if (end) {
+				err = netgpu_add_page_range(r, va, start, end);
+				if (err)
+					goto out;
+			}
+			start = pa;
+			va = r->start + i * gpu_pgsize;
+		}
+		end = pa + gpu_pgsize;
+	}
+	err = netgpu_add_page_range(r, va, start, end);
+	if (err)
+		goto out;
+
+	return r;
+
+out:
+	netgpu_cuda_release_pages(r);
+	if (gpu_pgtbl)
+		nvidia_p2p_put_pages(0, 0, r->start, gpu_pgtbl);
+	kfree(r);
+
+	return ERR_PTR(err);
+}
+
+static void
+netgpu_cuda_free_region(struct netgpu_mem *mem, struct netgpu_region *r)
+{
+	netgpu_cuda_release_pages(r);
+	if (cuda_region(r)->gpu_pgtbl)
+		nvidia_p2p_put_pages(0, 0, r->start, cuda_region(r)->gpu_pgtbl);
+	kfree(r);
+}
+
+struct netgpu_ops cuda_ops = {
+	.owner		= THIS_MODULE,
+	.memtype	= NETGPU_MEMTYPE_CUDA,
+	.add_region	= netgpu_cuda_add_region,
+	.free_region	= netgpu_cuda_free_region,
+	.map_region	= netgpu_cuda_map_region,
+	.unmap_region	= netgpu_cuda_unmap_region,
+	.get_dma	= netgpu_cuda_get_dma,
+	.get_page	= netgpu_cuda_get_page,
+	.get_pages	= netgpu_cuda_get_pages,
+};
+
+static int __init
+netgpu_cuda_init(void)
+{
+	return netgpu_register(&cuda_ops);
+}
+
+static void __exit
+netgpu_cuda_fini(void)
+{
+	netgpu_unregister(cuda_ops.memtype);
+}
+
+module_init(netgpu_cuda_init);
+module_exit(netgpu_cuda_fini);
+MODULE_LICENSE("GPL v2");
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 14/21] net/tcp: add netgpu ioctl setting up zero copy RX queues
  2020-07-27 22:44 ` [RFC PATCH v2 14/21] net/tcp: add netgpu ioctl setting up zero copy RX queues Jonathan Lemon
@ 2020-07-28  2:16   ` Jonathan Lemon
  0 siblings, 0 replies; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-28  2:16 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

On Mon, Jul 27, 2020 at 03:44:37PM -0700, Jonathan Lemon wrote:
> From: Jonathan Lemon <bsd@fb.com>
> 
> Netgpu delivers iovecs to userspace for incoming data, but the
> destination queue must be attached to the socket.  Do this via
> and ioctl call on the socket itself.
> 
> Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> ---
>  net/ipv4/tcp.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 27de9380ed14..261c28ccc8f6 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -279,6 +279,7 @@
>  #include <linux/uaccess.h>
>  #include <asm/ioctls.h>
>  #include <net/busy_poll.h>
> +#include <net/netgpu.h>
>  
>  struct percpu_counter tcp_orphan_count;
>  EXPORT_SYMBOL_GPL(tcp_orphan_count);
> @@ -636,6 +637,10 @@ int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg)
>  			answ = READ_ONCE(tp->write_seq) -
>  			       READ_ONCE(tp->snd_nxt);
>  		break;
> +#if IS_ENABLED(CONFIG_NETGPU)
> +	case NETGPU_SOCK_IOCTL_ATTACH_QUEUES:	/* SIOCPROTOPRIVATE */
> +		return netgpu_attach_socket(sk, (void __user *)arg);
> +#endif
>  	default:
>  		return -ENOIOCTLCMD;
>  	}

Actually, this is just ugly, so I'm going to rip it out and have it done
the other way around: (ctx -> sk) instead of (sk -> ctx), so ignore this.
-- 
Jonathan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 10/21] netgpu: add network/gpu/host dma module
  2020-07-27 22:44 ` [RFC PATCH v2 10/21] netgpu: add network/gpu/host dma module Jonathan Lemon
@ 2020-07-28 16:26   ` Greg KH
  2020-07-28 17:41     ` Jonathan Lemon
  0 siblings, 1 reply; 35+ messages in thread
From: Greg KH @ 2020-07-28 16:26 UTC (permalink / raw)
  To: Jonathan Lemon; +Cc: netdev, kernel-team

On Mon, Jul 27, 2020 at 03:44:33PM -0700, Jonathan Lemon wrote:
> From: Jonathan Lemon <bsd@fb.com>
> 
> Netgpu provides a data path for zero-copy sends and receives
> without having the host CPU touch the data.  Protocol processing
> is done on the host CPU, while data is DMA'd to and from DMA
> mapped memory areas.  The initial code provides transfers between
> (mlx5 / host memory) and (mlx5 / nvidia GPU memory).
> 
> The use case for this module are GPUs used for machine learning,
> which are located near the NICs, and have a high bandwidth PCI
> connection between the GPU/NIC.

Do we have such a GPU driver in the kernel today?  We can't add new
apis/interfaces for no in-kernel users, as you well know.

There's lots of crazyness in this patch, but this is just really odd:

> +#if IS_MODULE(CONFIG_NETGPU)
> +#define MAYBE_EXPORT_SYMBOL(s)
> +#else
> +#define MAYBE_EXPORT_SYMBOL(s)	EXPORT_SYMBOL(s)
> +#endif

Why is that needed at all?  Why does no one else in the kernel need such
a thing?

And why EXPORT_SYMBOL() and not EXPORT_SYMBOL_GPL() (I have to ask).

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 19/21] mlx5e: add the netgpu driver functions
  2020-07-27 22:44 ` [RFC PATCH v2 19/21] mlx5e: add the netgpu driver functions Jonathan Lemon
@ 2020-07-28 16:27   ` Greg KH
  0 siblings, 0 replies; 35+ messages in thread
From: Greg KH @ 2020-07-28 16:27 UTC (permalink / raw)
  To: Jonathan Lemon; +Cc: netdev, kernel-team

On Mon, Jul 27, 2020 at 03:44:42PM -0700, Jonathan Lemon wrote:
> --- /dev/null
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c
> @@ -0,0 +1,340 @@
> +#include "en.h"
> +#include "en/xdp.h"
> +#include "en/params.h"
> +#include "en/netgpu/setup.h"

<snip>

Always run scripts/checkpatch.pl on your patches do you do not get
grumpy driver maintainers telling you to run scripts/checkpatch.pl on
your patches.

greg k-h

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 11/21] core/skbuff: add page recycling logic for netgpu pages
  2020-07-27 22:44 ` [RFC PATCH v2 11/21] core/skbuff: add page recycling logic for netgpu pages Jonathan Lemon
@ 2020-07-28 16:28   ` Greg KH
  2020-07-28 18:00     ` Jonathan Lemon
  0 siblings, 1 reply; 35+ messages in thread
From: Greg KH @ 2020-07-28 16:28 UTC (permalink / raw)
  To: Jonathan Lemon; +Cc: netdev, kernel-team

On Mon, Jul 27, 2020 at 03:44:34PM -0700, Jonathan Lemon wrote:
> From: Jonathan Lemon <bsd@fb.com>
> 
> netgpu pages will always have a refcount of at least one (held by
> the netgpu module).  If the skb is marked as containing netgpu ZC
> pages, recycle them back to netgpu.

What???

> 
> Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> ---
>  net/core/skbuff.c | 32 ++++++++++++++++++++++++++++++--
>  1 file changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 1422b99b7090..50dbb7ce1965 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -591,6 +591,27 @@ static void skb_free_head(struct sk_buff *skb)
>  		kfree(head);
>  }
>  
> +#if IS_ENABLED(CONFIG_NETGPU)
> +static void skb_netgpu_unref(struct skb_shared_info *shinfo)
> +{
> +	struct netgpu_ifq *ifq = shinfo->destructor_arg;
> +	struct page *page;
> +	int i;
> +
> +	/* pages attached for skbs for TX shouldn't come here, since
> +	 * the skb is not marked as "zc_netgpu". (only RX skbs have this).
> +	 * dummy page does come here, but always has elevated refc.
> +	 *
> +	 * Undelivered zc skb's will arrive at this point.
> +	 */
> +	for (i = 0; i < shinfo->nr_frags; i++) {
> +		page = skb_frag_page(&shinfo->frags[i]);
> +		if (page && page_ref_dec_return(page) <= 2)
> +			netgpu_put_page(ifq, page, false);
> +	}
> +}
> +#endif

Becides the basic "no #if in C files" issue here, why is this correct?

> +
>  static void skb_release_data(struct sk_buff *skb)
>  {
>  	struct skb_shared_info *shinfo = skb_shinfo(skb);
> @@ -601,8 +622,15 @@ static void skb_release_data(struct sk_buff *skb)
>  			      &shinfo->dataref))
>  		return;
>  
> -	for (i = 0; i < shinfo->nr_frags; i++)
> -		__skb_frag_unref(&shinfo->frags[i]);
> +#if IS_ENABLED(CONFIG_NETGPU)
> +	if (skb->zc_netgpu && shinfo->nr_frags) {
> +		skb_netgpu_unref(shinfo);
> +	} else
> +#endif
> +	{
> +		for (i = 0; i < shinfo->nr_frags; i++)
> +			__skb_frag_unref(&shinfo->frags[i]);
> +	}

Again, no #if in C code.  But even then, this feels really really wrong.

greg k-h

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-27 22:44 ` [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu Jonathan Lemon
@ 2020-07-28 16:31   ` Greg KH
  2020-07-28 17:18     ` Chris Mason
  0 siblings, 1 reply; 35+ messages in thread
From: Greg KH @ 2020-07-28 16:31 UTC (permalink / raw)
  To: Jonathan Lemon; +Cc: netdev, kernel-team

On Mon, Jul 27, 2020 at 03:44:44PM -0700, Jonathan Lemon wrote:
> From: Jonathan Lemon <bsd@fb.com>
> 
> This provides the interface between the netgpu core module and the
> nvidia kernel driver.  This should be built as an external module,
> pointing to the nvidia build.  For example:
> 
> export NV_PACKAGE_DIR=/w/nvidia/NVIDIA-Linux-x86_64-440.64
> make -C ${kdir} M=`pwd` O=obj $*

Ok, now you are just trolling us.

Nice job, I shouldn't have read the previous patches.

Please, go get a lawyer to sign-off on this patch, with their corporate
email address on it.  That's the only way we could possibly consider
something like this.

Oh, and we need you to use your corporate email address too, as you are
not putting copyright notices on this code, we will need to know who to
come after in the future.

greg k-h

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-28 16:31   ` Greg KH
@ 2020-07-28 17:18     ` Chris Mason
  2020-07-28 17:27       ` Christoph Hellwig
  0 siblings, 1 reply; 35+ messages in thread
From: Chris Mason @ 2020-07-28 17:18 UTC (permalink / raw)
  To: Greg KH; +Cc: Jonathan Lemon, netdev, kernel-team

On 28 Jul 2020, at 12:31, Greg KH wrote:

> On Mon, Jul 27, 2020 at 03:44:44PM -0700, Jonathan Lemon wrote:
>> From: Jonathan Lemon <bsd@fb.com>
>>
>> This provides the interface between the netgpu core module and the
>> nvidia kernel driver.  This should be built as an external module,
>> pointing to the nvidia build.  For example:
>>
>> export NV_PACKAGE_DIR=/w/nvidia/NVIDIA-Linux-x86_64-440.64
>> make -C ${kdir} M=`pwd` O=obj $*
>
> Ok, now you are just trolling us.
>
> Nice job, I shouldn't have read the previous patches.
>
> Please, go get a lawyer to sign-off on this patch, with their 
> corporate
> email address on it.  That's the only way we could possibly consider
> something like this.
>
> Oh, and we need you to use your corporate email address too, as you 
> are
> not putting copyright notices on this code, we will need to know who 
> to
> come after in the future.

Jonathan, I think we need to do a better job talking about patches that 
are just meant to enable possible users vs patches that we actually hope 
the upstream kernel to take.  Obviously code that only supports out of 
tree drivers isn’t a good fit for the upstream kernel.  From the point 
of view of experimenting with these patches, GPUs benefit a lot from 
this functionality so I think it does make sense to have the enabling 
patches somewhere, just not in this series.

We’re finding it more common to have pcie switch hops between a [ GPU, 
NIC ] pair and the CPU, which gives a huge advantage to out of tree 
drivers or extensions that can DMA directly between the GPU/NIC without 
having to copy through the CPU.  I’d love to have an alternative built 
on TCP because that’s where we invest the vast majority of our tuning, 
security and interoperability testing.  It’s just more predictable 
overall.

This isn’t a new story, but if we can layer on APIs that enable this 
cleanly for in-tree drivers, we can work with the vendors to use better 
supported APIs and have a more stable kernel.  Obviously this is an RFC 
and there’s a long road ahead, but as long as the upstream kernel 
doesn’t provide an answer, out of tree drivers are going to fill in 
the weak spots.

Other possible use cases would include also include other GPUs or my 
favorite:

NVME <-> filesystem <-> NIC with io_uring driving the IO and without 
copies.

-chris

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-28 17:18     ` Chris Mason
@ 2020-07-28 17:27       ` Christoph Hellwig
  2020-07-28 18:47         ` Chris Mason
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2020-07-28 17:27 UTC (permalink / raw)
  To: Chris Mason; +Cc: Greg KH, Jonathan Lemon, netdev, kernel-team

On Tue, Jul 28, 2020 at 01:18:48PM -0400, Chris Mason wrote:
> > come after in the future.
> 
> Jonathan, I think we need to do a better job talking about patches that are
> just meant to enable possible users vs patches that we actually hope the
> upstream kernel to take.  Obviously code that only supports out of tree
> drivers isn???t a good fit for the upstream kernel.  From the point of view
> of experimenting with these patches, GPUs benefit a lot from this
> functionality so I think it does make sense to have the enabling patches
> somewhere, just not in this series.

Sorry, but his crap is built only for this use case, and that is what
really pissed people off as it very much looks intentional.

> We???re finding it more common to have pcie switch hops between a [ GPU, NIC
> ] pair and the CPU, which gives a huge advantage to out of tree drivers or
> extensions that can DMA directly between the GPU/NIC without having to copy
> through the CPU.  I???d love to have an alternative built on TCP because
> that???s where we invest the vast majority of our tuning, security and
> interoperability testing.  It???s just more predictable overall.
> 
> This isn???t a new story, but if we can layer on APIs that enable this
> cleanly for in-tree drivers, we can work with the vendors to use better
> supported APIs and have a more stable kernel.  Obviously this is an RFC and
> there???s a long road ahead, but as long as the upstream kernel doesn???t
> provide an answer, out of tree drivers are going to fill in the weak spots.
> 
> Other possible use cases would include also include other GPUs or my
> favorite:
> 
> NVME <-> filesystem <-> NIC with io_uring driving the IO and without copies.

And we have all that working with the existing p2pdma infrastructure (at
least if you're using RDMA insted of badly reinventing it, but it could
be added to other users easily).

That infrastructure is EXPORT_SYMBOL_GPL as it should be for
infrastructure like that, and a lot of his crap just seems to be because
he's working around that.

So I really agree with Gred, this very much looks like a deliberate
trolling attempt.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 10/21] netgpu: add network/gpu/host dma module
  2020-07-28 16:26   ` Greg KH
@ 2020-07-28 17:41     ` Jonathan Lemon
  0 siblings, 0 replies; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-28 17:41 UTC (permalink / raw)
  To: Greg KH; +Cc: netdev, kernel-team

On Tue, Jul 28, 2020 at 06:26:08PM +0200, Greg KH wrote:
> On Mon, Jul 27, 2020 at 03:44:33PM -0700, Jonathan Lemon wrote:
> > From: Jonathan Lemon <bsd@fb.com>
> > 
> > Netgpu provides a data path for zero-copy sends and receives
> > without having the host CPU touch the data.  Protocol processing
> > is done on the host CPU, while data is DMA'd to and from DMA
> > mapped memory areas.  The initial code provides transfers between
> > (mlx5 / host memory) and (mlx5 / nvidia GPU memory).
> > 
> > The use case for this module are GPUs used for machine learning,
> > which are located near the NICs, and have a high bandwidth PCI
> > connection between the GPU/NIC.
> 
> Do we have such a GPU driver in the kernel today?  We can't add new
> apis/interfaces for no in-kernel users, as you well know.

No, that's what I'm trying to create.  But Jens pointed out that the
main sticking point here seems to be Nvidia, so I'll look into seeing
whether there are some AMD or Intel GPUS I can use.


> There's lots of crazyness in this patch, but this is just really odd:
> 
> > +#if IS_MODULE(CONFIG_NETGPU)
> > +#define MAYBE_EXPORT_SYMBOL(s)
> > +#else
> > +#define MAYBE_EXPORT_SYMBOL(s)	EXPORT_SYMBOL(s)
> > +#endif
> 
> Why is that needed at all?  Why does no one else in the kernel need such
> a thing?

Really, this is just development code, allowing the netgpu to be built
as a loadable module.  I'll rip it out.


> And why EXPORT_SYMBOL() and not EXPORT_SYMBOL_GPL() (I have to ask).

Shorter typing, didn't think to add _GPL, I'll do that.
-- 
Jonathan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 11/21] core/skbuff: add page recycling logic for netgpu pages
  2020-07-28 16:28   ` Greg KH
@ 2020-07-28 18:00     ` Jonathan Lemon
  2020-07-28 18:26       ` Greg KH
  0 siblings, 1 reply; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-28 18:00 UTC (permalink / raw)
  To: Greg KH; +Cc: netdev, kernel-team

On Tue, Jul 28, 2020 at 06:28:25PM +0200, Greg KH wrote:
> On Mon, Jul 27, 2020 at 03:44:34PM -0700, Jonathan Lemon wrote:
> > From: Jonathan Lemon <bsd@fb.com>
> > 
> > netgpu pages will always have a refcount of at least one (held by
> > the netgpu module).  If the skb is marked as containing netgpu ZC
> > pages, recycle them back to netgpu.
> 
> What???

Yes, this is page refcount elevation.  ZONE_DEVICE pages do this also,
which is hidden in put_devmap_managed_page().  I would really like to
find a generic solution for this, as it has come up in other cases as
well (page recycling for page_pool, for example).  Some way to say "this
page is different", and a separate routine to release refcounts.

Can we have a discussion on this possibility?
-- 
Jonathan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 11/21] core/skbuff: add page recycling logic for netgpu pages
  2020-07-28 18:00     ` Jonathan Lemon
@ 2020-07-28 18:26       ` Greg KH
  0 siblings, 0 replies; 35+ messages in thread
From: Greg KH @ 2020-07-28 18:26 UTC (permalink / raw)
  To: Jonathan Lemon; +Cc: netdev, kernel-team

On Tue, Jul 28, 2020 at 11:00:40AM -0700, Jonathan Lemon wrote:
> On Tue, Jul 28, 2020 at 06:28:25PM +0200, Greg KH wrote:
> > On Mon, Jul 27, 2020 at 03:44:34PM -0700, Jonathan Lemon wrote:
> > > From: Jonathan Lemon <bsd@fb.com>
> > > 
> > > netgpu pages will always have a refcount of at least one (held by
> > > the netgpu module).  If the skb is marked as containing netgpu ZC
> > > pages, recycle them back to netgpu.
> > 
> > What???
> 
> Yes, this is page refcount elevation.  ZONE_DEVICE pages do this also,
> which is hidden in put_devmap_managed_page().  I would really like to
> find a generic solution for this, as it has come up in other cases as
> well (page recycling for page_pool, for example).  Some way to say "this
> page is different", and a separate routine to release refcounts.
> 
> Can we have a discussion on this possibility?

Then propose a generic solution, not a "solution" like this.

greg k-h

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
  2020-07-28 17:27       ` Christoph Hellwig
@ 2020-07-28 18:47         ` Chris Mason
  0 siblings, 0 replies; 35+ messages in thread
From: Chris Mason @ 2020-07-28 18:47 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Greg KH, Jonathan Lemon, netdev, kernel-team

On 28 Jul 2020, at 13:27, Christoph Hellwig wrote:

> On Tue, Jul 28, 2020 at 01:18:48PM -0400, Chris Mason wrote:
>>> come after in the future.
>>
>> Jonathan, I think we need to do a better job talking about patches 
>> that are
>> just meant to enable possible users vs patches that we actually hope 
>> the
>> upstream kernel to take.  Obviously code that only supports out of 
>> tree
>> drivers isn???t a good fit for the upstream kernel.  From the point 
>> of view
>> of experimenting with these patches, GPUs benefit a lot from this
>> functionality so I think it does make sense to have the enabling 
>> patches
>> somewhere, just not in this series.
>
> Sorry, but his crap is built only for this use case, and that is what
> really pissed people off as it very much looks intentional.

No, we’ve had workloads asking for better zero copy solutions for 
ages.  The goal is to address both this specialized workload and the 
general case zero copy tx/rx.

-chris

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU.
  2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
                   ` (20 preceding siblings ...)
  2020-07-27 22:44 ` [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu Jonathan Lemon
@ 2020-07-28 19:55 ` Stephen Hemminger
  2020-07-28 20:43   ` Jonathan Lemon
  21 siblings, 1 reply; 35+ messages in thread
From: Stephen Hemminger @ 2020-07-28 19:55 UTC (permalink / raw)
  To: Jonathan Lemon; +Cc: netdev, kernel-team

On Mon, 27 Jul 2020 15:44:23 -0700
Jonathan Lemon <jonathan.lemon@gmail.com> wrote:

> Current limitations:
>   - mlx5 only, header splitting is at a fixed offset.
>   - currently only TCP protocol delivery is performed.
>   - TX completion notification is planned, but not in this patchset.
>   - not compatible with xsk (re-uses same datastructures)
>   - not compatible with bpf payload inspection

This a good summary of why TCP Offload is not a mainstream solution.
Look back in archives and you will find lots of presentations about
why TOE sucks.

You also forgot no VRF, no namespaes, no firewall, no containers, no encapsulation.
It acts as proof that if you cut out everything you can build something faster.
But not suitable for upstream or production.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU.
  2020-07-28 19:55 ` [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Stephen Hemminger
@ 2020-07-28 20:43   ` Jonathan Lemon
  0 siblings, 0 replies; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-28 20:43 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, kernel-team

On 28 Jul 2020, at 12:55, Stephen Hemminger wrote:

> On Mon, 27 Jul 2020 15:44:23 -0700
> Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
>
>> Current limitations:
>>   - mlx5 only, header splitting is at a fixed offset.
>>   - currently only TCP protocol delivery is performed.
>>   - TX completion notification is planned, but not in this patchset.
>>   - not compatible with xsk (re-uses same datastructures)
>>   - not compatible with bpf payload inspection
>
> This a good summary of why TCP Offload is not a mainstream solution.
> Look back in archives and you will find lots of presentations about
> why TOE sucks.

I.. agree with this?  But I'm failing to see what TCP offload
(or any HW offload) has to do with the change.  I'm trying to do
the opposite of HW offload here - keeping the protocol in the kernel.
Although obviously what I'm doing is not suitable for all use cases.

>
> You also forgot no VRF, no namespaes, no firewall, no containers, no 
> encapsulation.
> It acts as proof that if you cut out everything you can build 
> something faster.
> But not suitable for upstream or production.
-- 
Jonathan

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2020-07-28 20:43 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 01/21] linux/log2.h: enclose macro arg in parens Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 02/21] mm/memory_hotplug: add {add|release}_memory_pages Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 03/21] mm: Allow DMA mapping of pages which are not online Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 04/21] kernel/user: export free_uid Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 05/21] uapi/misc: add shqueue.h for shared queues Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 06/21] include: add netgpu UAPI and kernel definitions Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 07/21] netdevice: add SETUP_NETGPU to the netdev_bpf structure Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 08/21] skbuff: add a zc_netgpu bitflag Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 09/21] core/skbuff: use skb_zdata for testing whether skb is zerocopy Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 10/21] netgpu: add network/gpu/host dma module Jonathan Lemon
2020-07-28 16:26   ` Greg KH
2020-07-28 17:41     ` Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 11/21] core/skbuff: add page recycling logic for netgpu pages Jonathan Lemon
2020-07-28 16:28   ` Greg KH
2020-07-28 18:00     ` Jonathan Lemon
2020-07-28 18:26       ` Greg KH
2020-07-27 22:44 ` [RFC PATCH v2 12/21] lib: have __zerocopy_sg_from_iter get netgpu pages for a sk Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 13/21] net/tcp: Pad TCP options out to a fixed size for netgpu Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 14/21] net/tcp: add netgpu ioctl setting up zero copy RX queues Jonathan Lemon
2020-07-28  2:16   ` Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg() Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 16/21] mlx5: remove the umem parameter from mlx5e_open_channel Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 17/21] mlx5e: add header split ability Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 18/21] mlx5e: add netgpu entries to mlx5 structures Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 19/21] mlx5e: add the netgpu driver functions Jonathan Lemon
2020-07-28 16:27   ` Greg KH
2020-07-27 22:44 ` [RFC PATCH v2 20/21] mlx5e: hook up the netgpu functions Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu Jonathan Lemon
2020-07-28 16:31   ` Greg KH
2020-07-28 17:18     ` Chris Mason
2020-07-28 17:27       ` Christoph Hellwig
2020-07-28 18:47         ` Chris Mason
2020-07-28 19:55 ` [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Stephen Hemminger
2020-07-28 20:43   ` Jonathan Lemon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).