netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU.
@ 2020-07-27 22:44 Jonathan Lemon
  2020-07-27 22:44 ` [RFC PATCH v2 01/21] linux/log2.h: enclose macro arg in parens Jonathan Lemon
                   ` (21 more replies)
  0 siblings, 22 replies; 35+ messages in thread
From: Jonathan Lemon @ 2020-07-27 22:44 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

From: Jonathan Lemon <bsd@fb.com>

[ RESENDING, as apparently the initial submission was lost ]

This series is a working RFC proof-of-concept that implements DMA
zero-copy between the NIC and a GPU device for the data path, while
keeping the protocol processing on the host CPU.

This also works for zero-copy send/recv to host (CPU) memory.

Current limitations:
  - mlx5 only, header splitting is at a fixed offset.
  - currently only TCP protocol delivery is performed.
  - TX completion notification is planned, but not in this patchset.
  - not compatible with xsk (re-uses same datastructures)
  - not compatible with bpf payload inspection

Changes since v1:
  - user api restructured to provide more flexibility.
  - iommu issues resolved
  - performance issues fixed.
  - nvidia bits are built as an external module.

The next section provides a brief overview of how things work.

A transport context (aka RSS object) is created for a device, which acts
as a container for a group of queues.  Queues are opened on a context,
these correcpond to hardware queues on the NIC.  There exists the
ability to request a sepcific numbered queue, or "any" queue.  Only RX
queues are needed, the standard TX queues are used for packet
transmission.

Memory regions which participate in zero-copy transmission (either send
or recieve) are registered with a memory object, and then a specific
region can be attached to a context for usage.  This way, multiple
contexts can share the same memory regions.  These areas can be used as
either RX packet buffers or TX data areas (or both).  The memory can
come from either malloc/mmap or cudaMalloc().  The latter call provides
a handle to the userspace application, but the memory region is only
accessible to the GPU.

A socket is created and registered with the context, which sets
SOCK_ZEROCOPY, also creates a per-socket queue for recieving zero-copy
data.

Asymmetrical data paths are possible (zc TX, normal RX), and vice versa,
but the curreent PoC sets things up for symmetrical transport.  The
application needs to provide the RX buffers to the ifq fill queue,
similar to AF_XDP.

Once things are set up, data is sent to the network with sendmsg().  The
iovecs provided contain an address in the region previously registered.
The normal protocol stack processing constructs the packet, but the data
is not touched by the stack.  In this phase, the application is not
notified when the protocol processing is complete and the data area is
safe to modify again.

For RX, packets undergo the usual protocol processing and are delivered
up to the socket receive queue.  At this point, the skb data fragments
are delivered to the application as iovecs through an AF_XDP style queue
which belongs to the socket.  The application can poll for readability,
but does not use read() to receive the data.

The initial application used is iperf3, a modified version with the
userspace library corresponding to this patch is available at:
    https://github.com/jlemon/iperf
    https://github.com/jlemon/netgpu

Running "iperf3 -s -z --dport 8888" (host memory) on a 12Gbps link:
    11.4 Gbit/sec receive
    10.8 Gbit/sec transmit

Running "iperf3 -s -z --dport 8888 --gpu" on a 25Gbps link:
    23.4 Gbit/sec receive
    23.8 Gbit/sec transmit

For the GPU runs, the Intel PCI monitoring tools were used to confirm
that the host PCI bus was mostly idle. 

Patch series:
  1,4    : cleanup & extension
  2,3    : extend mm, allowing allocation of specific memory regions
  5,6,7  : add include support files
  8,9,11 : skbuff core changes for handling zc frags
  10,21  : netgpu and nvidia code
  12-15  : changes to connect up netgpu code
  16-20  : mlx5 driver changes

Comments eagerly solicited.
--
Jonathan


Jonathan Lemon (21):
  linux/log2.h: enclose macro arg in parens
  mm/memory_hotplug: add {add|release}_memory_pages
  mm: Allow DMA mapping of pages which are not online
  kernel/user: export free_uid
  uapi/misc: add shqueue.h for shared queues
  include: add netgpu UAPI and kernel definitions
  netdevice: add SETUP_NETGPU to the netdev_bpf structure
  skbuff: add a zc_netgpu bitflag
  core/skbuff: use skb_zdata for testing whether skb is zerocopy
  netgpu: add network/gpu/host dma module
  core/skbuff: add page recycling logic for netgpu pages
  lib: have __zerocopy_sg_from_iter get netgpu pages for a sk
  net/tcp: Pad TCP options out to a fixed size for netgpu
  net/tcp: add netgpu ioctl setting up zero copy RX queues
  net/tcp: add MSG_NETDMA flag for sendmsg()
  mlx5: remove the umem parameter from mlx5e_open_channel
  mlx5e: add header split ability
  mlx5e: add netgpu entries to mlx5 structures
  mlx5e: add the netgpu driver functions
  mlx5e: hook up the netgpu functions
  netgpu/nvidia: add Nvidia plugin for netgpu

 drivers/misc/Kconfig                          |    1 +
 drivers/misc/Makefile                         |    1 +
 drivers/misc/netgpu/Kconfig                   |   14 +
 drivers/misc/netgpu/Makefile                  |    6 +
 drivers/misc/netgpu/netgpu_host.c             |  284 ++++
 drivers/misc/netgpu/netgpu_main.c             | 1215 +++++++++++++++++
 drivers/misc/netgpu/netgpu_mem.c              |  351 +++++
 drivers/misc/netgpu/netgpu_priv.h             |   88 ++
 drivers/misc/netgpu/netgpu_stub.c             |  166 +++
 drivers/misc/netgpu/netgpu_stub.h             |   19 +
 drivers/misc/netgpu/nvidia/Kbuild             |    9 +
 drivers/misc/netgpu/nvidia/Kconfig            |   10 +
 drivers/misc/netgpu/nvidia/netgpu_cuda.c      |  416 ++++++
 .../net/ethernet/mellanox/mlx5/core/Kconfig   |    1 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |    1 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   21 +-
 .../mellanox/mlx5/core/en/netgpu/setup.c      |  340 +++++
 .../mellanox/mlx5/core/en/netgpu/setup.h      |   96 ++
 .../ethernet/mellanox/mlx5/core/en/params.c   |    3 +-
 .../ethernet/mellanox/mlx5/core/en/params.h   |    9 +
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h |    3 +
 .../ethernet/mellanox/mlx5/core/en/xsk/umem.c |    4 +
 .../ethernet/mellanox/mlx5/core/en/xsk/umem.h |    3 +
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  121 +-
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |   58 +-
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   |   19 +
 .../net/ethernet/mellanox/mlx5/core/en_txrx.c |   16 +-
 include/linux/dma-mapping.h                   |    4 +-
 include/linux/log2.h                          |    2 +-
 include/linux/memory_hotplug.h                |    4 +
 include/linux/mmzone.h                        |    7 +
 include/linux/netdevice.h                     |    6 +
 include/linux/skbuff.h                        |   27 +-
 include/linux/socket.h                        |    1 +
 include/linux/uio.h                           |    4 +
 include/net/netgpu.h                          |   66 +
 include/uapi/misc/netgpu.h                    |   69 +
 include/uapi/misc/shqueue.h                   |  200 +++
 kernel/user.c                                 |    1 +
 lib/iov_iter.c                                |   53 +
 mm/memory_hotplug.c                           |   65 +-
 net/core/datagram.c                           |    9 +-
 net/core/skbuff.c                             |   50 +-
 net/ipv4/tcp.c                                |   13 +
 net/ipv4/tcp_output.c                         |   20 +
 45 files changed, 3827 insertions(+), 49 deletions(-)
 create mode 100644 drivers/misc/netgpu/Kconfig
 create mode 100644 drivers/misc/netgpu/Makefile
 create mode 100644 drivers/misc/netgpu/netgpu_host.c
 create mode 100644 drivers/misc/netgpu/netgpu_main.c
 create mode 100644 drivers/misc/netgpu/netgpu_mem.c
 create mode 100644 drivers/misc/netgpu/netgpu_priv.h
 create mode 100644 drivers/misc/netgpu/netgpu_stub.c
 create mode 100644 drivers/misc/netgpu/netgpu_stub.h
 create mode 100644 drivers/misc/netgpu/nvidia/Kbuild
 create mode 100644 drivers/misc/netgpu/nvidia/Kconfig
 create mode 100644 drivers/misc/netgpu/nvidia/netgpu_cuda.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.h
 create mode 100644 include/net/netgpu.h
 create mode 100644 include/uapi/misc/netgpu.h
 create mode 100644 include/uapi/misc/shqueue.h

-- 
2.24.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2020-07-28 20:43 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-27 22:44 [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 01/21] linux/log2.h: enclose macro arg in parens Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 02/21] mm/memory_hotplug: add {add|release}_memory_pages Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 03/21] mm: Allow DMA mapping of pages which are not online Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 04/21] kernel/user: export free_uid Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 05/21] uapi/misc: add shqueue.h for shared queues Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 06/21] include: add netgpu UAPI and kernel definitions Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 07/21] netdevice: add SETUP_NETGPU to the netdev_bpf structure Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 08/21] skbuff: add a zc_netgpu bitflag Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 09/21] core/skbuff: use skb_zdata for testing whether skb is zerocopy Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 10/21] netgpu: add network/gpu/host dma module Jonathan Lemon
2020-07-28 16:26   ` Greg KH
2020-07-28 17:41     ` Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 11/21] core/skbuff: add page recycling logic for netgpu pages Jonathan Lemon
2020-07-28 16:28   ` Greg KH
2020-07-28 18:00     ` Jonathan Lemon
2020-07-28 18:26       ` Greg KH
2020-07-27 22:44 ` [RFC PATCH v2 12/21] lib: have __zerocopy_sg_from_iter get netgpu pages for a sk Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 13/21] net/tcp: Pad TCP options out to a fixed size for netgpu Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 14/21] net/tcp: add netgpu ioctl setting up zero copy RX queues Jonathan Lemon
2020-07-28  2:16   ` Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 15/21] net/tcp: add MSG_NETDMA flag for sendmsg() Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 16/21] mlx5: remove the umem parameter from mlx5e_open_channel Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 17/21] mlx5e: add header split ability Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 18/21] mlx5e: add netgpu entries to mlx5 structures Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 19/21] mlx5e: add the netgpu driver functions Jonathan Lemon
2020-07-28 16:27   ` Greg KH
2020-07-27 22:44 ` [RFC PATCH v2 20/21] mlx5e: hook up the netgpu functions Jonathan Lemon
2020-07-27 22:44 ` [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu Jonathan Lemon
2020-07-28 16:31   ` Greg KH
2020-07-28 17:18     ` Chris Mason
2020-07-28 17:27       ` Christoph Hellwig
2020-07-28 18:47         ` Chris Mason
2020-07-28 19:55 ` [RFC PATCH v2 00/21] netgpu: networking between NIC and GPU/CPU Stephen Hemminger
2020-07-28 20:43   ` Jonathan Lemon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).