Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu

From: Jason Gunthorpe <jgg@nvidia.com>
To: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>, <netdev@vger.kernel.org>,
	<kernel-team@fb.com>, <robin.murphy@arm.com>,
	<akpm@linux-foundation.org>, <davem@davemloft.net>,
	<kuba@kernel.org>, <willemb@google.com>, <edumazet@google.com>,
	<steffen.klassert@secunet.com>, <saeedm@mellanox.com>,
	<maximmi@mellanox.com>, <bjorn.topel@intel.com>,
	<magnus.karlsson@intel.com>, <borisp@mellanox.com>,
	<david@redhat.com>
Subject: Re: [RFC PATCH v2 21/21] netgpu/nvidia: add Nvidia plugin for netgpu
Date: Tue, 28 Jul 2020 20:38:06 -0300	[thread overview]
Message-ID: <20200728233806.GC16789@nvidia.com> (raw)
In-Reply-To: <20200728210116.56potw45eyptmlc7@bsd-mbp.dhcp.thefacebook.com>

On Tue, Jul 28, 2020 at 02:01:16PM -0700, Jonathan Lemon wrote:
> On Tue, Jul 28, 2020 at 03:19:04PM -0300, Jason Gunthorpe wrote:
> > On Mon, Jul 27, 2020 at 06:48:12PM -0700, Jonathan Lemon wrote:
> > 
> > > While the current GPU utilized is nvidia, there's nothing in the rest of
> > > the patches specific to Nvidia - an Intel or AMD GPU interface could be
> > > equally workable.
> > 
> > I think that is very misleading.
> > 
> > It looks like this patch, and all the ugly MM stuff, is done the way
> > it is *specifically* to match the clunky nv_p2p interface that only
> > the NVIDIA driver exposes.
> 
> For /this/ patch [21], this is quite true.  I'm forced to use the nv_p2p
> API if I want to use the hardware that I have.  What's being overlooked
> is that the host mem driver does not do this, nor would another GPU
> if it used p2p_dma.  I'm just providing get_page, put_page, get_dma.

Not really, the design copied the nv_p2p api design directly into
struct netgpu_functions and then aligned the rest of the parts to use
it too. Yes, other GPU drivers could also be squeezed into this API,
but if you'd never looked at the NVIDIA driver you'd never pick such a
design. It is inherently disconnected from the MM.

> > Any approach done in tree, where we can actually modify the GPU
> > driver, would do sane things like have the GPU driver itself create
> > the MEMORY_DEVICE_PCI_P2PDMA pages, use the P2P DMA API framework, use
> > dmabuf for the cross-driver attachment, etc, etc.
> 
> So why doesn't Nvidia implement the above in the driver?
> Actually a serious question, not trolling here.

A kernel mailing list is not appropriate place to discuss a
proprietary out of tree driver, take questions like that with your
support channel.

> > If you are serious about advancing this then the initial patches in a
> > long road must be focused on building up the core kernel
> > infrastructure for P2P DMA to a point where netdev could consume
> > it. There has been a lot of different ideas thrown about on how to do
> > this over the years.
> 
> Yes, I'm serious about doing this work, and may not have seen or
> remember all the various ideas I've seen over time.  The netstack
> operates on pages - are you advocating replacing them with sglists?

So far, the general expectation is that any pages would be ZONE_DEVICE
MEMORY_DEVICE_PCI_P2PDMA pages created by the PCI device's driver to
cover the device's BAR. These are __iomem pages so they can't be
intermixed in the kernel with system memory pages. That detail has
a been a large stumbling block in most cases.

Resolving this design issue removes most of the MM hackery in the
netgpu. Though, I have no idea if you can intermix ZONE_DEVICE pages
into skb's in the net stack.

From there, I'd expect the pages are mmaped into userspace in a VMA or
passed into a userspace dmabuf FD.

At this point, consumers, like the net stack should rely on some core
APIs to extract the pages and DMA maps from the user objects. Either
some new pin_user_pages() variant for VMAs or via the work that was
started on the dma_buf_map_attachment() for dmabuf.

From there it needs to handle everything carefully and then call over
to the pcip2p code to validate and dma map them. There are many
missing little details along this path.

Overall there should be nothing like 'netgpu'. This is all just some
special case of an existing user memory flow where the user pages
being touched are allowed to be P2P pages or system memory and the
flow knows how to deal with the difference.

More or less. All of these touch points need beefing up and additional
features.

> > > I think this is a better patch than all the various implementations of
> > > the protocol stack in the form of RDMA, driver code and device firmware.
> > 
> > Oh? You mean "better" in the sense the header split offload in the NIC
> > is better liked than a full protocol running in the NIC?
> 
> Yes.  The NIC firmware should become simpler, not more complicated.

Do you have any application benchmarks? The typical AI communication
pattern is very challenging and a state of the art RDMA implementation
gets incredible performance.

Jason