From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jason Gunthorpe Subject: Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory Date: Tue, 18 Apr 2017 10:45:57 -0600 Message-ID: <20170418164557.GA7181@obsidianresearch.com> References: <1492381396.25766.43.camel@kernel.crashing.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Content-Disposition: inline In-Reply-To: <1492381396.25766.43.camel-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r@public.gmane.org> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: linux-nvdimm-bounces-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org Sender: "Linux-nvdimm" To: Benjamin Herrenschmidt Cc: Jens Axboe , "James E.J. Bottomley" , "Martin K. Petersen" , linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Steve Wise , "linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org" , linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org, Keith Busch , Jerome Glisse , Bjorn Helgaas , linux-scsi , linux-nvdimm , Max Gurtovoy , Christoph Hellwig List-Id: linux-nvdimm@lists.01.org On Mon, Apr 17, 2017 at 08:23:16AM +1000, Benjamin Herrenschmidt wrote: > Thanks :-) There's a reason why I'm insisting on this. We have constant > requests for this today. We have hacks in the GPU drivers to do it for > GPUs behind a switch, but those are just that, ad-hoc hacks in the > drivers. We have similar grossness around the corner with some CAPI > NICs trying to DMA to GPUs. I have people trying to use PLX DMA engines > to whack nVME devices. A lot of people feel this way in the RDMA community too. We have had vendors shipping out of tree code to enable P2P for RDMA with GPU years and years now. :( Attempts to get things in mainline have always run into the same sort of road blocks you've identified in this thread.. FWIW, I read this discussion and it sounds closer to an agreement than I've ever seen in the past. >>From Ben's comments, I would think that the 'first class' support that is needed here is simply a function to return the 'struct device' backing a CPU address range. This is the minimal required information for the arch or IOMMU code under the dma ops to figure out the fabric source/dest, compute the traffic path, determine if P2P is even possible, what translation hardware is crossed, and what DMA address should be used. If there is going to be more core support for this stuff I think it will be under the topic of more robustly describing the fabric to the core and core helpers to extract data from the description: eg compute the path, check if the path crosses translation, etc But that isn't really related to P2P, and is probably better left to the arch authors to figure out where they need to enhance the existing topology data.. I think the key agreement to get out of Logan's series is that P2P DMA means: - The BAR will be backed by struct pages - Passing the CPU __iomem address of the BAR to the DMA API is valid and, long term, dma ops providers are expected to fail or return the right DMA address - Mapping BAR memory into userspace and back to the kernel via get_user_pages works transparently, and with the DMA API above - The dma ops provider must be able to tell if source memory is bar mapped and recover the pci device backing the mapping. At least this is what we'd like in RDMA :) FWIW, RDMA probably wouldn't want to use a p2mem device either, we already have APIs that map BAR memory to user space, and would like to keep using them. A 'enable P2P for bar' helper function sounds better to me. Jason From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757572AbdDRQqj (ORCPT ); Tue, 18 Apr 2017 12:46:39 -0400 Received: from quartz.orcorp.ca ([184.70.90.242]:53482 "EHLO quartz.orcorp.ca" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751197AbdDRQqf (ORCPT ); Tue, 18 Apr 2017 12:46:35 -0400 Date: Tue, 18 Apr 2017 10:45:57 -0600 From: Jason Gunthorpe To: Benjamin Herrenschmidt Cc: Dan Williams , Logan Gunthorpe , Bjorn Helgaas , Christoph Hellwig , Sagi Grimberg , "James E.J. Bottomley" , "Martin K. Petersen" , Jens Axboe , Steve Wise , Stephen Bates , Max Gurtovoy , Keith Busch , linux-pci@vger.kernel.org, linux-scsi , linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org, linux-nvdimm , "linux-kernel@vger.kernel.org" , Jerome Glisse Subject: Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory Message-ID: <20170418164557.GA7181@obsidianresearch.com> References: <1492381396.25766.43.camel@kernel.crashing.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1492381396.25766.43.camel@kernel.crashing.org> User-Agent: Mutt/1.5.24 (2015-08-30) X-Broken-Reverse-DNS: no host name found for IP address 10.0.0.156 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 17, 2017 at 08:23:16AM +1000, Benjamin Herrenschmidt wrote: > Thanks :-) There's a reason why I'm insisting on this. We have constant > requests for this today. We have hacks in the GPU drivers to do it for > GPUs behind a switch, but those are just that, ad-hoc hacks in the > drivers. We have similar grossness around the corner with some CAPI > NICs trying to DMA to GPUs. I have people trying to use PLX DMA engines > to whack nVME devices. A lot of people feel this way in the RDMA community too. We have had vendors shipping out of tree code to enable P2P for RDMA with GPU years and years now. :( Attempts to get things in mainline have always run into the same sort of road blocks you've identified in this thread.. FWIW, I read this discussion and it sounds closer to an agreement than I've ever seen in the past. >>From Ben's comments, I would think that the 'first class' support that is needed here is simply a function to return the 'struct device' backing a CPU address range. This is the minimal required information for the arch or IOMMU code under the dma ops to figure out the fabric source/dest, compute the traffic path, determine if P2P is even possible, what translation hardware is crossed, and what DMA address should be used. If there is going to be more core support for this stuff I think it will be under the topic of more robustly describing the fabric to the core and core helpers to extract data from the description: eg compute the path, check if the path crosses translation, etc But that isn't really related to P2P, and is probably better left to the arch authors to figure out where they need to enhance the existing topology data.. I think the key agreement to get out of Logan's series is that P2P DMA means: - The BAR will be backed by struct pages - Passing the CPU __iomem address of the BAR to the DMA API is valid and, long term, dma ops providers are expected to fail or return the right DMA address - Mapping BAR memory into userspace and back to the kernel via get_user_pages works transparently, and with the DMA API above - The dma ops provider must be able to tell if source memory is bar mapped and recover the pci device backing the mapping. At least this is what we'd like in RDMA :) FWIW, RDMA probably wouldn't want to use a p2mem device either, we already have APIs that map BAR memory to user space, and would like to keep using them. A 'enable P2P for bar' helper function sounds better to me. Jason From mboxrd@z Thu Jan 1 00:00:00 1970 From: jgunthorpe@obsidianresearch.com (Jason Gunthorpe) Date: Tue, 18 Apr 2017 10:45:57 -0600 Subject: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory In-Reply-To: <1492381396.25766.43.camel@kernel.crashing.org> References: <1492381396.25766.43.camel@kernel.crashing.org> Message-ID: <20170418164557.GA7181@obsidianresearch.com> On Mon, Apr 17, 2017@08:23:16AM +1000, Benjamin Herrenschmidt wrote: > Thanks :-) There's a reason why I'm insisting on this. We have constant > requests for this today. We have hacks in the GPU drivers to do it for > GPUs behind a switch, but those are just that, ad-hoc hacks in the > drivers. We have similar grossness around the corner with some CAPI > NICs trying to DMA to GPUs. I have people trying to use PLX DMA engines > to whack nVME devices. A lot of people feel this way in the RDMA community too. We have had vendors shipping out of tree code to enable P2P for RDMA with GPU years and years now. :( Attempts to get things in mainline have always run into the same sort of road blocks you've identified in this thread.. FWIW, I read this discussion and it sounds closer to an agreement than I've ever seen in the past. >>From Ben's comments, I would think that the 'first class' support that is needed here is simply a function to return the 'struct device' backing a CPU address range. This is the minimal required information for the arch or IOMMU code under the dma ops to figure out the fabric source/dest, compute the traffic path, determine if P2P is even possible, what translation hardware is crossed, and what DMA address should be used. If there is going to be more core support for this stuff I think it will be under the topic of more robustly describing the fabric to the core and core helpers to extract data from the description: eg compute the path, check if the path crosses translation, etc But that isn't really related to P2P, and is probably better left to the arch authors to figure out where they need to enhance the existing topology data.. I think the key agreement to get out of Logan's series is that P2P DMA means: - The BAR will be backed by struct pages - Passing the CPU __iomem address of the BAR to the DMA API is valid and, long term, dma ops providers are expected to fail or return the right DMA address - Mapping BAR memory into userspace and back to the kernel via get_user_pages works transparently, and with the DMA API above - The dma ops provider must be able to tell if source memory is bar mapped and recover the pci device backing the mapping. At least this is what we'd like in RDMA :) FWIW, RDMA probably wouldn't want to use a p2mem device either, we already have APIs that map BAR memory to user space, and would like to keep using them. A 'enable P2P for bar' helper function sounds better to me. Jason