From mboxrd@z Thu Jan 1 00:00:00 1970 From: Benjamin Herrenschmidt Subject: Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory Date: Sat, 15 Apr 2017 08:07:23 +1000 Message-ID: <1492207643.25766.18.camel@kernel.crashing.org> References: <1490911959-5146-1-git-send-email-logang@deltatee.com> <1491974532.7236.43.camel@kernel.crashing.org> <5ac22496-56ec-025d-f153-140001d2a7f9@deltatee.com> <1492034124.7236.77.camel@kernel.crashing.org> <81888a1e-eb0d-cbbc-dc66-0a09c32e4ea2@deltatee.com> <20170413232631.GB24910@bhelgaas-glaptop.roam.corp.google.com> <20170414041656.GA30694@obsidianresearch.com> <1492169849.25766.3.camel@kernel.crashing.org> <630c1c63-ff17-1116-e069-2b8f93e50fa2@deltatee.com> <20170414190452.GA15679@bhelgaas-glaptop.roam.corp.google.com> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Return-path: In-Reply-To: <20170414190452.GA15679-1RhO1Y9PlrlHTL0Zs8A6p5iNqAH0jzoTYJqu5kTmcBRl57MIdRCFDg@public.gmane.org> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: linux-nvdimm-bounces-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org Sender: "Linux-nvdimm" To: Bjorn Helgaas , Logan Gunthorpe Cc: Jens Axboe , Keith Busch , "James E.J. Bottomley" , "Martin K. Petersen" , linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Steve Wise , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org, Jason Gunthorpe , Jerome Glisse , linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org, Max Gurtovoy , Christoph Hellwig List-Id: linux-nvdimm@lists.01.org T24gRnJpLCAyMDE3LTA0LTE0IGF0IDE0OjA0IC0wNTAwLCBCam9ybiBIZWxnYWFzIHdyb3RlOgo+ IEknbSBhIGxpdHRsZSBoZXNpdGFudCBhYm91dCBleGNsdWRpbmcgb2Zmc2V0IHN1cHBvcnQsIHNv IEknZCBsaWtlIHRvCj4gaGVhciBtb3JlIGFib3V0IHRoaXMuCj4gCj4gSXMgdGhlIGlzc3VlIHJl bGF0ZWQgdG8gUENJIEJBUnMgdGhhdCBhcmUgbm90IGNvbXBsZXRlbHkgYWRkcmVzc2FibGUKPiBi eSB0aGUgQ1BVP8KgIElmIHNvLCB0aGF0IHNvdW5kcyBsaWtlIGEgZmlyc3QtY2xhc3MgaXNzdWUg dGhhdCBzaG91bGQKPiBiZSByZXNvbHZlZCB1cCBmcm9udCBiZWNhdXNlIEkgZG9uJ3QgdGhpbmsg dGhlIFBDSSBjb3JlIGluIGdlbmVyYWwKPiB3b3VsZCBkZWFsIHdlbGwgd2l0aCB0aGF0Lgo+IAo+ IElmIGFsbCB0aGUgUENJIG1lbW9yeSBvZiBpbnRlcmVzdCBpcyBpbiBmYWN0IGFkZHJlc3NhYmxl IGJ5IHRoZSBDUFUsCj4gSSB3b3VsZCB0aGluayBpdCB3b3VsZCBiZSBwcmV0dHkgc3RyYWlnaHRm b3J3YXJkIHRvIHN1cHBvcnQgb2Zmc2V0cwo+IC0tCj4gZXZlcnl3aGVyZSB5b3UgY3VycmVudGx5 IHVzZSBhIFBDSSBidXMgYWRkcmVzcywgeW91IGp1c3QgdXNlIHRoZQo+IGNvcnJlc3BvbmRpbmcg Q1BVIHBoeXNpY2FsIGFkZHJlc3MgaW5zdGVhZC4KCkl0J3Mgbm90ICp0aGF0KiBlYXN5IHNhZGx5 LiBUaGUgcmVhc29uIGlzIHRoYXQgbm9ybWFsIGRtYSBtYXAgQVBJcwphc3N1bWUgdGhlICJ0YXJn ZXQiIG9mIHRoZSBETUFzIGFyZSBzeXN0ZW0gbWVtb3J5LCB0aGVyZSBpcyBubyB3YXkgdG8KcGFz cyBpdCAiYW5vdGhlciBkZXZpY2UiLCBpbiBhIHdheSB0aGF0IGFsbG93cyBpdCB0byB1bmRlcnN0 YW5kIHRoZQpvZmZzZXRzIGlmIG5lZWRlZC4KClRoYXQgc2FpZCwgZG1hX21hcCB3aWxsIGFscmVh ZHkgYmUgcHJvYmxlbWF0aWMgd2hlbiBkb2luZyBwMnAgYmVoaW5kCnRoZSBzYW1lIGJyaWRnZSBk dWUgdG8gdGhlIGZhY3QgdGhhdCB0aGUgaW9tbXUgaXMgbm90IGluIHRoZSB3YXkgc28geW91CmNh bid0IHVzZSB0aGUgYXJjaCBzdGFuZGFyZCBvcHMgdGhlcmUuCgpTbyBJIGFzc3VtZSB0aGUgcDJw IGNvZGUgcHJvdmlkZXMgYSB3YXkgdG8gYWRkcmVzcyB0aGF0IHRvbyB2aWEgc3BlY2lhbApkbWFf b3BzID8gT3Igd3JhcHBlcnMgPwoKQmFzaWNhbGx5IHRoZXJlIGFyZSB0d28gdmVyeSBkaWZmZXJl bnQgd2F5cyB5b3UgY2FuIGRvIHAycCwgZWl0aGVyCmJlaGluZCB0aGUgc2FtZSBob3N0IGJyaWRn ZSBvciBhY2Nyb3NzIHR3byBob3N0IGJyaWRnZXM6CgogLSBCZWhpbmQgdGhlIHNhbWUgaG9zdCBi cmlkZ2UsIHlvdSBkb24ndCBnbyB0aHJvdWdoIHRoZSBpb21tdSwgd2hpY2gKbWVhbnMgdGhhdCBl dmVuIGlmIHlvdXIgdGFyZ2V0IGxvb2tzIGxpa2UgYSBzdHJ1Y3QgcGFnZSwgeW91IGNhbid0IGp1 c3QKdXNlIGRtYV9tYXBfKiBvbiBpdCBiZWNhdXNlIHdoYXQgeW91J2xsIGdldCBiYWNrIGZyb20g dGhhdCBpcyBhbiBpb21tdQp0b2tlbiwgbm90IGEgc2libGluZyBCQVIgb2Zmc2V0LiBBZGRpdGlv bmFsbHksIGlmIHlvdSB1c2UgdGhlIHRhcmdldApzdHJ1Y3QgcmVzb3VyY2UgYWRkcmVzcywgeW91 IHdpbGwgbmVlZCB0byBvZmZzZXQgaXQgdG8gZ2V0IGJhY2sgdG8gdGhlCmFjdHVhbCBCQVIgdmFs dWUgb24gdGhlIFBDSWUgYnVzLgoKIC0gQmVoaW5kIGRpZmZlcmVudCBob3N0IGJyaWRnZXMsIHRo ZW4geW91IGdvIHRocm91Z2ggdGhlIGlvbW11IGFuZCB0aGUKaG9zdCByZW1hcHBpbmcuIElFLiB0 aGF0J3MgYWN0dWFsbHkgdGhlIGVhc3kgY2FzZS4gWW91IGNhbiBwcm9iYWJseQpqdXN0IHVzZSB0 aGUgbm9ybWFsIGlvbW11IHBhdGggYW5kIG5vcm1hbCBkbWEgbWFwcGluZyBvcHMsIHlvdSBqdXN0 Cm5lZWQgdG8gaGF2ZSB5b3VyIHN0cnVjdCBwYWdlIHJlcHJlc2VudGluZyB0aGUgdGFyZ2V0IGRl dmljZSBCQVIgKmluCkNQVSBzcGFjZSogdGhpcyB0aW1lLiBTbyBubyBvZmZzZXR0aW5nIHJlcXVp cmVkIGVpdGhlci4KClRoZSBwcm9ibGVtIGlzIHRoYXQgdGhlIGxhdHRlciB3aGlsZSBzZWVtaW5n bHkgZWFzaWVyLCBpcyBhbHNvIHNsb3dlcgphbmQgbm90IHN1cHBvcnRlZCBieSBhbGwgcGxhdGZv cm1zIGFuZCBhcmNoaXRlY3R1cmVzIChmb3IgZXhhbXBsZSwKUE9XRVIgY3VycmVudGx5IHdvbid0 IGFsbG93IGl0LCBvciByYXRoZXIgb25seSBhbGxvd3MgYSBzdG9yZS1vbmx5CnN1YnNldCBvZiBp dCB1bmRlciBzcGVjaWFsIGNpcmN1bXN0YW5jZXMpLgoKU28gd2hhdCBwZW9wbGUgcHJhY3RpY2Fs bHkgd2FudCB0byBkbyBpcyB0byBoYXZlIDIgZGV2aWNlcyBiZWhpbmQgYQpzd2l0Y2ggRE1BJ2lu ZyB0by9mcm9tIGVhY2ggb3RoZXIuCgpCdXQgdGhhdCBicmluZ3MgdGhlIDIgcHJvYmxlbXMgYWJv dmUuCgpJIGRvbid0IGZ1bGx5IHVuZGVyc3RhbmQgaG93IHAycG1lbSAic29sdmVzIiB0aGF0IGJ5 IGNyZWF0aW5nIHN0cnVjdApwYWdlcy4gVGhlIG9mZnNldCBwcm9ibGVtIGlzIG9uZSBpc3N1ZS4g QnV0IHRoZXJlJ3MgdGhlIGlvbW11IGlzc3VlIGFzCndlbGwsIHRoZSBkcml2ZXIgY2Fubm90IGp1 c3QgdXNlIHRoZSBub3JtYWwgZG1hX21hcCBvcHMuCgpJIGhhdmVuJ3QgaGFkIGEgY2hhbmNlIHRv IGxvb2sgYXQgdGhlIGRldGFpbHMgb2YgdGhlIHBhdGNoZXMgYnV0IGl0J3MKbm90IGNsZWFyIGZy b20gdGhlIGRlc2NyaXB0aW9uIGluIHBhdGNoIDAgaG93IHRoYXQgaXMgc29sdmVkLgoKQ2hlZXJz LApCZW4uCgpfX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fXwpM aW51eC1udmRpbW0gbWFpbGluZyBsaXN0CkxpbnV4LW52ZGltbUBsaXN0cy4wMS5vcmcKaHR0cHM6 Ly9saXN0cy4wMS5vcmcvbWFpbG1hbi9saXN0aW5mby9saW51eC1udmRpbW0K From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755296AbdDNWJA (ORCPT ); Fri, 14 Apr 2017 18:09:00 -0400 Received: from gate.crashing.org ([63.228.1.57]:56269 "EHLO gate.crashing.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752559AbdDNWI5 (ORCPT ); Fri, 14 Apr 2017 18:08:57 -0400 Message-ID: <1492207643.25766.18.camel@kernel.crashing.org> Subject: Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory From: Benjamin Herrenschmidt To: Bjorn Helgaas , Logan Gunthorpe Cc: Jason Gunthorpe , Christoph Hellwig , Sagi Grimberg , "James E.J. Bottomley" , "Martin K. Petersen" , Jens Axboe , Steve Wise , Stephen Bates , Max Gurtovoy , Dan Williams , Keith Busch , linux-pci@vger.kernel.org, linux-scsi@vger.kernel.org, linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org, linux-nvdimm@ml01.01.org, linux-kernel@vger.kernel.org, Jerome Glisse Date: Sat, 15 Apr 2017 08:07:23 +1000 In-Reply-To: <20170414190452.GA15679@bhelgaas-glaptop.roam.corp.google.com> References: <1490911959-5146-1-git-send-email-logang@deltatee.com> <1491974532.7236.43.camel@kernel.crashing.org> <5ac22496-56ec-025d-f153-140001d2a7f9@deltatee.com> <1492034124.7236.77.camel@kernel.crashing.org> <81888a1e-eb0d-cbbc-dc66-0a09c32e4ea2@deltatee.com> <20170413232631.GB24910@bhelgaas-glaptop.roam.corp.google.com> <20170414041656.GA30694@obsidianresearch.com> <1492169849.25766.3.camel@kernel.crashing.org> <630c1c63-ff17-1116-e069-2b8f93e50fa2@deltatee.com> <20170414190452.GA15679@bhelgaas-glaptop.roam.corp.google.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.22.6 (3.22.6-1.fc25) Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 2017-04-14 at 14:04 -0500, Bjorn Helgaas wrote: > I'm a little hesitant about excluding offset support, so I'd like to > hear more about this. > > Is the issue related to PCI BARs that are not completely addressable > by the CPU?  If so, that sounds like a first-class issue that should > be resolved up front because I don't think the PCI core in general > would deal well with that. > > If all the PCI memory of interest is in fact addressable by the CPU, > I would think it would be pretty straightforward to support offsets > -- > everywhere you currently use a PCI bus address, you just use the > corresponding CPU physical address instead. It's not *that* easy sadly. The reason is that normal dma map APIs assume the "target" of the DMAs are system memory, there is no way to pass it "another device", in a way that allows it to understand the offsets if needed. That said, dma_map will already be problematic when doing p2p behind the same bridge due to the fact that the iommu is not in the way so you can't use the arch standard ops there. So I assume the p2p code provides a way to address that too via special dma_ops ? Or wrappers ? Basically there are two very different ways you can do p2p, either behind the same host bridge or accross two host bridges: - Behind the same host bridge, you don't go through the iommu, which means that even if your target looks like a struct page, you can't just use dma_map_* on it because what you'll get back from that is an iommu token, not a sibling BAR offset. Additionally, if you use the target struct resource address, you will need to offset it to get back to the actual BAR value on the PCIe bus. - Behind different host bridges, then you go through the iommu and the host remapping. IE. that's actually the easy case. You can probably just use the normal iommu path and normal dma mapping ops, you just need to have your struct page representing the target device BAR *in CPU space* this time. So no offsetting required either. The problem is that the latter while seemingly easier, is also slower and not supported by all platforms and architectures (for example, POWER currently won't allow it, or rather only allows a store-only subset of it under special circumstances). So what people practically want to do is to have 2 devices behind a switch DMA'ing to/from each other. But that brings the 2 problems above. I don't fully understand how p2pmem "solves" that by creating struct pages. The offset problem is one issue. But there's the iommu issue as well, the driver cannot just use the normal dma_map ops. I haven't had a chance to look at the details of the patches but it's not clear from the description in patch 0 how that is solved. Cheers, Ben. From mboxrd@z Thu Jan 1 00:00:00 1970 From: benh@kernel.crashing.org (Benjamin Herrenschmidt) Date: Sat, 15 Apr 2017 08:07:23 +1000 Subject: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory In-Reply-To: <20170414190452.GA15679@bhelgaas-glaptop.roam.corp.google.com> References: <1490911959-5146-1-git-send-email-logang@deltatee.com> <1491974532.7236.43.camel@kernel.crashing.org> <5ac22496-56ec-025d-f153-140001d2a7f9@deltatee.com> <1492034124.7236.77.camel@kernel.crashing.org> <81888a1e-eb0d-cbbc-dc66-0a09c32e4ea2@deltatee.com> <20170413232631.GB24910@bhelgaas-glaptop.roam.corp.google.com> <20170414041656.GA30694@obsidianresearch.com> <1492169849.25766.3.camel@kernel.crashing.org> <630c1c63-ff17-1116-e069-2b8f93e50fa2@deltatee.com> <20170414190452.GA15679@bhelgaas-glaptop.roam.corp.google.com> Message-ID: <1492207643.25766.18.camel@kernel.crashing.org> On Fri, 2017-04-14@14:04 -0500, Bjorn Helgaas wrote: > I'm a little hesitant about excluding offset support, so I'd like to > hear more about this. > > Is the issue related to PCI BARs that are not completely addressable > by the CPU?? If so, that sounds like a first-class issue that should > be resolved up front because I don't think the PCI core in general > would deal well with that. > > If all the PCI memory of interest is in fact addressable by the CPU, > I would think it would be pretty straightforward to support offsets > -- > everywhere you currently use a PCI bus address, you just use the > corresponding CPU physical address instead. It's not *that* easy sadly. The reason is that normal dma map APIs assume the "target" of the DMAs are system memory, there is no way to pass it "another device", in a way that allows it to understand the offsets if needed. That said, dma_map will already be problematic when doing p2p behind the same bridge due to the fact that the iommu is not in the way so you can't use the arch standard ops there. So I assume the p2p code provides a way to address that too via special dma_ops ? Or wrappers ? Basically there are two very different ways you can do p2p, either behind the same host bridge or accross two host bridges: - Behind the same host bridge, you don't go through the iommu, which means that even if your target looks like a struct page, you can't just use dma_map_* on it because what you'll get back from that is an iommu token, not a sibling BAR offset. Additionally, if you use the target struct resource address, you will need to offset it to get back to the actual BAR value on the PCIe bus. - Behind different host bridges, then you go through the iommu and the host remapping. IE. that's actually the easy case. You can probably just use the normal iommu path and normal dma mapping ops, you just need to have your struct page representing the target device BAR *in CPU space* this time. So no offsetting required either. The problem is that the latter while seemingly easier, is also slower and not supported by all platforms and architectures (for example, POWER currently won't allow it, or rather only allows a store-only subset of it under special circumstances). So what people practically want to do is to have 2 devices behind a switch DMA'ing to/from each other. But that brings the 2 problems above. I don't fully understand how p2pmem "solves" that by creating struct pages. The offset problem is one issue. But there's the iommu issue as well, the driver cannot just use the normal dma_map ops. I haven't had a chance to look at the details of the patches but it's not clear from the description in patch 0 how that is solved. Cheers, Ben.