From mboxrd@z Thu Jan  1 00:00:00 1970
From: Benjamin Herrenschmidt <benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r@public.gmane.org>
Subject: Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
Date: Thu, 13 Apr 2017 07:55:24 +1000
Message-ID: <1492034124.7236.77.camel@kernel.crashing.org>
References: <1490911959-5146-1-git-send-email-logang@deltatee.com>
 <1491974532.7236.43.camel@kernel.crashing.org>
 <5ac22496-56ec-025d-f153-140001d2a7f9@deltatee.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Return-path: <linux-nvdimm-bounces-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org>
In-Reply-To: <5ac22496-56ec-025d-f153-140001d2a7f9-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
List-Unsubscribe: <https://lists.01.org/mailman/options/linux-nvdimm>,
 <mailto:linux-nvdimm-request-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org?subject=unsubscribe>
List-Archive: <http://lists.01.org/pipermail/linux-nvdimm/>
List-Post: <mailto:linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org>
List-Help: <mailto:linux-nvdimm-request-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org?subject=help>
List-Subscribe: <https://lists.01.org/mailman/listinfo/linux-nvdimm>,
 <mailto:linux-nvdimm-request-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org?subject=subscribe>
Errors-To: linux-nvdimm-bounces-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org
Sender: "Linux-nvdimm" <linux-nvdimm-bounces-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org>
To: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>, "James E.J. Bottomley" <jejb-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>, "Martin K. Petersen" <martin.petersen-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>, Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>, Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>, Stephen Bates <sbates-pv7U853sEMVWk0Htik3J/w@public.gmane.org>, Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>, Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>, Keith Busch <keith.busch-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
Cc: linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org, Jerome Glisse <jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>, linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org
List-Id: linux-nvdimm@lists.01.org

T24gV2VkLCAyMDE3LTA0LTEyIGF0IDExOjA5IC0wNjAwLCBMb2dhbiBHdW50aG9ycGUgd3JvdGU6
Cj4gCj4gPiBEbyB5b3UgaGFuZGxlIGZ1bmt5IGFkZHJlc3MgdHJhbnNsYXRpb24gdG9vID8gSUUu
IHRoZSBmYWN0IHRoYXQgdGhlIFBDSQo+ID4gYWRkcmVzc2VzIGFyZW4ndCB0aGUgc2FtZSBhcyB0
aGUgQ1BVIHBoeXNpY2FsIGFkZHJlc3NlcyBmb3IgYSBCQVIgPwo+IAo+IE5vLCB3ZSB1c2UgdGhl
IENQVSBwaHlzaWNhbCBhZGRyZXNzIG9mIHRoZSBCQVIuIElmIGl0J3Mgbm90IG1hcHBlZCB0aGF0
Cj4gd2F5IHdlIGNhbid0IHVzZSBpdC4KCk9rLCB5b3UgbmVlZCB0byBmaXggdGhhdCBvciBhIGJ1
bmNoIG9mIGFyY2hpdGVjdHVyZXMgd29uJ3Qgd29yay4gCgpMb29rIGF0IHBjaWJpb3NfcmVzb3Vy
Y2VfdG9fYnVzKCkgYW5kIHBjaWJpb3NfYnVzX3RvX3Jlc291cmNlKCkuIFRoZXkKd2lsbCBwZXJm
b3JtIHRoZSBjb252ZXJzaW9uIGJldHdlZW4gdGhlIHN0cnVjdCByZXNvdXJjZSBjb250ZW50IChD
UFUKcGh5c2ljYWwgYWRkcmVzcykgYW5kIHRoZSBhY3R1YWwgUENJIGJ1cyBzaWRlIGFkZHJlc3Mu
CgpXaGVuIGJlaGluZCB0aGUgc2FtZSBzd2l0Y2ggeW91IG5lZWQgdG8gdXNlIFBDSSBhZGRyZXNz
ZXMuIElmIG9uZSB0cmllcwpsYXRlciB0byBkbyBQMlAgYmV0d2VlbiBob3N0IGJyaWRnZXMgKHZp
YSB0aGUgQ1BVIGZhYnJpYykgdGhpbmdzIGdldAptb3JlIGNvbXBsZXggYW5kIG9uZSB3aWxsIGhh
dmUgdG8gdXNlIGVpdGhlciBDUFUgYWRkcmVzc2VzIG9yIHNvbWV0aGluZwplbHNlIGFsbHRvZ2V0
aGVyIChwcm9iYWJseSB3b3VsZCBoYXZlIHRvIHRlYWNoIHRoZSBhcmNoIERNQSBtYXBwaW5nCnJv
dXRpbmVzIHRvIHdvcmsgd2l0aCB0aG9zZSBzdHJ1Y3QgcGFnZXMgeW91IGNyZWF0ZSBhbmQgcmV0
dXJuIHRoZQpyaWdodCB0aGluZykuCgo+ID4gPiBUaGlzIHdpbGwgbWVhbiBtYW55IHNldHVwcyB0
aGF0IGNvdWxkIGxpa2VseQo+ID4gPiB3b3JrIHdlbGwgd2lsbCBub3QgYmUgc3VwcG9ydGVkIHNv
IHRoYXQgd2UgY2FuIGJlIG1vcmUgY29uZmlkZW50IGl0Cj4gPiA+IHdpbGwgd29yayBhbmQgbm90
IHBsYWNlIGFueSByZXNwb25zaWJpbGl0eSBvbiB0aGUgdXNlciB0byB1bmRlcnN0YW5kCj4gPiA+
IHRoZWlyIHRvcG9sb2d5LiAoV2UndmUgY2hvc2VuIHRvIGdvIHRoaXMgcm91dGUgYmFzZWQgb24g
ZmVlZGJhY2sgd2UKPiA+ID4gcmVjZWl2ZWQgYXQgTFNGKS4KPiA+ID4gCj4gPiA+IEluIG9yZGVy
IHRvIGVuYWJsZSB0aGlzIGZ1bmN0aW9uYWxpdHkgd2UgaW50cm9kdWNlIGEgbmV3IHAycG1lbSBk
ZXZpY2UKPiA+ID4gd2hpY2ggY2FuIGJlIGluc3RhbnRpYXRlZCBieSBQQ0kgZHJpdmVycy4gVGhl
IGRldmljZSB3aWxsIHJlZ2lzdGVyIHNvbWUKPiA+ID4gUENJIG1lbW9yeSBhcyBaT05FX0RFVklD
RSBhbmQgcHJvdmlkZSBhbiBnZW5hbGxvYyBiYXNlZCBhbGxvY2F0b3IgZm9yCj4gPiA+IHVzZXJz
IG9mIHRoZXNlIGRldmljZXMgdG8gZ2V0IGJ1ZmZlcnMuCj4gPiAKPiA+IEkgZG9uJ3QgY29tcGxl
dGVseSB1bmRlcnN0YW5kIHRoaXMuIFRoaXMgaXMgYWN0dWFsIG1lbW9yeSBvbiB0aGUgUENJCj4g
PiBidXMgPyBXaGVyZSBkb2VzIGl0IGNvbWUgZnJvbSA/IE9yIGFyZSB5b3UganVzdCB0cnlpbmcg
dG8gY3JlYXRlIHN0cnVjdAo+ID4gcGFnZXMgdGhhdCBjb3ZlciB5b3VyIFBDSWUgRE1BIHRhcmdl
dCA/Cj4gCj4gWWVzLCB0aGUgbWVtb3J5IGlzIG9uIHRoZSBQQ0kgYnVzIGluIGEgQkFSLiBGb3Ig
bm93IHdlIGhhdmUgYSBzcGVjaWFsCj4gUENJIGNhcmQgZm9yIHRoaXMsIGJ1dCBpbiB0aGUgZnV0
dXJlIGl0IHdvdWxkIGxpa2VseSBiZSB0aGUgQ01CIGluIGFuCj4gTlZNZSBjYXJkLiBUaGVzZSBw
YXRjaGVzIGNyZWF0ZSBzdHJ1Y3QgcGFnZXMgdG8gbWFwIHRoZXNlIEJBUiBhZGRyZXNzZXMKPiB1
c2luZyBaT05FX0RFVklDRS4KCk9rLgoKU28gaWRlYWxseSB3ZSdkIHdhbnQgdGhpbmdzIGxpa2Ug
ZG1hX21hcF8qIHRvIGJlIGFibGUgdG8gYmUgZmVkIHRob3NlCnN0cnVjdCBwYWdlcyBhbmQgZG8g
dGhlIHJpZ2h0IHRoaW5nIHdoaWNoIGlzIC4uLiB0cmlja3ksIGVzcGVjaWFsbHkKd2l0aCB0aGUg
YWRkcmVzcyB0cmFuc2xhdGlvbiBJIG1lbnRpb25lZCBzaW5jZSB0aGUgYWRkcmVzcyB3aWxsIGJl
CmRpZmZlcmVudCB3aGV0aGVyIHRoZSBpbml0aWF0b3IgaXMgb24gdGhlIHNhbWUgaG9zdCBicmlk
Z2UgYXMgdGhlCnRhcmdldCBvciBub3QuCgo+ID4gU28gY29ycmVjdCBtZSBpZiBJJ20gd3Jvbmcs
IHlvdSBhcmUgdHJ5aW5nIHRvIGNyZWF0ZSBzdHJ1Y3QgcGFnZSdzIHRoYXQKPiA+IG1hcCBhIFBD
SWUgQkFSIHJpZ2h0ID8gSSdtIHRyeWluZyB0byB1bmRlcnN0YW5kIGhvdyB0aGF0IGludGVyYWN0
cyB3aXRoCj4gPiB3aGF0IEplcm9tZSBpcyBkb2luZyBmb3IgSE1NLgo+IAo+IFllcywgd2VsbCB3
ZSBhcmUgdXNpbmcgWk9ORV9ERVZJQ0UgaW4gdGhlIGV4YWN0IHNhbWUgd2F5IGFzIHRoZSBkYXgg
Y29kZQo+IGlzLiBUaGVzZSBwYXRjaGVzIHVzZSB0aGUgZXhpc3RpbmcgQVBJIHdpdGggbm8gbW9k
aWZpY2F0aW9ucy4gQXMgSQo+IHVuZGVyc3RhbmQgaXQsIEhNTSB3YXMgdXNpbmcgWk9ORV9ERVZJ
Q0UgaW4gYSB3YXkgdGhhdCB3YXMgcXVpdGUKPiBkaWZmZXJlbnQgdG8gaG93IGl0IHdhcyBvcmln
aW5hbGx5IGRlc2lnbmVkLgoKU29ydC1vZi4gSSBkb24ndCBzZWUgd2h5IHRoZXJlIHdvdWxkIGJl
IGEgY29uZmxpY3Qgd2l0aCB0aGUgc3RydWN0CnBhZ2VzIHVzZSB0aG91Z2guIEplcm9tZSBjYW4g
eW91IGNoaW1lIGluID8gSmVyb21lOiBJdCBsb29rcyBsaWtlIHRoZXkKYXJlIGp1c3QgbGF5aW5n
IG91dCBzdHJ1Y3QgcGFnZSBvdmVyIGEgQkFSIHdoaWNoIGlzIHRoZSBzYW1lIHRoaW5nIEkKdGhp
bmsgeW91IHNob3VsZCBkbyB3aGVuIHRoZSBCQVIgaXMgImxhcmdlIGVub3VnaCIgZm9yIHRoZSBH
UFUgbWVtb3J5LgoKVGhlIGNhc2Ugd2hlcmUgSE1NIHVzZXMgImhvbGVzIiBpbiB0aGUgYWRkcmVz
cyBzcGFjZSBmb3IgaXRzIHN0cnVjdApwYWdlIGlzIHNvbWV3aGF0IG9ydGhvZ29uYWwgYnV0IEkg
YWxzbyBzZWUgbm8gY29uZmxpY3QgaGVyZS4KCj4gPiBUaGUgcmVhc29uIGlzIHRoYXQgdGhlIEhN
TSBjdXJyZW50bHkgY3JlYXRlcyB0aGUgc3RydWN0IHBhZ2VzIHdpdGgKPiA+ICJmYWtlIiBQRk5z
IHBvaW50aW5nIHRvIGEgaG9sZSBpbiB0aGUgYWRkcmVzcyBzcGFjZSByYXRoZXIgdGhhbgo+ID4g
Y292ZXJpbmcgdGhlIGFjdHVhbCBQQ0llIG1lbW9yeSBvZiB0aGUgR1BVLiBIZSBkb2VzIHRoYXQg
dG8gZGVhbCB3aXRoCj4gPiB0aGUgZmFjdCB0aGF0IHNvbWUgR1BVcyBoYXZlIGEgc21hbGxlciBh
cGVydHVyZSBvbiBQQ0llIHRoYW4gdGhlaXIKPiA+IHRvdGFsIG1lbW9yeS4KPiAKPiBJJ20gYXdh
cmUgb2Ygd2hhdCBITU0gaXMgdHJ5aW5nIHRvIGRvIGFuZCBhbHRob3VnaCBJJ20gbm90IGZhbWls
aWFyIHdpdGgKPiB0aGUgaW50aW1hdGUgZGV0YWlscywgSSBzYXcgaXQgYXMgZmFpcmx5IG9ydGhv
Z29uYWwgdG8gd2hhdCB3ZSBhcmUKPiBhdHRlbXB0aW5nIHRvIGRvLgoKUmlnaHQuCgo+ID4gSG93
ZXZlciwgSSBoYXZlIGFza2VkIGhpbSB0byBvbmx5IGFwcGx5IHRoYXQgcG9saWN5IGlmIHRoZSBh
cGVydHVyZSBpcwo+ID4gaW5kZWVkIHNtYWxsZXIsIGFuZCBpZiBub3QsIGNyZWF0ZSBzdHJ1Y3Qg
cGFnZXMgdGhhdCBkaXJlY3RseSBjb3ZlciB0aGUKPiA+IFBDSWUgQkFSIG9mIHRoZSBHUFUgaW5z
dGVhZCwgd2hpY2ggd2lsbCB3b3JrIGJldHRlciBvbiBzeXN0ZW1zIG9yCj4gPiBhcmNoaXRlY3R1
cmUgdGhhdCBkb24ndCBoYXZlIGEgInBpbmhvbGUiIHdpbmRvdyBsaW1pdGF0aW9uLgo+ID4gSG93
ZXZlciBoZSB3YXMgdW5kZXIgdGhlIGltcHJlc3Npb24gdGhhdCB0aGlzIHdhcyBnb2luZyB0byBj
b2xsaWRlIHdpdGgKPiA+IHdoYXQgeW91IGd1eXMgYXJlIGRvaW5nLCBzbyBJJ20gdHJ5aW5nIHRv
IHVuZGVyc3RhbmQgaG93LsKgCj4gCj4gSSdtIG5vdCBzdXJlIEkgdW5kZXJzdGFuZCBob3cgZWl0
aGVyLiBIb3dldmVyLCBJIHN1c3BlY3QgaWYgeW91IGNvbGxpZGUKPiB3aXRoIHRoZXNlIHBhdGNo
ZXMgdGhlbiB5b3UnZCBhbHNvIGJlIGJyZWFraW5nIGRheCB0b28uCgpQb3NzaWJseSBidXQgYXMg
SSBzYWlkLCBJIGRvbid0IHNlZSB3aHkgc28gSSdsbCBsZXQgSmVyb21lIGNoaW1lIGluCnNpbmNl
IGhlIHdhcyB1bmRlciB0aGUgaW1wcmVzc2lvbiB0aGF0IHRoZXJlIHdhcyBhIGNvbmZsaWN0IGhl
cmUgOi0pCgpDaGVlcnMsCkJlbi4KCl9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f
X19fX19fX19fX19fCkxpbnV4LW52ZGltbSBtYWlsaW5nIGxpc3QKTGludXgtbnZkaW1tQGxpc3Rz
LjAxLm9yZwpodHRwczovL2xpc3RzLjAxLm9yZy9tYWlsbWFuL2xpc3RpbmZvL2xpbnV4LW52ZGlt
bQo=

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1755442AbdDLV5S (ORCPT <rfc822;w@1wt.eu>);
        Wed, 12 Apr 2017 17:57:18 -0400
Received: from gate.crashing.org ([63.228.1.57]:41359 "EHLO gate.crashing.org"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1752401AbdDLV5P (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 12 Apr 2017 17:57:15 -0400
Message-ID: <1492034124.7236.77.camel@kernel.crashing.org>
Subject: Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
To: Logan Gunthorpe <logang@deltatee.com>, Christoph Hellwig <hch@lst.de>,
        Sagi Grimberg <sagi@grimberg.me>,
        "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>,
        "Martin K. Petersen" <martin.petersen@oracle.com>,
        Jens Axboe <axboe@kernel.dk>, Steve Wise <swise@opengridcomputing.com>,
        Stephen Bates <sbates@raithlin.com>, Max Gurtovoy <maxg@mellanox.com>,
        Dan Williams <dan.j.williams@intel.com>,
        Keith Busch <keith.busch@intel.com>,
        Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: linux-pci@vger.kernel.org, linux-scsi@vger.kernel.org,
        linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org,
        linux-nvdimm@ml01.01.org, linux-kernel@vger.kernel.org,
        Jerome Glisse <jglisse@redhat.com>
Date: Thu, 13 Apr 2017 07:55:24 +1000
In-Reply-To: <5ac22496-56ec-025d-f153-140001d2a7f9@deltatee.com>
References: <1490911959-5146-1-git-send-email-logang@deltatee.com>
         <1491974532.7236.43.camel@kernel.crashing.org>
         <5ac22496-56ec-025d-f153-140001d2a7f9@deltatee.com>
Content-Type: text/plain; charset="UTF-8"
X-Mailer: Evolution 3.22.6 (3.22.6-1.fc25) 
Mime-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, 2017-04-12 at 11:09 -0600, Logan Gunthorpe wrote:
> 
> > Do you handle funky address translation too ? IE. the fact that the PCI
> > addresses aren't the same as the CPU physical addresses for a BAR ?
> 
> No, we use the CPU physical address of the BAR. If it's not mapped that
> way we can't use it.

Ok, you need to fix that or a bunch of architectures won't work. 

Look at pcibios_resource_to_bus() and pcibios_bus_to_resource(). They
will perform the conversion between the struct resource content (CPU
physical address) and the actual PCI bus side address.

When behind the same switch you need to use PCI addresses. If one tries
later to do P2P between host bridges (via the CPU fabric) things get
more complex and one will have to use either CPU addresses or something
else alltogether (probably would have to teach the arch DMA mapping
routines to work with those struct pages you create and return the
right thing).

> > > This will mean many setups that could likely
> > > work well will not be supported so that we can be more confident it
> > > will work and not place any responsibility on the user to understand
> > > their topology. (We've chosen to go this route based on feedback we
> > > received at LSF).
> > > 
> > > In order to enable this functionality we introduce a new p2pmem device
> > > which can be instantiated by PCI drivers. The device will register some
> > > PCI memory as ZONE_DEVICE and provide an genalloc based allocator for
> > > users of these devices to get buffers.
> > 
> > I don't completely understand this. This is actual memory on the PCI
> > bus ? Where does it come from ? Or are you just trying to create struct
> > pages that cover your PCIe DMA target ?
> 
> Yes, the memory is on the PCI bus in a BAR. For now we have a special
> PCI card for this, but in the future it would likely be the CMB in an
> NVMe card. These patches create struct pages to map these BAR addresses
> using ZONE_DEVICE.

Ok.

So ideally we'd want things like dma_map_* to be able to be fed those
struct pages and do the right thing which is ... tricky, especially
with the address translation I mentioned since the address will be
different whether the initiator is on the same host bridge as the
target or not.

> > So correct me if I'm wrong, you are trying to create struct page's that
> > map a PCIe BAR right ? I'm trying to understand how that interacts with
> > what Jerome is doing for HMM.
> 
> Yes, well we are using ZONE_DEVICE in the exact same way as the dax code
> is. These patches use the existing API with no modifications. As I
> understand it, HMM was using ZONE_DEVICE in a way that was quite
> different to how it was originally designed.

Sort-of. I don't see why there would be a conflict with the struct
pages use though. Jerome can you chime in ? Jerome: It looks like they
are just laying out struct page over a BAR which is the same thing I
think you should do when the BAR is "large enough" for the GPU memory.

The case where HMM uses "holes" in the address space for its struct
page is somewhat orthogonal but I also see no conflict here.

> > The reason is that the HMM currently creates the struct pages with
> > "fake" PFNs pointing to a hole in the address space rather than
> > covering the actual PCIe memory of the GPU. He does that to deal with
> > the fact that some GPUs have a smaller aperture on PCIe than their
> > total memory.
> 
> I'm aware of what HMM is trying to do and although I'm not familiar with
> the intimate details, I saw it as fairly orthogonal to what we are
> attempting to do.

Right.

> > However, I have asked him to only apply that policy if the aperture is
> > indeed smaller, and if not, create struct pages that directly cover the
> > PCIe BAR of the GPU instead, which will work better on systems or
> > architecture that don't have a "pinhole" window limitation.
> > However he was under the impression that this was going to collide with
> > what you guys are doing, so I'm trying to understand how. 
> 
> I'm not sure I understand how either. However, I suspect if you collide
> with these patches then you'd also be breaking dax too.

Possibly but as I said, I don't see why so I'll let Jerome chime in
since he was under the impression that there was a conflict here :-)

Cheers,
Ben.

From mboxrd@z Thu Jan  1 00:00:00 1970
From: benh@kernel.crashing.org (Benjamin Herrenschmidt)
Date: Thu, 13 Apr 2017 07:55:24 +1000
Subject: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
In-Reply-To: <5ac22496-56ec-025d-f153-140001d2a7f9@deltatee.com>
References: <1490911959-5146-1-git-send-email-logang@deltatee.com>
 <1491974532.7236.43.camel@kernel.crashing.org>
 <5ac22496-56ec-025d-f153-140001d2a7f9@deltatee.com>
Message-ID: <1492034124.7236.77.camel@kernel.crashing.org>

On Wed, 2017-04-12@11:09 -0600, Logan Gunthorpe wrote:
> 
> > Do you handle funky address translation too ? IE. the fact that the PCI
> > addresses aren't the same as the CPU physical addresses for a BAR ?
> 
> No, we use the CPU physical address of the BAR. If it's not mapped that
> way we can't use it.

Ok, you need to fix that or a bunch of architectures won't work. 

Look at pcibios_resource_to_bus() and pcibios_bus_to_resource(). They
will perform the conversion between the struct resource content (CPU
physical address) and the actual PCI bus side address.

When behind the same switch you need to use PCI addresses. If one tries
later to do P2P between host bridges (via the CPU fabric) things get
more complex and one will have to use either CPU addresses or something
else alltogether (probably would have to teach the arch DMA mapping
routines to work with those struct pages you create and return the
right thing).

> > > This will mean many setups that could likely
> > > work well will not be supported so that we can be more confident it
> > > will work and not place any responsibility on the user to understand
> > > their topology. (We've chosen to go this route based on feedback we
> > > received at LSF).
> > > 
> > > In order to enable this functionality we introduce a new p2pmem device
> > > which can be instantiated by PCI drivers. The device will register some
> > > PCI memory as ZONE_DEVICE and provide an genalloc based allocator for
> > > users of these devices to get buffers.
> > 
> > I don't completely understand this. This is actual memory on the PCI
> > bus ? Where does it come from ? Or are you just trying to create struct
> > pages that cover your PCIe DMA target ?
> 
> Yes, the memory is on the PCI bus in a BAR. For now we have a special
> PCI card for this, but in the future it would likely be the CMB in an
> NVMe card. These patches create struct pages to map these BAR addresses
> using ZONE_DEVICE.

Ok.

So ideally we'd want things like dma_map_* to be able to be fed those
struct pages and do the right thing which is ... tricky, especially
with the address translation I mentioned since the address will be
different whether the initiator is on the same host bridge as the
target or not.

> > So correct me if I'm wrong, you are trying to create struct page's that
> > map a PCIe BAR right ? I'm trying to understand how that interacts with
> > what Jerome is doing for HMM.
> 
> Yes, well we are using ZONE_DEVICE in the exact same way as the dax code
> is. These patches use the existing API with no modifications. As I
> understand it, HMM was using ZONE_DEVICE in a way that was quite
> different to how it was originally designed.

Sort-of. I don't see why there would be a conflict with the struct
pages use though. Jerome can you chime in ? Jerome: It looks like they
are just laying out struct page over a BAR which is the same thing I
think you should do when the BAR is "large enough" for the GPU memory.

The case where HMM uses "holes" in the address space for its struct
page is somewhat orthogonal but I also see no conflict here.

> > The reason is that the HMM currently creates the struct pages with
> > "fake" PFNs pointing to a hole in the address space rather than
> > covering the actual PCIe memory of the GPU. He does that to deal with
> > the fact that some GPUs have a smaller aperture on PCIe than their
> > total memory.
> 
> I'm aware of what HMM is trying to do and although I'm not familiar with
> the intimate details, I saw it as fairly orthogonal to what we are
> attempting to do.

Right.

> > However, I have asked him to only apply that policy if the aperture is
> > indeed smaller, and if not, create struct pages that directly cover the
> > PCIe BAR of the GPU instead, which will work better on systems or
> > architecture that don't have a "pinhole" window limitation.
> > However he was under the impression that this was going to collide with
> > what you guys are doing, so I'm trying to understand how.?
> 
> I'm not sure I understand how either. However, I suspect if you collide
> with these patches then you'd also be breaking dax too.

Possibly but as I said, I don't see why so I'll let Jerome chime in
since he was under the impression that there was a conflict here :-)

Cheers,
Ben.