* [RFC PATCH 1/5] pci/p2p: add a function to test peer to peer capability
2019-01-29 17:47 [RFC PATCH 0/5] Device peer to peer (p2p) through vma jglisse
@ 2019-01-29 17:47 ` jglisse
2019-01-29 18:24 ` Logan Gunthorpe
2019-01-29 19:56 ` Alex Deucher
2019-01-29 17:47 ` [RFC PATCH 2/5] drivers/base: " jglisse
` (3 subsequent siblings)
4 siblings, 2 replies; 95+ messages in thread
From: jglisse @ 2019-01-29 17:47 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Jérôme Glisse, Logan Gunthorpe,
Greg Kroah-Hartman, Rafael J . Wysocki, Bjorn Helgaas,
Christian Koenig, Felix Kuehling, Jason Gunthorpe, linux-pci,
dri-devel, Christoph Hellwig, Marek Szyprowski, Robin Murphy,
Joerg Roedel, iommu
From: Jérôme Glisse <jglisse@redhat.com>
device_test_p2p() return true if two devices can peer to peer to
each other. We add a generic function as different inter-connect
can support peer to peer and we want to genericaly test this no
matter what the inter-connect might be. However this version only
support PCIE for now.
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: iommu@lists.linux-foundation.org
---
drivers/pci/p2pdma.c | 27 +++++++++++++++++++++++++++
include/linux/pci-p2pdma.h | 6 ++++++
2 files changed, 33 insertions(+)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index c52298d76e64..620ac60babb5 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -797,3 +797,30 @@ ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
return sprintf(page, "%s\n", pci_name(p2p_dev));
}
EXPORT_SYMBOL_GPL(pci_p2pdma_enable_show);
+
+bool pci_test_p2p(struct device *devA, struct device *devB)
+{
+ struct pci_dev *pciA, *pciB;
+ bool ret;
+ int tmp;
+
+ /*
+ * For now we only support PCIE peer to peer but other inter-connect
+ * can be added.
+ */
+ pciA = find_parent_pci_dev(devA);
+ pciB = find_parent_pci_dev(devB);
+ if (pciA == NULL || pciB == NULL) {
+ ret = false;
+ goto out;
+ }
+
+ tmp = upstream_bridge_distance(pciA, pciB, NULL);
+ ret = tmp < 0 ? false : true;
+
+out:
+ pci_dev_put(pciB);
+ pci_dev_put(pciA);
+ return false;
+}
+EXPORT_SYMBOL_GPL(pci_test_p2p);
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index bca9bc3e5be7..7671cc499a08 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -36,6 +36,7 @@ int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
bool *use_p2pdma);
ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
bool use_p2pdma);
+bool pci_test_p2p(struct device *devA, struct device *devB);
#else /* CONFIG_PCI_P2PDMA */
static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
size_t size, u64 offset)
@@ -97,6 +98,11 @@ static inline ssize_t pci_p2pdma_enable_show(char *page,
{
return sprintf(page, "none\n");
}
+
+static inline bool pci_test_p2p(struct device *devA, struct device *devB)
+{
+ return false;
+}
#endif /* CONFIG_PCI_P2PDMA */
--
2.17.2
^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 1/5] pci/p2p: add a function to test peer to peer capability
2019-01-29 17:47 ` [RFC PATCH 1/5] pci/p2p: add a function to test peer to peer capability jglisse
@ 2019-01-29 18:24 ` Logan Gunthorpe
2019-01-29 19:44 ` Greg Kroah-Hartman
2019-01-29 19:56 ` Alex Deucher
1 sibling, 1 reply; 95+ messages in thread
From: Logan Gunthorpe @ 2019-01-29 18:24 UTC (permalink / raw)
To: jglisse, linux-mm
Cc: linux-kernel, Greg Kroah-Hartman, Rafael J . Wysocki,
Bjorn Helgaas, Christian Koenig, Felix Kuehling, Jason Gunthorpe,
linux-pci, dri-devel, Christoph Hellwig, Marek Szyprowski,
Robin Murphy, Joerg Roedel, iommu
On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
> +bool pci_test_p2p(struct device *devA, struct device *devB)
> +{
> + struct pci_dev *pciA, *pciB;
> + bool ret;
> + int tmp;
> +
> + /*
> + * For now we only support PCIE peer to peer but other inter-connect
> + * can be added.
> + */
> + pciA = find_parent_pci_dev(devA);
> + pciB = find_parent_pci_dev(devB);
> + if (pciA == NULL || pciB == NULL) {
> + ret = false;
> + goto out;
> + }
> +
> + tmp = upstream_bridge_distance(pciA, pciB, NULL);
> + ret = tmp < 0 ? false : true;
> +
> +out:
> + pci_dev_put(pciB);
> + pci_dev_put(pciA);
> + return false;
> +}
> +EXPORT_SYMBOL_GPL(pci_test_p2p);
This function only ever returns false....
Logan
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 1/5] pci/p2p: add a function to test peer to peer capability
2019-01-29 18:24 ` Logan Gunthorpe
@ 2019-01-29 19:44 ` Greg Kroah-Hartman
2019-01-29 19:53 ` Jerome Glisse
2019-01-29 20:44 ` Logan Gunthorpe
0 siblings, 2 replies; 95+ messages in thread
From: Greg Kroah-Hartman @ 2019-01-29 19:44 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: Joerg Roedel, Rafael J . Wysocki, linux-pci, Felix Kuehling,
linux-kernel, dri-devel, Christoph Hellwig, linux-mm, jglisse,
Jason Gunthorpe, Bjorn Helgaas, iommu, Robin Murphy,
Christian Koenig, Marek Szyprowski
On Tue, Jan 29, 2019 at 11:24:09AM -0700, Logan Gunthorpe wrote:
>
>
> On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
> > +bool pci_test_p2p(struct device *devA, struct device *devB)
> > +{
> > + struct pci_dev *pciA, *pciB;
> > + bool ret;
> > + int tmp;
> > +
> > + /*
> > + * For now we only support PCIE peer to peer but other inter-connect
> > + * can be added.
> > + */
> > + pciA = find_parent_pci_dev(devA);
> > + pciB = find_parent_pci_dev(devB);
> > + if (pciA == NULL || pciB == NULL) {
> > + ret = false;
> > + goto out;
> > + }
> > +
> > + tmp = upstream_bridge_distance(pciA, pciB, NULL);
> > + ret = tmp < 0 ? false : true;
> > +
> > +out:
> > + pci_dev_put(pciB);
> > + pci_dev_put(pciA);
> > + return false;
> > +}
> > +EXPORT_SYMBOL_GPL(pci_test_p2p);
>
> This function only ever returns false....
I guess it was nevr actually tested :(
I feel really worried about passing random 'struct device' pointers into
the PCI layer. Are we _sure_ it can handle this properly?
thanks,
greg k-h
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 1/5] pci/p2p: add a function to test peer to peer capability
2019-01-29 19:44 ` Greg Kroah-Hartman
@ 2019-01-29 19:53 ` Jerome Glisse
2019-01-29 20:44 ` Logan Gunthorpe
1 sibling, 0 replies; 95+ messages in thread
From: Jerome Glisse @ 2019-01-29 19:53 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: Logan Gunthorpe, linux-mm, linux-kernel, Rafael J . Wysocki,
Bjorn Helgaas, Christian Koenig, Felix Kuehling, Jason Gunthorpe,
linux-pci, dri-devel, Christoph Hellwig, Marek Szyprowski,
Robin Murphy, Joerg Roedel, iommu
On Tue, Jan 29, 2019 at 08:44:26PM +0100, Greg Kroah-Hartman wrote:
> On Tue, Jan 29, 2019 at 11:24:09AM -0700, Logan Gunthorpe wrote:
> >
> >
> > On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
> > > +bool pci_test_p2p(struct device *devA, struct device *devB)
> > > +{
> > > + struct pci_dev *pciA, *pciB;
> > > + bool ret;
> > > + int tmp;
> > > +
> > > + /*
> > > + * For now we only support PCIE peer to peer but other inter-connect
> > > + * can be added.
> > > + */
> > > + pciA = find_parent_pci_dev(devA);
> > > + pciB = find_parent_pci_dev(devB);
> > > + if (pciA == NULL || pciB == NULL) {
> > > + ret = false;
> > > + goto out;
> > > + }
> > > +
> > > + tmp = upstream_bridge_distance(pciA, pciB, NULL);
> > > + ret = tmp < 0 ? false : true;
> > > +
> > > +out:
> > > + pci_dev_put(pciB);
> > > + pci_dev_put(pciA);
> > > + return false;
> > > +}
> > > +EXPORT_SYMBOL_GPL(pci_test_p2p);
> >
> > This function only ever returns false....
>
> I guess it was nevr actually tested :(
>
> I feel really worried about passing random 'struct device' pointers into
> the PCI layer. Are we _sure_ it can handle this properly?
>
Oh yes i fixed it on the test rig and forgot to patch
my local git tree. My bad.
Cheers,
Jérôme
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 1/5] pci/p2p: add a function to test peer to peer capability
2019-01-29 19:44 ` Greg Kroah-Hartman
2019-01-29 19:53 ` Jerome Glisse
@ 2019-01-29 20:44 ` Logan Gunthorpe
2019-01-29 21:00 ` Jerome Glisse
1 sibling, 1 reply; 95+ messages in thread
From: Logan Gunthorpe @ 2019-01-29 20:44 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: jglisse, linux-mm, linux-kernel, Rafael J . Wysocki,
Bjorn Helgaas, Christian Koenig, Felix Kuehling, Jason Gunthorpe,
linux-pci, dri-devel, Christoph Hellwig, Marek Szyprowski,
Robin Murphy, Joerg Roedel, iommu
On 2019-01-29 12:44 p.m., Greg Kroah-Hartman wrote:
> On Tue, Jan 29, 2019 at 11:24:09AM -0700, Logan Gunthorpe wrote:
>>
>>
>> On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
>>> +bool pci_test_p2p(struct device *devA, struct device *devB)
>>> +{
>>> + struct pci_dev *pciA, *pciB;
>>> + bool ret;
>>> + int tmp;
>>> +
>>> + /*
>>> + * For now we only support PCIE peer to peer but other inter-connect
>>> + * can be added.
>>> + */
>>> + pciA = find_parent_pci_dev(devA);
>>> + pciB = find_parent_pci_dev(devB);
>>> + if (pciA == NULL || pciB == NULL) {
>>> + ret = false;
>>> + goto out;
>>> + }
>>> +
>>> + tmp = upstream_bridge_distance(pciA, pciB, NULL);
>>> + ret = tmp < 0 ? false : true;
>>> +
>>> +out:
>>> + pci_dev_put(pciB);
>>> + pci_dev_put(pciA);
>>> + return false;
>>> +}
>>> +EXPORT_SYMBOL_GPL(pci_test_p2p);
>>
>> This function only ever returns false....
>
> I guess it was nevr actually tested :(
>
> I feel really worried about passing random 'struct device' pointers into
> the PCI layer. Are we _sure_ it can handle this properly?
Yes, there are a couple of pci_p2pdma functions that take struct devices
directly simply because it's way more convenient for the caller. That's
what find_parent_pci_dev() takes care of (it returns false if the device
is not a PCI device). Whether that's appropriate here is hard to say
seeing we haven't seen any caller code.
Logan
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 1/5] pci/p2p: add a function to test peer to peer capability
2019-01-29 20:44 ` Logan Gunthorpe
@ 2019-01-29 21:00 ` Jerome Glisse
0 siblings, 0 replies; 95+ messages in thread
From: Jerome Glisse @ 2019-01-29 21:00 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: Joerg Roedel, Rafael J . Wysocki, Greg Kroah-Hartman,
Felix Kuehling, linux-kernel, dri-devel, Christoph Hellwig,
linux-mm, iommu, Jason Gunthorpe, linux-pci, Bjorn Helgaas,
Robin Murphy, Christian Koenig, Marek Szyprowski
On Tue, Jan 29, 2019 at 01:44:09PM -0700, Logan Gunthorpe wrote:
>
>
> On 2019-01-29 12:44 p.m., Greg Kroah-Hartman wrote:
> > On Tue, Jan 29, 2019 at 11:24:09AM -0700, Logan Gunthorpe wrote:
> >>
> >>
> >> On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
> >>> +bool pci_test_p2p(struct device *devA, struct device *devB)
> >>> +{
> >>> + struct pci_dev *pciA, *pciB;
> >>> + bool ret;
> >>> + int tmp;
> >>> +
> >>> + /*
> >>> + * For now we only support PCIE peer to peer but other inter-connect
> >>> + * can be added.
> >>> + */
> >>> + pciA = find_parent_pci_dev(devA);
> >>> + pciB = find_parent_pci_dev(devB);
> >>> + if (pciA == NULL || pciB == NULL) {
> >>> + ret = false;
> >>> + goto out;
> >>> + }
> >>> +
> >>> + tmp = upstream_bridge_distance(pciA, pciB, NULL);
> >>> + ret = tmp < 0 ? false : true;
> >>> +
> >>> +out:
> >>> + pci_dev_put(pciB);
> >>> + pci_dev_put(pciA);
> >>> + return false;
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(pci_test_p2p);
> >>
> >> This function only ever returns false....
> >
> > I guess it was nevr actually tested :(
> >
> > I feel really worried about passing random 'struct device' pointers into
> > the PCI layer. Are we _sure_ it can handle this properly?
>
> Yes, there are a couple of pci_p2pdma functions that take struct devices
> directly simply because it's way more convenient for the caller. That's
> what find_parent_pci_dev() takes care of (it returns false if the device
> is not a PCI device). Whether that's appropriate here is hard to say
> seeing we haven't seen any caller code.
Caller code as a reference (i already given that link in other part of
thread but just so that people don't have to follow all branches).
https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-p2p&id=401a567696eafb1d4faf7054ab0d7c3a16a5ef06
Cheers,
Jérôme
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 1/5] pci/p2p: add a function to test peer to peer capability
2019-01-29 17:47 ` [RFC PATCH 1/5] pci/p2p: add a function to test peer to peer capability jglisse
2019-01-29 18:24 ` Logan Gunthorpe
@ 2019-01-29 19:56 ` Alex Deucher
2019-01-29 20:00 ` Jerome Glisse
2019-01-29 20:24 ` Logan Gunthorpe
1 sibling, 2 replies; 95+ messages in thread
From: Alex Deucher @ 2019-01-29 19:56 UTC (permalink / raw)
To: Jerome Glisse
Cc: Logan Gunthorpe, Joerg Roedel, Rafael J . Wysocki,
Greg Kroah-Hartman, Felix Kuehling, LKML,
Maling list - DRI developers, Christian Koenig, linux-mm, iommu,
Jason Gunthorpe, Linux PCI, Bjorn Helgaas, Robin Murphy,
Christoph Hellwig, Marek Szyprowski
On Tue, Jan 29, 2019 at 12:47 PM <jglisse@redhat.com> wrote:
>
> From: Jérôme Glisse <jglisse@redhat.com>
>
> device_test_p2p() return true if two devices can peer to peer to
> each other. We add a generic function as different inter-connect
> can support peer to peer and we want to genericaly test this no
> matter what the inter-connect might be. However this version only
> support PCIE for now.
>
What about something like these patches:
https://cgit.freedesktop.org/~deathsimple/linux/commit/?h=p2p&id=4fab9ff69cb968183f717551441b475fabce6c1c
https://cgit.freedesktop.org/~deathsimple/linux/commit/?h=p2p&id=f90b12d41c277335d08c9dab62433f27c0fadbe5
They are a bit more thorough.
Alex
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Logan Gunthorpe <logang@deltatee.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Rafael J. Wysocki <rafael@kernel.org>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: Christian Koenig <christian.koenig@amd.com>
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Cc: Jason Gunthorpe <jgg@mellanox.com>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-pci@vger.kernel.org
> Cc: dri-devel@lists.freedesktop.org
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Joerg Roedel <jroedel@suse.de>
> Cc: iommu@lists.linux-foundation.org
> ---
> drivers/pci/p2pdma.c | 27 +++++++++++++++++++++++++++
> include/linux/pci-p2pdma.h | 6 ++++++
> 2 files changed, 33 insertions(+)
>
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index c52298d76e64..620ac60babb5 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -797,3 +797,30 @@ ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
> return sprintf(page, "%s\n", pci_name(p2p_dev));
> }
> EXPORT_SYMBOL_GPL(pci_p2pdma_enable_show);
> +
> +bool pci_test_p2p(struct device *devA, struct device *devB)
> +{
> + struct pci_dev *pciA, *pciB;
> + bool ret;
> + int tmp;
> +
> + /*
> + * For now we only support PCIE peer to peer but other inter-connect
> + * can be added.
> + */
> + pciA = find_parent_pci_dev(devA);
> + pciB = find_parent_pci_dev(devB);
> + if (pciA == NULL || pciB == NULL) {
> + ret = false;
> + goto out;
> + }
> +
> + tmp = upstream_bridge_distance(pciA, pciB, NULL);
> + ret = tmp < 0 ? false : true;
> +
> +out:
> + pci_dev_put(pciB);
> + pci_dev_put(pciA);
> + return false;
> +}
> +EXPORT_SYMBOL_GPL(pci_test_p2p);
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index bca9bc3e5be7..7671cc499a08 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -36,6 +36,7 @@ int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
> bool *use_p2pdma);
> ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
> bool use_p2pdma);
> +bool pci_test_p2p(struct device *devA, struct device *devB);
> #else /* CONFIG_PCI_P2PDMA */
> static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
> size_t size, u64 offset)
> @@ -97,6 +98,11 @@ static inline ssize_t pci_p2pdma_enable_show(char *page,
> {
> return sprintf(page, "none\n");
> }
> +
> +static inline bool pci_test_p2p(struct device *devA, struct device *devB)
> +{
> + return false;
> +}
> #endif /* CONFIG_PCI_P2PDMA */
>
>
> --
> 2.17.2
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 1/5] pci/p2p: add a function to test peer to peer capability
2019-01-29 19:56 ` Alex Deucher
@ 2019-01-29 20:00 ` Jerome Glisse
2019-01-29 20:24 ` Logan Gunthorpe
1 sibling, 0 replies; 95+ messages in thread
From: Jerome Glisse @ 2019-01-29 20:00 UTC (permalink / raw)
To: Alex Deucher
Cc: linux-mm, Joerg Roedel, Rafael J . Wysocki, Greg Kroah-Hartman,
Felix Kuehling, LKML, Maling list - DRI developers,
Christoph Hellwig, iommu, Jason Gunthorpe, Linux PCI,
Bjorn Helgaas, Robin Murphy, Logan Gunthorpe, Christian Koenig,
Marek Szyprowski
On Tue, Jan 29, 2019 at 02:56:38PM -0500, Alex Deucher wrote:
> On Tue, Jan 29, 2019 at 12:47 PM <jglisse@redhat.com> wrote:
> >
> > From: Jérôme Glisse <jglisse@redhat.com>
> >
> > device_test_p2p() return true if two devices can peer to peer to
> > each other. We add a generic function as different inter-connect
> > can support peer to peer and we want to genericaly test this no
> > matter what the inter-connect might be. However this version only
> > support PCIE for now.
> >
>
> What about something like these patches:
> https://cgit.freedesktop.org/~deathsimple/linux/commit/?h=p2p&id=4fab9ff69cb968183f717551441b475fabce6c1c
> https://cgit.freedesktop.org/~deathsimple/linux/commit/?h=p2p&id=f90b12d41c277335d08c9dab62433f27c0fadbe5
> They are a bit more thorough.
Yes it would be better, i forgot about those. I can include them
next time i post. Thank you for reminding me about those :)
Cheers,
Jérôme
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 1/5] pci/p2p: add a function to test peer to peer capability
2019-01-29 19:56 ` Alex Deucher
2019-01-29 20:00 ` Jerome Glisse
@ 2019-01-29 20:24 ` Logan Gunthorpe
2019-01-29 21:28 ` Alex Deucher
2019-01-30 10:25 ` Christian König
1 sibling, 2 replies; 95+ messages in thread
From: Logan Gunthorpe @ 2019-01-29 20:24 UTC (permalink / raw)
To: Alex Deucher, Jerome Glisse
Cc: linux-mm, Joerg Roedel, Rafael J . Wysocki, Greg Kroah-Hartman,
Felix Kuehling, LKML, Maling list - DRI developers,
Christoph Hellwig, iommu, Jason Gunthorpe, Linux PCI,
Bjorn Helgaas, Robin Murphy, Christian Koenig, Marek Szyprowski
On 2019-01-29 12:56 p.m., Alex Deucher wrote:
> On Tue, Jan 29, 2019 at 12:47 PM <jglisse@redhat.com> wrote:
>>
>> From: Jérôme Glisse <jglisse@redhat.com>
>>
>> device_test_p2p() return true if two devices can peer to peer to
>> each other. We add a generic function as different inter-connect
>> can support peer to peer and we want to genericaly test this no
>> matter what the inter-connect might be. However this version only
>> support PCIE for now.
>>
>
> What about something like these patches:
> https://cgit.freedesktop.org/~deathsimple/linux/commit/?h=p2p&id=4fab9ff69cb968183f717551441b475fabce6c1c
> https://cgit.freedesktop.org/~deathsimple/linux/commit/?h=p2p&id=f90b12d41c277335d08c9dab62433f27c0fadbe5
> They are a bit more thorough.
Those new functions seem to have a lot of overlap with the code that is
already upstream in p2pdma.... Perhaps you should be improving the
p2pdma functions if they aren't suitable for what you want already
instead of creating new ones.
Logan
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 1/5] pci/p2p: add a function to test peer to peer capability
2019-01-29 20:24 ` Logan Gunthorpe
@ 2019-01-29 21:28 ` Alex Deucher
2019-01-30 10:25 ` Christian König
1 sibling, 0 replies; 95+ messages in thread
From: Alex Deucher @ 2019-01-29 21:28 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: Jerome Glisse, linux-mm, Joerg Roedel, Rafael J . Wysocki,
Greg Kroah-Hartman, Felix Kuehling, LKML,
Maling list - DRI developers, Christoph Hellwig, iommu,
Jason Gunthorpe, Linux PCI, Bjorn Helgaas, Robin Murphy,
Christian Koenig, Marek Szyprowski
On Tue, Jan 29, 2019 at 3:25 PM Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
>
> On 2019-01-29 12:56 p.m., Alex Deucher wrote:
> > On Tue, Jan 29, 2019 at 12:47 PM <jglisse@redhat.com> wrote:
> >>
> >> From: Jérôme Glisse <jglisse@redhat.com>
> >>
> >> device_test_p2p() return true if two devices can peer to peer to
> >> each other. We add a generic function as different inter-connect
> >> can support peer to peer and we want to genericaly test this no
> >> matter what the inter-connect might be. However this version only
> >> support PCIE for now.
> >>
> >
> > What about something like these patches:
> > https://cgit.freedesktop.org/~deathsimple/linux/commit/?h=p2p&id=4fab9ff69cb968183f717551441b475fabce6c1c
> > https://cgit.freedesktop.org/~deathsimple/linux/commit/?h=p2p&id=f90b12d41c277335d08c9dab62433f27c0fadbe5
> > They are a bit more thorough.
>
> Those new functions seem to have a lot of overlap with the code that is
> already upstream in p2pdma.... Perhaps you should be improving the
> p2pdma functions if they aren't suitable for what you want already
> instead of creating new ones.
Could be. Those patches are pretty old. They probably need to be
rebased on the latest upstream p2p stuff.
Alex
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 1/5] pci/p2p: add a function to test peer to peer capability
2019-01-29 20:24 ` Logan Gunthorpe
2019-01-29 21:28 ` Alex Deucher
@ 2019-01-30 10:25 ` Christian König
1 sibling, 0 replies; 95+ messages in thread
From: Christian König @ 2019-01-30 10:25 UTC (permalink / raw)
To: Logan Gunthorpe, Alex Deucher, Jerome Glisse
Cc: Joerg Roedel, Rafael J . Wysocki, Greg Kroah-Hartman,
Felix Kuehling, LKML, Maling list - DRI developers,
Christian Koenig, linux-mm, iommu, Jason Gunthorpe, Linux PCI,
Bjorn Helgaas, Robin Murphy, Christoph Hellwig, Marek Szyprowski
Am 29.01.19 um 21:24 schrieb Logan Gunthorpe:
>
> On 2019-01-29 12:56 p.m., Alex Deucher wrote:
>> On Tue, Jan 29, 2019 at 12:47 PM <jglisse@redhat.com> wrote:
>>> From: Jérôme Glisse <jglisse@redhat.com>
>>>
>>> device_test_p2p() return true if two devices can peer to peer to
>>> each other. We add a generic function as different inter-connect
>>> can support peer to peer and we want to genericaly test this no
>>> matter what the inter-connect might be. However this version only
>>> support PCIE for now.
>>>
>> What about something like these patches:
>> https://cgit.freedesktop.org/~deathsimple/linux/commit/?h=p2p&id=4fab9ff69cb968183f717551441b475fabce6c1c
>> https://cgit.freedesktop.org/~deathsimple/linux/commit/?h=p2p&id=f90b12d41c277335d08c9dab62433f27c0fadbe5
>> They are a bit more thorough.
> Those new functions seem to have a lot of overlap with the code that is
> already upstream in p2pdma.... Perhaps you should be improving the
> p2pdma functions if they aren't suitable for what you want already
> instead of creating new ones.
Yeah, well that's what I was suggesting for the very beginning :)
But completely agree the existing functions should be improved instead
of adding new ones,
Christian.
>
> Logan
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 95+ messages in thread
* [RFC PATCH 2/5] drivers/base: add a function to test peer to peer capability
2019-01-29 17:47 [RFC PATCH 0/5] Device peer to peer (p2p) through vma jglisse
2019-01-29 17:47 ` [RFC PATCH 1/5] pci/p2p: add a function to test peer to peer capability jglisse
@ 2019-01-29 17:47 ` jglisse
2019-01-29 18:26 ` Logan Gunthorpe
2019-01-29 19:46 ` Greg Kroah-Hartman
2019-01-29 17:47 ` [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma jglisse
` (2 subsequent siblings)
4 siblings, 2 replies; 95+ messages in thread
From: jglisse @ 2019-01-29 17:47 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Jérôme Glisse, Logan Gunthorpe,
Greg Kroah-Hartman, Rafael J . Wysocki, Bjorn Helgaas,
Christian Koenig, Felix Kuehling, Jason Gunthorpe, linux-pci,
dri-devel, Christoph Hellwig, Marek Szyprowski, Robin Murphy,
Joerg Roedel, iommu
From: Jérôme Glisse <jglisse@redhat.com>
device_test_p2p() return true if two devices can peer to peer to
each other. We add a generic function as different inter-connect
can support peer to peer and we want to genericaly test this no
matter what the inter-connect might be. However this version only
support PCIE for now.
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: iommu@lists.linux-foundation.org
---
drivers/base/core.c | 20 ++++++++++++++++++++
include/linux/device.h | 1 +
2 files changed, 21 insertions(+)
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 0073b09bb99f..56023b00e108 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -26,6 +26,7 @@
#include <linux/netdevice.h>
#include <linux/sched/signal.h>
#include <linux/sysfs.h>
+#include <linux/pci-p2pdma.h>
#include "base.h"
#include "power/power.h"
@@ -3167,3 +3168,22 @@ void device_set_of_node_from_dev(struct device *dev, const struct device *dev2)
dev->of_node_reused = true;
}
EXPORT_SYMBOL_GPL(device_set_of_node_from_dev);
+
+/**
+ * device_test_p2p - test if two device can peer to peer to each other
+ * @devA: device A
+ * @devB: device B
+ * Returns: true if device can peer to peer to each other, false otherwise
+ */
+bool device_test_p2p(struct device *devA, struct device *devB)
+{
+ /*
+ * For now we only support PCIE peer to peer but other inter-connect
+ * can be added.
+ */
+ if (pci_test_p2p(devA, devB))
+ return true;
+
+ return false;
+}
+EXPORT_SYMBOL_GPL(device_test_p2p);
diff --git a/include/linux/device.h b/include/linux/device.h
index 6cb4640b6160..0d532d7f0779 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -1250,6 +1250,7 @@ extern int device_online(struct device *dev);
extern void set_primary_fwnode(struct device *dev, struct fwnode_handle *fwnode);
extern void set_secondary_fwnode(struct device *dev, struct fwnode_handle *fwnode);
void device_set_of_node_from_dev(struct device *dev, const struct device *dev2);
+bool device_test_p2p(struct device *devA, struct device *devB);
static inline int dev_num_vf(struct device *dev)
{
--
2.17.2
^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 2/5] drivers/base: add a function to test peer to peer capability
2019-01-29 17:47 ` [RFC PATCH 2/5] drivers/base: " jglisse
@ 2019-01-29 18:26 ` Logan Gunthorpe
2019-01-29 19:54 ` Jerome Glisse
2019-01-29 19:46 ` Greg Kroah-Hartman
1 sibling, 1 reply; 95+ messages in thread
From: Logan Gunthorpe @ 2019-01-29 18:26 UTC (permalink / raw)
To: jglisse, linux-mm
Cc: linux-kernel, Greg Kroah-Hartman, Rafael J . Wysocki,
Bjorn Helgaas, Christian Koenig, Felix Kuehling, Jason Gunthorpe,
linux-pci, dri-devel, Christoph Hellwig, Marek Szyprowski,
Robin Murphy, Joerg Roedel, iommu
On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
>
> device_test_p2p() return true if two devices can peer to peer to
> each other. We add a generic function as different inter-connect
> can support peer to peer and we want to genericaly test this no
> matter what the inter-connect might be. However this version only
> support PCIE for now.
This doesn't appear to be used in any of the further patches; so it's
very confusing.
I'm not sure a struct device wrapper is really necessary...
Logan
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 2/5] drivers/base: add a function to test peer to peer capability
2019-01-29 18:26 ` Logan Gunthorpe
@ 2019-01-29 19:54 ` Jerome Glisse
0 siblings, 0 replies; 95+ messages in thread
From: Jerome Glisse @ 2019-01-29 19:54 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: linux-mm, linux-kernel, Greg Kroah-Hartman, Rafael J . Wysocki,
Bjorn Helgaas, Christian Koenig, Felix Kuehling, Jason Gunthorpe,
linux-pci, dri-devel, Christoph Hellwig, Marek Szyprowski,
Robin Murphy, Joerg Roedel, iommu
On Tue, Jan 29, 2019 at 11:26:01AM -0700, Logan Gunthorpe wrote:
>
>
> On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> >
> > device_test_p2p() return true if two devices can peer to peer to
> > each other. We add a generic function as different inter-connect
> > can support peer to peer and we want to genericaly test this no
> > matter what the inter-connect might be. However this version only
> > support PCIE for now.
>
> This doesn't appear to be used in any of the further patches; so it's
> very confusing.
>
> I'm not sure a struct device wrapper is really necessary...
I wanted to allow other non pci device to join in the fun but
yes right now i have only be doing this on pci devices.
Jérôme
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 2/5] drivers/base: add a function to test peer to peer capability
2019-01-29 17:47 ` [RFC PATCH 2/5] drivers/base: " jglisse
2019-01-29 18:26 ` Logan Gunthorpe
@ 2019-01-29 19:46 ` Greg Kroah-Hartman
2019-01-29 19:56 ` Jerome Glisse
1 sibling, 1 reply; 95+ messages in thread
From: Greg Kroah-Hartman @ 2019-01-29 19:46 UTC (permalink / raw)
To: jglisse
Cc: linux-mm, linux-kernel, Logan Gunthorpe, Rafael J . Wysocki,
Bjorn Helgaas, Christian Koenig, Felix Kuehling, Jason Gunthorpe,
linux-pci, dri-devel, Christoph Hellwig, Marek Szyprowski,
Robin Murphy, Joerg Roedel, iommu
On Tue, Jan 29, 2019 at 12:47:25PM -0500, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
>
> device_test_p2p() return true if two devices can peer to peer to
> each other. We add a generic function as different inter-connect
> can support peer to peer and we want to genericaly test this no
> matter what the inter-connect might be. However this version only
> support PCIE for now.
There is no defintion of "peer to peer" in the driver/device model, so
why should this be in the driver core at all?
Especially as you only do this for PCI, why not just keep it in the PCI
layer, that way you _know_ you are dealing with the right pointer types
and there is no need to mess around with the driver core at all.
thanks,
greg k-h
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 2/5] drivers/base: add a function to test peer to peer capability
2019-01-29 19:46 ` Greg Kroah-Hartman
@ 2019-01-29 19:56 ` Jerome Glisse
0 siblings, 0 replies; 95+ messages in thread
From: Jerome Glisse @ 2019-01-29 19:56 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: linux-mm, linux-kernel, Logan Gunthorpe, Rafael J . Wysocki,
Bjorn Helgaas, Christian Koenig, Felix Kuehling, Jason Gunthorpe,
linux-pci, dri-devel, Christoph Hellwig, Marek Szyprowski,
Robin Murphy, Joerg Roedel, iommu
On Tue, Jan 29, 2019 at 08:46:05PM +0100, Greg Kroah-Hartman wrote:
> On Tue, Jan 29, 2019 at 12:47:25PM -0500, jglisse@redhat.com wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> >
> > device_test_p2p() return true if two devices can peer to peer to
> > each other. We add a generic function as different inter-connect
> > can support peer to peer and we want to genericaly test this no
> > matter what the inter-connect might be. However this version only
> > support PCIE for now.
>
> There is no defintion of "peer to peer" in the driver/device model, so
> why should this be in the driver core at all?
>
> Especially as you only do this for PCI, why not just keep it in the PCI
> layer, that way you _know_ you are dealing with the right pointer types
> and there is no need to mess around with the driver core at all.
Ok i will drop the core device change. I wanted to allow other non
PCI to join latter on (ie allow PCI device to export to non PCI device)
but if that ever happen then we can update pci exporter at the same
time we introduce non pci importer.
Cheers,
Jérôme
^ permalink raw reply [flat|nested] 95+ messages in thread
* [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-29 17:47 [RFC PATCH 0/5] Device peer to peer (p2p) through vma jglisse
2019-01-29 17:47 ` [RFC PATCH 1/5] pci/p2p: add a function to test peer to peer capability jglisse
2019-01-29 17:47 ` [RFC PATCH 2/5] drivers/base: " jglisse
@ 2019-01-29 17:47 ` jglisse
2019-01-29 18:36 ` Logan Gunthorpe
2019-01-29 17:47 ` [RFC PATCH 4/5] mm/hmm: add support for peer to peer to HMM device memory jglisse
2019-01-29 17:47 ` [RFC PATCH 5/5] mm/hmm: add support for peer to peer to special device vma jglisse
4 siblings, 1 reply; 95+ messages in thread
From: jglisse @ 2019-01-29 17:47 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Jérôme Glisse, Logan Gunthorpe,
Greg Kroah-Hartman, Rafael J . Wysocki, Bjorn Helgaas,
Christian Koenig, Felix Kuehling, Jason Gunthorpe, linux-pci,
dri-devel, Christoph Hellwig, Marek Szyprowski, Robin Murphy,
Joerg Roedel, iommu
From: Jérôme Glisse <jglisse@redhat.com>
Allow mmap of device file to export device memory to peer to peer
devices. This will allow for instance a network device to access a
GPU memory or to access a storage device queue directly.
The common case will be a vma created by userspace device driver
that is then share to another userspace device driver which call
in its kernel device driver to map that vma.
The vma does not need to have any valid CPU mapping so that only
peer to peer device might access its content. Or it could have
valid CPU mapping too in that case it should point to same memory
for consistency.
Note that peer to peer mapping is highly platform and device
dependent and it might not work in all the cases. However we do
expect supports for this to grow on more hardware platform.
This patch only adds new call backs to vm_operations_struct bulk
of code light within common bus driver (like pci) and device
driver (both the exporting and importing device).
Current design mandate that the importer must obey mmu_notifier
and invalidate any peer to peer mapping anytime a notification
of invalidation happens for a range that have been peer to peer
mapped. This allows exporter device to easily invalidate mapping
for any importer device.
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: iommu@lists.linux-foundation.org
---
include/linux/mm.h | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80bb6408fe73..1bd60a90e575 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -429,6 +429,44 @@ struct vm_operations_struct {
pgoff_t start_pgoff, pgoff_t end_pgoff);
unsigned long (*pagesize)(struct vm_area_struct * area);
+ /*
+ * Optional for device driver that want to allow peer to peer (p2p)
+ * mapping of their vma (which can be back by some device memory) to
+ * another device.
+ *
+ * Note that the exporting device driver might not have map anything
+ * inside the vma for the CPU but might still want to allow a peer
+ * device to access the range of memory corresponding to a range in
+ * that vma.
+ *
+ * FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A
+ * DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID
+ * SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing
+ * device to map once during setup and report any failure at that time
+ * to the userspace. Further mapping of the same range might happen
+ * after mmu notifier invalidation over the range. The exporting device
+ * can use this to move things around (defrag BAR space for instance)
+ * or do other similar task.
+ *
+ * IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap()
+ * WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY
+ * POINT IN TIME WITH NO LOCK HELD.
+ *
+ * In below function, the device argument is the importing device,
+ * the exporting device is the device to which the vma belongs.
+ */
+ long (*p2p_map)(struct vm_area_struct *vma,
+ struct device *device,
+ unsigned long start,
+ unsigned long end,
+ dma_addr_t *pa,
+ bool write);
+ long (*p2p_unmap)(struct vm_area_struct *vma,
+ struct device *device,
+ unsigned long start,
+ unsigned long end,
+ dma_addr_t *pa);
+
/* notification that a previously read-only page is about to become
* writable, if an error is returned it will cause a SIGBUS */
vm_fault_t (*page_mkwrite)(struct vm_fault *vmf);
--
2.17.2
^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-29 17:47 ` [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma jglisse
@ 2019-01-29 18:36 ` Logan Gunthorpe
2019-01-29 19:11 ` Jerome Glisse
0 siblings, 1 reply; 95+ messages in thread
From: Logan Gunthorpe @ 2019-01-29 18:36 UTC (permalink / raw)
To: jglisse, linux-mm
Cc: linux-kernel, Greg Kroah-Hartman, Rafael J . Wysocki,
Bjorn Helgaas, Christian Koenig, Felix Kuehling, Jason Gunthorpe,
linux-pci, dri-devel, Christoph Hellwig, Marek Szyprowski,
Robin Murphy, Joerg Roedel, iommu
On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
> + /*
> + * Optional for device driver that want to allow peer to peer (p2p)
> + * mapping of their vma (which can be back by some device memory) to
> + * another device.
> + *
> + * Note that the exporting device driver might not have map anything
> + * inside the vma for the CPU but might still want to allow a peer
> + * device to access the range of memory corresponding to a range in
> + * that vma.
> + *
> + * FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A
> + * DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID
> + * SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing
> + * device to map once during setup and report any failure at that time
> + * to the userspace. Further mapping of the same range might happen
> + * after mmu notifier invalidation over the range. The exporting device
> + * can use this to move things around (defrag BAR space for instance)
> + * or do other similar task.
> + *
> + * IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap()
> + * WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY
> + * POINT IN TIME WITH NO LOCK HELD.
> + *
> + * In below function, the device argument is the importing device,
> + * the exporting device is the device to which the vma belongs.
> + */
> + long (*p2p_map)(struct vm_area_struct *vma,
> + struct device *device,
> + unsigned long start,
> + unsigned long end,
> + dma_addr_t *pa,
> + bool write);
> + long (*p2p_unmap)(struct vm_area_struct *vma,
> + struct device *device,
> + unsigned long start,
> + unsigned long end,
> + dma_addr_t *pa);
I don't understand why we need new p2p_[un]map function pointers for
this. In subsequent patches, they never appear to be set anywhere and
are only called by the HMM code. I'd have expected it to be called by
some core VMA code and set by HMM as that's what vm_operations_struct is
for.
But the code as all very confusing, hard to follow and seems to be
missing significant chunks. So I'm not really sure what is going on.
Logan
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-29 18:36 ` Logan Gunthorpe
@ 2019-01-29 19:11 ` Jerome Glisse
2019-01-29 19:24 ` Logan Gunthorpe
2019-01-29 19:32 ` Jason Gunthorpe
0 siblings, 2 replies; 95+ messages in thread
From: Jerome Glisse @ 2019-01-29 19:11 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: linux-mm, linux-kernel, Greg Kroah-Hartman, Rafael J . Wysocki,
Bjorn Helgaas, Christian Koenig, Felix Kuehling, Jason Gunthorpe,
linux-pci, dri-devel, Christoph Hellwig, Marek Szyprowski,
Robin Murphy, Joerg Roedel, iommu
On Tue, Jan 29, 2019 at 11:36:29AM -0700, Logan Gunthorpe wrote:
>
>
> On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
>
> > + /*
> > + * Optional for device driver that want to allow peer to peer (p2p)
> > + * mapping of their vma (which can be back by some device memory) to
> > + * another device.
> > + *
> > + * Note that the exporting device driver might not have map anything
> > + * inside the vma for the CPU but might still want to allow a peer
> > + * device to access the range of memory corresponding to a range in
> > + * that vma.
> > + *
> > + * FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A
> > + * DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID
> > + * SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing
> > + * device to map once during setup and report any failure at that time
> > + * to the userspace. Further mapping of the same range might happen
> > + * after mmu notifier invalidation over the range. The exporting device
> > + * can use this to move things around (defrag BAR space for instance)
> > + * or do other similar task.
> > + *
> > + * IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap()
> > + * WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY
> > + * POINT IN TIME WITH NO LOCK HELD.
> > + *
> > + * In below function, the device argument is the importing device,
> > + * the exporting device is the device to which the vma belongs.
> > + */
> > + long (*p2p_map)(struct vm_area_struct *vma,
> > + struct device *device,
> > + unsigned long start,
> > + unsigned long end,
> > + dma_addr_t *pa,
> > + bool write);
> > + long (*p2p_unmap)(struct vm_area_struct *vma,
> > + struct device *device,
> > + unsigned long start,
> > + unsigned long end,
> > + dma_addr_t *pa);
>
> I don't understand why we need new p2p_[un]map function pointers for
> this. In subsequent patches, they never appear to be set anywhere and
> are only called by the HMM code. I'd have expected it to be called by
> some core VMA code and set by HMM as that's what vm_operations_struct is
> for.
>
> But the code as all very confusing, hard to follow and seems to be
> missing significant chunks. So I'm not really sure what is going on.
It is set by device driver when userspace do mmap(fd) where fd comes
from open("/dev/somedevicefile"). So it is set by device driver. HMM
has nothing to do with this. It must be set by device driver mmap
call back (mmap callback of struct file_operations). For this patch
you can completely ignore all the HMM patches. Maybe posting this as
2 separate patchset would make it clearer.
For instance see [1] for how a non HMM driver can export its memory
by just setting those callback. Note that a proper implementation of
this should also include some kind of driver policy on what to allow
to map and what to not allow ... All this is driver specific in any
way.
Cheers,
Jérôme
[1] https://cgit.freedesktop.org/~glisse/linux/commit/?h=wip-p2p-showcase&id=964214dcd4df96f296e2214042e8cfce135ae3d4
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-29 19:11 ` Jerome Glisse
@ 2019-01-29 19:24 ` Logan Gunthorpe
2019-01-29 19:44 ` Jerome Glisse
2019-01-29 19:32 ` Jason Gunthorpe
1 sibling, 1 reply; 95+ messages in thread
From: Logan Gunthorpe @ 2019-01-29 19:24 UTC (permalink / raw)
To: Jerome Glisse
Cc: linux-mm, linux-kernel, Greg Kroah-Hartman, Rafael J . Wysocki,
Bjorn Helgaas, Christian Koenig, Felix Kuehling, Jason Gunthorpe,
linux-pci, dri-devel, Christoph Hellwig, Marek Szyprowski,
Robin Murphy, Joerg Roedel, iommu
On 2019-01-29 12:11 p.m., Jerome Glisse wrote:
> On Tue, Jan 29, 2019 at 11:36:29AM -0700, Logan Gunthorpe wrote:
>>
>>
>> On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
>>
>>> + /*
>>> + * Optional for device driver that want to allow peer to peer (p2p)
>>> + * mapping of their vma (which can be back by some device memory) to
>>> + * another device.
>>> + *
>>> + * Note that the exporting device driver might not have map anything
>>> + * inside the vma for the CPU but might still want to allow a peer
>>> + * device to access the range of memory corresponding to a range in
>>> + * that vma.
>>> + *
>>> + * FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A
>>> + * DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID
>>> + * SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing
>>> + * device to map once during setup and report any failure at that time
>>> + * to the userspace. Further mapping of the same range might happen
>>> + * after mmu notifier invalidation over the range. The exporting device
>>> + * can use this to move things around (defrag BAR space for instance)
>>> + * or do other similar task.
>>> + *
>>> + * IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap()
>>> + * WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY
>>> + * POINT IN TIME WITH NO LOCK HELD.
>>> + *
>>> + * In below function, the device argument is the importing device,
>>> + * the exporting device is the device to which the vma belongs.
>>> + */
>>> + long (*p2p_map)(struct vm_area_struct *vma,
>>> + struct device *device,
>>> + unsigned long start,
>>> + unsigned long end,
>>> + dma_addr_t *pa,
>>> + bool write);
>>> + long (*p2p_unmap)(struct vm_area_struct *vma,
>>> + struct device *device,
>>> + unsigned long start,
>>> + unsigned long end,
>>> + dma_addr_t *pa);
>>
>> I don't understand why we need new p2p_[un]map function pointers for
>> this. In subsequent patches, they never appear to be set anywhere and
>> are only called by the HMM code. I'd have expected it to be called by
>> some core VMA code and set by HMM as that's what vm_operations_struct is
>> for.
>>
>> But the code as all very confusing, hard to follow and seems to be
>> missing significant chunks. So I'm not really sure what is going on.
>
> It is set by device driver when userspace do mmap(fd) where fd comes
> from open("/dev/somedevicefile"). So it is set by device driver. HMM
> has nothing to do with this. It must be set by device driver mmap
> call back (mmap callback of struct file_operations). For this patch
> you can completely ignore all the HMM patches. Maybe posting this as
> 2 separate patchset would make it clearer.
>
> For instance see [1] for how a non HMM driver can export its memory
> by just setting those callback. Note that a proper implementation of
> this should also include some kind of driver policy on what to allow
> to map and what to not allow ... All this is driver specific in any
> way.
I'd suggest [1] should be a part of the patchset so we can actually see
a user of the stuff you're adding.
But it still doesn't explain everything as without the HMM code nothing
calls the new vm_ops. And there's still no callers for the p2p_test
functions you added. And I still don't understand why we need the new
vm_ops or who calls them and when. Why can't drivers use the existing
'fault' vm_op and call a new helper function to map p2p when appropriate
or a different helper function to map a large range in its mmap
operation? Just like regular mmap code...
Logan
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-29 19:24 ` Logan Gunthorpe
@ 2019-01-29 19:44 ` Jerome Glisse
2019-01-29 20:43 ` Logan Gunthorpe
0 siblings, 1 reply; 95+ messages in thread
From: Jerome Glisse @ 2019-01-29 19:44 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: linux-mm, linux-kernel, Greg Kroah-Hartman, Rafael J . Wysocki,
Bjorn Helgaas, Christian Koenig, Felix Kuehling, Jason Gunthorpe,
linux-pci, dri-devel, Christoph Hellwig, Marek Szyprowski,
Robin Murphy, Joerg Roedel, iommu
On Tue, Jan 29, 2019 at 12:24:04PM -0700, Logan Gunthorpe wrote:
>
>
> On 2019-01-29 12:11 p.m., Jerome Glisse wrote:
> > On Tue, Jan 29, 2019 at 11:36:29AM -0700, Logan Gunthorpe wrote:
> >>
> >>
> >> On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
> >>
> >>> + /*
> >>> + * Optional for device driver that want to allow peer to peer (p2p)
> >>> + * mapping of their vma (which can be back by some device memory) to
> >>> + * another device.
> >>> + *
> >>> + * Note that the exporting device driver might not have map anything
> >>> + * inside the vma for the CPU but might still want to allow a peer
> >>> + * device to access the range of memory corresponding to a range in
> >>> + * that vma.
> >>> + *
> >>> + * FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A
> >>> + * DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID
> >>> + * SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing
> >>> + * device to map once during setup and report any failure at that time
> >>> + * to the userspace. Further mapping of the same range might happen
> >>> + * after mmu notifier invalidation over the range. The exporting device
> >>> + * can use this to move things around (defrag BAR space for instance)
> >>> + * or do other similar task.
> >>> + *
> >>> + * IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap()
> >>> + * WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY
> >>> + * POINT IN TIME WITH NO LOCK HELD.
> >>> + *
> >>> + * In below function, the device argument is the importing device,
> >>> + * the exporting device is the device to which the vma belongs.
> >>> + */
> >>> + long (*p2p_map)(struct vm_area_struct *vma,
> >>> + struct device *device,
> >>> + unsigned long start,
> >>> + unsigned long end,
> >>> + dma_addr_t *pa,
> >>> + bool write);
> >>> + long (*p2p_unmap)(struct vm_area_struct *vma,
> >>> + struct device *device,
> >>> + unsigned long start,
> >>> + unsigned long end,
> >>> + dma_addr_t *pa);
> >>
> >> I don't understand why we need new p2p_[un]map function pointers for
> >> this. In subsequent patches, they never appear to be set anywhere and
> >> are only called by the HMM code. I'd have expected it to be called by
> >> some core VMA code and set by HMM as that's what vm_operations_struct is
> >> for.
> >>
> >> But the code as all very confusing, hard to follow and seems to be
> >> missing significant chunks. So I'm not really sure what is going on.
> >
> > It is set by device driver when userspace do mmap(fd) where fd comes
> > from open("/dev/somedevicefile"). So it is set by device driver. HMM
> > has nothing to do with this. It must be set by device driver mmap
> > call back (mmap callback of struct file_operations). For this patch
> > you can completely ignore all the HMM patches. Maybe posting this as
> > 2 separate patchset would make it clearer.
> >
> > For instance see [1] for how a non HMM driver can export its memory
> > by just setting those callback. Note that a proper implementation of
> > this should also include some kind of driver policy on what to allow
> > to map and what to not allow ... All this is driver specific in any
> > way.
>
> I'd suggest [1] should be a part of the patchset so we can actually see
> a user of the stuff you're adding.
I did not wanted to clutter patchset with device driver specific usage
of this. As the API can be reason about in abstract way.
>
> But it still doesn't explain everything as without the HMM code nothing
> calls the new vm_ops. And there's still no callers for the p2p_test
> functions you added. And I still don't understand why we need the new
> vm_ops or who calls them and when. Why can't drivers use the existing
> 'fault' vm_op and call a new helper function to map p2p when appropriate
> or a different helper function to map a large range in its mmap
> operation? Just like regular mmap code...
HMM code is just one user, if you have a driver that use HMM mirror
then your driver get support for this for free. If you do not want to
use HMM then you can directly call this in your driver.
The flow is, device driver want to setup some mapping for a range of
virtual address [va_start, va_end]:
1 - Lookup vma for the range
2 - If vma is regular vma (not an mmap of a file) then use GUP
If vma is a mmap of a file and they are p2p_map call back
then call p2p_map()
3 - Use either the result of GUP or p2p_map() to program the
device
The p2p test function is use by device driver implementing the call
back for instance see:
https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-p2p&id=401a567696eafb1d4faf7054ab0d7c3a16a5ef06
The vm_fault callback is not suited because here we are mapping to
another device so this will need special handling, someone must have
both device struct pointer in end and someone must be allow to make
decission on what to allow and what not to allow.
Moreover exporting driver like GPU driver might have complex policy
in place for which they will only allow export of some memory to a
peer device but not other.
In the end it means that it easier and simpler to add new call back
specificaly for that, so the intention is clear to both the caller
and the callee. The exporting device can then do the proper check
using the core helper (ie checking that the device can actually peer
to each other) and if that works it can then decide wether or not
it wants to allow this other device to access its memory or if it
prefer to use main memory for this.
To add this to the fault callback we would need to define a bunch
of new flags, setup fake page table so that we can populate pte and
then have the importing device re-interpret everything specialy
because it comes from another device. It would look ugly and it
would need to modify bunch of core mm code.
Note that this callback solution also allow an exporting device to
allow peer access while CPU can not access the memory ie the pte
entry for the range are pte_none.
Cheers,
Jérôme
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-29 19:44 ` Jerome Glisse
@ 2019-01-29 20:43 ` Logan Gunthorpe
2019-01-30 7:52 ` Christoph Hellwig
0 siblings, 1 reply; 95+ messages in thread
From: Logan Gunthorpe @ 2019-01-29 20:43 UTC (permalink / raw)
To: Jerome Glisse
Cc: linux-mm, linux-kernel, Greg Kroah-Hartman, Rafael J . Wysocki,
Bjorn Helgaas, Christian Koenig, Felix Kuehling, Jason Gunthorpe,
linux-pci, dri-devel, Christoph Hellwig, Marek Szyprowski,
Robin Murphy, Joerg Roedel, iommu
On 2019-01-29 12:44 p.m., Jerome Glisse wrote:
>> I'd suggest [1] should be a part of the patchset so we can actually see
>> a user of the stuff you're adding.
>
> I did not wanted to clutter patchset with device driver specific usage
> of this. As the API can be reason about in abstract way.
It's hard to reason about an interface when you can't see what all the
layers want to do with it. Most maintainers (I'd hope) would certainly
never merge code that has no callers, and for much the same reason, I'd
rather not review patches that don't have real use case examples.
Logan
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-29 20:43 ` Logan Gunthorpe
@ 2019-01-30 7:52 ` Christoph Hellwig
0 siblings, 0 replies; 95+ messages in thread
From: Christoph Hellwig @ 2019-01-30 7:52 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: Jerome Glisse, linux-mm, linux-kernel, Greg Kroah-Hartman,
Rafael J . Wysocki, Bjorn Helgaas, Christian Koenig,
Felix Kuehling, Jason Gunthorpe, linux-pci, dri-devel,
Christoph Hellwig, Marek Szyprowski, Robin Murphy, Joerg Roedel,
iommu
On Tue, Jan 29, 2019 at 01:43:02PM -0700, Logan Gunthorpe wrote:
> It's hard to reason about an interface when you can't see what all the
> layers want to do with it. Most maintainers (I'd hope) would certainly
> never merge code that has no callers, and for much the same reason, I'd
> rather not review patches that don't have real use case examples.
Yes, we should never review, nevermind merge code without users.
We had one example recently where this was not followed, which was HMM
and that turned out to be a desaster.
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-29 19:11 ` Jerome Glisse
2019-01-29 19:24 ` Logan Gunthorpe
@ 2019-01-29 19:32 ` Jason Gunthorpe
2019-01-29 19:50 ` Jerome Glisse
2019-01-29 20:39 ` Logan Gunthorpe
1 sibling, 2 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2019-01-29 19:32 UTC (permalink / raw)
To: Jerome Glisse
Cc: Logan Gunthorpe, linux-mm, linux-kernel, Greg Kroah-Hartman,
Rafael J . Wysocki, Bjorn Helgaas, Christian Koenig,
Felix Kuehling, linux-pci, dri-devel, Christoph Hellwig,
Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On Tue, Jan 29, 2019 at 02:11:23PM -0500, Jerome Glisse wrote:
> On Tue, Jan 29, 2019 at 11:36:29AM -0700, Logan Gunthorpe wrote:
> >
> >
> > On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
> >
> > > + /*
> > > + * Optional for device driver that want to allow peer to peer (p2p)
> > > + * mapping of their vma (which can be back by some device memory) to
> > > + * another device.
> > > + *
> > > + * Note that the exporting device driver might not have map anything
> > > + * inside the vma for the CPU but might still want to allow a peer
> > > + * device to access the range of memory corresponding to a range in
> > > + * that vma.
> > > + *
> > > + * FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A
> > > + * DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID
> > > + * SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing
> > > + * device to map once during setup and report any failure at that time
> > > + * to the userspace. Further mapping of the same range might happen
> > > + * after mmu notifier invalidation over the range. The exporting device
> > > + * can use this to move things around (defrag BAR space for instance)
> > > + * or do other similar task.
> > > + *
> > > + * IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap()
> > > + * WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY
> > > + * POINT IN TIME WITH NO LOCK HELD.
> > > + *
> > > + * In below function, the device argument is the importing device,
> > > + * the exporting device is the device to which the vma belongs.
> > > + */
> > > + long (*p2p_map)(struct vm_area_struct *vma,
> > > + struct device *device,
> > > + unsigned long start,
> > > + unsigned long end,
> > > + dma_addr_t *pa,
> > > + bool write);
> > > + long (*p2p_unmap)(struct vm_area_struct *vma,
> > > + struct device *device,
> > > + unsigned long start,
> > > + unsigned long end,
> > > + dma_addr_t *pa);
> >
> > I don't understand why we need new p2p_[un]map function pointers for
> > this. In subsequent patches, they never appear to be set anywhere and
> > are only called by the HMM code. I'd have expected it to be called by
> > some core VMA code and set by HMM as that's what vm_operations_struct is
> > for.
> >
> > But the code as all very confusing, hard to follow and seems to be
> > missing significant chunks. So I'm not really sure what is going on.
>
> It is set by device driver when userspace do mmap(fd) where fd comes
> from open("/dev/somedevicefile"). So it is set by device driver. HMM
> has nothing to do with this. It must be set by device driver mmap
> call back (mmap callback of struct file_operations). For this patch
> you can completely ignore all the HMM patches. Maybe posting this as
> 2 separate patchset would make it clearer.
>
> For instance see [1] for how a non HMM driver can export its memory
> by just setting those callback. Note that a proper implementation of
> this should also include some kind of driver policy on what to allow
> to map and what to not allow ... All this is driver specific in any
> way.
I'm imagining that the RDMA drivers would use this interface on their
per-process 'doorbell' BAR pages - we also wish to have P2P DMA to
this memory. Also the entire VFIO PCI BAR mmap would be good to cover
with this too.
Jerome, I think it would be nice to have a helper scheme - I think the
simple case would be simple remapping of PCI BAR memory, so if we
could have, say something like:
static const struct vm_operations_struct my_ops {
.p2p_map = p2p_ioremap_map_op,
.p2p_unmap = p2p_ioremap_unmap_op,
}
struct ioremap_data {
[..]
}
fops_mmap() {
vma->private_data = &driver_priv->ioremap_data;
return p2p_ioremap_device_memory(vma, exporting_device, [..]);
}
Which closely matches at least what the RDMA drivers do. Where
p2p_ioremap_device_memory populates p2p_map and p2p_unmap pointers
with sensible functions, etc.
It looks like vfio would be able to use this as well (though I am
unsure why vfio uses remap_pfn_range instead of io_remap_pfn range for
BAR memory..)
Do any drivers need more control than this?
Jason
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-29 19:32 ` Jason Gunthorpe
@ 2019-01-29 19:50 ` Jerome Glisse
2019-01-29 20:24 ` Jason Gunthorpe
2019-01-29 20:39 ` Logan Gunthorpe
1 sibling, 1 reply; 95+ messages in thread
From: Jerome Glisse @ 2019-01-29 19:50 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Logan Gunthorpe, linux-mm, linux-kernel, Greg Kroah-Hartman,
Rafael J . Wysocki, Bjorn Helgaas, Christian Koenig,
Felix Kuehling, linux-pci, dri-devel, Christoph Hellwig,
Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On Tue, Jan 29, 2019 at 07:32:57PM +0000, Jason Gunthorpe wrote:
> On Tue, Jan 29, 2019 at 02:11:23PM -0500, Jerome Glisse wrote:
> > On Tue, Jan 29, 2019 at 11:36:29AM -0700, Logan Gunthorpe wrote:
> > >
> > >
> > > On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
> > >
> > > > + /*
> > > > + * Optional for device driver that want to allow peer to peer (p2p)
> > > > + * mapping of their vma (which can be back by some device memory) to
> > > > + * another device.
> > > > + *
> > > > + * Note that the exporting device driver might not have map anything
> > > > + * inside the vma for the CPU but might still want to allow a peer
> > > > + * device to access the range of memory corresponding to a range in
> > > > + * that vma.
> > > > + *
> > > > + * FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A
> > > > + * DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID
> > > > + * SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing
> > > > + * device to map once during setup and report any failure at that time
> > > > + * to the userspace. Further mapping of the same range might happen
> > > > + * after mmu notifier invalidation over the range. The exporting device
> > > > + * can use this to move things around (defrag BAR space for instance)
> > > > + * or do other similar task.
> > > > + *
> > > > + * IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap()
> > > > + * WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY
> > > > + * POINT IN TIME WITH NO LOCK HELD.
> > > > + *
> > > > + * In below function, the device argument is the importing device,
> > > > + * the exporting device is the device to which the vma belongs.
> > > > + */
> > > > + long (*p2p_map)(struct vm_area_struct *vma,
> > > > + struct device *device,
> > > > + unsigned long start,
> > > > + unsigned long end,
> > > > + dma_addr_t *pa,
> > > > + bool write);
> > > > + long (*p2p_unmap)(struct vm_area_struct *vma,
> > > > + struct device *device,
> > > > + unsigned long start,
> > > > + unsigned long end,
> > > > + dma_addr_t *pa);
> > >
> > > I don't understand why we need new p2p_[un]map function pointers for
> > > this. In subsequent patches, they never appear to be set anywhere and
> > > are only called by the HMM code. I'd have expected it to be called by
> > > some core VMA code and set by HMM as that's what vm_operations_struct is
> > > for.
> > >
> > > But the code as all very confusing, hard to follow and seems to be
> > > missing significant chunks. So I'm not really sure what is going on.
> >
> > It is set by device driver when userspace do mmap(fd) where fd comes
> > from open("/dev/somedevicefile"). So it is set by device driver. HMM
> > has nothing to do with this. It must be set by device driver mmap
> > call back (mmap callback of struct file_operations). For this patch
> > you can completely ignore all the HMM patches. Maybe posting this as
> > 2 separate patchset would make it clearer.
> >
> > For instance see [1] for how a non HMM driver can export its memory
> > by just setting those callback. Note that a proper implementation of
> > this should also include some kind of driver policy on what to allow
> > to map and what to not allow ... All this is driver specific in any
> > way.
>
> I'm imagining that the RDMA drivers would use this interface on their
> per-process 'doorbell' BAR pages - we also wish to have P2P DMA to
> this memory. Also the entire VFIO PCI BAR mmap would be good to cover
> with this too.
Correct, you would set those callback on the mmap of your doorbell.
>
> Jerome, I think it would be nice to have a helper scheme - I think the
> simple case would be simple remapping of PCI BAR memory, so if we
> could have, say something like:
>
> static const struct vm_operations_struct my_ops {
> .p2p_map = p2p_ioremap_map_op,
> .p2p_unmap = p2p_ioremap_unmap_op,
> }
>
> struct ioremap_data {
> [..]
> }
>
> fops_mmap() {
> vma->private_data = &driver_priv->ioremap_data;
> return p2p_ioremap_device_memory(vma, exporting_device, [..]);
> }
>
> Which closely matches at least what the RDMA drivers do. Where
> p2p_ioremap_device_memory populates p2p_map and p2p_unmap pointers
> with sensible functions, etc.
>
> It looks like vfio would be able to use this as well (though I am
> unsure why vfio uses remap_pfn_range instead of io_remap_pfn range for
> BAR memory..)
Yes simple helper that implement a sane default implementation is
definitly a good idea. As i was working with GPU it was not something
that immediatly poped to mind (see below). But i can certainly do
a sane set of default helper that simple device driver can use right
away without too much thinking on there part. I will add this for
next posting.
> Do any drivers need more control than this?
GPU driver do want more control :) GPU driver are moving things around
all the time and they have more memory than bar space (on newer platform
AMD GPU do resize the bar but it is not the rule for all GPUs). So
GPU driver do actualy manage their BAR address space and they map and
unmap thing there. They can not allow someone to just pin stuff there
randomly or this would disrupt their regular work flow. Hence they need
control and they might implement threshold for instance if they have
more than N pages of bar space map for peer to peer then they can decide
to fall back to main memory for any new peer mapping.
Cheers,
Jérôme
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-29 19:50 ` Jerome Glisse
@ 2019-01-29 20:24 ` Jason Gunthorpe
2019-01-29 20:44 ` Jerome Glisse
0 siblings, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2019-01-29 20:24 UTC (permalink / raw)
To: Jerome Glisse
Cc: Logan Gunthorpe, linux-mm, linux-kernel, Greg Kroah-Hartman,
Rafael J . Wysocki, Bjorn Helgaas, Christian Koenig,
Felix Kuehling, linux-pci, dri-devel, Christoph Hellwig,
Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On Tue, Jan 29, 2019 at 02:50:55PM -0500, Jerome Glisse wrote:
> GPU driver do want more control :) GPU driver are moving things around
> all the time and they have more memory than bar space (on newer platform
> AMD GPU do resize the bar but it is not the rule for all GPUs). So
> GPU driver do actualy manage their BAR address space and they map and
> unmap thing there. They can not allow someone to just pin stuff there
> randomly or this would disrupt their regular work flow. Hence they need
> control and they might implement threshold for instance if they have
> more than N pages of bar space map for peer to peer then they can decide
> to fall back to main memory for any new peer mapping.
But this API doesn't seem to offer any control - I thought that
control was all coming from the mm/hmm notifiers triggering p2p_unmaps?
I would think that the importing driver can assume the BAR page is
kept alive until it calls unmap (presumably triggered by notifiers)?
ie the exporting driver sees the BAR page as pinned until unmap.
Jason
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-29 20:24 ` Jason Gunthorpe
@ 2019-01-29 20:44 ` Jerome Glisse
2019-01-29 23:02 ` Jason Gunthorpe
0 siblings, 1 reply; 95+ messages in thread
From: Jerome Glisse @ 2019-01-29 20:44 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Joerg Roedel, Rafael J . Wysocki, Greg Kroah-Hartman,
Felix Kuehling, linux-kernel, dri-devel, Christoph Hellwig,
linux-mm, iommu, linux-pci, Bjorn Helgaas, Robin Murphy,
Logan Gunthorpe, Christian Koenig, Marek Szyprowski
On Tue, Jan 29, 2019 at 08:24:36PM +0000, Jason Gunthorpe wrote:
> On Tue, Jan 29, 2019 at 02:50:55PM -0500, Jerome Glisse wrote:
>
> > GPU driver do want more control :) GPU driver are moving things around
> > all the time and they have more memory than bar space (on newer platform
> > AMD GPU do resize the bar but it is not the rule for all GPUs). So
> > GPU driver do actualy manage their BAR address space and they map and
> > unmap thing there. They can not allow someone to just pin stuff there
> > randomly or this would disrupt their regular work flow. Hence they need
> > control and they might implement threshold for instance if they have
> > more than N pages of bar space map for peer to peer then they can decide
> > to fall back to main memory for any new peer mapping.
>
> But this API doesn't seem to offer any control - I thought that
> control was all coming from the mm/hmm notifiers triggering p2p_unmaps?
The control is within the driver implementation of those callbacks. So
driver implementation can refuse to map by returning an error on p2p_map
or it can decide to use main memory by migrating its object to main memory
and populating the dma address array with dma_page_map() of the main memory
pages. Driver like GPU can have policy on top of that for instance they
will only allow p2p map to succeed for objects that have been tagged by the
userspace in some way ie the userspace application is in control of what
can be map to peer device. This is needed for GPU driver as we do want
userspace involvement on what object are allowed to have p2p access and
also so that we can report to userspace when we are running out of BAR
addresses for this to work as intended (ie not falling back to main memory)
so that application can take appropriate actions (like decide what to
prioritize).
For moving things around after a successful p2p_map yes the exporting
device have to call for instance zap_vma_ptes() or something similar.
This will trigger notifier call and the importing device will invalidate
its mapping. Once it is invalidated then the exporting device can
point new call of p2p_map (for the same range) to new memory (obviously
the exporting device have to synchronize any concurrent call to p2p_map
with the invalidation).
>
> I would think that the importing driver can assume the BAR page is
> kept alive until it calls unmap (presumably triggered by notifiers)?
>
> ie the exporting driver sees the BAR page as pinned until unmap.
The intention with this patchset is that it is not pin ie the importer
device _must_ abide by all mmu notifier invalidations and they can
happen at anytime. The importing device can however re-p2p_map the
same range after an invalidation.
I would like to restrict this to importer that can invalidate for
now because i believe all the first device to use can support the
invalidation.
Also when using HMM private device memory we _can not_ pin virtual
address to device memory as otherwise CPU access would have to SIGBUS
or SEGFAULT and we do not want that. So this was also a motivation to
keep thing consistent for the importer for both cases.
Cheers,
Jérôme
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-29 20:44 ` Jerome Glisse
@ 2019-01-29 23:02 ` Jason Gunthorpe
2019-01-30 0:08 ` Jerome Glisse
0 siblings, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2019-01-29 23:02 UTC (permalink / raw)
To: Jerome Glisse
Cc: Logan Gunthorpe, linux-mm, linux-kernel, Greg Kroah-Hartman,
Rafael J . Wysocki, Bjorn Helgaas, Christian Koenig,
Felix Kuehling, linux-pci, dri-devel, Christoph Hellwig,
Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On Tue, Jan 29, 2019 at 03:44:00PM -0500, Jerome Glisse wrote:
> > But this API doesn't seem to offer any control - I thought that
> > control was all coming from the mm/hmm notifiers triggering p2p_unmaps?
>
> The control is within the driver implementation of those callbacks.
Seems like what you mean by control is 'the exporter gets to choose
the physical address at the instant of map' - which seems reasonable
for GPU.
> will only allow p2p map to succeed for objects that have been tagged by the
> userspace in some way ie the userspace application is in control of what
> can be map to peer device.
I would have thought this means the VMA for the object is created
without the map/unmap ops? Or are GPU objects and VMAs unrelated?
> For moving things around after a successful p2p_map yes the exporting
> device have to call for instance zap_vma_ptes() or something
> similar.
Okay, great, RDMA needs this flow for hotplug - we zap the VMA's when
unplugging the PCI device and we can delay the PCI unplug completion
until all the p2p_unmaps are called...
But in this case a future p2p_map will have to fail as the BAR no
longer exists. How to handle this?
> > I would think that the importing driver can assume the BAR page is
> > kept alive until it calls unmap (presumably triggered by notifiers)?
> >
> > ie the exporting driver sees the BAR page as pinned until unmap.
>
> The intention with this patchset is that it is not pin ie the importer
> device _must_ abide by all mmu notifier invalidations and they can
> happen at anytime. The importing device can however re-p2p_map the
> same range after an invalidation.
>
> I would like to restrict this to importer that can invalidate for
> now because i believe all the first device to use can support the
> invalidation.
This seems reasonable (and sort of says importers not getting this
from HMM need careful checking), was this in the comment above the
ops?
Jason
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-29 23:02 ` Jason Gunthorpe
@ 2019-01-30 0:08 ` Jerome Glisse
2019-01-30 4:30 ` Jason Gunthorpe
0 siblings, 1 reply; 95+ messages in thread
From: Jerome Glisse @ 2019-01-30 0:08 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Joerg Roedel, Rafael J . Wysocki, Greg Kroah-Hartman,
Felix Kuehling, linux-kernel, dri-devel, Christoph Hellwig,
linux-mm, iommu, linux-pci, Bjorn Helgaas, Robin Murphy,
Logan Gunthorpe, Christian Koenig, Marek Szyprowski
On Tue, Jan 29, 2019 at 11:02:25PM +0000, Jason Gunthorpe wrote:
> On Tue, Jan 29, 2019 at 03:44:00PM -0500, Jerome Glisse wrote:
>
> > > But this API doesn't seem to offer any control - I thought that
> > > control was all coming from the mm/hmm notifiers triggering p2p_unmaps?
> >
> > The control is within the driver implementation of those callbacks.
>
> Seems like what you mean by control is 'the exporter gets to choose
> the physical address at the instant of map' - which seems reasonable
> for GPU.
>
>
> > will only allow p2p map to succeed for objects that have been tagged by the
> > userspace in some way ie the userspace application is in control of what
> > can be map to peer device.
>
> I would have thought this means the VMA for the object is created
> without the map/unmap ops? Or are GPU objects and VMAs unrelated?
GPU object and VMA are unrelated in all open source GPU driver i am
somewhat familiar with (AMD, Intel, NVidia). You can create a GPU
object and never map it (and thus never have it associated with a
vma) and in fact this is very common. For graphic you usualy only
have hand full of the hundreds of GPU object your application have
mapped.
The control for peer to peer can also be a mutable properties of the
object ie userspace do ioctl on the GPU driver which create an object;
Some times after the object is created the userspace do others ioctl
to allow to export the object to another specific device again this
result in ioctl to the device driver, those ioctl set flags and
update GPU object kernel structure with all the info.
In the meantime you have no control on when other driver might call
the vma p2p call backs. So you must have register the vma with
vm_operations that include the p2p_map and p2p_unmap. Those driver
function will check the object kernel structure each time they get
call and act accordingly.
> > For moving things around after a successful p2p_map yes the exporting
> > device have to call for instance zap_vma_ptes() or something
> > similar.
>
> Okay, great, RDMA needs this flow for hotplug - we zap the VMA's when
> unplugging the PCI device and we can delay the PCI unplug completion
> until all the p2p_unmaps are called...
>
> But in this case a future p2p_map will have to fail as the BAR no
> longer exists. How to handle this?
So the comment above the callback (i should write more thorough guideline
and documentation) state that export should/(must?) be predictable ie
if an importer device calls p2p_map() once on a vma and it does succeed
then if the same device calls again p2p_map() on the same vma and if the
vma is still valid (ie no unmap or does not correspond to a different
object ...) then the p2p_map() should/(must?) succeed.
The idea is that the importer would do a first call to p2p_map() when it
setup its own object, report failure to userspace if that fails. If it
does succeed then we should never have an issue next time we call p2p_map()
(after mapping being invalidated by mmu notifier for instance). So it will
succeed just like the first call (again assuming the vma is still valid).
Idea is that we can only ask exporter to be predictable and still allow
them to fail if things are really going bad.
> > > I would think that the importing driver can assume the BAR page is
> > > kept alive until it calls unmap (presumably triggered by notifiers)?
> > >
> > > ie the exporting driver sees the BAR page as pinned until unmap.
> >
> > The intention with this patchset is that it is not pin ie the importer
> > device _must_ abide by all mmu notifier invalidations and they can
> > happen at anytime. The importing device can however re-p2p_map the
> > same range after an invalidation.
> >
> > I would like to restrict this to importer that can invalidate for
> > now because i believe all the first device to use can support the
> > invalidation.
>
> This seems reasonable (and sort of says importers not getting this
> from HMM need careful checking), was this in the comment above the
> ops?
I think i put it in the comment above the ops but in any cases i should
write something in documentation with example and thorough guideline.
Note that there won't be any mmu notifier to mmap of a device file
unless the device driver calls for it or there is a syscall like munmap
or mremap or mprotect well any syscall that work on vma.
So assuming the application is no doing something stupid, nor the
driver. Then the result of p2p_map can stay valid until the importer
is done and call p2p_unmap of its own free will. This is what i do
expect for this. But for GPU i would still like to allow GPU driver
to evict (and thus invalidate importer mapping) to main memory or
defragment their BAR address space if the GPU driver feels a pressing
need to do so.
If we ever want to support full pin then we might have to add a
flag so that GPU driver can refuse an importer that wants things
pin forever.
Cheers,
Jérôme
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-30 0:08 ` Jerome Glisse
@ 2019-01-30 4:30 ` Jason Gunthorpe
2019-01-30 15:43 ` Jerome Glisse
0 siblings, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2019-01-30 4:30 UTC (permalink / raw)
To: Jerome Glisse
Cc: Logan Gunthorpe, linux-mm, linux-kernel, Greg Kroah-Hartman,
Rafael J . Wysocki, Bjorn Helgaas, Christian Koenig,
Felix Kuehling, linux-pci, dri-devel, Christoph Hellwig,
Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On Tue, Jan 29, 2019 at 07:08:06PM -0500, Jerome Glisse wrote:
> On Tue, Jan 29, 2019 at 11:02:25PM +0000, Jason Gunthorpe wrote:
> > On Tue, Jan 29, 2019 at 03:44:00PM -0500, Jerome Glisse wrote:
> >
> > > > But this API doesn't seem to offer any control - I thought that
> > > > control was all coming from the mm/hmm notifiers triggering p2p_unmaps?
> > >
> > > The control is within the driver implementation of those callbacks.
> >
> > Seems like what you mean by control is 'the exporter gets to choose
> > the physical address at the instant of map' - which seems reasonable
> > for GPU.
> >
> >
> > > will only allow p2p map to succeed for objects that have been tagged by the
> > > userspace in some way ie the userspace application is in control of what
> > > can be map to peer device.
> >
> > I would have thought this means the VMA for the object is created
> > without the map/unmap ops? Or are GPU objects and VMAs unrelated?
>
> GPU object and VMA are unrelated in all open source GPU driver i am
> somewhat familiar with (AMD, Intel, NVidia). You can create a GPU
> object and never map it (and thus never have it associated with a
> vma) and in fact this is very common. For graphic you usualy only
> have hand full of the hundreds of GPU object your application have
> mapped.
I mean the other way does every VMA with a p2p_map/unmap point to
exactly one GPU object?
ie I'm surprised you say that p2p_map needs to have policy, I would
have though the policy is applied when the VMA is created (ie objects
that are not for p2p do not have p2p_map set), and even for GPU
p2p_map should really only have to do with window allocation and pure
'can I even do p2p' type functionality.
> Idea is that we can only ask exporter to be predictable and still allow
> them to fail if things are really going bad.
I think hot unplug / PCI error recovery is one of the 'really going
bad' cases..
> I think i put it in the comment above the ops but in any cases i should
> write something in documentation with example and thorough guideline.
> Note that there won't be any mmu notifier to mmap of a device file
> unless the device driver calls for it or there is a syscall like munmap
> or mremap or mprotect well any syscall that work on vma.
This is something we might need to explore, does calling
zap_vma_ptes() invoke enough notifiers that a MMU notifiers or HMM
mirror consumer will release any p2p maps on that VMA?
> If we ever want to support full pin then we might have to add a
> flag so that GPU driver can refuse an importer that wants things
> pin forever.
This would become interesting for VFIO and RDMA at least - I don't
think VFIO has anything like SVA so it would want to import a p2p_map
and indicate that it will not respond to MMU notifiers.
GPU can refuse, but maybe RDMA would allow it...
Jason
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-30 4:30 ` Jason Gunthorpe
@ 2019-01-30 15:43 ` Jerome Glisse
0 siblings, 0 replies; 95+ messages in thread
From: Jerome Glisse @ 2019-01-30 15:43 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Logan Gunthorpe, linux-mm, linux-kernel, Greg Kroah-Hartman,
Rafael J . Wysocki, Bjorn Helgaas, Christian Koenig,
Felix Kuehling, linux-pci, dri-devel, Christoph Hellwig,
Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On Wed, Jan 30, 2019 at 04:30:27AM +0000, Jason Gunthorpe wrote:
> On Tue, Jan 29, 2019 at 07:08:06PM -0500, Jerome Glisse wrote:
> > On Tue, Jan 29, 2019 at 11:02:25PM +0000, Jason Gunthorpe wrote:
> > > On Tue, Jan 29, 2019 at 03:44:00PM -0500, Jerome Glisse wrote:
> > >
> > > > > But this API doesn't seem to offer any control - I thought that
> > > > > control was all coming from the mm/hmm notifiers triggering p2p_unmaps?
> > > >
> > > > The control is within the driver implementation of those callbacks.
> > >
> > > Seems like what you mean by control is 'the exporter gets to choose
> > > the physical address at the instant of map' - which seems reasonable
> > > for GPU.
> > >
> > >
> > > > will only allow p2p map to succeed for objects that have been tagged by the
> > > > userspace in some way ie the userspace application is in control of what
> > > > can be map to peer device.
> > >
> > > I would have thought this means the VMA for the object is created
> > > without the map/unmap ops? Or are GPU objects and VMAs unrelated?
> >
> > GPU object and VMA are unrelated in all open source GPU driver i am
> > somewhat familiar with (AMD, Intel, NVidia). You can create a GPU
> > object and never map it (and thus never have it associated with a
> > vma) and in fact this is very common. For graphic you usualy only
> > have hand full of the hundreds of GPU object your application have
> > mapped.
>
> I mean the other way does every VMA with a p2p_map/unmap point to
> exactly one GPU object?
>
> ie I'm surprised you say that p2p_map needs to have policy, I would
> have though the policy is applied when the VMA is created (ie objects
> that are not for p2p do not have p2p_map set), and even for GPU
> p2p_map should really only have to do with window allocation and pure
> 'can I even do p2p' type functionality.
All userspace API to enable p2p happens after object creation and in
some case they are mutable ie you can decide to no longer share the
object (userspace application decision). The BAR address space is a
resource from GPU driver point of view and thus from userspace point
of view. As such decissions that affect how it is use an what object
can use it, can change over application lifetime.
This is why i would like to allow kernel driver to apply any such
access policy, decided by the application on its object (on top of
which the kernel GPU driver can apply its own policy for GPU resource
sharing by forcing some object to main memory).
>
> > Idea is that we can only ask exporter to be predictable and still allow
> > them to fail if things are really going bad.
>
> I think hot unplug / PCI error recovery is one of the 'really going
> bad' cases..
GPU can hang and all data becomes _undefined_, it can also be suspended
to save power (think laptop with discret GPU for instance). GPU threads
can be kill ... So they are few cases i can think of where either you
want to kill the p2p mapping and make sure the importer is aware and
might have a change to report back through its own userspace API, or
at very least fallback to dummy pages. In some of the above cases, for
instance suspend, you just want to move thing around to allow to shut
down device memory.
> > I think i put it in the comment above the ops but in any cases i should
> > write something in documentation with example and thorough guideline.
> > Note that there won't be any mmu notifier to mmap of a device file
> > unless the device driver calls for it or there is a syscall like munmap
> > or mremap or mprotect well any syscall that work on vma.
>
> This is something we might need to explore, does calling
> zap_vma_ptes() invoke enough notifiers that a MMU notifiers or HMM
> mirror consumer will release any p2p maps on that VMA?
Yes it does.
>
> > If we ever want to support full pin then we might have to add a
> > flag so that GPU driver can refuse an importer that wants things
> > pin forever.
>
> This would become interesting for VFIO and RDMA at least - I don't
> think VFIO has anything like SVA so it would want to import a p2p_map
> and indicate that it will not respond to MMU notifiers.
>
> GPU can refuse, but maybe RDMA would allow it...
Ok i will add a flag field in next post. GPU could allow pin but they
would most likely use main memory for any such object, hence it is no
longer really p2p but at least both device look at the same data.
Cheers,
Jérôme
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-29 19:32 ` Jason Gunthorpe
2019-01-29 19:50 ` Jerome Glisse
@ 2019-01-29 20:39 ` Logan Gunthorpe
2019-01-29 20:57 ` Jerome Glisse
2019-01-29 20:58 ` Jason Gunthorpe
1 sibling, 2 replies; 95+ messages in thread
From: Logan Gunthorpe @ 2019-01-29 20:39 UTC (permalink / raw)
To: Jason Gunthorpe, Jerome Glisse
Cc: linux-mm, linux-kernel, Greg Kroah-Hartman, Rafael J . Wysocki,
Bjorn Helgaas, Christian Koenig, Felix Kuehling, linux-pci,
dri-devel, Christoph Hellwig, Marek Szyprowski, Robin Murphy,
Joerg Roedel, iommu
On 2019-01-29 12:32 p.m., Jason Gunthorpe wrote:
> Jerome, I think it would be nice to have a helper scheme - I think the
> simple case would be simple remapping of PCI BAR memory, so if we
> could have, say something like:
>
> static const struct vm_operations_struct my_ops {
> .p2p_map = p2p_ioremap_map_op,
> .p2p_unmap = p2p_ioremap_unmap_op,
> }
>
> struct ioremap_data {
> [..]
> }
>
> fops_mmap() {
> vma->private_data = &driver_priv->ioremap_data;
> return p2p_ioremap_device_memory(vma, exporting_device, [..]);
> }
This is roughly what I was expecting, except I don't see exactly what
the p2p_map and p2p_unmap callbacks are for. The importing driver should
see p2pdma/hmm struct pages and use the appropriate function to map
them. It shouldn't be the responsibility of the exporting driver to
implement the mapping. And I don't think we should have 'special' vma's
for this (though we may need something to ensure we don't get mapping
requests mixed with different types of pages...).
I also figured there'd be a fault version of p2p_ioremap_device_memory()
for when you are mapping P2P memory and you want to assign the pages
lazily. Though, this can come later when someone wants to implement that.
Logan
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-29 20:39 ` Logan Gunthorpe
@ 2019-01-29 20:57 ` Jerome Glisse
2019-01-29 21:30 ` Logan Gunthorpe
2019-01-29 20:58 ` Jason Gunthorpe
1 sibling, 1 reply; 95+ messages in thread
From: Jerome Glisse @ 2019-01-29 20:57 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: Jason Gunthorpe, linux-mm, linux-kernel, Greg Kroah-Hartman,
Rafael J . Wysocki, Bjorn Helgaas, Christian Koenig,
Felix Kuehling, linux-pci, dri-devel, Christoph Hellwig,
Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On Tue, Jan 29, 2019 at 01:39:49PM -0700, Logan Gunthorpe wrote:
>
>
> On 2019-01-29 12:32 p.m., Jason Gunthorpe wrote:
> > Jerome, I think it would be nice to have a helper scheme - I think the
> > simple case would be simple remapping of PCI BAR memory, so if we
> > could have, say something like:
> >
> > static const struct vm_operations_struct my_ops {
> > .p2p_map = p2p_ioremap_map_op,
> > .p2p_unmap = p2p_ioremap_unmap_op,
> > }
> >
> > struct ioremap_data {
> > [..]
> > }
> >
> > fops_mmap() {
> > vma->private_data = &driver_priv->ioremap_data;
> > return p2p_ioremap_device_memory(vma, exporting_device, [..]);
> > }
>
> This is roughly what I was expecting, except I don't see exactly what
> the p2p_map and p2p_unmap callbacks are for. The importing driver should
> see p2pdma/hmm struct pages and use the appropriate function to map
> them. It shouldn't be the responsibility of the exporting driver to
> implement the mapping. And I don't think we should have 'special' vma's
> for this (though we may need something to ensure we don't get mapping
> requests mixed with different types of pages...).
GPU driver must be in control and must be call to. Here there is 2 cases
in this patchset and i should have instead posted 2 separate patchset as
it seems that it is confusing things.
For the HMM page, the physical address of the page ie the pfn does not
correspond to anything ie there is nothing behind it. So the importing
device has no idea how to get a valid physical address from an HMM page
only the device driver exporting its memory with HMM device memory knows
that.
For the special vma ie mmap of a device file. GPU driver do manage their
BAR ie the GPU have a page table that map BAR page to GPU memory and the
driver _constantly_ update this page table, it is reflected by invalidating
the CPU mapping. In fact most of the time the CPU mapping of GPU object are
invalid they are valid only a small fraction of their lifetime. So you
_must_ have some call to inform the exporting device driver that another
device would like to map one of its vma. The exporting device can then
try to avoid as much churn as possible for the importing device. But this
has consequence and the exporting device driver must be allow to apply
policy and make decission on wether or not it authorize the other device
to peer map its memory. For GPU the userspace application have to call
specific API that translate into specific ioctl which themself set flags
on object (in the kernel struct tracking the user space object). The only
way to allow program predictability is if the application can ask and know
if it can peer export an object (ie is there enough BAR space left).
Moreover i would like to be able to use this API between GPUs that are
inter-connected between each other and for those the CPU page table are
just invalid and the physical address to use are only meaning full to the
exporting and importing device. So again here core kernel has no idea of
what the physical address would be.
So in both cases, at very least for GPU, we do want total control to be
given to the exporter.
>
> I also figured there'd be a fault version of p2p_ioremap_device_memory()
> for when you are mapping P2P memory and you want to assign the pages
> lazily. Though, this can come later when someone wants to implement that.
For GPU the BAR address space is manage page by page and thus you do not
want to map a range of BAR addresses but you want to allow mapping of
multiple page of BAR address that are not adjacent to each other nor
ordered in anyway. But providing helper for simpler device does make sense.
Cheers,
Jérôme
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-29 20:57 ` Jerome Glisse
@ 2019-01-29 21:30 ` Logan Gunthorpe
2019-01-29 21:50 ` Jerome Glisse
0 siblings, 1 reply; 95+ messages in thread
From: Logan Gunthorpe @ 2019-01-29 21:30 UTC (permalink / raw)
To: Jerome Glisse
Cc: Jason Gunthorpe, linux-mm, linux-kernel, Greg Kroah-Hartman,
Rafael J . Wysocki, Bjorn Helgaas, Christian Koenig,
Felix Kuehling, linux-pci, dri-devel, Christoph Hellwig,
Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On 2019-01-29 1:57 p.m., Jerome Glisse wrote:
> GPU driver must be in control and must be call to. Here there is 2 cases
> in this patchset and i should have instead posted 2 separate patchset as
> it seems that it is confusing things.
>
> For the HMM page, the physical address of the page ie the pfn does not
> correspond to anything ie there is nothing behind it. So the importing
> device has no idea how to get a valid physical address from an HMM page
> only the device driver exporting its memory with HMM device memory knows
> that.
>
>
> For the special vma ie mmap of a device file. GPU driver do manage their
> BAR ie the GPU have a page table that map BAR page to GPU memory and the
> driver _constantly_ update this page table, it is reflected by invalidating
> the CPU mapping. In fact most of the time the CPU mapping of GPU object are
> invalid they are valid only a small fraction of their lifetime. So you
> _must_ have some call to inform the exporting device driver that another
> device would like to map one of its vma. The exporting device can then
> try to avoid as much churn as possible for the importing device. But this
> has consequence and the exporting device driver must be allow to apply
> policy and make decission on wether or not it authorize the other device
> to peer map its memory. For GPU the userspace application have to call
> specific API that translate into specific ioctl which themself set flags
> on object (in the kernel struct tracking the user space object). The only
> way to allow program predictability is if the application can ask and know
> if it can peer export an object (ie is there enough BAR space left).
This all seems like it's an HMM problem and not related to mapping
BARs/"potential BARs" to userspace. If some code wants to DMA map HMM
pages, it calls an HMM function to map them. If HMM needs to consult
with the driver on aspects of how that's mapped, then that's between HMM
and the driver and not something I really care about. But making the
entire mapping stuff tied to userspace VMAs does not make sense to me.
What if somebody wants to map some HMM pages in the same way but from
kernel space and they therefore don't have a VMA?
>> I also figured there'd be a fault version of p2p_ioremap_device_memory()
>> for when you are mapping P2P memory and you want to assign the pages
>> lazily. Though, this can come later when someone wants to implement that.
>
> For GPU the BAR address space is manage page by page and thus you do not
> want to map a range of BAR addresses but you want to allow mapping of
> multiple page of BAR address that are not adjacent to each other nor
> ordered in anyway. But providing helper for simpler device does make sense.
Well, this has little do with the backing device but how the memory is
mapped into userspace. With p2p_ioremap_device_memory() the entire range
is mapped into the userspace VMA immediately during the call to mmap().
With p2p_fault_device_memory(), mmap() would not actually map anything
and a page in the VMA would be mapped only when userspace accesses it
(using fault()). It seems to me like GPUs would prefer the latter but if
HMM takes care of the mapping from userspace potential pages to actual
GPU pages through the BAR then that may not be true.
Logan
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-29 21:30 ` Logan Gunthorpe
@ 2019-01-29 21:50 ` Jerome Glisse
2019-01-29 22:58 ` Logan Gunthorpe
0 siblings, 1 reply; 95+ messages in thread
From: Jerome Glisse @ 2019-01-29 21:50 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: Joerg Roedel, Rafael J . Wysocki, Greg Kroah-Hartman,
Felix Kuehling, linux-kernel, dri-devel, Christoph Hellwig,
linux-mm, iommu, Jason Gunthorpe, linux-pci, Bjorn Helgaas,
Robin Murphy, Christian Koenig, Marek Szyprowski
On Tue, Jan 29, 2019 at 02:30:49PM -0700, Logan Gunthorpe wrote:
>
>
> On 2019-01-29 1:57 p.m., Jerome Glisse wrote:
> > GPU driver must be in control and must be call to. Here there is 2 cases
> > in this patchset and i should have instead posted 2 separate patchset as
> > it seems that it is confusing things.
> >
> > For the HMM page, the physical address of the page ie the pfn does not
> > correspond to anything ie there is nothing behind it. So the importing
> > device has no idea how to get a valid physical address from an HMM page
> > only the device driver exporting its memory with HMM device memory knows
> > that.
> >
> >
> > For the special vma ie mmap of a device file. GPU driver do manage their
> > BAR ie the GPU have a page table that map BAR page to GPU memory and the
> > driver _constantly_ update this page table, it is reflected by invalidating
> > the CPU mapping. In fact most of the time the CPU mapping of GPU object are
> > invalid they are valid only a small fraction of their lifetime. So you
> > _must_ have some call to inform the exporting device driver that another
> > device would like to map one of its vma. The exporting device can then
> > try to avoid as much churn as possible for the importing device. But this
> > has consequence and the exporting device driver must be allow to apply
> > policy and make decission on wether or not it authorize the other device
> > to peer map its memory. For GPU the userspace application have to call
> > specific API that translate into specific ioctl which themself set flags
> > on object (in the kernel struct tracking the user space object). The only
> > way to allow program predictability is if the application can ask and know
> > if it can peer export an object (ie is there enough BAR space left).
>
> This all seems like it's an HMM problem and not related to mapping
> BARs/"potential BARs" to userspace. If some code wants to DMA map HMM
> pages, it calls an HMM function to map them. If HMM needs to consult
> with the driver on aspects of how that's mapped, then that's between HMM
> and the driver and not something I really care about. But making the
> entire mapping stuff tied to userspace VMAs does not make sense to me.
> What if somebody wants to map some HMM pages in the same way but from
> kernel space and they therefore don't have a VMA?
No this is the non HMM case i am talking about here. Fully ignore HMM
in this frame. A GPU driver that do not support or use HMM in anyway
has all the properties and requirement i do list above. So all the points
i was making are without HMM in the picture whatsoever. I should have
posted this a separate patches to avoid this confusion.
Regarding your HMM question. You can not map HMM pages, all code path
that would try that would trigger a migration back to regular memory
and will use the regular memory for CPU access.
> >> I also figured there'd be a fault version of p2p_ioremap_device_memory()
> >> for when you are mapping P2P memory and you want to assign the pages
> >> lazily. Though, this can come later when someone wants to implement that.
> >
> > For GPU the BAR address space is manage page by page and thus you do not
> > want to map a range of BAR addresses but you want to allow mapping of
> > multiple page of BAR address that are not adjacent to each other nor
> > ordered in anyway. But providing helper for simpler device does make sense.
>
> Well, this has little do with the backing device but how the memory is
> mapped into userspace. With p2p_ioremap_device_memory() the entire range
> is mapped into the userspace VMA immediately during the call to mmap().
> With p2p_fault_device_memory(), mmap() would not actually map anything
> and a page in the VMA would be mapped only when userspace accesses it
> (using fault()). It seems to me like GPUs would prefer the latter but if
> HMM takes care of the mapping from userspace potential pages to actual
> GPU pages through the BAR then that may not be true.
Again HMM has nothing to do here, ignore HMM it does not play any role
and it is not involve in anyway here. GPU want to control what object
they allow other device to access and object they do not allow. GPU driver
_constantly_ invalidate the CPU page table and in fact the CPU page table
do not have any valid pte for a vma that is an mmap of GPU device file
for most of the vma lifetime. Changing that would highly disrupt and
break GPU drivers. They need to control that, they need to control what
to do if another device tries to peer map some of their memory. Hence
why they need to implement the callback and decide on wether or not they
allow the peer mapping or use device memory for it (they can decide to
fallback to main memory).
If the exporter can not control than this is useless to GPU driver. I
would rather not exclude GPU driver from this.
Cheers,
Jérôme
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-29 21:50 ` Jerome Glisse
@ 2019-01-29 22:58 ` Logan Gunthorpe
2019-01-29 23:47 ` Jerome Glisse
0 siblings, 1 reply; 95+ messages in thread
From: Logan Gunthorpe @ 2019-01-29 22:58 UTC (permalink / raw)
To: Jerome Glisse
Cc: Jason Gunthorpe, linux-mm, linux-kernel, Greg Kroah-Hartman,
Rafael J . Wysocki, Bjorn Helgaas, Christian Koenig,
Felix Kuehling, linux-pci, dri-devel, Christoph Hellwig,
Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On 2019-01-29 2:50 p.m., Jerome Glisse wrote:
> No this is the non HMM case i am talking about here. Fully ignore HMM
> in this frame. A GPU driver that do not support or use HMM in anyway
> has all the properties and requirement i do list above. So all the points
> i was making are without HMM in the picture whatsoever. I should have
> posted this a separate patches to avoid this confusion.
>
> Regarding your HMM question. You can not map HMM pages, all code path
> that would try that would trigger a migration back to regular memory
> and will use the regular memory for CPU access.
>
I thought this was the whole point of HMM... And eventually it would
support being able to map the pages through the BAR in cooperation with
the driver. If not, what's that whole layer for? Why not just have HMM
handle this situation?
And what struct pages are actually going to be backing these VMAs if
it's not using HMM?
> Again HMM has nothing to do here, ignore HMM it does not play any role
> and it is not involve in anyway here. GPU want to control what object
> they allow other device to access and object they do not allow. GPU driver
> _constantly_ invalidate the CPU page table and in fact the CPU page table
> do not have any valid pte for a vma that is an mmap of GPU device file
> for most of the vma lifetime. Changing that would highly disrupt and
> break GPU drivers. They need to control that, they need to control what
> to do if another device tries to peer map some of their memory. Hence
> why they need to implement the callback and decide on wether or not they
> allow the peer mapping or use device memory for it (they can decide to
> fallback to main memory).
But mapping is an operation of the memory/struct pages behind the VMA;
not of the VMA itself and I think that's evident by the code in that the
only way the VMA layer is involved is the fact that you're abusing
vm_ops by adding new ops there and calling it by other layers.
Logan
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-29 22:58 ` Logan Gunthorpe
@ 2019-01-29 23:47 ` Jerome Glisse
2019-01-30 1:17 ` Logan Gunthorpe
0 siblings, 1 reply; 95+ messages in thread
From: Jerome Glisse @ 2019-01-29 23:47 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: Jason Gunthorpe, linux-mm, linux-kernel, Greg Kroah-Hartman,
Rafael J . Wysocki, Bjorn Helgaas, Christian Koenig,
Felix Kuehling, linux-pci, dri-devel, Christoph Hellwig,
Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On Tue, Jan 29, 2019 at 03:58:45PM -0700, Logan Gunthorpe wrote:
>
>
> On 2019-01-29 2:50 p.m., Jerome Glisse wrote:
> > No this is the non HMM case i am talking about here. Fully ignore HMM
> > in this frame. A GPU driver that do not support or use HMM in anyway
> > has all the properties and requirement i do list above. So all the points
> > i was making are without HMM in the picture whatsoever. I should have
> > posted this a separate patches to avoid this confusion.
> >
> > Regarding your HMM question. You can not map HMM pages, all code path
> > that would try that would trigger a migration back to regular memory
> > and will use the regular memory for CPU access.
> >
>
> I thought this was the whole point of HMM... And eventually it would
> support being able to map the pages through the BAR in cooperation with
> the driver. If not, what's that whole layer for? Why not just have HMM
> handle this situation?
The whole point is to allow to use device memory for range of virtual
address of a process when it does make sense to use device memory for
that range. So they are multiple cases where it does make sense:
[1] - Only the device is accessing the range and they are no CPU access
For instance the program is executing/running a big function on
the GPU and they are not concurrent CPU access, this is very
common in all the existing GPGPU code. In fact AFAICT It is the
most common pattern. So here you can use HMM private or public
memory.
[2] - Both device and CPU access a common range of virtul address
concurrently. In that case if you are on a platform with cache
coherent inter-connect like OpenCAPI or CCIX then you can use
HMM public device memory and have both access the same memory.
You can not use HMM private memory.
So far on x86 we only have PCIE and thus so far on x86 we only have
private HMM device memory that is not accessible by the CPU in any
way.
It does not make that memory useless, far from it. Having only the
device work on the dataset while CPU is either waiting or accessing
something else is very common.
Then HMM is a toolbox, so here are some of the tools:
HMM mirror - helper to mirror process address on to a device
ie this is SVM(Share Virtual Memory)/SVA(Share Virtual Address)
in software
HMM private memory - allow to register device memory with the linux
kernel. The memory is not CPU accessible. The memory is fully manage
by the device driver. What and when to migrate is under the control
of the device driver.
HMM public memory - allow to register device memory with the linux
kernel. The memory must be CPU accessible and cache coherent and
abide by the platform memory model. The memory is fully manage by
the device driver because otherwise it would disrupt the device
driver operation (for instance GPU can also be use for graphics).
Migration - helper to perform migration to and from device memory.
It does not make any decission on itself it just perform all the
steps in the right order and call back into the driver to get the
migration going.
It is up to device driver to implement heuristic and provide userspace
API to control memory migration to and from device memory. For device
private memory on CPU page fault the kernel will force a migration back
to system memory so that the CPU can access the memory. What matter here
is that the memory model of the platform is intact and thus you can
safely use CPU atomic operation or rely on your platform memory model
for your program. Note that long term i would like to define common API
to expose to userspace to manage memory binding to specific device
memory so that we can mix and match multiple device memory into a single
process and define policy too.
Also CPU atomic instruction to PCIE BAR gives _undefined_ results and in
fact on some AMD/Intel platform it leads to weirdness/crash/freeze. So
obviously we can not map PCIE BAR to CPU without breaking the memory
model. More over on PCIE you might not be able to resize the BAR to
expose all the device memory. GPU can have several giga bytes of memory
and not all of them support PCIE bar resize, and sometimes PCIE bar
resize does not work either because of bios/firmware issue or simply
because you are running out of IO space.
So on x86 we are stuck with HMM private memory, i am hopping that some
day in the future we will have CCIX or something similar. But for now
we have to work with what we have.
> And what struct pages are actually going to be backing these VMAs if
> it's not using HMM?
When you have some range of virtual address migrated to HMM private
memory then the CPU pte are special swap entry and they behave just
as if the memory was swapped to disk. So CPU access to those will
fault and trigger a migration back to main memory.
We still want to allow peer to peer to exist when using HMM memory
for a range of virtual address (of a vma that is not an mmap of a
device file) because the peer device do not rely on atomic or on the
platform memory model. In those cases we assume that the importer is
aware of the limitation and is asking access in good faith and thus
we want to allow the exporting device to either allow the peer mapping
(because it has enough BAR address to map) or fall back to main memory.
> > Again HMM has nothing to do here, ignore HMM it does not play any role
> > and it is not involve in anyway here. GPU want to control what object
> > they allow other device to access and object they do not allow. GPU driver
> > _constantly_ invalidate the CPU page table and in fact the CPU page table
> > do not have any valid pte for a vma that is an mmap of GPU device file
> > for most of the vma lifetime. Changing that would highly disrupt and
> > break GPU drivers. They need to control that, they need to control what
> > to do if another device tries to peer map some of their memory. Hence
> > why they need to implement the callback and decide on wether or not they
> > allow the peer mapping or use device memory for it (they can decide to
> > fallback to main memory).
>
> But mapping is an operation of the memory/struct pages behind the VMA;
> not of the VMA itself and I think that's evident by the code in that the
> only way the VMA layer is involved is the fact that you're abusing
> vm_ops by adding new ops there and calling it by other layers.
For GPU driver the vma pte are populated on CPU page fault and they get
clear quickly after. A very usual pattern is:
- CPU write something to the object through the object mapping ie
through a vma. This trigger page fault which call the fault()
callback from vm_operations struct. This populate the page table
for the vma.
- Userspace launch commands on the GPU, first thing kernel do is
clear all CPU page table entry for objects listed in the commands
ie we do not except any further CPU access nor do we want it.
GPU driver have always been geared toward minimizing CPU access to GPU
memory. For object that need to be access by both concurrently we use the
main memory and not the device memory.
So in fact you will almost never have valid pte for an mmap of a GPU
object (done throught the GPU device file). However it does not mean that
we want to block peer to peer to happen. Today the use cases we know for
peer to peer are with GPUDirect (NVidia) or ROCmDMA (AMD) roughly the
same thing. Most common use cases i am aware are:
- RDMA is streaming in input directly into GPU memory avoiding the
need to have a bounce buffer into memory (this save both main
memory and PCIE bandwidth by avoiding RDMA->main then main->GPU).
- RDMA is streaming out result (same idea as streaming in but in
the other direction :))
- RDMA is use to monitor computation progress on the GPU and it
tries to do so with minimal disruption to the GPU. So RDMA would
like to be able to peek into GPU memory to fetch some values
and transmit them over the network.
I believe people would like to have more complex use case, like for
instance having the GPU be able to directly control some RDMA queue
to request data to some other host on the networ, or control some
block device queue to read data from block device directly. I believe
those can be implemented with the API set forward in those patches.
So for those above use cases it is fine to not have valid CPU pte and
only have peer to peer mapping. The CPU is not expected to be involve
and we should not make it a requirement. Hence we should not expect
to have valid pte.
Also another common use case is that GPU driver might leave pte that
points to main memory while the GPU is using device memory for the
object corresponding to the vma those pte are in. Expectation is that
the CPU access are synchronized with the device access through the
API use by the application. Note here we are talking non HMM, non SVM
case ie special object that are allocated through API specific functions
that result in driver ioctl and mmap of device file.
Hopes this helps understand the big picture from GPU driver point of
view :)
Cheers,
Jérôme
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-29 23:47 ` Jerome Glisse
@ 2019-01-30 1:17 ` Logan Gunthorpe
2019-01-30 2:48 ` Jerome Glisse
2019-01-30 4:18 ` Jason Gunthorpe
0 siblings, 2 replies; 95+ messages in thread
From: Logan Gunthorpe @ 2019-01-30 1:17 UTC (permalink / raw)
To: Jerome Glisse
Cc: Jason Gunthorpe, linux-mm, linux-kernel, Greg Kroah-Hartman,
Rafael J . Wysocki, Bjorn Helgaas, Christian Koenig,
Felix Kuehling, linux-pci, dri-devel, Christoph Hellwig,
Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On 2019-01-29 4:47 p.m., Jerome Glisse wrote:
> The whole point is to allow to use device memory for range of virtual
> address of a process when it does make sense to use device memory for
> that range. So they are multiple cases where it does make sense:
> [1] - Only the device is accessing the range and they are no CPU access
> For instance the program is executing/running a big function on
> the GPU and they are not concurrent CPU access, this is very
> common in all the existing GPGPU code. In fact AFAICT It is the
> most common pattern. So here you can use HMM private or public
> memory.
> [2] - Both device and CPU access a common range of virtul address
> concurrently. In that case if you are on a platform with cache
> coherent inter-connect like OpenCAPI or CCIX then you can use
> HMM public device memory and have both access the same memory.
> You can not use HMM private memory.
>
> So far on x86 we only have PCIE and thus so far on x86 we only have
> private HMM device memory that is not accessible by the CPU in any
> way.
I feel like you're just moving the rug out from under us... Before you
said ignore HMM and I was asking about the use case that wasn't using
HMM and how it works without HMM. In response, you just give me *way*
too much information describing HMM. And still, as best as I can see,
managing DMA mappings (which is different from the userspace mappings)
for GPU P2P should be handled by HMM and the userspace mappings should
*just* link VMAs to HMM pages using the standard infrastructure we
already have.
>> And what struct pages are actually going to be backing these VMAs if
>> it's not using HMM?
>
> When you have some range of virtual address migrated to HMM private
> memory then the CPU pte are special swap entry and they behave just
> as if the memory was swapped to disk. So CPU access to those will
> fault and trigger a migration back to main memory.
This isn't answering my question at all... I specifically asked what is
backing the VMA when we are *not* using HMM.
Logan
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-30 1:17 ` Logan Gunthorpe
@ 2019-01-30 2:48 ` Jerome Glisse
2019-01-30 4:18 ` Jason Gunthorpe
1 sibling, 0 replies; 95+ messages in thread
From: Jerome Glisse @ 2019-01-30 2:48 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: Jason Gunthorpe, linux-mm, linux-kernel, Greg Kroah-Hartman,
Rafael J . Wysocki, Bjorn Helgaas, Christian Koenig,
Felix Kuehling, linux-pci, dri-devel, Christoph Hellwig,
Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On Tue, Jan 29, 2019 at 06:17:43PM -0700, Logan Gunthorpe wrote:
>
>
> On 2019-01-29 4:47 p.m., Jerome Glisse wrote:
> > The whole point is to allow to use device memory for range of virtual
> > address of a process when it does make sense to use device memory for
> > that range. So they are multiple cases where it does make sense:
> > [1] - Only the device is accessing the range and they are no CPU access
> > For instance the program is executing/running a big function on
> > the GPU and they are not concurrent CPU access, this is very
> > common in all the existing GPGPU code. In fact AFAICT It is the
> > most common pattern. So here you can use HMM private or public
> > memory.
> > [2] - Both device and CPU access a common range of virtul address
> > concurrently. In that case if you are on a platform with cache
> > coherent inter-connect like OpenCAPI or CCIX then you can use
> > HMM public device memory and have both access the same memory.
> > You can not use HMM private memory.
> >
> > So far on x86 we only have PCIE and thus so far on x86 we only have
> > private HMM device memory that is not accessible by the CPU in any
> > way.
>
> I feel like you're just moving the rug out from under us... Before you
> said ignore HMM and I was asking about the use case that wasn't using
> HMM and how it works without HMM. In response, you just give me *way*
> too much information describing HMM. And still, as best as I can see,
> managing DMA mappings (which is different from the userspace mappings)
> for GPU P2P should be handled by HMM and the userspace mappings should
> *just* link VMAs to HMM pages using the standard infrastructure we
> already have.
For HMM P2P mapping we need to call into the driver to know if driver
wants to fallback to main memory (running out of BAR addresses) or if
it can allow a peer device to directly access its memory. We also need
the call to exporting device driver as only the exporting device driver
can map the HMM page pfn to some physical BAR address (which would be
allocated by driver for GPU).
I wanted to make sure the HMM case was understood too, sorry if it
caused confusion with the non HMM case which i describe below.
> >> And what struct pages are actually going to be backing these VMAs if
> >> it's not using HMM?
> >
> > When you have some range of virtual address migrated to HMM private
> > memory then the CPU pte are special swap entry and they behave just
> > as if the memory was swapped to disk. So CPU access to those will
> > fault and trigger a migration back to main memory.
>
> This isn't answering my question at all... I specifically asked what is
> backing the VMA when we are *not* using HMM.
So when you are not using HMM ie existing GPU object without HMM then
like i said you do not have any valid pte most of the time inside the
CPU page table ie the GPU driver only populate the pte with valid
entry when they are CPU page fault and it clear those as soon as the
corresponding object is use by the GPU. In fact some driver also unmap
it agressively from the BAR making the memory totaly un-accessible to
anything but the GPU.
GPU driver do not like CPU mapping, they are quite aggressive about
clearing them. Then everything i said about having userspace deciding
which object can be share, and, with who, do apply here. So for GPU you
do want to give control to GPU driver and you do not want to require valid
CPU pte for the vma so that the exporting driver can return valid
address to the importing peer device only.
Also exporting device driver might decide to fallback to main memory
(running out of BAR addresses for instance). So again here we want to
go through the exporting device driver so that it can take the right
action.
So the expected pattern (for GPU driver) is:
- no valid pte for the special vma (mmap of device file)
- importing device call p2p_map() for the vma if it succeed the
first time then we expect it will succeed for the same vma and
range next time we call it.
- exporting driver can either return physical address to page
into its BAR space that point to the correct device memory or
fallback to main memory
Then at any point in time:
- if GPU driver want to move the object around (for whatever
reasons) it calls zap_vma_ptes() the fact that there is no
valid CPU pte does not matter it will call mmu notifier and thus
any importing device driver will invalidate its mapping
- importing device driver that lost the mapping due to mmu
notification can re-map by re-calling p2p_map() (it should
check that the vma is still valid ...) and guideline is for
the exporting device driver to succeed and return valid
address to the new memory use for the object
This allow device driver like GPU to keep control. The expected
pattern is still the p2p mapping to stay undisrupted for their
whole lifetime. Invalidation should only be triggered if GPU driver
do need to move things around.
All the above is for the no HMM case ie mmap of a device file so
for any existing open source GPU device driver that do not support
HMM.
Cheers,
Jérôme
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-30 1:17 ` Logan Gunthorpe
2019-01-30 2:48 ` Jerome Glisse
@ 2019-01-30 4:18 ` Jason Gunthorpe
2019-01-30 8:00 ` Christoph Hellwig
2019-01-30 17:17 ` Logan Gunthorpe
1 sibling, 2 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2019-01-30 4:18 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: Jerome Glisse, linux-mm, linux-kernel, Greg Kroah-Hartman,
Rafael J . Wysocki, Bjorn Helgaas, Christian Koenig,
Felix Kuehling, linux-pci, dri-devel, Christoph Hellwig,
Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On Tue, Jan 29, 2019 at 06:17:43PM -0700, Logan Gunthorpe wrote:
> This isn't answering my question at all... I specifically asked what is
> backing the VMA when we are *not* using HMM.
At least for RDMA what backs the VMA today is non-struct-page BAR
memory filled in with io_remap_pfn.
And we want to expose this for P2P DMA. None of the HMM stuff applies
here and the p2p_map/unmap are a nice simple approach that covers all
the needs RDMA has, at least.
Every attempt to give BAR memory to struct page has run into major
trouble, IMHO, so I like that this approach avoids that.
And if you don't have struct page then the only kernel object left to
hang meta data off is the VMA itself.
It seems very similar to the existing P2P work between in-kernel
consumers, just that VMA is now mediating a general user space driven
discovery process instead of being hard wired into a driver.
Jason
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-30 4:18 ` Jason Gunthorpe
@ 2019-01-30 8:00 ` Christoph Hellwig
2019-01-30 15:49 ` Jerome Glisse
2019-01-30 19:06 ` Jason Gunthorpe
2019-01-30 17:17 ` Logan Gunthorpe
1 sibling, 2 replies; 95+ messages in thread
From: Christoph Hellwig @ 2019-01-30 8:00 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Logan Gunthorpe, Jerome Glisse, linux-mm, linux-kernel,
Greg Kroah-Hartman, Rafael J . Wysocki, Bjorn Helgaas,
Christian Koenig, Felix Kuehling, linux-pci, dri-devel,
Christoph Hellwig, Marek Szyprowski, Robin Murphy, Joerg Roedel,
iommu
On Wed, Jan 30, 2019 at 04:18:48AM +0000, Jason Gunthorpe wrote:
> Every attempt to give BAR memory to struct page has run into major
> trouble, IMHO, so I like that this approach avoids that.
Way less problems than not having struct page for doing anything
non-trivial. If you map the BAR to userspace with remap_pfn_range
and friends the mapping is indeed very simple. But any operation
that expects a page structure, which is at least everything using
get_user_pages won't work.
So you can't do direct I/O to your remapped BAR, you can't create MRs
on it, etc, etc.
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-30 8:00 ` Christoph Hellwig
@ 2019-01-30 15:49 ` Jerome Glisse
2019-01-30 19:06 ` Jason Gunthorpe
1 sibling, 0 replies; 95+ messages in thread
From: Jerome Glisse @ 2019-01-30 15:49 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jason Gunthorpe, Logan Gunthorpe, linux-mm, linux-kernel,
Greg Kroah-Hartman, Rafael J . Wysocki, Bjorn Helgaas,
Christian Koenig, Felix Kuehling, linux-pci, dri-devel,
Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On Wed, Jan 30, 2019 at 09:00:06AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 30, 2019 at 04:18:48AM +0000, Jason Gunthorpe wrote:
> > Every attempt to give BAR memory to struct page has run into major
> > trouble, IMHO, so I like that this approach avoids that.
>
> Way less problems than not having struct page for doing anything
> non-trivial. If you map the BAR to userspace with remap_pfn_range
> and friends the mapping is indeed very simple. But any operation
> that expects a page structure, which is at least everything using
> get_user_pages won't work.
>
> So you can't do direct I/O to your remapped BAR, you can't create MRs
> on it, etc, etc.
We do not want direct I/O, in fact at least for GPU we want to seldomly
allow access to object vma, so the less thing can access it the happier
we are :) All the GPU userspace driver API (OpenGL, OpenCL, Vulkan, ...)
that expose any such mapping with the application are very clear on the
limitation which is often worded: the only valid thing is direct CPU
access (no syscall can be use with those pointers).
So application developer already have low expectation on what is valid
and allowed to do.
Cheers,
Jérôme
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-30 8:00 ` Christoph Hellwig
2019-01-30 15:49 ` Jerome Glisse
@ 2019-01-30 19:06 ` Jason Gunthorpe
[not found] ` <20190130190651.GC17080-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
1 sibling, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2019-01-30 19:06 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Logan Gunthorpe, Jerome Glisse, linux-mm, linux-kernel,
Greg Kroah-Hartman, Rafael J . Wysocki, Bjorn Helgaas,
Christian Koenig, Felix Kuehling, linux-pci, dri-devel,
Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On Wed, Jan 30, 2019 at 09:00:06AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 30, 2019 at 04:18:48AM +0000, Jason Gunthorpe wrote:
> > Every attempt to give BAR memory to struct page has run into major
> > trouble, IMHO, so I like that this approach avoids that.
>
> Way less problems than not having struct page for doing anything
> non-trivial. If you map the BAR to userspace with remap_pfn_range
> and friends the mapping is indeed very simple. But any operation
> that expects a page structure, which is at least everything using
> get_user_pages won't work.
GUP doesn't work anyhow today, and won't work with BAR struct pages in
the forseeable future (Logan has sent attempts on this before).
So nothing seems lost..
> So you can't do direct I/O to your remapped BAR, you can't create MRs
> on it, etc, etc.
Jerome made the HMM mirror API use this flow, so afer his patch to
switch the ODP MR to use HMM, and to switch GPU drivers, it will work
for those cases. Which is more than the zero cases than we have today
:)
Jason
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-30 4:18 ` Jason Gunthorpe
2019-01-30 8:00 ` Christoph Hellwig
@ 2019-01-30 17:17 ` Logan Gunthorpe
[not found] ` <bdf03cd5-f5b1-4b78-a40e-b24024ca8c9f-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
1 sibling, 1 reply; 95+ messages in thread
From: Logan Gunthorpe @ 2019-01-30 17:17 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jerome Glisse, linux-mm, linux-kernel, Greg Kroah-Hartman,
Rafael J . Wysocki, Bjorn Helgaas, Christian Koenig,
Felix Kuehling, linux-pci, dri-devel, Christoph Hellwig,
Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On 2019-01-29 9:18 p.m., Jason Gunthorpe wrote:
> Every attempt to give BAR memory to struct page has run into major
> trouble, IMHO, so I like that this approach avoids that.
>
> And if you don't have struct page then the only kernel object left to
> hang meta data off is the VMA itself.
>
> It seems very similar to the existing P2P work between in-kernel
> consumers, just that VMA is now mediating a general user space driven
> discovery process instead of being hard wired into a driver.
But the kernel now has P2P bars backed by struct pages and it works
well. And that's what we are doing in-kernel. We even have a hacky
out-of-tree module which exposes these pages and it also works (but
would need Jerome's solution for denying those pages in GUP, etc). So
why do something completely different in userspace so they can't share
any of the DMA map infrastructure?
Logan
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-29 20:39 ` Logan Gunthorpe
2019-01-29 20:57 ` Jerome Glisse
@ 2019-01-29 20:58 ` Jason Gunthorpe
2019-01-30 8:02 ` Christoph Hellwig
1 sibling, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2019-01-29 20:58 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: Jerome Glisse, linux-mm, linux-kernel, Greg Kroah-Hartman,
Rafael J . Wysocki, Bjorn Helgaas, Christian Koenig,
Felix Kuehling, linux-pci, dri-devel, Christoph Hellwig,
Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On Tue, Jan 29, 2019 at 01:39:49PM -0700, Logan Gunthorpe wrote:
> implement the mapping. And I don't think we should have 'special' vma's
> for this (though we may need something to ensure we don't get mapping
> requests mixed with different types of pages...).
I think Jerome explained the point here is to have a 'special vma'
rather than a 'special struct page' as, really, we don't need a
struct page at all to make this work.
If I recall your earlier attempts at adding struct page for BAR
memory, it ran aground on issues related to O_DIRECT/sgls, etc, etc.
This does seem to avoid that pitfall entirely as we can never
accidently get into the SGL system with this kind of memory or VMA?
Jason
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-29 20:58 ` Jason Gunthorpe
@ 2019-01-30 8:02 ` Christoph Hellwig
2019-01-30 10:33 ` Koenig, Christian
2019-01-30 17:44 ` Jason Gunthorpe
0 siblings, 2 replies; 95+ messages in thread
From: Christoph Hellwig @ 2019-01-30 8:02 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Logan Gunthorpe, Jerome Glisse, linux-mm, linux-kernel,
Greg Kroah-Hartman, Rafael J . Wysocki, Bjorn Helgaas,
Christian Koenig, Felix Kuehling, linux-pci, dri-devel,
Christoph Hellwig, Marek Szyprowski, Robin Murphy, Joerg Roedel,
iommu
On Tue, Jan 29, 2019 at 08:58:35PM +0000, Jason Gunthorpe wrote:
> On Tue, Jan 29, 2019 at 01:39:49PM -0700, Logan Gunthorpe wrote:
>
> > implement the mapping. And I don't think we should have 'special' vma's
> > for this (though we may need something to ensure we don't get mapping
> > requests mixed with different types of pages...).
>
> I think Jerome explained the point here is to have a 'special vma'
> rather than a 'special struct page' as, really, we don't need a
> struct page at all to make this work.
>
> If I recall your earlier attempts at adding struct page for BAR
> memory, it ran aground on issues related to O_DIRECT/sgls, etc, etc.
Struct page is what makes O_DIRECT work, using sgls or biovecs, etc on
it work. Without struct page none of the above can work at all. That
is why we use struct page for backing BARs in the existing P2P code.
Not that I'm a particular fan of creating struct page for this device
memory, but without major invasive surgery to large parts of the kernel
it is the only way to make it work.
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-30 8:02 ` Christoph Hellwig
@ 2019-01-30 10:33 ` Koenig, Christian
2019-01-30 15:55 ` Jerome Glisse
2019-01-30 17:44 ` Jason Gunthorpe
1 sibling, 1 reply; 95+ messages in thread
From: Koenig, Christian @ 2019-01-30 10:33 UTC (permalink / raw)
To: Christoph Hellwig, Jason Gunthorpe
Cc: Logan Gunthorpe, Jerome Glisse, linux-mm, linux-kernel,
Greg Kroah-Hartman, Rafael J . Wysocki, Bjorn Helgaas, Kuehling,
Felix, linux-pci, dri-devel, Marek Szyprowski, Robin Murphy,
Joerg Roedel, iommu
Am 30.01.19 um 09:02 schrieb Christoph Hellwig:
> On Tue, Jan 29, 2019 at 08:58:35PM +0000, Jason Gunthorpe wrote:
>> On Tue, Jan 29, 2019 at 01:39:49PM -0700, Logan Gunthorpe wrote:
>>
>>> implement the mapping. And I don't think we should have 'special' vma's
>>> for this (though we may need something to ensure we don't get mapping
>>> requests mixed with different types of pages...).
>> I think Jerome explained the point here is to have a 'special vma'
>> rather than a 'special struct page' as, really, we don't need a
>> struct page at all to make this work.
>>
>> If I recall your earlier attempts at adding struct page for BAR
>> memory, it ran aground on issues related to O_DIRECT/sgls, etc, etc.
> Struct page is what makes O_DIRECT work, using sgls or biovecs, etc on
> it work. Without struct page none of the above can work at all. That
> is why we use struct page for backing BARs in the existing P2P code.
> Not that I'm a particular fan of creating struct page for this device
> memory, but without major invasive surgery to large parts of the kernel
> it is the only way to make it work.
The problem seems to be that struct page does two things:
1. Memory management for system memory.
2. The object to work with in the I/O layer.
This was done because a good part of that stuff overlaps, like reference
counting how often a page is used. The problem now is that this doesn't
work very well for device memory in some cases.
For example on GPUs you usually have a large amount of memory which is
not even accessible by the CPU. In other words you can't easily create a
struct page for it because you can't reference it with a physical CPU
address.
Maybe struct page should be split up into smaller structures? I mean
it's really overloaded with data.
Christian.
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-30 10:33 ` Koenig, Christian
@ 2019-01-30 15:55 ` Jerome Glisse
2019-01-30 17:26 ` Christoph Hellwig
0 siblings, 1 reply; 95+ messages in thread
From: Jerome Glisse @ 2019-01-30 15:55 UTC (permalink / raw)
To: Koenig, Christian
Cc: Christoph Hellwig, Jason Gunthorpe, Logan Gunthorpe, linux-mm,
linux-kernel, Greg Kroah-Hartman, Rafael J . Wysocki,
Bjorn Helgaas, Kuehling, Felix, linux-pci, dri-devel,
Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On Wed, Jan 30, 2019 at 10:33:39AM +0000, Koenig, Christian wrote:
> Am 30.01.19 um 09:02 schrieb Christoph Hellwig:
> > On Tue, Jan 29, 2019 at 08:58:35PM +0000, Jason Gunthorpe wrote:
> >> On Tue, Jan 29, 2019 at 01:39:49PM -0700, Logan Gunthorpe wrote:
> >>
> >>> implement the mapping. And I don't think we should have 'special' vma's
> >>> for this (though we may need something to ensure we don't get mapping
> >>> requests mixed with different types of pages...).
> >> I think Jerome explained the point here is to have a 'special vma'
> >> rather than a 'special struct page' as, really, we don't need a
> >> struct page at all to make this work.
> >>
> >> If I recall your earlier attempts at adding struct page for BAR
> >> memory, it ran aground on issues related to O_DIRECT/sgls, etc, etc.
> > Struct page is what makes O_DIRECT work, using sgls or biovecs, etc on
> > it work. Without struct page none of the above can work at all. That
> > is why we use struct page for backing BARs in the existing P2P code.
> > Not that I'm a particular fan of creating struct page for this device
> > memory, but without major invasive surgery to large parts of the kernel
> > it is the only way to make it work.
>
> The problem seems to be that struct page does two things:
>
> 1. Memory management for system memory.
> 2. The object to work with in the I/O layer.
>
> This was done because a good part of that stuff overlaps, like reference
> counting how often a page is used. The problem now is that this doesn't
> work very well for device memory in some cases.
>
> For example on GPUs you usually have a large amount of memory which is
> not even accessible by the CPU. In other words you can't easily create a
> struct page for it because you can't reference it with a physical CPU
> address.
>
> Maybe struct page should be split up into smaller structures? I mean
> it's really overloaded with data.
I think the simpler answer is that we do not want to allow GUP or any-
thing similar to pin BAR or device memory. Doing so can only hurt us
long term by fragmenting the GPU memory and forbidding us to move thing
around. For transparent use of device memory within a process this is
definitly forbidden to pin.
I do not see any good reasons we would like to pin device memory for
the existing GPU GEM objects. Userspace always had a very low expectation
on what it can do with mmap of those object and i believe it is better
to keep expectation low here and says nothing will work with those
pointer. I just do not see a valid and compelling use case to change
that :)
Even outside GPU driver, device driver like RDMA just want to share their
doorbell to other device and they do not want to see those doorbell page
use in direct I/O or anything similar AFAICT.
Cheers,
Jérôme
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-30 15:55 ` Jerome Glisse
@ 2019-01-30 17:26 ` Christoph Hellwig
2019-01-30 17:32 ` Logan Gunthorpe
` (2 more replies)
0 siblings, 3 replies; 95+ messages in thread
From: Christoph Hellwig @ 2019-01-30 17:26 UTC (permalink / raw)
To: Jerome Glisse
Cc: Koenig, Christian, Christoph Hellwig, Jason Gunthorpe,
Logan Gunthorpe, linux-mm, linux-kernel, Greg Kroah-Hartman,
Rafael J . Wysocki, Bjorn Helgaas, Kuehling, Felix, linux-pci,
dri-devel, Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On Wed, Jan 30, 2019 at 10:55:43AM -0500, Jerome Glisse wrote:
> Even outside GPU driver, device driver like RDMA just want to share their
> doorbell to other device and they do not want to see those doorbell page
> use in direct I/O or anything similar AFAICT.
At least Mellanox HCA support and inline data feature where you
can copy data directly into the BAR. For something like a usrspace
NVMe target it might be very useful to do direct I/O straight into
the BAR for that.
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-30 17:26 ` Christoph Hellwig
@ 2019-01-30 17:32 ` Logan Gunthorpe
2019-01-30 17:39 ` Jason Gunthorpe
2019-01-30 18:05 ` Jerome Glisse
2 siblings, 0 replies; 95+ messages in thread
From: Logan Gunthorpe @ 2019-01-30 17:32 UTC (permalink / raw)
To: Christoph Hellwig, Jerome Glisse
Cc: Koenig, Christian, Jason Gunthorpe, linux-mm, linux-kernel,
Greg Kroah-Hartman, Rafael J . Wysocki, Bjorn Helgaas, Kuehling,
Felix, linux-pci, dri-devel, Marek Szyprowski, Robin Murphy,
Joerg Roedel, iommu
On 2019-01-30 10:26 a.m., Christoph Hellwig wrote:
> On Wed, Jan 30, 2019 at 10:55:43AM -0500, Jerome Glisse wrote:
>> Even outside GPU driver, device driver like RDMA just want to share their
>> doorbell to other device and they do not want to see those doorbell page
>> use in direct I/O or anything similar AFAICT.
>
> At least Mellanox HCA support and inline data feature where you
> can copy data directly into the BAR. For something like a usrspace
> NVMe target it might be very useful to do direct I/O straight into
> the BAR for that.
Yup, these are things we definitely want to be able to do, and have done
with hacky garbage code: Direct I/O from NVMe to P2P BAR, then we could
Direct I/O to another drive or map it as an MR and send it over an RNIC.
We'd definitely like to move in that direction. And a world where such
userspace mappings are gimpped by the fact that they are only some
special feature of userspace VMAs that can only be used in specialized
userspace interfaces is not useful to us.
Logan
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-30 17:26 ` Christoph Hellwig
2019-01-30 17:32 ` Logan Gunthorpe
@ 2019-01-30 17:39 ` Jason Gunthorpe
2019-01-30 18:05 ` Jerome Glisse
2 siblings, 0 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2019-01-30 17:39 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jerome Glisse, Koenig, Christian, Logan Gunthorpe, linux-mm,
linux-kernel, Greg Kroah-Hartman, Rafael J . Wysocki,
Bjorn Helgaas, Kuehling, Felix, linux-pci, dri-devel,
Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On Wed, Jan 30, 2019 at 06:26:53PM +0100, Christoph Hellwig wrote:
> On Wed, Jan 30, 2019 at 10:55:43AM -0500, Jerome Glisse wrote:
> > Even outside GPU driver, device driver like RDMA just want to share their
> > doorbell to other device and they do not want to see those doorbell page
> > use in direct I/O or anything similar AFAICT.
>
> At least Mellanox HCA support and inline data feature where you
> can copy data directly into the BAR. For something like a usrspace
> NVMe target it might be very useful to do direct I/O straight into
> the BAR for that.
It doesn't really work like that.
The PCI-E TLP sequence to trigger this feature is very precise, and
the data requires the right headers/etc. Mixing that with O_DIRECT
seems very unlikely.
Jason
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-30 17:26 ` Christoph Hellwig
2019-01-30 17:32 ` Logan Gunthorpe
2019-01-30 17:39 ` Jason Gunthorpe
@ 2019-01-30 18:05 ` Jerome Glisse
2 siblings, 0 replies; 95+ messages in thread
From: Jerome Glisse @ 2019-01-30 18:05 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Koenig, Christian, Jason Gunthorpe, Logan Gunthorpe, linux-mm,
linux-kernel, Greg Kroah-Hartman, Rafael J . Wysocki,
Bjorn Helgaas, Kuehling, Felix, linux-pci, dri-devel,
Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On Wed, Jan 30, 2019 at 06:26:53PM +0100, Christoph Hellwig wrote:
> On Wed, Jan 30, 2019 at 10:55:43AM -0500, Jerome Glisse wrote:
> > Even outside GPU driver, device driver like RDMA just want to share their
> > doorbell to other device and they do not want to see those doorbell page
> > use in direct I/O or anything similar AFAICT.
>
> At least Mellanox HCA support and inline data feature where you
> can copy data directly into the BAR. For something like a usrspace
> NVMe target it might be very useful to do direct I/O straight into
> the BAR for that.
And what i am proposing is not exclusive of that. If exporting device
wants to have struct page for its BAR than it can do so. What i do not
want is imposing that burden on everyone as many devices do not want
or do not care for that. Moreover having struct page and allowing that
struct page to trickle know in obscure corner of the kernel means that
exporter that want that will also have the burden to check that what
they are doing does not end up in something terribly bad.
While i would like a one API fits all i do not think that we can sanely
do that for P2P. They are too much differences between how different
devices expose and manage their BAR to make any such attempt reasonably
sane.
Maybe thing will evolve oragnicaly, but for now i do not see a way out
side the API i am proposing (again this is not exclusive of the struct
page API that is upstream both can co-exist and a device can use both
or just one).
Cheers,
Jérôme
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-30 8:02 ` Christoph Hellwig
2019-01-30 10:33 ` Koenig, Christian
@ 2019-01-30 17:44 ` Jason Gunthorpe
2019-01-30 18:13 ` Logan Gunthorpe
1 sibling, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2019-01-30 17:44 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Logan Gunthorpe, Jerome Glisse, linux-mm, linux-kernel,
Greg Kroah-Hartman, Rafael J . Wysocki, Bjorn Helgaas,
Christian Koenig, Felix Kuehling, linux-pci, dri-devel,
Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On Wed, Jan 30, 2019 at 09:02:08AM +0100, Christoph Hellwig wrote:
> On Tue, Jan 29, 2019 at 08:58:35PM +0000, Jason Gunthorpe wrote:
> > On Tue, Jan 29, 2019 at 01:39:49PM -0700, Logan Gunthorpe wrote:
> >
> > > implement the mapping. And I don't think we should have 'special' vma's
> > > for this (though we may need something to ensure we don't get mapping
> > > requests mixed with different types of pages...).
> >
> > I think Jerome explained the point here is to have a 'special vma'
> > rather than a 'special struct page' as, really, we don't need a
> > struct page at all to make this work.
> >
> > If I recall your earlier attempts at adding struct page for BAR
> > memory, it ran aground on issues related to O_DIRECT/sgls, etc, etc.
>
> Struct page is what makes O_DIRECT work, using sgls or biovecs, etc on
> it work. Without struct page none of the above can work at all. That
> is why we use struct page for backing BARs in the existing P2P code.
> Not that I'm a particular fan of creating struct page for this device
> memory, but without major invasive surgery to large parts of the kernel
> it is the only way to make it work.
I don't think anyone is interested in O_DIRECT/etc for RDMA doorbell
pages.
.. and again, I recall Logan already attempted to mix non-CPU memory
into sgls and it was a disaster. You pointed out that one cannot just
put iomem addressed into a SGL without auditing basically the entire
block stack to prove that nothing uses iomem without an iomem
accessor.
I recall that proposal veered into a direction where the block layer
would just fail very early if there was iomem in the sgl, so generally
no O_DIRECT support anyhow.
We already accepted the P2P stuff from Logan as essentially a giant
special case - it only works with RDMA and only because RDMA MR was
hacked up with a special p2p callback.
I don't see why a special case with a VMA is really that different.
If someone figures out the struct page path down the road it can
obviously be harmonized with this VMA approach pretty easily.
Jason
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-30 17:44 ` Jason Gunthorpe
@ 2019-01-30 18:13 ` Logan Gunthorpe
2019-01-30 18:50 ` Jerome Glisse
2019-01-30 19:19 ` Jason Gunthorpe
0 siblings, 2 replies; 95+ messages in thread
From: Logan Gunthorpe @ 2019-01-30 18:13 UTC (permalink / raw)
To: Jason Gunthorpe, Christoph Hellwig
Cc: Jerome Glisse, linux-mm, linux-kernel, Greg Kroah-Hartman,
Rafael J . Wysocki, Bjorn Helgaas, Christian Koenig,
Felix Kuehling, linux-pci, dri-devel, Marek Szyprowski,
Robin Murphy, Joerg Roedel, iommu
On 2019-01-30 10:44 a.m., Jason Gunthorpe wrote:
> I don't see why a special case with a VMA is really that different.
Well one *really* big difference is the VMA changes necessarily expose
specialized new functionality to userspace which has to be supported
forever and may be difficult to change. The p2pdma code is largely
in-kernel and we can rework and change the interfaces all we want as we
improve our struct page infrastructure.
I'd also argue that p2pdma isn't nearly as specialized as this VMA thing
and can be used pretty generically to do other things. Though, the other
ideas we've talked about doing are pretty far off and may have other
challenges.
Logan
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-30 18:13 ` Logan Gunthorpe
@ 2019-01-30 18:50 ` Jerome Glisse
2019-01-31 8:02 ` Christoph Hellwig
2019-01-30 19:19 ` Jason Gunthorpe
1 sibling, 1 reply; 95+ messages in thread
From: Jerome Glisse @ 2019-01-30 18:50 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: Jason Gunthorpe, Christoph Hellwig, linux-mm, linux-kernel,
Greg Kroah-Hartman, Rafael J . Wysocki, Bjorn Helgaas,
Christian Koenig, Felix Kuehling, linux-pci, dri-devel,
Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On Wed, Jan 30, 2019 at 11:13:11AM -0700, Logan Gunthorpe wrote:
>
>
> On 2019-01-30 10:44 a.m., Jason Gunthorpe wrote:
> > I don't see why a special case with a VMA is really that different.
>
> Well one *really* big difference is the VMA changes necessarily expose
> specialized new functionality to userspace which has to be supported
> forever and may be difficult to change. The p2pdma code is largely
> in-kernel and we can rework and change the interfaces all we want as we
> improve our struct page infrastructure.
I do not see how VMA changes are any different than using struct page
in respect to userspace exposure. Those vma callback do not need to be
set by everyone, in fact expectation is that only handful of driver
will set those.
How can we do p2p between RDMA and GPU for instance, without exposure
to userspace ? At some point you need to tell userspace hey this kernel
does allow you to do that :)
RDMA works on vma, and GPU driver can easily setup vma for an object
hence why vma sounds like a logical place. In fact vma (mmap of a device
file) is very common device driver pattern.
In the model i am proposing the exporting device is in control of
policy ie wether to allow or not the peer to peer mapping. So each
device driver can define proper device specific API to enable and
expose that feature to userspace.
If they do, the only thing we have to preserve is the end result for
the user. The userspace does not care one bit if we achieve this in
the kernel with a set of new callback within the vm_operations struct
or in some other way. Only the end result matter.
So question is do we want to allow RDMA to access GPU driver object ?
I believe we do, they are people using non upstream solution with open
source driver to do just that, so it is a testimony that they are
users for this. More use case have been propose too.
>
> I'd also argue that p2pdma isn't nearly as specialized as this VMA thing
> and can be used pretty generically to do other things. Though, the other
> ideas we've talked about doing are pretty far off and may have other
> challenges.
I believe p2p is highly specialize on non cache-coherent inter-connect
platform like x86 with PCIE. So i do not think that using struct page
for this is a good idea, it is not warranted/needed, and it can only be
problematic if some random kernel code get holds of those struct page
without understanding it is not regular memory.
I believe the vma callback are the simplest solution with the minimum
burden for the device driver and for the kernel. If they are any better
solution that emerge there is nothing that would block us to remove
this to replace it with the other solution.
Cheers,
Jérôme
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-30 18:50 ` Jerome Glisse
@ 2019-01-31 8:02 ` Christoph Hellwig
2019-01-31 15:03 ` Jerome Glisse
0 siblings, 1 reply; 95+ messages in thread
From: Christoph Hellwig @ 2019-01-31 8:02 UTC (permalink / raw)
To: Jerome Glisse
Cc: Logan Gunthorpe, Jason Gunthorpe, Christoph Hellwig, linux-mm,
linux-kernel, Greg Kroah-Hartman, Rafael J . Wysocki,
Bjorn Helgaas, Christian Koenig, Felix Kuehling, linux-pci,
dri-devel, Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On Wed, Jan 30, 2019 at 01:50:27PM -0500, Jerome Glisse wrote:
> I do not see how VMA changes are any different than using struct page
> in respect to userspace exposure. Those vma callback do not need to be
> set by everyone, in fact expectation is that only handful of driver
> will set those.
>
> How can we do p2p between RDMA and GPU for instance, without exposure
> to userspace ? At some point you need to tell userspace hey this kernel
> does allow you to do that :)
To do RDMA on a memory region you need struct page backіng to start
with..
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-31 8:02 ` Christoph Hellwig
@ 2019-01-31 15:03 ` Jerome Glisse
0 siblings, 0 replies; 95+ messages in thread
From: Jerome Glisse @ 2019-01-31 15:03 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Logan Gunthorpe, Jason Gunthorpe, linux-mm, linux-kernel,
Greg Kroah-Hartman, Rafael J . Wysocki, Bjorn Helgaas,
Christian Koenig, Felix Kuehling, linux-pci, dri-devel,
Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On Thu, Jan 31, 2019 at 09:02:03AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 30, 2019 at 01:50:27PM -0500, Jerome Glisse wrote:
> > I do not see how VMA changes are any different than using struct page
> > in respect to userspace exposure. Those vma callback do not need to be
> > set by everyone, in fact expectation is that only handful of driver
> > will set those.
> >
> > How can we do p2p between RDMA and GPU for instance, without exposure
> > to userspace ? At some point you need to tell userspace hey this kernel
> > does allow you to do that :)
>
> To do RDMA on a memory region you need struct page backіng to start
> with..
No you do not with this patchset and there is no reason to tie RDMA to
struct page it does not provide a single feature we would need. So as
it can be done without and they are not benefit of using one i do not
see why we should use one.
Cheers,
Jérôme
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma
2019-01-30 18:13 ` Logan Gunthorpe
2019-01-30 18:50 ` Jerome Glisse
@ 2019-01-30 19:19 ` Jason Gunthorpe
[not found] ` <20190130191946.GD17080-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
1 sibling, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2019-01-30 19:19 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: Christoph Hellwig, Jerome Glisse, linux-mm, linux-kernel,
Greg Kroah-Hartman, Rafael J . Wysocki, Bjorn Helgaas,
Christian Koenig, Felix Kuehling, linux-pci, dri-devel,
Marek Szyprowski, Robin Murphy, Joerg Roedel, iommu
On Wed, Jan 30, 2019 at 11:13:11AM -0700, Logan Gunthorpe wrote:
>
>
> On 2019-01-30 10:44 a.m., Jason Gunthorpe wrote:
> > I don't see why a special case with a VMA is really that different.
>
> Well one *really* big difference is the VMA changes necessarily expose
> specialized new functionality to userspace which has to be supported
> forever and may be difficult to change.
The only user change here is that more things will succeed when
creating RDMA MRs (and vice versa to GPU). I don't think this
restricts the kernel implementation at all, unless we intend to
remove P2P entirely..
Jason
^ permalink raw reply [flat|nested] 95+ messages in thread
* [RFC PATCH 4/5] mm/hmm: add support for peer to peer to HMM device memory
2019-01-29 17:47 [RFC PATCH 0/5] Device peer to peer (p2p) through vma jglisse
` (2 preceding siblings ...)
2019-01-29 17:47 ` [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma jglisse
@ 2019-01-29 17:47 ` jglisse
2019-01-29 17:47 ` [RFC PATCH 5/5] mm/hmm: add support for peer to peer to special device vma jglisse
4 siblings, 0 replies; 95+ messages in thread
From: jglisse @ 2019-01-29 17:47 UTC (permalink / raw)
To: linux-mm
Cc: Joerg Roedel, Rafael J . Wysocki, Greg Kroah-Hartman,
Felix Kuehling, linux-kernel, dri-devel, Christoph Hellwig,
iommu, Jérôme Glisse, Jason Gunthorpe, linux-pci,
Bjorn Helgaas, Robin Murphy, Logan Gunthorpe, Christian Koenig,
Marek Szyprowski
From: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: linux-pci@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: iommu@lists.linux-foundation.org
---
include/linux/hmm.h | 47 +++++++++++++++++++++++++++++++++
mm/hmm.c | 63 +++++++++++++++++++++++++++++++++++++++++----
2 files changed, 105 insertions(+), 5 deletions(-)
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 4a1454e3efba..7a3ac182cc48 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -710,6 +710,53 @@ struct hmm_devmem_ops {
const struct page *page,
unsigned int flags,
pmd_t *pmdp);
+
+ /*
+ * p2p_map() - map page for peer to peer between device
+ * @devmem: device memory structure (see struct hmm_devmem)
+ * @range: range of virtual address that is being mapped
+ * @device: device the range is being map to
+ * @addr: first virtual address in the range to consider
+ * @pa: device address (where actual mapping is store)
+ * Returns: number of page successfuly mapped, 0 otherwise
+ *
+ * Map page belonging to devmem to another device for peer to peer
+ * access. Device can decide not to map in which case memory will
+ * be migrated to main memory.
+ *
+ * Also there is no garantee that all the pages in the range does
+ * belongs to the devmem so it is up to the function to check that
+ * every single page does belong to devmem.
+ *
+ * Note for now we do not care about error exect error, so on failure
+ * function should just return 0.
+ */
+ long (*p2p_map)(struct hmm_devmem *devmem,
+ struct hmm_range *range,
+ struct device *device,
+ unsigned long addr,
+ dma_addr_t *pas);
+
+ /*
+ * p2p_unmap() - unmap page from peer to peer between device
+ * @devmem: device memory structure (see struct hmm_devmem)
+ * @range: range of virtual address that is being mapped
+ * @device: device the range is being map to
+ * @addr: first virtual address in the range to consider
+ * @pa: device address (where actual mapping is store)
+ * Returns: number of page successfuly unmapped, 0 otherwise
+ *
+ * Unmap page belonging to devmem previously map with p2p_map().
+ *
+ * Note there is no garantee that all the pages in the range does
+ * belongs to the devmem so it is up to the function to check that
+ * every single page does belong to devmem.
+ */
+ unsigned long (*p2p_unmap)(struct hmm_devmem *devmem,
+ struct hmm_range *range,
+ struct device *device,
+ unsigned long addr,
+ dma_addr_t *pas);
};
/*
diff --git a/mm/hmm.c b/mm/hmm.c
index 1a444885404e..fd49b1e116d0 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -1193,16 +1193,19 @@ long hmm_range_dma_map(struct hmm_range *range,
dma_addr_t *daddrs,
bool block)
{
- unsigned long i, npages, mapped, page_size;
+ unsigned long i, npages, mapped, page_size, addr;
long ret;
+again:
ret = hmm_range_fault(range, block);
if (ret <= 0)
return ret ? ret : -EBUSY;
+ mapped = 0;
+ addr = range->start;
page_size = hmm_range_page_size(range);
npages = (range->end - range->start) >> range->page_shift;
- for (i = 0, mapped = 0; i < npages; ++i) {
+ for (i = 0; i < npages; ++i, addr += page_size) {
enum dma_data_direction dir = DMA_FROM_DEVICE;
struct page *page;
@@ -1226,6 +1229,29 @@ long hmm_range_dma_map(struct hmm_range *range,
goto unmap;
}
+ if (is_device_private_page(page)) {
+ struct hmm_devmem *devmem = page->pgmap->data;
+
+ if (!devmem->ops->p2p_map || !devmem->ops->p2p_unmap) {
+ /* Fall-back to main memory. */
+ range->default_flags |=
+ range->flags[HMM_PFN_DEVICE_PRIVATE];
+ goto again;
+ }
+
+ ret = devmem->ops->p2p_map(devmem, range, device,
+ addr, daddrs);
+ if (ret <= 0) {
+ /* Fall-back to main memory. */
+ range->default_flags |=
+ range->flags[HMM_PFN_DEVICE_PRIVATE];
+ goto again;
+ }
+ mapped += ret;
+ i += ret;
+ continue;
+ }
+
/* If it is read and write than map bi-directional. */
if (range->pfns[i] & range->values[HMM_PFN_WRITE])
dir = DMA_BIDIRECTIONAL;
@@ -1242,7 +1268,9 @@ long hmm_range_dma_map(struct hmm_range *range,
return mapped;
unmap:
- for (npages = i, i = 0; (i < npages) && mapped; ++i) {
+ npages = i;
+ addr = range->start;
+ for (i = 0; (i < npages) && mapped; ++i, addr += page_size) {
enum dma_data_direction dir = DMA_FROM_DEVICE;
struct page *page;
@@ -1253,6 +1281,18 @@ long hmm_range_dma_map(struct hmm_range *range,
if (dma_mapping_error(device, daddrs[i]))
continue;
+ if (is_device_private_page(page)) {
+ struct hmm_devmem *devmem = page->pgmap->data;
+ unsigned long inc;
+
+ inc = devmem->ops->p2p_unmap(devmem, range, device,
+ addr, &daddrs[i]);
+ BUG_ON(inc > npages);
+ mapped += inc;
+ i += inc;
+ continue;
+ }
+
/* If it is read and write than map bi-directional. */
if (range->pfns[i] & range->values[HMM_PFN_WRITE])
dir = DMA_BIDIRECTIONAL;
@@ -1285,7 +1325,7 @@ long hmm_range_dma_unmap(struct hmm_range *range,
dma_addr_t *daddrs,
bool dirty)
{
- unsigned long i, npages, page_size;
+ unsigned long i, npages, page_size, addr;
long cpages = 0;
/* Sanity check. */
@@ -1298,7 +1338,7 @@ long hmm_range_dma_unmap(struct hmm_range *range,
page_size = hmm_range_page_size(range);
npages = (range->end - range->start) >> range->page_shift;
- for (i = 0; i < npages; ++i) {
+ for (i = 0, addr = range->start; i < npages; ++i, addr += page_size) {
enum dma_data_direction dir = DMA_FROM_DEVICE;
struct page *page;
@@ -1318,6 +1358,19 @@ long hmm_range_dma_unmap(struct hmm_range *range,
set_page_dirty(page);
}
+ if (is_device_private_page(page)) {
+ struct hmm_devmem *devmem = page->pgmap->data;
+ unsigned long ret;
+
+ BUG_ON(!devmem->ops->p2p_unmap);
+
+ ret = devmem->ops->p2p_unmap(devmem, range, device,
+ addr, &daddrs[i]);
+ BUG_ON(ret > npages);
+ i += ret;
+ continue;
+ }
+
/* Unmap and clear pfns/dma address */
dma_unmap_page(device, daddrs[i], page_size, dir);
range->pfns[i] = range->values[HMM_PFN_NONE];
--
2.17.2
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply related [flat|nested] 95+ messages in thread
* [RFC PATCH 5/5] mm/hmm: add support for peer to peer to special device vma
2019-01-29 17:47 [RFC PATCH 0/5] Device peer to peer (p2p) through vma jglisse
` (3 preceding siblings ...)
2019-01-29 17:47 ` [RFC PATCH 4/5] mm/hmm: add support for peer to peer to HMM device memory jglisse
@ 2019-01-29 17:47 ` jglisse
4 siblings, 0 replies; 95+ messages in thread
From: jglisse @ 2019-01-29 17:47 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Jérôme Glisse, Logan Gunthorpe,
Greg Kroah-Hartman, Rafael J . Wysocki, Bjorn Helgaas,
Christian Koenig, Felix Kuehling, Jason Gunthorpe, linux-pci,
dri-devel, Christoph Hellwig, Marek Szyprowski, Robin Murphy,
Joerg Roedel, iommu
From: Jérôme Glisse <jglisse@redhat.com>
Special device vma (mmap of a device file) can correspond to device
driver object that some device driver might want to share with other
device (giving access to). This add support for HMM to map those
special device vma if the owning device (exporter) allows it.
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: linux-pci@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: iommu@lists.linux-foundation.org
---
include/linux/hmm.h | 6 ++
mm/hmm.c | 156 ++++++++++++++++++++++++++++++++++----------
2 files changed, 128 insertions(+), 34 deletions(-)
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 7a3ac182cc48..98ebe9f52432 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -137,6 +137,7 @@ enum hmm_pfn_flag_e {
* result of vmf_insert_pfn() or vm_insert_page(). Therefore, it should not
* be mirrored by a device, because the entry will never have HMM_PFN_VALID
* set and the pfn value is undefined.
+ * HMM_PFN_P2P: this entry have been map as P2P ie the dma address is valid
*
* Driver provide entry value for none entry, error entry and special entry,
* driver can alias (ie use same value for error and special for instance). It
@@ -151,6 +152,7 @@ enum hmm_pfn_value_e {
HMM_PFN_ERROR,
HMM_PFN_NONE,
HMM_PFN_SPECIAL,
+ HMM_PFN_P2P,
HMM_PFN_VALUE_MAX
};
@@ -250,6 +252,8 @@ static inline bool hmm_range_valid(struct hmm_range *range)
static inline struct page *hmm_pfn_to_page(const struct hmm_range *range,
uint64_t pfn)
{
+ if (pfn == range->values[HMM_PFN_P2P])
+ return NULL;
if (pfn == range->values[HMM_PFN_NONE])
return NULL;
if (pfn == range->values[HMM_PFN_ERROR])
@@ -270,6 +274,8 @@ static inline struct page *hmm_pfn_to_page(const struct hmm_range *range,
static inline unsigned long hmm_pfn_to_pfn(const struct hmm_range *range,
uint64_t pfn)
{
+ if (pfn == range->values[HMM_PFN_P2P])
+ return -1UL;
if (pfn == range->values[HMM_PFN_NONE])
return -1UL;
if (pfn == range->values[HMM_PFN_ERROR])
diff --git a/mm/hmm.c b/mm/hmm.c
index fd49b1e116d0..621a4f831483 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -1058,37 +1058,36 @@ long hmm_range_snapshot(struct hmm_range *range)
}
EXPORT_SYMBOL(hmm_range_snapshot);
-/*
- * hmm_range_fault() - try to fault some address in a virtual address range
- * @range: range being faulted
- * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
- * Returns: 0 on success ortherwise:
- * -EINVAL:
- * Invalid argument
- * -ENOMEM:
- * Out of memory.
- * -EPERM:
- * Invalid permission (for instance asking for write and range
- * is read only).
- * -EAGAIN:
- * If you need to retry and mmap_sem was drop. This can only
- * happens if block argument is false.
- * -EBUSY:
- * If the the range is being invalidated and you should wait for
- * invalidation to finish.
- * -EFAULT:
- * Invalid (ie either no valid vma or it is illegal to access that
- * range), number of valid pages in range->pfns[] (from range start
- * address).
- *
- * This is similar to a regular CPU page fault except that it will not trigger
- * any memory migration if the memory being faulted is not accessible by CPUs
- * and caller does not ask for migration.
- *
- * On error, for one virtual address in the range, the function will mark the
- * corresponding HMM pfn entry with an error flag.
- */
-long hmm_range_fault(struct hmm_range *range, bool block)
+static int hmm_vma_p2p_map(struct hmm_range *range, struct vm_area_struct *vma,
+ unsigned long start, unsigned long end,
+ struct device *device, dma_addr_t *pas)
+{
+ struct hmm_vma_walk hmm_vma_walk;
+ unsigned long npages, i;
+ bool fault, write;
+ uint64_t *pfns;
+ int ret;
+
+ i = (start - range->start) >> PAGE_SHIFT;
+ npages = (end - start) >> PAGE_SHIFT;
+ pfns = &range->pfns[i];
+ pas = &pas[i];
+
+ hmm_vma_walk.range = range;
+ hmm_vma_walk.fault = true;
+ hmm_range_need_fault(&hmm_vma_walk, pfns, npages,
+ 0, &fault, &write);
+
+ ret = vma->vm_ops->p2p_map(vma, device, start, end, pas, write);
+ for (i = 0; i < npages; ++i) {
+ pfns[i] = ret ? range->values[HMM_PFN_ERROR] :
+ range->values[HMM_PFN_P2P];
+ }
+ return ret;
+}
+
+static long _hmm_range_fault(struct hmm_range *range, bool block,
+ struct device *device, dma_addr_t *pas)
{
const unsigned long device_vma = VM_IO | VM_PFNMAP | VM_MIXEDMAP;
unsigned long start = range->start, end;
@@ -1110,9 +1109,22 @@ long hmm_range_fault(struct hmm_range *range, bool block)
}
vma = find_vma(hmm->mm, start);
- if (vma == NULL || (vma->vm_flags & device_vma))
+ if (vma == NULL)
return -EFAULT;
+ end = min(range->end, vma->vm_end);
+ if (vma->vm_flags & device_vma) {
+ if (!device || !pas || !vma->vm_ops->p2p_map)
+ return -EFAULT;
+
+ ret = hmm_vma_p2p_map(range, vma, start,
+ end, device, pas);
+ if (ret)
+ return ret;
+ start = end;
+ continue;
+ }
+
if (is_vm_hugetlb_page(vma)) {
struct hstate *h = hstate_vma(vma);
@@ -1142,7 +1154,6 @@ long hmm_range_fault(struct hmm_range *range, bool block)
hmm_vma_walk.block = block;
hmm_vma_walk.range = range;
mm_walk.private = &hmm_vma_walk;
- end = min(range->end, vma->vm_end);
mm_walk.vma = vma;
mm_walk.mm = vma->vm_mm;
@@ -1175,6 +1186,41 @@ long hmm_range_fault(struct hmm_range *range, bool block)
return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
}
+
+/*
+ * hmm_range_fault() - try to fault some address in a virtual address range
+ * @range: range being faulted
+ * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
+ * Returns: 0 on success ortherwise:
+ * -EINVAL:
+ * Invalid argument
+ * -ENOMEM:
+ * Out of memory.
+ * -EPERM:
+ * Invalid permission (for instance asking for write and range
+ * is read only).
+ * -EAGAIN:
+ * If you need to retry and mmap_sem was drop. This can only
+ * happens if block argument is false.
+ * -EBUSY:
+ * If the the range is being invalidated and you should wait for
+ * invalidation to finish.
+ * -EFAULT:
+ * Invalid (ie either no valid vma or it is illegal to access that
+ * range), number of valid pages in range->pfns[] (from range start
+ * address).
+ *
+ * This is similar to a regular CPU page fault except that it will not trigger
+ * any memory migration if the memory being faulted is not accessible by CPUs
+ * and caller does not ask for migration.
+ *
+ * On error, for one virtual address in the range, the function will mark the
+ * corresponding HMM pfn entry with an error flag.
+ */
+long hmm_range_fault(struct hmm_range *range, bool block)
+{
+ return _hmm_range_fault(range, block, NULL, NULL);
+}
EXPORT_SYMBOL(hmm_range_fault);
/*
@@ -1197,7 +1243,7 @@ long hmm_range_dma_map(struct hmm_range *range,
long ret;
again:
- ret = hmm_range_fault(range, block);
+ ret = _hmm_range_fault(range, block, device, daddrs);
if (ret <= 0)
return ret ? ret : -EBUSY;
@@ -1209,6 +1255,11 @@ long hmm_range_dma_map(struct hmm_range *range,
enum dma_data_direction dir = DMA_FROM_DEVICE;
struct page *page;
+ if (range->pfns[i] == range->values[HMM_PFN_P2P]) {
+ mapped++;
+ continue;
+ }
+
/*
* FIXME need to update DMA API to provide invalid DMA address
* value instead of a function to test dma address value. This
@@ -1274,6 +1325,11 @@ long hmm_range_dma_map(struct hmm_range *range,
enum dma_data_direction dir = DMA_FROM_DEVICE;
struct page *page;
+ if (range->pfns[i] == range->values[HMM_PFN_P2P]) {
+ mapped--;
+ continue;
+ }
+
page = hmm_pfn_to_page(range, range->pfns[i]);
if (page == NULL)
continue;
@@ -1305,6 +1361,30 @@ long hmm_range_dma_map(struct hmm_range *range,
}
EXPORT_SYMBOL(hmm_range_dma_map);
+static unsigned long hmm_vma_p2p_unmap(struct hmm_range *range,
+ struct vm_area_struct *vma,
+ unsigned long start,
+ struct device *device,
+ dma_addr_t *pas)
+{
+ unsigned long end;
+
+ if (!vma) {
+ BUG();
+ return 1;
+ }
+
+ start &= PAGE_MASK;
+ if (start < vma->vm_start || start >= vma->vm_end) {
+ BUG();
+ return 1;
+ }
+
+ end = min(range->end, vma->vm_end);
+ vma->vm_ops->p2p_unmap(vma, device, start, end, pas);
+ return (end - start) >> PAGE_SHIFT;
+}
+
/*
* hmm_range_dma_unmap() - unmap range of that was map with hmm_range_dma_map()
* @range: range being unmapped
@@ -1342,6 +1422,14 @@ long hmm_range_dma_unmap(struct hmm_range *range,
enum dma_data_direction dir = DMA_FROM_DEVICE;
struct page *page;
+ if (range->pfns[i] == range->values[HMM_PFN_P2P]) {
+ BUG_ON(!vma);
+ cpages += hmm_vma_p2p_unmap(range, vma, addr,
+ device, &daddrs[i]);
+ i += cpages - 1;
+ continue;
+ }
+
page = hmm_pfn_to_page(range, range->pfns[i]);
if (page == NULL)
continue;
--
2.17.2
^ permalink raw reply related [flat|nested] 95+ messages in thread