linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/8] lib/scatterlist: add sg_set_dma_addr() helper
@ 2018-03-25 10:59 Christian König
  2018-03-25 10:59 ` [PATCH 2/8] PCI: Add pci_find_common_upstream_dev() Christian König
                   ` (7 more replies)
  0 siblings, 8 replies; 83+ messages in thread
From: Christian König @ 2018-03-25 10:59 UTC (permalink / raw)
  To: linaro-mm-sig, linux-media, dri-devel, amd-gfx, linux-kernel

Use this function to set an sg entry to point to device resources mapped
using dma_map_resource(). The page pointer is set to NULL and only the DMA
address, length and offset values are valid.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 include/linux/scatterlist.h | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 22b2131bcdcd..f944ee4e482c 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -149,6 +149,29 @@ static inline void sg_set_buf(struct scatterlist *sg, const void *buf,
 	sg_set_page(sg, virt_to_page(buf), buflen, offset_in_page(buf));
 }
 
+/**
+ * sg_set_dma_addr - Set sg entry to point at specified dma address
+ * @sg:		 SG entry
+ * @address:	 DMA address to set
+ * @len:	 Length of data
+ * @offset:	 Offset into page
+ *
+ * Description:
+ *   Use this function to set an sg entry to point to device resources mapped
+ *   using dma_map_resource(). The page pointer is set to NULL and only the DMA
+ *   address, length and offset values are valid.
+ *
+ **/
+static inline void sg_set_dma_addr(struct scatterlist *sg, dma_addr_t address,
+				   unsigned int len, unsigned int offset)
+{
+	sg_set_page(sg, NULL, len, offset);
+	sg->dma_address = address;
+#ifdef CONFIG_NEED_SG_DMA_LENGTH
+	sg->dma_length = len;
+#endif
+}
+
 /*
  * Loop over each sg element, following the pointer to a new list if necessary
  */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-03-25 10:59 [PATCH 1/8] lib/scatterlist: add sg_set_dma_addr() helper Christian König
@ 2018-03-25 10:59 ` Christian König
  2018-03-28 12:38   ` Christoph Hellwig
  2018-03-25 10:59 ` [PATCH 3/8] PCI: Add pci_peer_traffic_supported() Christian König
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 83+ messages in thread
From: Christian König @ 2018-03-25 10:59 UTC (permalink / raw)
  To: linaro-mm-sig, linux-media, dri-devel, amd-gfx, linux-kernel

From: "wdavis@nvidia.com" <wdavis@nvidia.com>

Add an interface to find the first device which is upstream of both
devices.

Signed-off-by: Will Davis <wdavis@nvidia.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/pci/search.c | 24 ++++++++++++++++++++++++
 include/linux/pci.h  |  2 ++
 2 files changed, 26 insertions(+)

diff --git a/drivers/pci/search.c b/drivers/pci/search.c
index bc1e023f1353..446648f4238b 100644
--- a/drivers/pci/search.c
+++ b/drivers/pci/search.c
@@ -393,3 +393,27 @@ int pci_dev_present(const struct pci_device_id *ids)
 	return 0;
 }
 EXPORT_SYMBOL(pci_dev_present);
+
+/**
+ * pci_find_common_upstream_dev - Returns the first common upstream device
+ * @dev: the PCI device from which the bus hierarchy walk will start
+ * @other: the PCI device whose bus number will be used for the search
+ *
+ * Walks up the bus hierarchy from the @dev device, looking for the first bus
+ * which the @other device is downstream of. Returns %NULL if the devices do
+ * not share any upstream devices.
+ */
+struct pci_dev *pci_find_common_upstream_dev(struct pci_dev *dev,
+					     struct pci_dev *other)
+{
+	struct pci_dev *pdev = dev;
+
+	while (pdev != NULL) {
+		if ((other->bus->number >= pdev->bus->busn_res.start) &&
+		    (other->bus->number <= pdev->bus->busn_res.end))
+			return pdev;
+		pdev = pci_upstream_bridge(pdev);
+	}
+
+	return NULL;
+}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 024a1beda008..0d29f5cdcb04 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -956,6 +956,8 @@ static inline struct pci_dev *pci_get_bus_and_slot(unsigned int bus,
 }
 struct pci_dev *pci_get_class(unsigned int class, struct pci_dev *from);
 int pci_dev_present(const struct pci_device_id *ids);
+struct pci_dev *pci_find_common_upstream_dev(struct pci_dev *from,
+					     struct pci_dev *to);
 
 int pci_bus_read_config_byte(struct pci_bus *bus, unsigned int devfn,
 			     int where, u8 *val);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 3/8] PCI: Add pci_peer_traffic_supported()
  2018-03-25 10:59 [PATCH 1/8] lib/scatterlist: add sg_set_dma_addr() helper Christian König
  2018-03-25 10:59 ` [PATCH 2/8] PCI: Add pci_find_common_upstream_dev() Christian König
@ 2018-03-25 10:59 ` Christian König
  2018-03-25 10:59 ` [PATCH 4/8] dma-buf: add peer2peer flag Christian König
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 83+ messages in thread
From: Christian König @ 2018-03-25 10:59 UTC (permalink / raw)
  To: linaro-mm-sig, linux-media, dri-devel, amd-gfx, linux-kernel

From: "wdavis@nvidia.com" <wdavis@nvidia.com>

Add checks for topology and ACS configuration to determine whether or not
peer traffic should be supported between two PCI devices.

Signed-off-by: Will Davis <wdavis@nvidia.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/pci/pci.c   | 112 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/pci.h |   1 +
 2 files changed, 113 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index f6a4dd10d9b0..efdca3c9dad1 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -28,6 +28,7 @@
 #include <linux/device.h>
 #include <linux/pm_runtime.h>
 #include <linux/pci_hotplug.h>
+#include <linux/iommu.h>
 #include <linux/vmalloc.h>
 #include <linux/pci-ats.h>
 #include <asm/setup.h>
@@ -5285,6 +5286,117 @@ void pci_ignore_hotplug(struct pci_dev *dev)
 }
 EXPORT_SYMBOL_GPL(pci_ignore_hotplug);
 
+static bool pci_peer_host_traffic_supported(struct pci_host_bridge *host)
+{
+	struct pci_dev *bridge = pci_get_slot(host->bus, PCI_DEVFN(0, 0));
+	unsigned short vendor, device;
+
+	if (!bridge)
+		return false;
+
+	vendor = bridge->vendor;
+	device = bridge->device;
+	pci_dev_put(bridge);
+
+	/* AMD ZEN host bridges can do peer to peer */
+	if (vendor == PCI_VENDOR_ID_AMD && device == 0x1450)
+		return true;
+
+	/* TODO: Extend that to a proper whitelist */
+	return false;
+}
+
+bool pci_peer_traffic_supported(struct pci_dev *dev, struct pci_dev *peer)
+{
+	struct pci_dev *rp_dev, *rp_peer, *common_upstream;
+	struct pci_host_bridge *peer_host_bridge;
+	struct pci_host_bridge *dev_host_bridge;
+	int pos;
+	u16 cap;
+
+	/* This is tricky enough, focus on PCIe for now */
+	if (!pci_is_pcie(dev) || !pci_is_pcie(peer))
+		return false;
+
+	/*
+	 * Disallow the peer-to-peer traffic if the devices do not share a
+	 * host bridge. The PCI specifications does not make any guarantees
+	 * about P2P capabilities between devices under separate domains.
+	 *
+	 * PCI Local Bus Specification Revision 3.0, section 3.10:
+	 *    "Peer-to-peer transactions crossing multiple host bridges
+	 *     PCI host bridges may, but are not required to, support PCI
+	 *     peer-to-peer transactions that traverse multiple PCI host
+	 *     bridges."
+	 */
+	dev_host_bridge = pci_find_host_bridge(dev->bus);
+	peer_host_bridge = pci_find_host_bridge(peer->bus);
+	if (dev_host_bridge != peer_host_bridge)
+		return pci_peer_host_traffic_supported(dev_host_bridge);
+
+	rp_dev = pcie_find_root_port(dev);
+	rp_peer = pcie_find_root_port(peer);
+	common_upstream = pci_find_common_upstream_dev(dev, peer);
+
+	/*
+	 * Access Control Services (ACS) Checks
+	 *
+	 * ACS has a capability bit for P2P Request Redirects (RR),
+	 * but unfortunately it doesn't tell us much about the real
+	 * capabilities of the hardware.
+	 *
+	 * PCI Express Base Specification Revision 3.0, section
+	 * 6.12.1.1:
+	 *    "ACS P2P Request Redirect: must be implemented by Root
+	 *     Ports that support peer-to-peer traffic with other
+	 *     Root Ports; [80]"
+	 * but
+	 *    "[80] Root Port indication of ACS P2P Request Redirect
+	 *     or ACS P2P Completion Redirect support does not imply
+	 *     any particular level of peer-to-peer support by the
+	 *     Root Complex, or that peer-to-peer traffic is
+	 *     supported at all"
+	 */
+
+	/*
+	 * If ACS is not implemented, we have no idea about P2P
+	 * support. Optimistically allow this if there is a common
+	 * upstream device.
+	 */
+	pos = pci_find_ext_capability(rp_dev, PCI_EXT_CAP_ID_ACS);
+	if (!pos)
+		return common_upstream != NULL;
+
+	/*
+	 * If the devices are under the same root port and have a common
+	 * upstream device, allow if the root port is further upstream
+	 * from the common upstream device and the common upstream
+	 * device has Upstream Forwarding disabled, or if the root port
+	 * is the common upstream device and ACS is not implemented.
+	 */
+	pci_read_config_word(rp_dev, pos + PCI_ACS_CAP, &cap);
+	if ((rp_dev == rp_peer && common_upstream) &&
+	    (((common_upstream != rp_dev) &&
+	      !pci_acs_enabled(common_upstream, PCI_ACS_UF)) ||
+	     ((common_upstream == rp_dev) && ((cap & PCI_ACS_RR) == 0))))
+		return true;
+
+	/*
+	 * If ACS RR is implemented and disabled, allow only if the
+	 * devices are under the same root port.
+	 */
+	if (cap & PCI_ACS_RR && !pci_acs_enabled(rp_dev, PCI_ACS_RR))
+		return rp_dev == rp_peer;
+
+	/*
+	 * If ACS RR is not implemented, or is implemented and enabled,
+	 * only allow if there's a translation agent enabled to do the
+	 * redirect.
+	 */
+	return iommu_present(&pci_bus_type);
+}
+EXPORT_SYMBOL(pci_peer_traffic_supported);
+
 resource_size_t __weak pcibios_default_alignment(void)
 {
 	return 0;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 0d29f5cdcb04..3c8eaa505991 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -921,6 +921,7 @@ void pci_remove_root_bus(struct pci_bus *bus);
 void pci_setup_cardbus(struct pci_bus *bus);
 void pcibios_setup_bridge(struct pci_bus *bus, unsigned long type);
 void pci_sort_breadthfirst(void);
+bool pci_peer_traffic_supported(struct pci_dev *dev, struct pci_dev *peer);
 #define dev_is_pci(d) ((d)->bus == &pci_bus_type)
 #define dev_is_pf(d) ((dev_is_pci(d) ? to_pci_dev(d)->is_physfn : false))
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 4/8] dma-buf: add peer2peer flag
  2018-03-25 10:59 [PATCH 1/8] lib/scatterlist: add sg_set_dma_addr() helper Christian König
  2018-03-25 10:59 ` [PATCH 2/8] PCI: Add pci_find_common_upstream_dev() Christian König
  2018-03-25 10:59 ` [PATCH 3/8] PCI: Add pci_peer_traffic_supported() Christian König
@ 2018-03-25 10:59 ` Christian König
  2018-03-29  6:57   ` Daniel Vetter
  2018-03-25 10:59 ` [PATCH 5/8] drm/amdgpu: print DMA-buf status in debugfs Christian König
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 83+ messages in thread
From: Christian König @ 2018-03-25 10:59 UTC (permalink / raw)
  To: linaro-mm-sig, linux-media, dri-devel, amd-gfx, linux-kernel

Add a peer2peer flag noting that the importer can deal with device
resources which are not backed by pages.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/dma-buf/dma-buf.c | 1 +
 include/linux/dma-buf.h   | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index ffaa2f9a9c2c..f420225f93c6 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -565,6 +565,7 @@ struct dma_buf_attachment *dma_buf_attach(const struct dma_buf_attach_info *info
 
 	attach->dev = info->dev;
 	attach->dmabuf = dmabuf;
+	attach->peer2peer = info->peer2peer;
 	attach->priv = info->priv;
 	attach->invalidate = info->invalidate;
 
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 15dd8598bff1..1ef50bd9bc5b 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -313,6 +313,7 @@ struct dma_buf {
  * @dmabuf: buffer for this attachment.
  * @dev: device attached to the buffer.
  * @node: list of dma_buf_attachment.
+ * @peer2peer: true if the importer can handle peer resources without pages.
  * @priv: exporter specific attachment data.
  *
  * This structure holds the attachment information between the dma_buf buffer
@@ -328,6 +329,7 @@ struct dma_buf_attachment {
 	struct dma_buf *dmabuf;
 	struct device *dev;
 	struct list_head node;
+	bool peer2peer;
 	void *priv;
 
 	/**
@@ -392,6 +394,7 @@ struct dma_buf_export_info {
  * @dmabuf:	the exported dma_buf
  * @dev:	the device which wants to import the attachment
  * @priv:	private data of importer to this attachment
+ * @peer2peer:	true if the importer can handle peer resources without pages
  * @invalidate:	callback to use for invalidating mappings
  *
  * This structure holds the information required to attach to a buffer. Used
@@ -401,6 +404,7 @@ struct dma_buf_attach_info {
 	struct dma_buf *dmabuf;
 	struct device *dev;
 	void *priv;
+	bool peer2peer;
 	void (*invalidate)(struct dma_buf_attachment *attach);
 };
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 5/8] drm/amdgpu: print DMA-buf status in debugfs
  2018-03-25 10:59 [PATCH 1/8] lib/scatterlist: add sg_set_dma_addr() helper Christian König
                   ` (2 preceding siblings ...)
  2018-03-25 10:59 ` [PATCH 4/8] dma-buf: add peer2peer flag Christian König
@ 2018-03-25 10:59 ` Christian König
  2018-03-25 10:59 ` [PATCH 6/8] drm/amdgpu: note that we can handle peer2peer DMA-buf Christian König
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 83+ messages in thread
From: Christian König @ 2018-03-25 10:59 UTC (permalink / raw)
  To: linaro-mm-sig, linux-media, dri-devel, amd-gfx, linux-kernel

Just note if a BO was imported/exported.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index 46b9ea4e6103..7c1f761bb1a5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -777,6 +777,8 @@ static int amdgpu_debugfs_gem_bo_info(int id, void *ptr, void *data)
 	struct amdgpu_bo *bo = gem_to_amdgpu_bo(gobj);
 	struct seq_file *m = data;
 
+	struct dma_buf_attachment *attachment;
+	struct dma_buf *dma_buf;
 	unsigned domain;
 	const char *placement;
 	unsigned pin_count;
@@ -805,6 +807,15 @@ static int amdgpu_debugfs_gem_bo_info(int id, void *ptr, void *data)
 	pin_count = READ_ONCE(bo->pin_count);
 	if (pin_count)
 		seq_printf(m, " pin count %d", pin_count);
+
+	dma_buf = READ_ONCE(bo->gem_base.dma_buf);
+	attachment = READ_ONCE(bo->gem_base.import_attach);
+
+	if (attachment)
+		seq_printf(m, " imported from %p", dma_buf);
+	else if (dma_buf)
+		seq_printf(m, " exported as %p", dma_buf);
+
 	seq_printf(m, "\n");
 
 	return 0;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 6/8] drm/amdgpu: note that we can handle peer2peer DMA-buf
  2018-03-25 10:59 [PATCH 1/8] lib/scatterlist: add sg_set_dma_addr() helper Christian König
                   ` (3 preceding siblings ...)
  2018-03-25 10:59 ` [PATCH 5/8] drm/amdgpu: print DMA-buf status in debugfs Christian König
@ 2018-03-25 10:59 ` Christian König
  2018-03-25 10:59 ` [PATCH 7/8] drm/amdgpu: add amdgpu_gem_attach Christian König
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 83+ messages in thread
From: Christian König @ 2018-03-25 10:59 UTC (permalink / raw)
  To: linaro-mm-sig, linux-media, dri-devel, amd-gfx, linux-kernel

Importing should work out of the box.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c
index fb43faf88eb0..2566268806c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c
@@ -353,6 +353,7 @@ struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev,
 	if (IS_ERR(obj))
 		return obj;
 
+	attach_info.peer2peer = true;
 	attach_info.priv = obj;
 	attach = dma_buf_attach(&attach_info);
 	if (IS_ERR(attach)) {
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 7/8] drm/amdgpu: add amdgpu_gem_attach
  2018-03-25 10:59 [PATCH 1/8] lib/scatterlist: add sg_set_dma_addr() helper Christian König
                   ` (4 preceding siblings ...)
  2018-03-25 10:59 ` [PATCH 6/8] drm/amdgpu: note that we can handle peer2peer DMA-buf Christian König
@ 2018-03-25 10:59 ` Christian König
  2018-03-25 11:00 ` [PATCH 8/8] drm/amdgpu: add support for exporting VRAM using DMA-buf Christian König
  2018-03-28 12:37 ` [PATCH 1/8] lib/scatterlist: add sg_set_dma_addr() helper Christoph Hellwig
  7 siblings, 0 replies; 83+ messages in thread
From: Christian König @ 2018-03-25 10:59 UTC (permalink / raw)
  To: linaro-mm-sig, linux-media, dri-devel, amd-gfx, linux-kernel

Check if we can do peer2peer on the PCIe bus.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c
index 2566268806c3..133596df0775 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c
@@ -134,6 +134,29 @@ amdgpu_gem_prime_import_sg_table(struct drm_device *dev,
 	return obj;
 }
 
+static int amdgpu_gem_attach(struct dma_buf *dma_buf, struct device *dev,
+			     struct dma_buf_attachment *attach)
+{
+	struct drm_gem_object *obj = dma_buf->priv;
+	struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
+	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
+
+	if (!attach->peer2peer)
+		return 0;
+
+	if (!dev_is_pci(dev))
+		goto no_peer2peer;
+
+	if (!pci_peer_traffic_supported(adev->pdev, to_pci_dev(dev)))
+		goto no_peer2peer;
+
+	return 0;
+
+no_peer2peer:
+	attach->peer2peer = false;
+	return 0;
+}
+
 static struct sg_table *
 amdgpu_gem_map_dma_buf(struct dma_buf_attachment *attach,
 		       enum dma_data_direction dir)
@@ -274,6 +297,7 @@ static int amdgpu_gem_begin_cpu_access(struct dma_buf *dma_buf,
 }
 
 static const struct dma_buf_ops amdgpu_dmabuf_ops = {
+	.attach = amdgpu_gem_attach,
 	.map_dma_buf = amdgpu_gem_map_dma_buf,
 	.unmap_dma_buf = amdgpu_gem_unmap_dma_buf,
 	.release = drm_gem_dmabuf_release,
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 8/8] drm/amdgpu: add support for exporting VRAM using DMA-buf
  2018-03-25 10:59 [PATCH 1/8] lib/scatterlist: add sg_set_dma_addr() helper Christian König
                   ` (5 preceding siblings ...)
  2018-03-25 10:59 ` [PATCH 7/8] drm/amdgpu: add amdgpu_gem_attach Christian König
@ 2018-03-25 11:00 ` Christian König
  2018-03-28 12:37 ` [PATCH 1/8] lib/scatterlist: add sg_set_dma_addr() helper Christoph Hellwig
  7 siblings, 0 replies; 83+ messages in thread
From: Christian König @ 2018-03-25 11:00 UTC (permalink / raw)
  To: linaro-mm-sig, linux-media, dri-devel, amd-gfx, linux-kernel

We should be able to do this now after checking all the prerequisites.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c    | 44 +++++++++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h      |  9 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 91 ++++++++++++++++++++++++++++
 3 files changed, 135 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c
index 133596df0775..86d983a0678b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c
@@ -197,23 +197,44 @@ amdgpu_gem_map_dma_buf(struct dma_buf_attachment *attach,
 	} else {
 		/* move buffer into GTT */
 		struct ttm_operation_ctx ctx = { false, false };
+		unsigned domains = AMDGPU_GEM_DOMAIN_GTT;
 
-		amdgpu_ttm_placement_from_domain(bo, AMDGPU_GEM_DOMAIN_GTT);
+		if (bo->preferred_domains & AMDGPU_GEM_DOMAIN_VRAM &&
+		    attach->peer2peer) {
+			bo->flags |= AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED;
+			domains |= AMDGPU_GEM_DOMAIN_VRAM;
+		}
+		amdgpu_ttm_placement_from_domain(bo, domains);
 		r = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx);
 		if (r)
 			goto error_unreserve;
 	}
 
-	sgt = drm_prime_pages_to_sg(bo->tbo.ttm->pages, bo->tbo.num_pages);
-	if (IS_ERR(sgt)) {
-		r = PTR_ERR(sgt);
+	switch (bo->tbo.mem.mem_type) {
+	case TTM_PL_TT:
+		sgt = drm_prime_pages_to_sg(bo->tbo.ttm->pages,
+					    bo->tbo.num_pages);
+		if (IS_ERR(sgt)) {
+			r = PTR_ERR(sgt);
+			goto error_unreserve;
+		}
+
+		if (!dma_map_sg_attrs(attach->dev, sgt->sgl, sgt->nents, dir,
+				      DMA_ATTR_SKIP_CPU_SYNC))
+			goto error_free;
+		break;
+
+	case TTM_PL_VRAM:
+		r = amdgpu_vram_mgr_alloc_sgt(adev, &bo->tbo.mem, attach->dev,
+					      dir, &sgt);
+		if (r)
+			goto error_unreserve;
+		break;
+	default:
+		r = -EINVAL;
 		goto error_unreserve;
 	}
 
-	if (!dma_map_sg_attrs(attach->dev, sgt->sgl, sgt->nents, dir,
-			      DMA_ATTR_SKIP_CPU_SYNC))
-		goto error_free;
-
 	if (attach->dev->driver != adev->dev->driver)
 		bo->prime_shared_count++;
 
@@ -254,10 +275,15 @@ static void amdgpu_gem_unmap_dma_buf(struct dma_buf_attachment *attach,
 	if (!attach->invalidate)
 		amdgpu_bo_unreserve(bo);
 
-	if (sgt) {
+	if (!sgt)
+		return;
+
+	if (sgt->sgl->page_link) {
 		dma_unmap_sg(attach->dev, sgt->sgl, sgt->nents, dir);
 		sg_free_table(sgt);
 		kfree(sgt);
+	} else {
+		amdgpu_vram_mgr_free_sgt(adev, attach->dev, dir, sgt);
 	}
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
index 6ea7de863041..b483900abed2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
@@ -73,6 +73,15 @@ bool amdgpu_gtt_mgr_has_gart_addr(struct ttm_mem_reg *mem);
 uint64_t amdgpu_gtt_mgr_usage(struct ttm_mem_type_manager *man);
 int amdgpu_gtt_mgr_recover(struct ttm_mem_type_manager *man);
 
+int amdgpu_vram_mgr_alloc_sgt(struct amdgpu_device *adev,
+			      struct ttm_mem_reg *mem,
+			      struct device *dev,
+			      enum dma_data_direction dir,
+			      struct sg_table **sgt);
+void amdgpu_vram_mgr_free_sgt(struct amdgpu_device *adev,
+			      struct device *dev,
+			      enum dma_data_direction dir,
+			      struct sg_table *sgt);
 uint64_t amdgpu_vram_mgr_usage(struct ttm_mem_type_manager *man);
 uint64_t amdgpu_vram_mgr_vis_usage(struct ttm_mem_type_manager *man);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
index 9aca653bec07..eb8f75525a81 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
@@ -233,6 +233,97 @@ static void amdgpu_vram_mgr_del(struct ttm_mem_type_manager *man,
 	mem->mm_node = NULL;
 }
 
+/**
+ * amdgpu_vram_mgr_alloc_sgt - allocate and fill a sg table
+ *
+ * @adev: amdgpu device pointer
+ * @mem: TTM memory object
+ * @dev: the other device
+ * @dir: dma direction
+ * @sgt: resulting sg table
+ *
+ * Allocate and fill a sg table from a VRAM allocation.
+ */
+int amdgpu_vram_mgr_alloc_sgt(struct amdgpu_device *adev,
+			      struct ttm_mem_reg *mem,
+			      struct device *dev,
+			      enum dma_data_direction dir,
+			      struct sg_table **sgt)
+{
+	struct drm_mm_node *node = mem->mm_node;
+	struct scatterlist *sg;
+	int num_entries;
+	int i, r;
+
+	*sgt = kmalloc(sizeof(*sgt), GFP_KERNEL);
+	if (!*sgt)
+		return -ENOMEM;
+
+	num_entries = DIV_ROUND_UP(mem->num_pages, node->size);
+	r = sg_alloc_table(*sgt, num_entries, GFP_KERNEL);
+	if (r)
+		goto error_free;
+
+	for_each_sg((*sgt)->sgl, sg, num_entries, i)
+		sg->length = 0;
+
+	for_each_sg((*sgt)->sgl, sg, num_entries, i) {
+		phys_addr_t phys = (node->start << PAGE_SHIFT) +
+			adev->gmc.aper_base;
+		size_t size = node->size << PAGE_SHIFT;
+		dma_addr_t addr;
+
+		++node;
+		addr = dma_map_resource(dev, phys, size, dir,
+					DMA_ATTR_SKIP_CPU_SYNC);
+		r = dma_mapping_error(dev, addr);
+		if (r)
+			goto error_unmap;
+
+		sg_set_dma_addr(sg, addr, size, 0);
+	}
+	return 0;
+
+error_unmap:
+	for_each_sg((*sgt)->sgl, sg, num_entries, i) {
+		if (!sg->length)
+			continue;
+
+		dma_unmap_resource(dev, sg->dma_address,
+				   sg->length, dir,
+				   DMA_ATTR_SKIP_CPU_SYNC);
+	}
+	sg_free_table(*sgt);
+
+error_free:
+	kfree(*sgt);
+	return r;
+}
+
+/**
+ * amdgpu_vram_mgr_alloc_sgt - allocate and fill a sg table
+ *
+ * @adev: amdgpu device pointer
+ * @sgt: sg table to free
+ *
+ * Free a previously allocate sg table.
+ */
+void amdgpu_vram_mgr_free_sgt(struct amdgpu_device *adev,
+			      struct device *dev,
+			      enum dma_data_direction dir,
+			      struct sg_table *sgt)
+{
+	struct scatterlist *sg;
+	int i;
+
+	for_each_sg(sgt->sgl, sg, sgt->nents, i)
+		dma_unmap_resource(dev, sg->dma_address,
+				   sg->length, dir,
+				   DMA_ATTR_SKIP_CPU_SYNC);
+	sg_free_table(sgt);
+	kfree(sgt);
+}
+
 /**
  * amdgpu_vram_mgr_usage - how many bytes are used in this domain
  *
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/8] lib/scatterlist: add sg_set_dma_addr() helper
  2018-03-25 10:59 [PATCH 1/8] lib/scatterlist: add sg_set_dma_addr() helper Christian König
                   ` (6 preceding siblings ...)
  2018-03-25 11:00 ` [PATCH 8/8] drm/amdgpu: add support for exporting VRAM using DMA-buf Christian König
@ 2018-03-28 12:37 ` Christoph Hellwig
  7 siblings, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2018-03-28 12:37 UTC (permalink / raw)
  To: Christian König
  Cc: linaro-mm-sig, linux-media, dri-devel, amd-gfx, linux-kernel

On Sun, Mar 25, 2018 at 12:59:53PM +0200, Christian König wrote:
> Use this function to set an sg entry to point to device resources mapped
> using dma_map_resource(). The page pointer is set to NULL and only the DMA
> address, length and offset values are valid.

NAK.  Please provide a higher level primitive.

Btw, you series seems to miss a cover letter.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-03-25 10:59 ` [PATCH 2/8] PCI: Add pci_find_common_upstream_dev() Christian König
@ 2018-03-28 12:38   ` Christoph Hellwig
  2018-03-28 15:07     ` Christian König
  0 siblings, 1 reply; 83+ messages in thread
From: Christoph Hellwig @ 2018-03-28 12:38 UTC (permalink / raw)
  To: Christian König
  Cc: linaro-mm-sig, linux-media, dri-devel, amd-gfx, linux-kernel,
	Logan Gunthorpe

On Sun, Mar 25, 2018 at 12:59:54PM +0200, Christian König wrote:
> From: "wdavis@nvidia.com" <wdavis@nvidia.com>
> 
> Add an interface to find the first device which is upstream of both
> devices.

Please work with Logan and base this on top of the outstanding peer
to peer patchset.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-03-28 12:38   ` Christoph Hellwig
@ 2018-03-28 15:07     ` Christian König
  2018-03-28 15:47       ` Logan Gunthorpe
  0 siblings, 1 reply; 83+ messages in thread
From: Christian König @ 2018-03-28 15:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linaro-mm-sig, linux-media, dri-devel, amd-gfx, linux-kernel,
	Logan Gunthorpe

Am 28.03.2018 um 14:38 schrieb Christoph Hellwig:
> On Sun, Mar 25, 2018 at 12:59:54PM +0200, Christian König wrote:
>> From: "wdavis@nvidia.com" <wdavis@nvidia.com>
>>
>> Add an interface to find the first device which is upstream of both
>> devices.
> Please work with Logan and base this on top of the outstanding peer
> to peer patchset.

Can you point me to that? The last code I could find about that was from 
2015.

Thanks,
Christian.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-03-28 15:07     ` Christian König
@ 2018-03-28 15:47       ` Logan Gunthorpe
  2018-03-28 16:02         ` Christian König
  0 siblings, 1 reply; 83+ messages in thread
From: Logan Gunthorpe @ 2018-03-28 15:47 UTC (permalink / raw)
  To: christian.koenig, Christoph Hellwig
  Cc: linaro-mm-sig, linux-media, dri-devel, amd-gfx, linux-kernel



On 28/03/18 09:07 AM, Christian König wrote:
> Am 28.03.2018 um 14:38 schrieb Christoph Hellwig:
>> On Sun, Mar 25, 2018 at 12:59:54PM +0200, Christian König wrote:
>>> From: "wdavis@nvidia.com" <wdavis@nvidia.com>
>>>
>>> Add an interface to find the first device which is upstream of both
>>> devices.
>> Please work with Logan and base this on top of the outstanding peer
>> to peer patchset.
> 
> Can you point me to that? The last code I could find about that was from 
> 2015.

The latest posted series is here:

https://lkml.org/lkml/2018/3/12/830

However, we've made some significant changes to the area that's similar
to what you are doing. You can find lasted un-posted here:

https://github.com/sbates130272/linux-p2pmem/tree/pci-p2p-v4-pre2

Specifically this function would be of interest to you:

https://github.com/sbates130272/linux-p2pmem/blob/0e9468ae2a5a5198513dd12990151e09105f0351/drivers/pci/p2pdma.c#L239

However, the difference between what we are doing is that we are
interested in the distance through the common upstream device and you
appear to be finding the actual common device.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-03-28 15:47       ` Logan Gunthorpe
@ 2018-03-28 16:02         ` Christian König
  2018-03-28 16:25           ` Logan Gunthorpe
  0 siblings, 1 reply; 83+ messages in thread
From: Christian König @ 2018-03-28 16:02 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig
  Cc: linaro-mm-sig, linux-media, dri-devel, amd-gfx, linux-kernel

Am 28.03.2018 um 17:47 schrieb Logan Gunthorpe:
>
> On 28/03/18 09:07 AM, Christian König wrote:
>> Am 28.03.2018 um 14:38 schrieb Christoph Hellwig:
>>> On Sun, Mar 25, 2018 at 12:59:54PM +0200, Christian König wrote:
>>>> From: "wdavis@nvidia.com" <wdavis@nvidia.com>
>>>>
>>>> Add an interface to find the first device which is upstream of both
>>>> devices.
>>> Please work with Logan and base this on top of the outstanding peer
>>> to peer patchset.
>> Can you point me to that? The last code I could find about that was from
>> 2015.
> The latest posted series is here:
>
> https://lkml.org/lkml/2018/3/12/830
>
> However, we've made some significant changes to the area that's similar
> to what you are doing. You can find lasted un-posted here:
>
> https://github.com/sbates130272/linux-p2pmem/tree/pci-p2p-v4-pre2
>
> Specifically this function would be of interest to you:
>
> https://github.com/sbates130272/linux-p2pmem/blob/0e9468ae2a5a5198513dd12990151e09105f0351/drivers/pci/p2pdma.c#L239
>
> However, the difference between what we are doing is that we are
> interested in the distance through the common upstream device and you
> appear to be finding the actual common device.

Yeah, that looks very similar to what I picked up from the older 
patches, going to read up on that after my vacation.

Just in general why are you interested in the "distance" of the devices?

And BTW: At least for writes that Peer 2 Peer transactions between 
different root complexes work is actually more common than the other way 
around.

So I'm a bit torn between using a blacklist or a whitelist. A whitelist 
is certainly more conservative approach, but that could get a bit long.

Thanks,
Christian.

>
> Thanks,
>
> Logan

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-03-28 16:02         ` Christian König
@ 2018-03-28 16:25           ` Logan Gunthorpe
  2018-03-28 18:28             ` Christian König
  0 siblings, 1 reply; 83+ messages in thread
From: Logan Gunthorpe @ 2018-03-28 16:25 UTC (permalink / raw)
  To: Christian König, Christoph Hellwig
  Cc: linaro-mm-sig, linux-media, dri-devel, amd-gfx, linux-kernel



On 28/03/18 10:02 AM, Christian König wrote:
> Yeah, that looks very similar to what I picked up from the older 
> patches, going to read up on that after my vacation.

Yeah, I was just reading through your patchset and there are a lot of
similarities. Though, I'm not sure what you're trying to accomplish as I
could not find a cover letter and it seems to only enable one driver. Is
it meant to enable DMA transactions only between two AMD GPUs?

I also don't see where you've taken into account the PCI bus address. On
some architectures this is not the same as the CPU physical address.

> Just in general why are you interested in the "distance" of the devices?

We've taken a general approach where some drivers may provide p2p memory
(ie. an NVMe card or an RDMA NIC) and other drivers make use of it (ie.
the NVMe-of driver). The orchestrator driver needs to find the most
applicable provider device for a transaction in a situation that may
have multiple providers and multiple clients. So the most applicable
provider is the one that's closest ("distance"-wise) to all the clients
for the P2P transaction.

> And BTW: At least for writes that Peer 2 Peer transactions between 
> different root complexes work is actually more common than the other way 
> around.

Maybe on x86 with hardware made in the last few years. But on PowerPC,
ARM64, and likely a lot more the chance of support is *much* less. Also,
hardware that only supports P2P stores is hardly full support and is
insufficient for our needs.

> So I'm a bit torn between using a blacklist or a whitelist. A whitelist 
> is certainly more conservative approach, but that could get a bit long.

I think a whitelist approach is correct. Given old hardware and other
architectures, a black list is going to be too long and too difficult to
comprehensively populate.

Logan

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-03-28 16:25           ` Logan Gunthorpe
@ 2018-03-28 18:28             ` Christian König
  2018-03-28 18:57               ` Logan Gunthorpe
  0 siblings, 1 reply; 83+ messages in thread
From: Christian König @ 2018-03-28 18:28 UTC (permalink / raw)
  To: Logan Gunthorpe, Christian König, Christoph Hellwig
  Cc: linaro-mm-sig, linux-kernel, amd-gfx, dri-devel, linux-media

Am 28.03.2018 um 18:25 schrieb Logan Gunthorpe:
>
> On 28/03/18 10:02 AM, Christian König wrote:
>> Yeah, that looks very similar to what I picked up from the older
>> patches, going to read up on that after my vacation.
> Yeah, I was just reading through your patchset and there are a lot of
> similarities. Though, I'm not sure what you're trying to accomplish as I
> could not find a cover letter and it seems to only enable one driver.

Yeah, it was the last day before my easter vacation and I wanted it out 
of the door.

> Is it meant to enable DMA transactions only between two AMD GPUs?

Not really, DMA-buf is a general framework for sharing buffers between 
device drivers.

It is widely used in the GFX stack on laptops with both Intel+AMD, 
Intel+NVIDIA or AMD+AMD graphics devices.

Additional to that ARM uses it quite massively for their GFX stacks 
because they have rendering and displaying device separated.

I'm just using amdgpu as blueprint because I'm the co-maintainer of it 
and know it mostly inside out.

> I also don't see where you've taken into account the PCI bus address. On
> some architectures this is not the same as the CPU physical address.

The resource addresses are translated using dma_map_resource(). As far 
as I know that should be sufficient to offload all the architecture 
specific stuff to the DMA subsystem.

>
>> Just in general why are you interested in the "distance" of the devices?
> We've taken a general approach where some drivers may provide p2p memory
> (ie. an NVMe card or an RDMA NIC) and other drivers make use of it (ie.
> the NVMe-of driver). The orchestrator driver needs to find the most
> applicable provider device for a transaction in a situation that may
> have multiple providers and multiple clients. So the most applicable
> provider is the one that's closest ("distance"-wise) to all the clients
> for the P2P transaction.

That seems to make sense.

>
>> And BTW: At least for writes that Peer 2 Peer transactions between
>> different root complexes work is actually more common than the other way
>> around.
> Maybe on x86 with hardware made in the last few years. But on PowerPC,
> ARM64, and likely a lot more the chance of support is *much* less. Also,
> hardware that only supports P2P stores is hardly full support and is
> insufficient for our needs.

Yeah, but not for ours. See if you want to do real peer 2 peer you need 
to keep both the operation as well as the direction into account.

For example when you can do writes between A and B that doesn't mean 
that writes between B and A work. And reads are generally less likely to 
work than writes. etc...

Since the use case I'm targeting for is GFX or GFX+V4L (or GFX+NIC in 
the future) I really need to handle all such use cases as well.

>
>> So I'm a bit torn between using a blacklist or a whitelist. A whitelist
>> is certainly more conservative approach, but that could get a bit long.
> I think a whitelist approach is correct. Given old hardware and other
> architectures, a black list is going to be too long and too difficult to
> comprehensively populate.

Yeah, it would certainly be better if we have something in the root 
complex capabilities. But you're right that a whitelist sounds the less 
painful way.

Regards,
Christian.


>
> Logan
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-03-28 18:28             ` Christian König
@ 2018-03-28 18:57               ` Logan Gunthorpe
  2018-03-28 19:44                 ` Christian König
  0 siblings, 1 reply; 83+ messages in thread
From: Logan Gunthorpe @ 2018-03-28 18:57 UTC (permalink / raw)
  To: Christian König, Christoph Hellwig
  Cc: linaro-mm-sig, linux-kernel, amd-gfx, dri-devel, linux-media



On 28/03/18 12:28 PM, Christian König wrote:
> I'm just using amdgpu as blueprint because I'm the co-maintainer of it 
> and know it mostly inside out.

Ah, I see.

> The resource addresses are translated using dma_map_resource(). As far 
> as I know that should be sufficient to offload all the architecture 
> specific stuff to the DMA subsystem.

It's not. The dma_map infrastructure currently has no concept of
peer-to-peer mappings and is designed for system memory only. No
architecture I'm aware of will translate PCI CPU addresses into PCI Bus
addresses which is necessary for any transfer that doesn't go through
the root complex (though on arches like x86 the CPU and Bus address
happen to be the same). There's a lot of people that would like to see
this change but it's likely going to be a long road before it does.

Furthermore, one of the reasons our patch-set avoids going through the
root complex at all is that IOMMU drivers will need to be made aware
that it is operating on P2P memory and do arch-specific things
accordingly. There will also need to be flags that indicate whether a
given IOMMU driver supports this. None of this work is done or easy.

> Yeah, but not for ours. See if you want to do real peer 2 peer you need 
> to keep both the operation as well as the direction into account.

Not sure what you are saying here... I'm pretty sure we are doing "real"
peer 2 peer...

> For example when you can do writes between A and B that doesn't mean 
> that writes between B and A work. And reads are generally less likely to 
> work than writes. etc...

If both devices are behind a switch then the PCI spec guarantees that A
can both read and write B and vice versa. Only once you involve root
complexes do you have this problem. Ie. you have unknown support which
may be no support, or partial support (stores but not loads); or
sometimes bad performance; or a combination of both... and you need some
way to figure out all this mess and that is hard. Whoever tries to
implement a white list will have to sort all this out.

Logan

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-03-28 18:57               ` Logan Gunthorpe
@ 2018-03-28 19:44                 ` Christian König
  2018-03-28 19:53                   ` Logan Gunthorpe
  0 siblings, 1 reply; 83+ messages in thread
From: Christian König @ 2018-03-28 19:44 UTC (permalink / raw)
  To: Logan Gunthorpe, Christian König, Christoph Hellwig
  Cc: linaro-mm-sig, dri-devel, linux-kernel, amd-gfx, linux-media

Am 28.03.2018 um 20:57 schrieb Logan Gunthorpe:
>
> On 28/03/18 12:28 PM, Christian König wrote:
>> I'm just using amdgpu as blueprint because I'm the co-maintainer of it
>> and know it mostly inside out.
> Ah, I see.
>
>> The resource addresses are translated using dma_map_resource(). As far
>> as I know that should be sufficient to offload all the architecture
>> specific stuff to the DMA subsystem.
> It's not. The dma_map infrastructure currently has no concept of
> peer-to-peer mappings and is designed for system memory only. No
> architecture I'm aware of will translate PCI CPU addresses into PCI Bus
> addresses which is necessary for any transfer that doesn't go through
> the root complex (though on arches like x86 the CPU and Bus address
> happen to be the same). There's a lot of people that would like to see
> this change but it's likely going to be a long road before it does.

Well, isn't that exactly what dma_map_resource() is good for? As far as 
I can see it makes sure IOMMU is aware of the access route and 
translates a CPU address into a PCI Bus address.

> Furthermore, one of the reasons our patch-set avoids going through the
> root complex at all is that IOMMU drivers will need to be made aware
> that it is operating on P2P memory and do arch-specific things
> accordingly. There will also need to be flags that indicate whether a
> given IOMMU driver supports this. None of this work is done or easy.

I'm using that with the AMD IOMMU driver and at least there it works 
perfectly fine.

>> Yeah, but not for ours. See if you want to do real peer 2 peer you need
>> to keep both the operation as well as the direction into account.
> Not sure what you are saying here... I'm pretty sure we are doing "real"
> peer 2 peer...
>
>> For example when you can do writes between A and B that doesn't mean
>> that writes between B and A work. And reads are generally less likely to
>> work than writes. etc...
> If both devices are behind a switch then the PCI spec guarantees that A
> can both read and write B and vice versa.

Sorry to say that, but I know a whole bunch of PCI devices which 
horrible ignores that.

For example all AMD APUs fall under that category...

> Only once you involve root
> complexes do you have this problem. Ie. you have unknown support which
> may be no support, or partial support (stores but not loads); or
> sometimes bad performance; or a combination of both... and you need some
> way to figure out all this mess and that is hard. Whoever tries to
> implement a white list will have to sort all this out.

Yes, exactly and unfortunately it looks like I'm the poor guy who needs 
to do this :)

Regards,
Christian.

>
> Logan
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-03-28 19:44                 ` Christian König
@ 2018-03-28 19:53                   ` Logan Gunthorpe
  2018-03-29 11:44                     ` Christian König
       [not found]                     ` <CADnq5_P-z=Noos_jaME9_CERri3C-m2hPPvx2bArr36O=1FnrA@mail.gmail.com>
  0 siblings, 2 replies; 83+ messages in thread
From: Logan Gunthorpe @ 2018-03-28 19:53 UTC (permalink / raw)
  To: Christian König, Christoph Hellwig
  Cc: linaro-mm-sig, dri-devel, linux-kernel, amd-gfx, linux-media



On 28/03/18 01:44 PM, Christian König wrote:
> Well, isn't that exactly what dma_map_resource() is good for? As far as 
> I can see it makes sure IOMMU is aware of the access route and 
> translates a CPU address into a PCI Bus address.

> I'm using that with the AMD IOMMU driver and at least there it works 
> perfectly fine.

Yes, it would be nice, but no arch has implemented this yet. We are just
lucky in the x86 case because that arch is simple and doesn't need to do
anything for P2P (partially due to the Bus and CPU addresses being the
same). But in the general case, you can't rely on it.

>>> Yeah, but not for ours. See if you want to do real peer 2 peer you need
>>> to keep both the operation as well as the direction into account.
>> Not sure what you are saying here... I'm pretty sure we are doing "real"
>> peer 2 peer...
>>
>>> For example when you can do writes between A and B that doesn't mean
>>> that writes between B and A work. And reads are generally less likely to
>>> work than writes. etc...
>> If both devices are behind a switch then the PCI spec guarantees that A
>> can both read and write B and vice versa.
> 
> Sorry to say that, but I know a whole bunch of PCI devices which 
> horrible ignores that.

Can you elaborate? As far as the device is concerned it shouldn't know
whether a request comes from a peer or from the host. If it does do
crazy stuff like that it's well out of spec. It's up to the switch (or
root complex if good support exists) to route the request to the device
and it's the root complex that tends to be what drops the load requests
which causes the asymmetries.

Logan

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 4/8] dma-buf: add peer2peer flag
  2018-03-25 10:59 ` [PATCH 4/8] dma-buf: add peer2peer flag Christian König
@ 2018-03-29  6:57   ` Daniel Vetter
  2018-03-29 11:34     ` Christian König
  0 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2018-03-29  6:57 UTC (permalink / raw)
  To: Christian König
  Cc: linaro-mm-sig, linux-media, dri-devel, amd-gfx, linux-kernel

On Sun, Mar 25, 2018 at 12:59:56PM +0200, Christian König wrote:
> Add a peer2peer flag noting that the importer can deal with device
> resources which are not backed by pages.
> 
> Signed-off-by: Christian König <christian.koenig@amd.com>

Um strictly speaking they all should, but ttm never bothered to use the
real interfaces but just hacked around the provided sg list, grabbing the
underlying struct pages, then rebuilding&remapping the sg list again.

The entire point of using sg lists was exactly to allow this use case of
peer2peer dma (or well in general have special exporters which managed
memory/IO ranges not backed by struct page). So essentially you're having
a "I'm totally not broken flag" here.

I think a better approach would be if we add a requires_struct_page or so,
and annotate the current importers accordingly. Or we just fix them up (it
is all in shared ttm code after all, I think everyone else got this
right).
-Daniel

> ---
>  drivers/dma-buf/dma-buf.c | 1 +
>  include/linux/dma-buf.h   | 4 ++++
>  2 files changed, 5 insertions(+)
> 
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index ffaa2f9a9c2c..f420225f93c6 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -565,6 +565,7 @@ struct dma_buf_attachment *dma_buf_attach(const struct dma_buf_attach_info *info
>  
>  	attach->dev = info->dev;
>  	attach->dmabuf = dmabuf;
> +	attach->peer2peer = info->peer2peer;
>  	attach->priv = info->priv;
>  	attach->invalidate = info->invalidate;
>  
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index 15dd8598bff1..1ef50bd9bc5b 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -313,6 +313,7 @@ struct dma_buf {
>   * @dmabuf: buffer for this attachment.
>   * @dev: device attached to the buffer.
>   * @node: list of dma_buf_attachment.
> + * @peer2peer: true if the importer can handle peer resources without pages.
>   * @priv: exporter specific attachment data.
>   *
>   * This structure holds the attachment information between the dma_buf buffer
> @@ -328,6 +329,7 @@ struct dma_buf_attachment {
>  	struct dma_buf *dmabuf;
>  	struct device *dev;
>  	struct list_head node;
> +	bool peer2peer;
>  	void *priv;
>  
>  	/**
> @@ -392,6 +394,7 @@ struct dma_buf_export_info {
>   * @dmabuf:	the exported dma_buf
>   * @dev:	the device which wants to import the attachment
>   * @priv:	private data of importer to this attachment
> + * @peer2peer:	true if the importer can handle peer resources without pages
>   * @invalidate:	callback to use for invalidating mappings
>   *
>   * This structure holds the information required to attach to a buffer. Used
> @@ -401,6 +404,7 @@ struct dma_buf_attach_info {
>  	struct dma_buf *dmabuf;
>  	struct device *dev;
>  	void *priv;
> +	bool peer2peer;
>  	void (*invalidate)(struct dma_buf_attachment *attach);
>  };
>  
> -- 
> 2.14.1
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 4/8] dma-buf: add peer2peer flag
  2018-03-29  6:57   ` Daniel Vetter
@ 2018-03-29 11:34     ` Christian König
  2018-04-03  9:09       ` Daniel Vetter
  0 siblings, 1 reply; 83+ messages in thread
From: Christian König @ 2018-03-29 11:34 UTC (permalink / raw)
  To: linaro-mm-sig, linux-media, dri-devel, amd-gfx, linux-kernel

Am 29.03.2018 um 08:57 schrieb Daniel Vetter:
> On Sun, Mar 25, 2018 at 12:59:56PM +0200, Christian König wrote:
>> Add a peer2peer flag noting that the importer can deal with device
>> resources which are not backed by pages.
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
> Um strictly speaking they all should, but ttm never bothered to use the
> real interfaces but just hacked around the provided sg list, grabbing the
> underlying struct pages, then rebuilding&remapping the sg list again.

Actually that isn't correct. TTM converts them to a dma address array 
because drivers need it like this (at least nouveau, radeon and amdgpu).

I've fixed radeon and amdgpu to be able to deal without it and mailed 
with Ben about nouveau, but the outcome is they don't really know.

TTM itself doesn't have any need for the pages on imported BOs (you 
can't mmap them anyway), the real underlying problem is that sg tables 
doesn't provide what drivers need.

I think we could rather easily fix sg tables, but that is a totally 
separate task.

> The entire point of using sg lists was exactly to allow this use case of
> peer2peer dma (or well in general have special exporters which managed
> memory/IO ranges not backed by struct page). So essentially you're having
> a "I'm totally not broken flag" here.

No, independent of needed struct page pointers we need to note if the 
exporter can handle peer2peer stuff from the hardware side in general.

So what I've did is just to set peer2peer allowed on the importer 
because of the driver needs and clear it in the exporter if the hardware 
can't handle that.

> I think a better approach would be if we add a requires_struct_page or so,
> and annotate the current importers accordingly. Or we just fix them up (it
> is all in shared ttm code after all, I think everyone else got this
> right).

I would rather not bed on that.

Christian.

> -Daniel
>
>> ---
>>   drivers/dma-buf/dma-buf.c | 1 +
>>   include/linux/dma-buf.h   | 4 ++++
>>   2 files changed, 5 insertions(+)
>>
>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>> index ffaa2f9a9c2c..f420225f93c6 100644
>> --- a/drivers/dma-buf/dma-buf.c
>> +++ b/drivers/dma-buf/dma-buf.c
>> @@ -565,6 +565,7 @@ struct dma_buf_attachment *dma_buf_attach(const struct dma_buf_attach_info *info
>>   
>>   	attach->dev = info->dev;
>>   	attach->dmabuf = dmabuf;
>> +	attach->peer2peer = info->peer2peer;
>>   	attach->priv = info->priv;
>>   	attach->invalidate = info->invalidate;
>>   
>> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
>> index 15dd8598bff1..1ef50bd9bc5b 100644
>> --- a/include/linux/dma-buf.h
>> +++ b/include/linux/dma-buf.h
>> @@ -313,6 +313,7 @@ struct dma_buf {
>>    * @dmabuf: buffer for this attachment.
>>    * @dev: device attached to the buffer.
>>    * @node: list of dma_buf_attachment.
>> + * @peer2peer: true if the importer can handle peer resources without pages.
>>    * @priv: exporter specific attachment data.
>>    *
>>    * This structure holds the attachment information between the dma_buf buffer
>> @@ -328,6 +329,7 @@ struct dma_buf_attachment {
>>   	struct dma_buf *dmabuf;
>>   	struct device *dev;
>>   	struct list_head node;
>> +	bool peer2peer;
>>   	void *priv;
>>   
>>   	/**
>> @@ -392,6 +394,7 @@ struct dma_buf_export_info {
>>    * @dmabuf:	the exported dma_buf
>>    * @dev:	the device which wants to import the attachment
>>    * @priv:	private data of importer to this attachment
>> + * @peer2peer:	true if the importer can handle peer resources without pages
>>    * @invalidate:	callback to use for invalidating mappings
>>    *
>>    * This structure holds the information required to attach to a buffer. Used
>> @@ -401,6 +404,7 @@ struct dma_buf_attach_info {
>>   	struct dma_buf *dmabuf;
>>   	struct device *dev;
>>   	void *priv;
>> +	bool peer2peer;
>>   	void (*invalidate)(struct dma_buf_attachment *attach);
>>   };
>>   
>> -- 
>> 2.14.1
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-03-28 19:53                   ` Logan Gunthorpe
@ 2018-03-29 11:44                     ` Christian König
  2018-03-29 15:45                       ` Logan Gunthorpe
       [not found]                     ` <CADnq5_P-z=Noos_jaME9_CERri3C-m2hPPvx2bArr36O=1FnrA@mail.gmail.com>
  1 sibling, 1 reply; 83+ messages in thread
From: Christian König @ 2018-03-29 11:44 UTC (permalink / raw)
  To: Logan Gunthorpe, Christian König, Christoph Hellwig
  Cc: linaro-mm-sig, amd-gfx, linux-kernel, dri-devel, linux-media

Am 28.03.2018 um 21:53 schrieb Logan Gunthorpe:
>
> On 28/03/18 01:44 PM, Christian König wrote:
>> Well, isn't that exactly what dma_map_resource() is good for? As far as
>> I can see it makes sure IOMMU is aware of the access route and
>> translates a CPU address into a PCI Bus address.
>> I'm using that with the AMD IOMMU driver and at least there it works
>> perfectly fine.
> Yes, it would be nice, but no arch has implemented this yet. We are just
> lucky in the x86 case because that arch is simple and doesn't need to do
> anything for P2P (partially due to the Bus and CPU addresses being the
> same). But in the general case, you can't rely on it.

Well, that an arch hasn't implemented it doesn't mean that we don't have 
the right interface to do it.

>>>> Yeah, but not for ours. See if you want to do real peer 2 peer you need
>>>> to keep both the operation as well as the direction into account.
>>> Not sure what you are saying here... I'm pretty sure we are doing "real"
>>> peer 2 peer...
>>>
>>>> For example when you can do writes between A and B that doesn't mean
>>>> that writes between B and A work. And reads are generally less likely to
>>>> work than writes. etc...
>>> If both devices are behind a switch then the PCI spec guarantees that A
>>> can both read and write B and vice versa.
>> Sorry to say that, but I know a whole bunch of PCI devices which
>> horrible ignores that.
> Can you elaborate? As far as the device is concerned it shouldn't know
> whether a request comes from a peer or from the host. If it does do
> crazy stuff like that it's well out of spec. It's up to the switch (or
> root complex if good support exists) to route the request to the device
> and it's the root complex that tends to be what drops the load requests
> which causes the asymmetries.

Devices integrated in the CPU usually only "claim" to be PCIe devices. 
In reality their memory request path go directly through the integrated 
north bridge. The reason for this is simple better throughput/latency.

That is hidden from the software, for example the BIOS just allocates 
address space for the BARs as if it's a normal PCIe device.

The only crux is when you then do peer2peer your request simply go into 
nirvana and are not handled by anything because the BARs are only 
visible from the CPU side of the northbridge.

Regards,
Christian.

>
> Logan
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
       [not found]                     ` <CADnq5_P-z=Noos_jaME9_CERri3C-m2hPPvx2bArr36O=1FnrA@mail.gmail.com>
@ 2018-03-29 14:37                       ` Alex Deucher
  0 siblings, 0 replies; 83+ messages in thread
From: Alex Deucher @ 2018-03-29 14:37 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linaro-mm-sig, amd-gfx list, LKML, Maling list - DRI developers,
	linux-media

Sorry, didn't mean to drop the lists here. re-adding.

On Wed, Mar 28, 2018 at 4:05 PM, Alex Deucher <alexdeucher@gmail.com> wrote:
> On Wed, Mar 28, 2018 at 3:53 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>>
>>
>> On 28/03/18 01:44 PM, Christian König wrote:
>>> Well, isn't that exactly what dma_map_resource() is good for? As far as
>>> I can see it makes sure IOMMU is aware of the access route and
>>> translates a CPU address into a PCI Bus address.
>>
>>> I'm using that with the AMD IOMMU driver and at least there it works
>>> perfectly fine.
>>
>> Yes, it would be nice, but no arch has implemented this yet. We are just
>> lucky in the x86 case because that arch is simple and doesn't need to do
>> anything for P2P (partially due to the Bus and CPU addresses being the
>> same). But in the general case, you can't rely on it.
>
> Could we do something for the arches where it works?  I feel like peer
> to peer has dragged out for years because everyone is trying to boil
> the ocean for all arches.  There are a huge number of use cases for
> peer to peer on these "simple" architectures which actually represent
> a good deal of the users that want this.
>
> Alex
>
>>
>>>>> Yeah, but not for ours. See if you want to do real peer 2 peer you need
>>>>> to keep both the operation as well as the direction into account.
>>>> Not sure what you are saying here... I'm pretty sure we are doing "real"
>>>> peer 2 peer...
>>>>
>>>>> For example when you can do writes between A and B that doesn't mean
>>>>> that writes between B and A work. And reads are generally less likely to
>>>>> work than writes. etc...
>>>> If both devices are behind a switch then the PCI spec guarantees that A
>>>> can both read and write B and vice versa.
>>>
>>> Sorry to say that, but I know a whole bunch of PCI devices which
>>> horrible ignores that.
>>
>> Can you elaborate? As far as the device is concerned it shouldn't know
>> whether a request comes from a peer or from the host. If it does do
>> crazy stuff like that it's well out of spec. It's up to the switch (or
>> root complex if good support exists) to route the request to the device
>> and it's the root complex that tends to be what drops the load requests
>> which causes the asymmetries.
>>
>> Logan
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-03-29 11:44                     ` Christian König
@ 2018-03-29 15:45                       ` Logan Gunthorpe
  2018-03-29 16:10                         ` Christian König
  0 siblings, 1 reply; 83+ messages in thread
From: Logan Gunthorpe @ 2018-03-29 15:45 UTC (permalink / raw)
  To: christian.koenig, Christoph Hellwig
  Cc: linaro-mm-sig, amd-gfx, linux-kernel, dri-devel, linux-media



On 29/03/18 05:44 AM, Christian König wrote:
> Am 28.03.2018 um 21:53 schrieb Logan Gunthorpe:
>>
>> On 28/03/18 01:44 PM, Christian König wrote:
>>> Well, isn't that exactly what dma_map_resource() is good for? As far as
>>> I can see it makes sure IOMMU is aware of the access route and
>>> translates a CPU address into a PCI Bus address.
>>> I'm using that with the AMD IOMMU driver and at least there it works
>>> perfectly fine.
>> Yes, it would be nice, but no arch has implemented this yet. We are just
>> lucky in the x86 case because that arch is simple and doesn't need to do
>> anything for P2P (partially due to the Bus and CPU addresses being the
>> same). But in the general case, you can't rely on it.
> 
> Well, that an arch hasn't implemented it doesn't mean that we don't have 
> the right interface to do it.

Yes, but right now we don't have a performant way to check if we are
doing P2P or not in the dma_map_X() wrappers. And this is necessary to
check if the DMA ops in use support it or not. We can't have the
dma_map_X() functions do the wrong thing because they don't support it yet.

> Devices integrated in the CPU usually only "claim" to be PCIe devices. 
> In reality their memory request path go directly through the integrated 
> north bridge. The reason for this is simple better throughput/latency.

These are just more reasons why our patchset restricts to devices behind
a switch. And more mess for someone to deal with if they need to relax
that restriction.

Logan

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-03-29 15:45                       ` Logan Gunthorpe
@ 2018-03-29 16:10                         ` Christian König
  2018-03-29 16:25                           ` Logan Gunthorpe
  0 siblings, 1 reply; 83+ messages in thread
From: Christian König @ 2018-03-29 16:10 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig
  Cc: linaro-mm-sig, amd-gfx, linux-kernel, dri-devel, linux-media

Am 29.03.2018 um 17:45 schrieb Logan Gunthorpe:
>
> On 29/03/18 05:44 AM, Christian König wrote:
>> Am 28.03.2018 um 21:53 schrieb Logan Gunthorpe:
>>> On 28/03/18 01:44 PM, Christian König wrote:
>>>> Well, isn't that exactly what dma_map_resource() is good for? As far as
>>>> I can see it makes sure IOMMU is aware of the access route and
>>>> translates a CPU address into a PCI Bus address.
>>>> I'm using that with the AMD IOMMU driver and at least there it works
>>>> perfectly fine.
>>> Yes, it would be nice, but no arch has implemented this yet. We are just
>>> lucky in the x86 case because that arch is simple and doesn't need to do
>>> anything for P2P (partially due to the Bus and CPU addresses being the
>>> same). But in the general case, you can't rely on it.
>> Well, that an arch hasn't implemented it doesn't mean that we don't have
>> the right interface to do it.
> Yes, but right now we don't have a performant way to check if we are
> doing P2P or not in the dma_map_X() wrappers.

Why not? I mean the dma_map_resource() function is for P2P while other 
dma_map_* functions are only for system memory.

> And this is necessary to
> check if the DMA ops in use support it or not. We can't have the
> dma_map_X() functions do the wrong thing because they don't support it yet.

Well that sounds like we should just return an error from 
dma_map_resources() when an architecture doesn't support P2P yet as Alex 
suggested.

>> Devices integrated in the CPU usually only "claim" to be PCIe devices.
>> In reality their memory request path go directly through the integrated
>> north bridge. The reason for this is simple better throughput/latency.
> These are just more reasons why our patchset restricts to devices behind
> a switch. And more mess for someone to deal with if they need to relax
> that restriction.

You don't seem to understand the implications: The devices do have a 
common upstream bridge! In other words your code would currently claim 
that P2P is supported, but in practice it doesn't work.

You need to include both drivers which participate in the P2P 
transaction to make sure that both supports this and give them 
opportunity to chicken out and in the case of AMD APUs even redirect the 
request to another location (e.g. participate in the DMA translation).

DMA-buf fortunately seems to handle all this already, that's why we 
choose it as base for our implementation.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-03-29 16:10                         ` Christian König
@ 2018-03-29 16:25                           ` Logan Gunthorpe
  2018-03-29 18:15                             ` Christian König
  2018-03-30  1:58                             ` Jerome Glisse
  0 siblings, 2 replies; 83+ messages in thread
From: Logan Gunthorpe @ 2018-03-29 16:25 UTC (permalink / raw)
  To: Christian König, Christoph Hellwig
  Cc: linaro-mm-sig, amd-gfx, linux-kernel, dri-devel, linux-media,
	Bjorn Helgaas



On 29/03/18 10:10 AM, Christian König wrote:
> Why not? I mean the dma_map_resource() function is for P2P while other 
> dma_map_* functions are only for system memory.

Oh, hmm, I wasn't aware dma_map_resource was exclusively for mapping
P2P. Though it's a bit odd seeing we've been working under the
assumption that PCI P2P is different as it has to translate the PCI bus
address. Where as P2P for devices on other buses is a big unknown.

>> And this is necessary to
>> check if the DMA ops in use support it or not. We can't have the
>> dma_map_X() functions do the wrong thing because they don't support it yet.
> 
> Well that sounds like we should just return an error from 
> dma_map_resources() when an architecture doesn't support P2P yet as Alex 
> suggested.

Yes, well except in our patch-set we can't easily use
dma_map_resources() as we either have SGLs to deal with or we need to
create whole new interfaces to a number of subsystems.

> You don't seem to understand the implications: The devices do have a 
> common upstream bridge! In other words your code would currently claim 
> that P2P is supported, but in practice it doesn't work.

Do they? They don't on any of the Intel machines I'm looking at. The
previous version of the patchset not only required a common upstream
bridge but two layers of upstream bridges on both devices which would
effectively limit transfers to PCIe switches only. But Bjorn did not
like this.

> You need to include both drivers which participate in the P2P 
> transaction to make sure that both supports this and give them 
> opportunity to chicken out and in the case of AMD APUs even redirect the 
> request to another location (e.g. participate in the DMA translation).

I don't think it's the drivers responsibility to reject P2P . The
topology is what governs support or not. The discussions we had with
Bjorn settled on if the devices are all behind the same bridge they can
communicate with each other. This is essentially guaranteed by the PCI spec.

> DMA-buf fortunately seems to handle all this already, that's why we 
> choose it as base for our implementation.

Well, unfortunately DMA-buf doesn't help for the drivers we are working
with as neither the block layer nor the RDMA subsystem have any
interfaces for it.

Logan

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-03-29 16:25                           ` Logan Gunthorpe
@ 2018-03-29 18:15                             ` Christian König
  2018-03-30  1:58                             ` Jerome Glisse
  1 sibling, 0 replies; 83+ messages in thread
From: Christian König @ 2018-03-29 18:15 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig
  Cc: linaro-mm-sig, amd-gfx, linux-kernel, dri-devel, linux-media,
	Bjorn Helgaas

Am 29.03.2018 um 18:25 schrieb Logan Gunthorpe:
>
> On 29/03/18 10:10 AM, Christian König wrote:
>> Why not? I mean the dma_map_resource() function is for P2P while other
>> dma_map_* functions are only for system memory.
> Oh, hmm, I wasn't aware dma_map_resource was exclusively for mapping
> P2P. Though it's a bit odd seeing we've been working under the
> assumption that PCI P2P is different as it has to translate the PCI bus
> address. Where as P2P for devices on other buses is a big unknown.

Yeah, completely agree. On my TODO list (but rather far down) is 
actually supporting P2P with USB devices.

And no, I don't have the slightest idea how to do this at the moment.

>>> And this is necessary to
>>> check if the DMA ops in use support it or not. We can't have the
>>> dma_map_X() functions do the wrong thing because they don't support it yet.
>> Well that sounds like we should just return an error from
>> dma_map_resources() when an architecture doesn't support P2P yet as Alex
>> suggested.
> Yes, well except in our patch-set we can't easily use
> dma_map_resources() as we either have SGLs to deal with or we need to
> create whole new interfaces to a number of subsystems.

Agree as well. I was also in clear favor of extending the SGLs to have a 
flag for this instead of the dma_map_resource() interface, but for some 
reason that didn't made it into the kernel.

>> You don't seem to understand the implications: The devices do have a
>> common upstream bridge! In other words your code would currently claim
>> that P2P is supported, but in practice it doesn't work.
> Do they? They don't on any of the Intel machines I'm looking at. The
> previous version of the patchset not only required a common upstream
> bridge but two layers of upstream bridges on both devices which would
> effectively limit transfers to PCIe switches only. But Bjorn did not
> like this.

At least to me that sounds like a good idea, it would at least disable 
(the incorrect) auto detection of P2P for such devices.

>> You need to include both drivers which participate in the P2P
>> transaction to make sure that both supports this and give them
>> opportunity to chicken out and in the case of AMD APUs even redirect the
>> request to another location (e.g. participate in the DMA translation).
> I don't think it's the drivers responsibility to reject P2P . The
> topology is what governs support or not. The discussions we had with
> Bjorn settled on if the devices are all behind the same bridge they can
> communicate with each other. This is essentially guaranteed by the PCI spec.

Well it is not only rejecting P2P, see the devices I need to worry about 
are essentially part of the CPU. Their resources looks like a PCI BAR to 
the BIOS and OS, but are actually backed by stolen system memory.

So as crazy as it sounds what you get is an operation which starts as 
P2P, but then the GPU drivers sees it and says: Hey please don't write 
that to my PCIe BAR, but rather system memory location X.

>> DMA-buf fortunately seems to handle all this already, that's why we
>> choose it as base for our implementation.
> Well, unfortunately DMA-buf doesn't help for the drivers we are working
> with as neither the block layer nor the RDMA subsystem have any
> interfaces for it.

A fact that gives me quite some sleepless nights as well. I think we 
sooner or later need to extend those interfaces to work with DMA-bufs as 
well.

I will try to give your patch set a review when I'm back from vacation 
and rebase my DMA-buf work on top of that.

Regards,
Christian.

>
> Logan

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-03-29 16:25                           ` Logan Gunthorpe
  2018-03-29 18:15                             ` Christian König
@ 2018-03-30  1:58                             ` Jerome Glisse
  2018-03-30  6:33                               ` Christoph Hellwig
  2018-03-30 18:46                               ` Logan Gunthorpe
  1 sibling, 2 replies; 83+ messages in thread
From: Jerome Glisse @ 2018-03-30  1:58 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christian König, Christoph Hellwig, linaro-mm-sig, amd-gfx,
	linux-kernel, dri-devel, linux-media, Bjorn Helgaas

On Thu, Mar 29, 2018 at 10:25:52AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 29/03/18 10:10 AM, Christian König wrote:
> > Why not? I mean the dma_map_resource() function is for P2P while other 
> > dma_map_* functions are only for system memory.
> 
> Oh, hmm, I wasn't aware dma_map_resource was exclusively for mapping
> P2P. Though it's a bit odd seeing we've been working under the
> assumption that PCI P2P is different as it has to translate the PCI bus
> address. Where as P2P for devices on other buses is a big unknown.

dma_map_resource() is the right API (thought its current implementation
is fill with x86 assumptions). So i would argue that arch can decide to
implement it or simply return dma error address which trigger fallback
path into the caller (at least for GPU drivers). SG variant can be added
on top.

dma_map_resource() is the right API because it has all the necessary
informations. It use the CPU physical address as the common address
"language" with CPU physical address of PCIE bar to map to another
device you can find the corresponding bus address from the IOMMU code
(NOP on x86). So dma_map_resource() knows both the source device which
export its PCIE bar and the destination devices.


So i don't think Christian need to base his patchset on yours. This
is orthogonal. Christian is using existing upstream API, if it is
broken on some arch it is up to those arch to fix it, or return error
if it is not fixable. In fact i would assume that you would need to
base your patchset on top of dma_map_resource() too ...

Moreover i doubt Christian want to have struct page for this. For
nouveau there will be HMM struct page and this would conflict.

AFAICT you need struct page because the API you are trying to expose
to user space rely on "regular" filesystem/RDMA umem API which all
assume struct page. GPU drivers do not have this requirement and it
should not be impose on them.


So from my point of view Christian patchset is good as it is. Modulo
better commit message.


Bottom line i think we need common PCIE helper for P2P and the current
dma_map_resource() is the right kind from POV. What you are doing with
struct page is specific to your sub-system and should not be impose on
others.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-03-30  1:58                             ` Jerome Glisse
@ 2018-03-30  6:33                               ` Christoph Hellwig
  2018-03-30 15:25                                 ` Jerome Glisse
  2018-03-30 18:46                               ` Logan Gunthorpe
  1 sibling, 1 reply; 83+ messages in thread
From: Christoph Hellwig @ 2018-03-30  6:33 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Logan Gunthorpe, Christian König, Christoph Hellwig,
	linaro-mm-sig, amd-gfx, linux-kernel, dri-devel, linux-media,
	Bjorn Helgaas

On Thu, Mar 29, 2018 at 09:58:54PM -0400, Jerome Glisse wrote:
> dma_map_resource() is the right API (thought its current implementation
> is fill with x86 assumptions). So i would argue that arch can decide to
> implement it or simply return dma error address which trigger fallback
> path into the caller (at least for GPU drivers). SG variant can be added
> on top.

It isn't in general.  It doesn't integrate with scatterlists (see my
comment to page one), and it doesn't integrate with all the subsystems
that also need a kernel virtual address.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-03-30  6:33                               ` Christoph Hellwig
@ 2018-03-30 15:25                                 ` Jerome Glisse
  0 siblings, 0 replies; 83+ messages in thread
From: Jerome Glisse @ 2018-03-30 15:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Logan Gunthorpe, Christian König, Will Davis, linaro-mm-sig,
	amd-gfx, linux-kernel, dri-devel, linux-media, Bjorn Helgaas

On Thu, Mar 29, 2018 at 11:33:34PM -0700, Christoph Hellwig wrote:
> On Thu, Mar 29, 2018 at 09:58:54PM -0400, Jerome Glisse wrote:
> > dma_map_resource() is the right API (thought its current implementation
> > is fill with x86 assumptions). So i would argue that arch can decide to
> > implement it or simply return dma error address which trigger fallback
> > path into the caller (at least for GPU drivers). SG variant can be added
> > on top.
> 
> It isn't in general.  It doesn't integrate with scatterlists (see my
> comment to page one), and it doesn't integrate with all the subsystems
> that also need a kernel virtual address.

IIRC SG variant was proposed at the time dma_map_resource() was added,
dunno why they did not make it (again from memory i think it was because
it grows the scatterlist struct size). My point is more than people that
need SG variant should work on adding it. People who do not, should not
be stop by the lack of it. There is something there upstream, it does
not make sense to not use it. The dma-buf infrastructure is useful to
many drivers.

If you do not want to share the underlying logic of dma_map_resource()
fine (ie not sharing code inside drivers/iommu/*). I thought it would
be a good thing to share code ...

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-03-30  1:58                             ` Jerome Glisse
  2018-03-30  6:33                               ` Christoph Hellwig
@ 2018-03-30 18:46                               ` Logan Gunthorpe
  2018-03-30 19:45                                 ` Jerome Glisse
  1 sibling, 1 reply; 83+ messages in thread
From: Logan Gunthorpe @ 2018-03-30 18:46 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Christian König, Christoph Hellwig, linaro-mm-sig, amd-gfx,
	linux-kernel, dri-devel, linux-media, Bjorn Helgaas



On 29/03/18 07:58 PM, Jerome Glisse wrote:
> On Thu, Mar 29, 2018 at 10:25:52AM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 29/03/18 10:10 AM, Christian König wrote:
>>> Why not? I mean the dma_map_resource() function is for P2P while other 
>>> dma_map_* functions are only for system memory.
>>
>> Oh, hmm, I wasn't aware dma_map_resource was exclusively for mapping
>> P2P. Though it's a bit odd seeing we've been working under the
>> assumption that PCI P2P is different as it has to translate the PCI bus
>> address. Where as P2P for devices on other buses is a big unknown.
> 
> dma_map_resource() is the right API (thought its current implementation
> is fill with x86 assumptions). So i would argue that arch can decide to
> implement it or simply return dma error address which trigger fallback
> path into the caller (at least for GPU drivers). SG variant can be added
> on top.
> 
> dma_map_resource() is the right API because it has all the necessary
> informations. It use the CPU physical address as the common address
> "language" with CPU physical address of PCIE bar to map to another
> device you can find the corresponding bus address from the IOMMU code
> (NOP on x86). So dma_map_resource() knows both the source device which
> export its PCIE bar and the destination devices.

Well, as it is today, it doesn't look very sane to me. The default is to
just return the physical address if the architecture doesn't support it.
So if someone tries this on an arch that hasn't added itself to return
an error they're very likely going to just end up DMAing to an invalid
address and loosing the data or causing a machine check.

Furthermore, the API does not have all the information it needs to do
sane things. A phys_addr_t doesn't really tell you anything about the
memory behind it or what needs to be done with it. For example, on some
arm64 implementations if the physical address points to a PCI BAR and
that BAR is behind a switch with the DMA device then the address must be
converted to the PCI BUS address. On the other hand, if it's a physical
address of a device in an SOC it might need to  be handled in a
completely different way. And right now all the implementations I can
find seem to just assume that phys_addr_t points to regular memory and
can be treated as such.

This is one of the reasons that, based on feedback, our work went from
being general P2P with any device to being restricted to only P2P with
PCI devices. The dream that we can just grab the physical address of any
device and use it in a DMA request is simply not realistic.

Logan

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-03-30 18:46                               ` Logan Gunthorpe
@ 2018-03-30 19:45                                 ` Jerome Glisse
  2018-04-02 17:02                                   ` Logan Gunthorpe
  0 siblings, 1 reply; 83+ messages in thread
From: Jerome Glisse @ 2018-03-30 19:45 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christian König, Christoph Hellwig, Will Davis,
	Joerg Roedel, linaro-mm-sig, amd-gfx, linux-kernel, dri-devel,
	linux-media, Bjorn Helgaas

On Fri, Mar 30, 2018 at 12:46:42PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 29/03/18 07:58 PM, Jerome Glisse wrote:
> > On Thu, Mar 29, 2018 at 10:25:52AM -0600, Logan Gunthorpe wrote:
> >>
> >>
> >> On 29/03/18 10:10 AM, Christian König wrote:
> >>> Why not? I mean the dma_map_resource() function is for P2P while other 
> >>> dma_map_* functions are only for system memory.
> >>
> >> Oh, hmm, I wasn't aware dma_map_resource was exclusively for mapping
> >> P2P. Though it's a bit odd seeing we've been working under the
> >> assumption that PCI P2P is different as it has to translate the PCI bus
> >> address. Where as P2P for devices on other buses is a big unknown.
> > 
> > dma_map_resource() is the right API (thought its current implementation
> > is fill with x86 assumptions). So i would argue that arch can decide to
> > implement it or simply return dma error address which trigger fallback
> > path into the caller (at least for GPU drivers). SG variant can be added
> > on top.
> > 
> > dma_map_resource() is the right API because it has all the necessary
> > informations. It use the CPU physical address as the common address
> > "language" with CPU physical address of PCIE bar to map to another
> > device you can find the corresponding bus address from the IOMMU code
> > (NOP on x86). So dma_map_resource() knows both the source device which
> > export its PCIE bar and the destination devices.
> 
> Well, as it is today, it doesn't look very sane to me. The default is to
> just return the physical address if the architecture doesn't support it.
> So if someone tries this on an arch that hasn't added itself to return
> an error they're very likely going to just end up DMAing to an invalid
> address and loosing the data or causing a machine check.

Looking at upstream code it seems that the x86 bits never made it upstream
and thus what is now upstream is only for ARM. See [1] for x86 code. Dunno
what happen, i was convince it got merge. So yes current code is broken on
x86. ccing Joerg maybe he remembers what happened there.

[1] https://lwn.net/Articles/646605/

> 
> Furthermore, the API does not have all the information it needs to do
> sane things. A phys_addr_t doesn't really tell you anything about the
> memory behind it or what needs to be done with it. For example, on some
> arm64 implementations if the physical address points to a PCI BAR and
> that BAR is behind a switch with the DMA device then the address must be
> converted to the PCI BUS address. On the other hand, if it's a physical
> address of a device in an SOC it might need to  be handled in a
> completely different way. And right now all the implementations I can
> find seem to just assume that phys_addr_t points to regular memory and
> can be treated as such.

Given it is currently only used by ARM folks it appear to at least work
for them (tm) :) Note that Christian is doing this in PCIE only context
and again dma_map_resource can easily figure that out if the address is
a PCIE or something else. Note that the exporter export the CPU bus
address. So again dma_map_resource has all the informations it will ever
need, if the peer to peer is fundamentaly un-doable it can return dma
error and it is up to the caller to handle this, just like GPU code do.

Do you claim that dma_map_resource is missing any information ?


> 
> This is one of the reasons that, based on feedback, our work went from
> being general P2P with any device to being restricted to only P2P with
> PCI devices. The dream that we can just grab the physical address of any
> device and use it in a DMA request is simply not realistic.

I agree and understand that but for platform where such feature make sense
this will work. For me it is PowerPC and x86 and given PowerPC has CAPI
which has far more advance feature when it comes to peer to peer, i don't
see something more basic not working. On x86, Intel is a bit of lone wolf,
dunno if they gonna support this usecase pro-actively. AMD definitly will.

If you feel like dma_map_resource() can be interpreted too broadly, more
strict phrasing/wording can be added to it so people better understand its
limitation and gotcha.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-03-30 19:45                                 ` Jerome Glisse
@ 2018-04-02 17:02                                   ` Logan Gunthorpe
  2018-04-02 17:20                                     ` Jerome Glisse
  0 siblings, 1 reply; 83+ messages in thread
From: Logan Gunthorpe @ 2018-04-02 17:02 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Christian König, Christoph Hellwig, Will Davis,
	Joerg Roedel, linaro-mm-sig, amd-gfx, linux-kernel, dri-devel,
	linux-media, Bjorn Helgaas



On 30/03/18 01:45 PM, Jerome Glisse wrote:
> Looking at upstream code it seems that the x86 bits never made it upstream
> and thus what is now upstream is only for ARM. See [1] for x86 code. Dunno
> what happen, i was convince it got merge. So yes current code is broken on
> x86. ccing Joerg maybe he remembers what happened there.
> 
> [1] https://lwn.net/Articles/646605/

That looks like a significant improvement over what's upstream. But it's
three years old and looks to have been abandoned. I think I agree with
Bjorn's comments in that if it's going to be a general P2P API it will
need the device the resource belongs to in addition to the device that's
initiating the DMA request.

> Given it is currently only used by ARM folks it appear to at least work
> for them (tm) :) Note that Christian is doing this in PCIE only context
> and again dma_map_resource can easily figure that out if the address is
> a PCIE or something else. Note that the exporter export the CPU bus
> address. So again dma_map_resource has all the informations it will ever
> need, if the peer to peer is fundamentaly un-doable it can return dma
> error and it is up to the caller to handle this, just like GPU code do.
> 
> Do you claim that dma_map_resource is missing any information ?

Yes, that's what I said. All the existing ARM code appears to use it for
platform devices only. I suspect platform P2P is relatively trivial to
support on ARM. I think it's a big difference from using it for PCI P2P
or general P2P on any bus.

> I agree and understand that but for platform where such feature make sense
> this will work. For me it is PowerPC and x86 and given PowerPC has CAPI
> which has far more advance feature when it comes to peer to peer, i don't
> see something more basic not working. On x86, Intel is a bit of lone wolf,
> dunno if they gonna support this usecase pro-actively. AMD definitly will.

Well PowerPC doesn't even support P2P between root ports. And I fail to
see how CAPI applies unless/until we get this memory mapped into
userspace and the mappings need to be dynamically managed. Seems like
that's a long way away.

Logan

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-04-02 17:02                                   ` Logan Gunthorpe
@ 2018-04-02 17:20                                     ` Jerome Glisse
  2018-04-02 17:37                                       ` Logan Gunthorpe
  0 siblings, 1 reply; 83+ messages in thread
From: Jerome Glisse @ 2018-04-02 17:20 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christian König, Christoph Hellwig, Will Davis,
	Joerg Roedel, linaro-mm-sig, amd-gfx, linux-kernel, dri-devel,
	linux-media, Bjorn Helgaas

On Mon, Apr 02, 2018 at 11:02:10AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 30/03/18 01:45 PM, Jerome Glisse wrote:
> > Looking at upstream code it seems that the x86 bits never made it upstream
> > and thus what is now upstream is only for ARM. See [1] for x86 code. Dunno
> > what happen, i was convince it got merge. So yes current code is broken on
> > x86. ccing Joerg maybe he remembers what happened there.
> > 
> > [1] https://lwn.net/Articles/646605/
> 
> That looks like a significant improvement over what's upstream. But it's
> three years old and looks to have been abandoned. I think I agree with
> Bjorn's comments in that if it's going to be a general P2P API it will
> need the device the resource belongs to in addition to the device that's
> initiating the DMA request.

The point i have been trying to get accross is that you do have this
information with dma_map_resource() you know the device to which you
are trying to map (dev argument to dma_map_resource()) and you can
easily get the device to which the memory belongs because you have the
CPU physical address of the memory hence you can lookup the resource
and get the device from that.


> > Given it is currently only used by ARM folks it appear to at least work
> > for them (tm) :) Note that Christian is doing this in PCIE only context
> > and again dma_map_resource can easily figure that out if the address is
> > a PCIE or something else. Note that the exporter export the CPU bus
> > address. So again dma_map_resource has all the informations it will ever
> > need, if the peer to peer is fundamentaly un-doable it can return dma
> > error and it is up to the caller to handle this, just like GPU code do.
> > 
> > Do you claim that dma_map_resource is missing any information ?
> 
> Yes, that's what I said. All the existing ARM code appears to use it for
> platform devices only. I suspect platform P2P is relatively trivial to
> support on ARM. I think it's a big difference from using it for PCI P2P
> or general P2P on any bus.
> 

It does have all we need for PCIE when using dma_api and not the SG one.
SG issue IIRC is that it assume struct page ... See above for device
lookup.

> > I agree and understand that but for platform where such feature make sense
> > this will work. For me it is PowerPC and x86 and given PowerPC has CAPI
> > which has far more advance feature when it comes to peer to peer, i don't
> > see something more basic not working. On x86, Intel is a bit of lone wolf,
> > dunno if they gonna support this usecase pro-actively. AMD definitly will.
> 
> Well PowerPC doesn't even support P2P between root ports. And I fail to
> see how CAPI applies unless/until we get this memory mapped into
> userspace and the mappings need to be dynamically managed. Seems like
> that's a long way away.

IIRC CAPI make P2P mandatory but maybe this is with NVLink. We can ask
the PowerPC folks to confirm. Note CAPI is Power8 and newer AFAICT.

Mapping to userspace have nothing to do here. I am talking at hardware
level. How thing are expose to userspace is a completely different
problems that do not have one solution fit all. For GPU you want this
to be under total control of GPU drivers. For storage like persistent
memory, you might want to expose it userspace more directly ...

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-04-02 17:20                                     ` Jerome Glisse
@ 2018-04-02 17:37                                       ` Logan Gunthorpe
  2018-04-02 19:16                                         ` Jerome Glisse
  0 siblings, 1 reply; 83+ messages in thread
From: Logan Gunthorpe @ 2018-04-02 17:37 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Christian König, Christoph Hellwig, Will Davis,
	Joerg Roedel, linaro-mm-sig, amd-gfx, linux-kernel, dri-devel,
	linux-media, Bjorn Helgaas



On 02/04/18 11:20 AM, Jerome Glisse wrote:
> The point i have been trying to get accross is that you do have this
> information with dma_map_resource() you know the device to which you
> are trying to map (dev argument to dma_map_resource()) and you can
> easily get the device to which the memory belongs because you have the
> CPU physical address of the memory hence you can lookup the resource
> and get the device from that.

How do you go from a physical address to a struct device generally and
in a performant manner?

> IIRC CAPI make P2P mandatory but maybe this is with NVLink. We can ask
> the PowerPC folks to confirm. Note CAPI is Power8 and newer AFAICT.

PowerPC folks recently told us specifically that Power9 does not support
P2P between PCI root ports. I've said this many times. CAPI has nothing
to do with it.

> Mapping to userspace have nothing to do here. I am talking at hardware
> level. How thing are expose to userspace is a completely different
> problems that do not have one solution fit all. For GPU you want this
> to be under total control of GPU drivers. For storage like persistent
> memory, you might want to expose it userspace more directly ...

My understanding (and I worked on this a while ago) is that CAPI
hardware manages memory maps typically for userspace memory. When a
userspace program changes it's mapping, the CAPI hardware is updated so
that hardware is coherent with the user address space and it is safe to
DMA to any address without having to pin memory. (This is very similar
to ODP in RNICs.) This is *really* nice but doesn't solve *any* of the
problems we've been discussing. Moreover, many developers want to keep
P2P in-kernel, for the time being, where the problem of pinning memory
does not exist.

Logan

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-04-02 17:37                                       ` Logan Gunthorpe
@ 2018-04-02 19:16                                         ` Jerome Glisse
  2018-04-02 19:32                                           ` Logan Gunthorpe
  0 siblings, 1 reply; 83+ messages in thread
From: Jerome Glisse @ 2018-04-02 19:16 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christian König, Christoph Hellwig, Will Davis,
	Joerg Roedel, linaro-mm-sig, amd-gfx, linux-kernel, dri-devel,
	linux-media, Bjorn Helgaas

On Mon, Apr 02, 2018 at 11:37:07AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 02/04/18 11:20 AM, Jerome Glisse wrote:
> > The point i have been trying to get accross is that you do have this
> > information with dma_map_resource() you know the device to which you
> > are trying to map (dev argument to dma_map_resource()) and you can
> > easily get the device to which the memory belongs because you have the
> > CPU physical address of the memory hence you can lookup the resource
> > and get the device from that.
> 
> How do you go from a physical address to a struct device generally and
> in a performant manner?

There isn't good API at the moment AFAIK, closest thing would either be
lookup_resource() or region_intersects(), but a more appropriate one can
easily be added, code to walk down the tree is readily available. More-
over this can be optimize like vma lookup are, even more as resource are
seldomly added so read side (finding a resource) can be heavily favor
over write side (adding|registering a new resource).

> 
> > IIRC CAPI make P2P mandatory but maybe this is with NVLink. We can ask
> > the PowerPC folks to confirm. Note CAPI is Power8 and newer AFAICT.
> 
> PowerPC folks recently told us specifically that Power9 does not support
> P2P between PCI root ports. I've said this many times. CAPI has nothing
> to do with it.

I need to check CAPI, i must have confuse that with NVLink which is also
on some powerpc arch.

> 
> > Mapping to userspace have nothing to do here. I am talking at hardware
> > level. How thing are expose to userspace is a completely different
> > problems that do not have one solution fit all. For GPU you want this
> > to be under total control of GPU drivers. For storage like persistent
> > memory, you might want to expose it userspace more directly ...
> 
> My understanding (and I worked on this a while ago) is that CAPI
> hardware manages memory maps typically for userspace memory. When a
> userspace program changes it's mapping, the CAPI hardware is updated so
> that hardware is coherent with the user address space and it is safe to
> DMA to any address without having to pin memory. (This is very similar
> to ODP in RNICs.) This is *really* nice but doesn't solve *any* of the
> problems we've been discussing. Moreover, many developers want to keep
> P2P in-kernel, for the time being, where the problem of pinning memory
> does not exist.

What you describe is the ATS(Address Translation Service)/PASID(Process
Address Space IDentifier) part of CAPI. Which have also been available
for years on AMD x86 platform (AMD IOMMU-v2), thought it is barely ever
use. Interesting aspect of CAPI is its cache coherency protocol between
devices and CPUs. This in both direction, the usual device access to
system memory can be cache coherent with CPU access and participate in
cache coherency protocol (bit further than PCIE snoop). But also the
other direction the CPU access to device memory can also be cache coherent,
which is not the case in PCIE.

This cache coherency between CPU and device is what made me assume that
CAPI must have Peer To Peer support as peer must be able to talk to each
other for cache coherency purpose. But maybe all cache coherency
arbritration goes through central directory allievating Peer to Peer
requirement.

Anyway, like you said, this does not matter for the discussion. The
dma_map_resource() can be just stub out on platform that do not support
this and they would not allow it. If it get use on other platform and
shows enough advantages that users start asking for it then maybe those
platform will attention to the hardware requirement.

Note that with mmu_notifier there isn't any need to pin stuff (even
without any special hardware capabilities), as long as you can preempt
what is happening on your hardware to update its page table.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-04-02 19:16                                         ` Jerome Glisse
@ 2018-04-02 19:32                                           ` Logan Gunthorpe
  2018-04-02 19:45                                             ` Jerome Glisse
  0 siblings, 1 reply; 83+ messages in thread
From: Logan Gunthorpe @ 2018-04-02 19:32 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Christian König, Christoph Hellwig, Will Davis,
	Joerg Roedel, linaro-mm-sig, amd-gfx, linux-kernel, dri-devel,
	linux-media, Bjorn Helgaas



On 02/04/18 01:16 PM, Jerome Glisse wrote:
> There isn't good API at the moment AFAIK, closest thing would either be
> lookup_resource() or region_intersects(), but a more appropriate one can
> easily be added, code to walk down the tree is readily available. More-
> over this can be optimize like vma lookup are, even more as resource are
> seldomly added so read side (finding a resource) can be heavily favor
> over write side (adding|registering a new resource).

So someone needs to create a highly optimized tree that registers all
physical address on the system and maps them to devices? That seems a
long way from being realized. I'd hardly characterize that as "easily".
If we can pass both devices to the API I'd suspect it would be preferred
over the complicated tree. This, of course, depends on what users of the
API need.

> cache coherency protocol (bit further than PCIE snoop). But also the
> other direction the CPU access to device memory can also be cache coherent,
> which is not the case in PCIE.

I was not aware that CAPI allows PCI device memory to be cache coherent.
That sounds like it would be very tricky...

> Note that with mmu_notifier there isn't any need to pin stuff (even
> without any special hardware capabilities), as long as you can preempt
> what is happening on your hardware to update its page table.

I've been told there's a lot of dislike of the mmu_notifier interface.
And being able to preempt what's happening on hardware, generally, is
not trivial. But, yes, this is essentially how ODP works.

Logan

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
  2018-04-02 19:32                                           ` Logan Gunthorpe
@ 2018-04-02 19:45                                             ` Jerome Glisse
  0 siblings, 0 replies; 83+ messages in thread
From: Jerome Glisse @ 2018-04-02 19:45 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christian König, Christoph Hellwig, Will Davis,
	Joerg Roedel, linaro-mm-sig, amd-gfx, linux-kernel, dri-devel,
	linux-media, Bjorn Helgaas

On Mon, Apr 02, 2018 at 01:32:37PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 02/04/18 01:16 PM, Jerome Glisse wrote:
> > There isn't good API at the moment AFAIK, closest thing would either be
> > lookup_resource() or region_intersects(), but a more appropriate one can
> > easily be added, code to walk down the tree is readily available. More-
> > over this can be optimize like vma lookup are, even more as resource are
> > seldomly added so read side (finding a resource) can be heavily favor
> > over write side (adding|registering a new resource).
> 
> So someone needs to create a highly optimized tree that registers all
> physical address on the system and maps them to devices? That seems a
> long way from being realized. I'd hardly characterize that as "easily".
> If we can pass both devices to the API I'd suspect it would be preferred
> over the complicated tree. This, of course, depends on what users of the
> API need.

This tree already exist, it is all there upstream see kernel/resource.c
What is missing is something that take a single address and return the
device struct. There is function that take a range region_intersects()
or one that take the start address lookup_resource(). It isn't hard to
think that using roughly same code as region_intersects() an helper
that return the device for a resource can be added.

And yes currently this does not have a pointer back to the device that
own the resource but this can be added. It wasn't needed until now.

It can latter be optimize if device lookup shows as a bottleneck in perf
profile.


> 
> > cache coherency protocol (bit further than PCIE snoop). But also the
> > other direction the CPU access to device memory can also be cache coherent,
> > which is not the case in PCIE.
> 
> I was not aware that CAPI allows PCI device memory to be cache coherent.
> That sounds like it would be very tricky...

And yet CAPI, CCIX, Gen-Z, NVLink, ... are all inter-connect that aim at
achieving this cache coherency between multiple devices and CPUs.

Jérôme

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 4/8] dma-buf: add peer2peer flag
  2018-03-29 11:34     ` Christian König
@ 2018-04-03  9:09       ` Daniel Vetter
  2018-04-03 17:06         ` Jerome Glisse
  0 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2018-04-03  9:09 UTC (permalink / raw)
  To: christian.koenig
  Cc: linaro-mm-sig, linux-media, dri-devel, amd-gfx, linux-kernel

On Thu, Mar 29, 2018 at 01:34:24PM +0200, Christian König wrote:
> Am 29.03.2018 um 08:57 schrieb Daniel Vetter:
> > On Sun, Mar 25, 2018 at 12:59:56PM +0200, Christian König wrote:
> > > Add a peer2peer flag noting that the importer can deal with device
> > > resources which are not backed by pages.
> > > 
> > > Signed-off-by: Christian König <christian.koenig@amd.com>
> > Um strictly speaking they all should, but ttm never bothered to use the
> > real interfaces but just hacked around the provided sg list, grabbing the
> > underlying struct pages, then rebuilding&remapping the sg list again.
> 
> Actually that isn't correct. TTM converts them to a dma address array
> because drivers need it like this (at least nouveau, radeon and amdgpu).
> 
> I've fixed radeon and amdgpu to be able to deal without it and mailed with
> Ben about nouveau, but the outcome is they don't really know.
> 
> TTM itself doesn't have any need for the pages on imported BOs (you can't
> mmap them anyway), the real underlying problem is that sg tables doesn't
> provide what drivers need.
> 
> I think we could rather easily fix sg tables, but that is a totally separate
> task.

Looking at patch 8, the sg table seems perfectly sufficient to convey the
right dma addresses to the importer. Ofcourse the exporter has to set up
the right kind of iommu mappings to make this work.

> > The entire point of using sg lists was exactly to allow this use case of
> > peer2peer dma (or well in general have special exporters which managed
> > memory/IO ranges not backed by struct page). So essentially you're having
> > a "I'm totally not broken flag" here.
> 
> No, independent of needed struct page pointers we need to note if the
> exporter can handle peer2peer stuff from the hardware side in general.
> 
> So what I've did is just to set peer2peer allowed on the importer because of
> the driver needs and clear it in the exporter if the hardware can't handle
> that.

The only thing the importer seems to do is call the
pci_peer_traffic_supported, which the exporter could call too. What am I
missing (since the sturct_page stuff sounds like it's fixed already by
you)?
-Daniel

> > I think a better approach would be if we add a requires_struct_page or so,
> > and annotate the current importers accordingly. Or we just fix them up (it
> > is all in shared ttm code after all, I think everyone else got this
> > right).
> 
> I would rather not bed on that.
> 
> Christian.
> 
> > -Daniel
> > 
> > > ---
> > >   drivers/dma-buf/dma-buf.c | 1 +
> > >   include/linux/dma-buf.h   | 4 ++++
> > >   2 files changed, 5 insertions(+)
> > > 
> > > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > > index ffaa2f9a9c2c..f420225f93c6 100644
> > > --- a/drivers/dma-buf/dma-buf.c
> > > +++ b/drivers/dma-buf/dma-buf.c
> > > @@ -565,6 +565,7 @@ struct dma_buf_attachment *dma_buf_attach(const struct dma_buf_attach_info *info
> > >   	attach->dev = info->dev;
> > >   	attach->dmabuf = dmabuf;
> > > +	attach->peer2peer = info->peer2peer;
> > >   	attach->priv = info->priv;
> > >   	attach->invalidate = info->invalidate;
> > > diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> > > index 15dd8598bff1..1ef50bd9bc5b 100644
> > > --- a/include/linux/dma-buf.h
> > > +++ b/include/linux/dma-buf.h
> > > @@ -313,6 +313,7 @@ struct dma_buf {
> > >    * @dmabuf: buffer for this attachment.
> > >    * @dev: device attached to the buffer.
> > >    * @node: list of dma_buf_attachment.
> > > + * @peer2peer: true if the importer can handle peer resources without pages.
> > >    * @priv: exporter specific attachment data.
> > >    *
> > >    * This structure holds the attachment information between the dma_buf buffer
> > > @@ -328,6 +329,7 @@ struct dma_buf_attachment {
> > >   	struct dma_buf *dmabuf;
> > >   	struct device *dev;
> > >   	struct list_head node;
> > > +	bool peer2peer;
> > >   	void *priv;
> > >   	/**
> > > @@ -392,6 +394,7 @@ struct dma_buf_export_info {
> > >    * @dmabuf:	the exported dma_buf
> > >    * @dev:	the device which wants to import the attachment
> > >    * @priv:	private data of importer to this attachment
> > > + * @peer2peer:	true if the importer can handle peer resources without pages
> > >    * @invalidate:	callback to use for invalidating mappings
> > >    *
> > >    * This structure holds the information required to attach to a buffer. Used
> > > @@ -401,6 +404,7 @@ struct dma_buf_attach_info {
> > >   	struct dma_buf *dmabuf;
> > >   	struct device *dev;
> > >   	void *priv;
> > > +	bool peer2peer;
> > >   	void (*invalidate)(struct dma_buf_attachment *attach);
> > >   };
> > > -- 
> > > 2.14.1
> > > 
> > > _______________________________________________
> > > dri-devel mailing list
> > > dri-devel@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-03  9:09       ` Daniel Vetter
@ 2018-04-03 17:06         ` Jerome Glisse
  2018-04-03 18:08           ` Daniel Vetter
  0 siblings, 1 reply; 83+ messages in thread
From: Jerome Glisse @ 2018-04-03 17:06 UTC (permalink / raw)
  To: christian.koenig, linaro-mm-sig, linux-media, dri-devel, amd-gfx,
	linux-kernel

On Tue, Apr 03, 2018 at 11:09:09AM +0200, Daniel Vetter wrote:
> On Thu, Mar 29, 2018 at 01:34:24PM +0200, Christian König wrote:
> > Am 29.03.2018 um 08:57 schrieb Daniel Vetter:
> > > On Sun, Mar 25, 2018 at 12:59:56PM +0200, Christian König wrote:
> > > > Add a peer2peer flag noting that the importer can deal with device
> > > > resources which are not backed by pages.
> > > > 
> > > > Signed-off-by: Christian König <christian.koenig@amd.com>
> > > Um strictly speaking they all should, but ttm never bothered to use the
> > > real interfaces but just hacked around the provided sg list, grabbing the
> > > underlying struct pages, then rebuilding&remapping the sg list again.
> > 
> > Actually that isn't correct. TTM converts them to a dma address array
> > because drivers need it like this (at least nouveau, radeon and amdgpu).
> > 
> > I've fixed radeon and amdgpu to be able to deal without it and mailed with
> > Ben about nouveau, but the outcome is they don't really know.
> > 
> > TTM itself doesn't have any need for the pages on imported BOs (you can't
> > mmap them anyway), the real underlying problem is that sg tables doesn't
> > provide what drivers need.
> > 
> > I think we could rather easily fix sg tables, but that is a totally separate
> > task.
> 
> Looking at patch 8, the sg table seems perfectly sufficient to convey the
> right dma addresses to the importer. Ofcourse the exporter has to set up
> the right kind of iommu mappings to make this work.
> 
> > > The entire point of using sg lists was exactly to allow this use case of
> > > peer2peer dma (or well in general have special exporters which managed
> > > memory/IO ranges not backed by struct page). So essentially you're having
> > > a "I'm totally not broken flag" here.
> > 
> > No, independent of needed struct page pointers we need to note if the
> > exporter can handle peer2peer stuff from the hardware side in general.
> > 
> > So what I've did is just to set peer2peer allowed on the importer because of
> > the driver needs and clear it in the exporter if the hardware can't handle
> > that.
> 
> The only thing the importer seems to do is call the
> pci_peer_traffic_supported, which the exporter could call too. What am I
> missing (since the sturct_page stuff sounds like it's fixed already by
> you)?
> -Daniel

AFAIK Logan patchset require to register and initialize struct page
for the device memory you want to map (export from exporter point of
view).

With GPU this isn't something we want, struct page is >~= 2^6 so for
4GB GPU = 2^6*2^32/2^12 = 2^26 = 64MB of RAM
8GB GPU = 2^6*2^33/2^12 = 2^27 = 128MB of RAM
16GB GPU = 2^6*2^34/2^12 = 2^28 = 256MB of RAM
32GB GPU = 2^6*2^34/2^12 = 2^29 = 512MB of RAM

All this is mostly wasted as only a small sub-set (that can not be
constraint to specific range) will ever be exported at any point in
time. For GPU work load this is hardly justifiable, even for HMM i
do not plan to register all those pages.

Hence why i argue that dma_map_resource() like use by Christian is
good enough for us. People that care about SG can fix that but i
rather not have to depend on that and waste system memory.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-03 17:06         ` Jerome Glisse
@ 2018-04-03 18:08           ` Daniel Vetter
  2018-04-16 12:39             ` Christoph Hellwig
  0 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2018-04-03 18:08 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: christian.koenig, linaro-mm-sig, linux-media, dri-devel, amd-gfx,
	linux-kernel

On Tue, Apr 03, 2018 at 01:06:45PM -0400, Jerome Glisse wrote:
> On Tue, Apr 03, 2018 at 11:09:09AM +0200, Daniel Vetter wrote:
> > On Thu, Mar 29, 2018 at 01:34:24PM +0200, Christian König wrote:
> > > Am 29.03.2018 um 08:57 schrieb Daniel Vetter:
> > > > On Sun, Mar 25, 2018 at 12:59:56PM +0200, Christian König wrote:
> > > > > Add a peer2peer flag noting that the importer can deal with device
> > > > > resources which are not backed by pages.
> > > > > 
> > > > > Signed-off-by: Christian König <christian.koenig@amd.com>
> > > > Um strictly speaking they all should, but ttm never bothered to use the
> > > > real interfaces but just hacked around the provided sg list, grabbing the
> > > > underlying struct pages, then rebuilding&remapping the sg list again.
> > > 
> > > Actually that isn't correct. TTM converts them to a dma address array
> > > because drivers need it like this (at least nouveau, radeon and amdgpu).
> > > 
> > > I've fixed radeon and amdgpu to be able to deal without it and mailed with
> > > Ben about nouveau, but the outcome is they don't really know.
> > > 
> > > TTM itself doesn't have any need for the pages on imported BOs (you can't
> > > mmap them anyway), the real underlying problem is that sg tables doesn't
> > > provide what drivers need.
> > > 
> > > I think we could rather easily fix sg tables, but that is a totally separate
> > > task.
> > 
> > Looking at patch 8, the sg table seems perfectly sufficient to convey the
> > right dma addresses to the importer. Ofcourse the exporter has to set up
> > the right kind of iommu mappings to make this work.
> > 
> > > > The entire point of using sg lists was exactly to allow this use case of
> > > > peer2peer dma (or well in general have special exporters which managed
> > > > memory/IO ranges not backed by struct page). So essentially you're having
> > > > a "I'm totally not broken flag" here.
> > > 
> > > No, independent of needed struct page pointers we need to note if the
> > > exporter can handle peer2peer stuff from the hardware side in general.
> > > 
> > > So what I've did is just to set peer2peer allowed on the importer because of
> > > the driver needs and clear it in the exporter if the hardware can't handle
> > > that.
> > 
> > The only thing the importer seems to do is call the
> > pci_peer_traffic_supported, which the exporter could call too. What am I
> > missing (since the sturct_page stuff sounds like it's fixed already by
> > you)?
> > -Daniel
> 
> AFAIK Logan patchset require to register and initialize struct page
> for the device memory you want to map (export from exporter point of
> view).
> 
> With GPU this isn't something we want, struct page is >~= 2^6 so for
> 4GB GPU = 2^6*2^32/2^12 = 2^26 = 64MB of RAM
> 8GB GPU = 2^6*2^33/2^12 = 2^27 = 128MB of RAM
> 16GB GPU = 2^6*2^34/2^12 = 2^28 = 256MB of RAM
> 32GB GPU = 2^6*2^34/2^12 = 2^29 = 512MB of RAM
> 
> All this is mostly wasted as only a small sub-set (that can not be
> constraint to specific range) will ever be exported at any point in
> time. For GPU work load this is hardly justifiable, even for HMM i
> do not plan to register all those pages.
> 
> Hence why i argue that dma_map_resource() like use by Christian is
> good enough for us. People that care about SG can fix that but i
> rather not have to depend on that and waste system memory.

I did not mean you should dma_map_sg/page. I just meant that using
dma_map_resource to fill only the dma address part of the sg table seems
perfectly sufficient. And that's exactly why the importer gets an already
mapped sg table, so that it doesn't have to call dma_map_sg on something
that dma_map_sg can't handle.

Assuming you get an sg table that's been mapping by calling dma_map_sg was
always a bit a case of bending the abstraction to avoid typing code. The
only thing an importer ever should have done is look at the dma addresses
in that sg table, nothing else.

And p2p seems to perfectly fit into this (surprise, it was meant to).
That's why I suggested we annotate the broken importers who assume the sg
table is mapped using dma_map_sg or has a struct_page backing the memory
(but there doesn't seem to be any left it seems) instead of annotating the
ones that aren't broken with a flag that's confusing - you could also have
a dma-buf sgt that points at some other memory that doesn't have struct
pages backing it.

Aside: At least internally in i915 we've been using this forever for our
own private/stolen memory. Unfortunately no other device can access that
range of memory, which is why we don't allow it to be imported to anything
but i915 itself. But if that hw restriction doesn't exist, it'd would
work.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-03 18:08           ` Daniel Vetter
@ 2018-04-16 12:39             ` Christoph Hellwig
  2018-04-16 13:38               ` Daniel Vetter
  0 siblings, 1 reply; 83+ messages in thread
From: Christoph Hellwig @ 2018-04-16 12:39 UTC (permalink / raw)
  To: Jerome Glisse, christian.koenig, linaro-mm-sig, linux-media,
	dri-devel, amd-gfx, linux-kernel

On Tue, Apr 03, 2018 at 08:08:32PM +0200, Daniel Vetter wrote:
> I did not mean you should dma_map_sg/page. I just meant that using
> dma_map_resource to fill only the dma address part of the sg table seems
> perfectly sufficient.

But that is not how the interface work, especially facing sg_dma_len.

> Assuming you get an sg table that's been mapping by calling dma_map_sg was
> always a bit a case of bending the abstraction to avoid typing code. The
> only thing an importer ever should have done is look at the dma addresses
> in that sg table, nothing else.

The scatterlist is not a very good abstraction unfortunately, but it
it is spread all over the kernel.  And we do expect that anyone who
gets passed a scatterlist can use sg_page() or sg_virt() (which calls
sg_page()) on it.  Your changes would break that, and will cause major
trouble because of that.

If you want to expose p2p memory returned from dma_map_resource in
dmabuf do not use scatterlists for this please, but with a new interface
that explicitly passes a virtual address, a dma address and a length
and make it very clear that virt_to_page will not work on the virtual
address.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-16 12:39             ` Christoph Hellwig
@ 2018-04-16 13:38               ` Daniel Vetter
  2018-04-19  8:16                 ` Christoph Hellwig
  0 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2018-04-16 13:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jerome Glisse, Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	open list:DMA BUFFER SHARING FRAMEWORK, dri-devel, amd-gfx list,
	Linux Kernel Mailing List

On Mon, Apr 16, 2018 at 2:39 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Tue, Apr 03, 2018 at 08:08:32PM +0200, Daniel Vetter wrote:
>> I did not mean you should dma_map_sg/page. I just meant that using
>> dma_map_resource to fill only the dma address part of the sg table seems
>> perfectly sufficient.
>
> But that is not how the interface work, especially facing sg_dma_len.
>
>> Assuming you get an sg table that's been mapping by calling dma_map_sg was
>> always a bit a case of bending the abstraction to avoid typing code. The
>> only thing an importer ever should have done is look at the dma addresses
>> in that sg table, nothing else.
>
> The scatterlist is not a very good abstraction unfortunately, but it
> it is spread all over the kernel.  And we do expect that anyone who
> gets passed a scatterlist can use sg_page() or sg_virt() (which calls
> sg_page()) on it.  Your changes would break that, and will cause major
> trouble because of that.
>
> If you want to expose p2p memory returned from dma_map_resource in
> dmabuf do not use scatterlists for this please, but with a new interface
> that explicitly passes a virtual address, a dma address and a length
> and make it very clear that virt_to_page will not work on the virtual
> address.

We've broken that assumption in i915 years ago. Not struct page backed
gpu memory is very real.

Of course we'll never feed such a strange sg table to a driver which
doesn't understand it, but allowing sg_page == NULL works perfectly
fine. At least for gpu drivers.

If that's not acceptable then I guess we could go over the entire tree
and frob all the gpu related code to switch over to a new struct
sg_table_might_not_be_struct_page_backed, including all the other
functions we added over the past few years to iterate over sg tables.
But seems slightly silly, given that sg tables seem to do exactly what
we need.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-16 13:38               ` Daniel Vetter
@ 2018-04-19  8:16                 ` Christoph Hellwig
  2018-04-20  7:13                   ` Daniel Vetter
  0 siblings, 1 reply; 83+ messages in thread
From: Christoph Hellwig @ 2018-04-19  8:16 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christoph Hellwig, Jerome Glisse, Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	open list:DMA BUFFER SHARING FRAMEWORK, dri-devel, amd-gfx list,
	Linux Kernel Mailing List, Logan Gunthorpe, Dan Williams

On Mon, Apr 16, 2018 at 03:38:56PM +0200, Daniel Vetter wrote:
> We've broken that assumption in i915 years ago. Not struct page backed
> gpu memory is very real.
> 
> Of course we'll never feed such a strange sg table to a driver which
> doesn't understand it, but allowing sg_page == NULL works perfectly
> fine. At least for gpu drivers.

For GPU drivers on x86 with no dma coherency problems, sure.  But not
all the world is x86.  We already have problems due to dmabugs use
of the awkward get_sgtable interface (see the common on
arm_dma_get_sgtable that I fully agree with), and doing this for memory
that doesn't have a struct page at all will make things even worse.

> If that's not acceptable then I guess we could go over the entire tree
> and frob all the gpu related code to switch over to a new struct
> sg_table_might_not_be_struct_page_backed, including all the other
> functions we added over the past few years to iterate over sg tables.
> But seems slightly silly, given that sg tables seem to do exactly what
> we need.

It isn't silly.  We will have to do some surgery like that anyway
because the current APIs don't work.  So relax, sit back and come up
with an API that solves the existing issues and serves us well in
the future.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-19  8:16                 ` Christoph Hellwig
@ 2018-04-20  7:13                   ` Daniel Vetter
  2018-04-20  8:58                     ` Christian König
  0 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2018-04-20  7:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Daniel Vetter, Jerome Glisse, Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	open list:DMA BUFFER SHARING FRAMEWORK, dri-devel, amd-gfx list,
	Linux Kernel Mailing List, Logan Gunthorpe, Dan Williams

On Thu, Apr 19, 2018 at 01:16:57AM -0700, Christoph Hellwig wrote:
> On Mon, Apr 16, 2018 at 03:38:56PM +0200, Daniel Vetter wrote:
> > We've broken that assumption in i915 years ago. Not struct page backed
> > gpu memory is very real.
> > 
> > Of course we'll never feed such a strange sg table to a driver which
> > doesn't understand it, but allowing sg_page == NULL works perfectly
> > fine. At least for gpu drivers.
> 
> For GPU drivers on x86 with no dma coherency problems, sure.  But not
> all the world is x86.  We already have problems due to dmabugs use
> of the awkward get_sgtable interface (see the common on
> arm_dma_get_sgtable that I fully agree with), and doing this for memory
> that doesn't have a struct page at all will make things even worse.

x86 dma isn't coherent either, if you're a GPU :-) Flushing gpu caches
tends to be too expensive, so there's pci-e support and chipset support to
forgo it. Plus drivers flushing caches themselves.

The dma_get_sgtable thing is indeed fun, right solution would probably be
to push the dma-buf export down into the dma layer. The comment for
arm_dma_get_sgtable is also not a realy concern, because dma-buf also
abstracts away the flushing (or well is supposed to), so there really
shouldn't be anyone calling the streaming apis on the returned sg table.
That's why dma-buf gives you an sg table that's mapped already.

> > If that's not acceptable then I guess we could go over the entire tree
> > and frob all the gpu related code to switch over to a new struct
> > sg_table_might_not_be_struct_page_backed, including all the other
> > functions we added over the past few years to iterate over sg tables.
> > But seems slightly silly, given that sg tables seem to do exactly what
> > we need.
> 
> It isn't silly.  We will have to do some surgery like that anyway
> because the current APIs don't work.  So relax, sit back and come up
> with an API that solves the existing issues and serves us well in
> the future.

So we should just implement a copy of sg table for dma-buf, since I still
think it does exactly what we need for gpus?

Yes there's a bit a layering violation insofar that drivers really
shouldn't each have their own copy of "how do I convert a piece of dma
memory into  dma-buf", but that doesn't render the interface a bad idea.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-20  7:13                   ` Daniel Vetter
@ 2018-04-20  8:58                     ` Christian König
  2018-04-20 10:17                       ` Christoph Hellwig
  0 siblings, 1 reply; 83+ messages in thread
From: Christian König @ 2018-04-20  8:58 UTC (permalink / raw)
  To: Christoph Hellwig, Jerome Glisse, Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	open list:DMA BUFFER SHARING FRAMEWORK, dri-devel, amd-gfx list,
	Linux Kernel Mailing List, Logan Gunthorpe, Dan Williams

Am 20.04.2018 um 09:13 schrieb Daniel Vetter:
> On Thu, Apr 19, 2018 at 01:16:57AM -0700, Christoph Hellwig wrote:
>> On Mon, Apr 16, 2018 at 03:38:56PM +0200, Daniel Vetter wrote:
>>> We've broken that assumption in i915 years ago. Not struct page backed
>>> gpu memory is very real.
>>>
>>> Of course we'll never feed such a strange sg table to a driver which
>>> doesn't understand it, but allowing sg_page == NULL works perfectly
>>> fine. At least for gpu drivers.
>> For GPU drivers on x86 with no dma coherency problems, sure.  But not
>> all the world is x86.  We already have problems due to dmabugs use
>> of the awkward get_sgtable interface (see the common on
>> arm_dma_get_sgtable that I fully agree with), and doing this for memory
>> that doesn't have a struct page at all will make things even worse.
> x86 dma isn't coherent either, if you're a GPU :-) Flushing gpu caches
> tends to be too expensive, so there's pci-e support and chipset support to
> forgo it. Plus drivers flushing caches themselves.
>
> The dma_get_sgtable thing is indeed fun, right solution would probably be
> to push the dma-buf export down into the dma layer. The comment for
> arm_dma_get_sgtable is also not a realy concern, because dma-buf also
> abstracts away the flushing (or well is supposed to), so there really
> shouldn't be anyone calling the streaming apis on the returned sg table.
> That's why dma-buf gives you an sg table that's mapped already.
>
>>> If that's not acceptable then I guess we could go over the entire tree
>>> and frob all the gpu related code to switch over to a new struct
>>> sg_table_might_not_be_struct_page_backed, including all the other
>>> functions we added over the past few years to iterate over sg tables.
>>> But seems slightly silly, given that sg tables seem to do exactly what
>>> we need.
>> It isn't silly.  We will have to do some surgery like that anyway
>> because the current APIs don't work.  So relax, sit back and come up
>> with an API that solves the existing issues and serves us well in
>> the future.
> So we should just implement a copy of sg table for dma-buf, since I still
> think it does exactly what we need for gpus?
>
> Yes there's a bit a layering violation insofar that drivers really
> shouldn't each have their own copy of "how do I convert a piece of dma
> memory into  dma-buf", but that doesn't render the interface a bad idea.

Completely agree on that.

What we need is an sg_alloc_table_from_resources(dev, resources, 
num_resources) which does the handling common to all drivers.

Christian.

> -Daniel

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-20  8:58                     ` Christian König
@ 2018-04-20 10:17                       ` Christoph Hellwig
  2018-04-20 10:44                         ` Christian König
  0 siblings, 1 reply; 83+ messages in thread
From: Christoph Hellwig @ 2018-04-20 10:17 UTC (permalink / raw)
  To: christian.koenig
  Cc: Christoph Hellwig, Jerome Glisse,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	open list:DMA BUFFER SHARING FRAMEWORK, dri-devel, amd-gfx list,
	Linux Kernel Mailing List, Logan Gunthorpe, Dan Williams

On Fri, Apr 20, 2018 at 10:58:50AM +0200, Christian König wrote:
> > Yes there's a bit a layering violation insofar that drivers really
> > shouldn't each have their own copy of "how do I convert a piece of dma
> > memory into  dma-buf", but that doesn't render the interface a bad idea.
> 
> Completely agree on that.
> 
> What we need is an sg_alloc_table_from_resources(dev, resources,
> num_resources) which does the handling common to all drivers.

A structure that contains

{page,offset,len} + {dma_addr+dma_len}

is not a good container for storing

{virt addr, dma_addr, len}

no matter what interface you build arond it.  And that is discounting
all the problems around mapping coherent allocations for other devices,
or the iommu merging problem we are having another thread on.

So let's come up with a better high level interface first, and then
worrty about how to implement it in the low-level dma-mapping interface
second.  Especially given that my consolidation of the dma_map_ops
implementation is in full stream and there shoudn't be all that many
to bother with.

So first question:  Do you actually care about having multiple
pairs of the above, or instead of all chunks just deal with a single
of the above?  In that case we really should not need that many
new interfaces as dma_map_resource will be all you need anyway.

> 
> Christian.
> 
> > -Daniel
> 
---end quoted text---

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-20 10:17                       ` Christoph Hellwig
@ 2018-04-20 10:44                         ` Christian König
  2018-04-20 12:46                           ` Christoph Hellwig
  0 siblings, 1 reply; 83+ messages in thread
From: Christian König @ 2018-04-20 10:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jerome Glisse, moderated list:DMA BUFFER SHARING FRAMEWORK,
	open list:DMA BUFFER SHARING FRAMEWORK, dri-devel, amd-gfx list,
	Linux Kernel Mailing List, Logan Gunthorpe, Dan Williams

Am 20.04.2018 um 12:17 schrieb Christoph Hellwig:
> On Fri, Apr 20, 2018 at 10:58:50AM +0200, Christian König wrote:
>>> Yes there's a bit a layering violation insofar that drivers really
>>> shouldn't each have their own copy of "how do I convert a piece of dma
>>> memory into  dma-buf", but that doesn't render the interface a bad idea.
>> Completely agree on that.
>>
>> What we need is an sg_alloc_table_from_resources(dev, resources,
>> num_resources) which does the handling common to all drivers.
> A structure that contains
>
> {page,offset,len} + {dma_addr+dma_len}
>
> is not a good container for storing
>
> {virt addr, dma_addr, len}
>
> no matter what interface you build arond it.

Why not? I mean at least for my use case we actually don't need the 
virtual address.

What we need is {dma_addr+dma_len} in a consistent interface which can 
come from both {page,offset,len} as well as {resource, len}.

What I actually don't need is separate handling for system memory and 
resources, but that would we get exactly when we don't use sg_table.

Christian.

> And that is discounting
> all the problems around mapping coherent allocations for other devices,
> or the iommu merging problem we are having another thread on.
>
> So let's come up with a better high level interface first, and then
> worrty about how to implement it in the low-level dma-mapping interface
> second.  Especially given that my consolidation of the dma_map_ops
> implementation is in full stream and there shoudn't be all that many
> to bother with.
>
> So first question:  Do you actually care about having multiple
> pairs of the above, or instead of all chunks just deal with a single
> of the above?  In that case we really should not need that many
> new interfaces as dma_map_resource will be all you need anyway.
>
>> Christian.
>>
>>> -Daniel
> ---end quoted text---

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-20 10:44                         ` Christian König
@ 2018-04-20 12:46                           ` Christoph Hellwig
  2018-04-20 15:21                             ` [Linaro-mm-sig] " Daniel Vetter
  0 siblings, 1 reply; 83+ messages in thread
From: Christoph Hellwig @ 2018-04-20 12:46 UTC (permalink / raw)
  To: Christian König
  Cc: Christoph Hellwig, Jerome Glisse,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	open list:DMA BUFFER SHARING FRAMEWORK, dri-devel, amd-gfx list,
	Linux Kernel Mailing List, Logan Gunthorpe, Dan Williams

On Fri, Apr 20, 2018 at 12:44:01PM +0200, Christian König wrote:
> > > What we need is an sg_alloc_table_from_resources(dev, resources,
> > > num_resources) which does the handling common to all drivers.
> > A structure that contains
> > 
> > {page,offset,len} + {dma_addr+dma_len}
> > 
> > is not a good container for storing
> > 
> > {virt addr, dma_addr, len}
> > 
> > no matter what interface you build arond it.
> 
> Why not? I mean at least for my use case we actually don't need the virtual
> address.

If you don't need the virtual address you need scatterlist even list.

> What we need is {dma_addr+dma_len} in a consistent interface which can come
> from both {page,offset,len} as well as {resource, len}.

Ok.

> What I actually don't need is separate handling for system memory and
> resources, but that would we get exactly when we don't use sg_table.

At the very lowest level they will need to be handled differently for
many architectures, the questions is at what point we'll do the
branching out.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-20 12:46                           ` Christoph Hellwig
@ 2018-04-20 15:21                             ` Daniel Vetter
  2018-04-24 18:48                               ` Christoph Hellwig
  0 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2018-04-20 15:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Linux Kernel Mailing List, amd-gfx list, Jerome Glisse,
	dri-devel, Dan Williams, Logan Gunthorpe,
	open list:DMA BUFFER SHARING FRAMEWORK

On Fri, Apr 20, 2018 at 05:46:25AM -0700, Christoph Hellwig wrote:
> On Fri, Apr 20, 2018 at 12:44:01PM +0200, Christian König wrote:
> > > > What we need is an sg_alloc_table_from_resources(dev, resources,
> > > > num_resources) which does the handling common to all drivers.
> > > A structure that contains
> > > 
> > > {page,offset,len} + {dma_addr+dma_len}
> > > 
> > > is not a good container for storing
> > > 
> > > {virt addr, dma_addr, len}
> > > 
> > > no matter what interface you build arond it.
> > 
> > Why not? I mean at least for my use case we actually don't need the virtual
> > address.
> 
> If you don't need the virtual address you need scatterlist even list.
> 
> > What we need is {dma_addr+dma_len} in a consistent interface which can come
> > from both {page,offset,len} as well as {resource, len}.
> 
> Ok.
> 
> > What I actually don't need is separate handling for system memory and
> > resources, but that would we get exactly when we don't use sg_table.
> 
> At the very lowest level they will need to be handled differently for
> many architectures, the questions is at what point we'll do the
> branching out.

Having at least struct page also in that list with (dma_addr_t, lenght)
pairs has a bunch of benefits for drivers in unifying buffer handling
code. You just pass that one single list around, use the dma_addr_t side
for gpu access (generally bashing it into gpu ptes). And the struct page
(if present) for cpu access, using kmap or vm_insert_*. We generally
ignore virt, if we do need a full mapping then we construct a vmap for
that buffer of our own.

If (and that would be serious amounts of work all over the tree, with lots
of drivers) we come up with a new struct for gpu buffers, then I'd also
add "force page alignement for everything" to the requirements list.
That's another mismatch we have, since gpu buffer objects (and dma-buf)
are always full pages. That mismatch motived the addition of the
page-oriented sg iterators.

So maybe a list of (struct page *, dma_addr_t, num_pages) would suit best,
with struct page * being optional (if it's a resource, or something else
that the kernel core mm isn't aware of). But that only has benefits if we
really roll it out everywhere, in all the subsystems and drivers, since if
we don't we've made the struct pages ** <-> sgt conversion fun only worse
by adding a 3 representation of gpu buffer object backing storage.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-20 15:21                             ` [Linaro-mm-sig] " Daniel Vetter
@ 2018-04-24 18:48                               ` Christoph Hellwig
  2018-04-24 19:32                                 ` Daniel Vetter
  0 siblings, 1 reply; 83+ messages in thread
From: Christoph Hellwig @ 2018-04-24 18:48 UTC (permalink / raw)
  To: Christoph Hellwig, Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Linux Kernel Mailing List, amd-gfx list, Jerome Glisse,
	dri-devel, Dan Williams, Logan Gunthorpe,
	open list:DMA BUFFER SHARING FRAMEWORK

On Fri, Apr 20, 2018 at 05:21:11PM +0200, Daniel Vetter wrote:
> > At the very lowest level they will need to be handled differently for
> > many architectures, the questions is at what point we'll do the
> > branching out.
> 
> Having at least struct page also in that list with (dma_addr_t, lenght)
> pairs has a bunch of benefits for drivers in unifying buffer handling
> code. You just pass that one single list around, use the dma_addr_t side
> for gpu access (generally bashing it into gpu ptes). And the struct page
> (if present) for cpu access, using kmap or vm_insert_*. We generally
> ignore virt, if we do need a full mapping then we construct a vmap for
> that buffer of our own.

Well, for mapping a resource (which gets back to the start of the
discussion) you will need an explicit virt pointer.  You also need
an explicit virt pointer and not just page_address/kmap for users of
dma_get_sgtable, because for many architectures you will need to flush
the virtual address used to access the data, which might be a
vmap/ioremap style mapping retourned from dma_alloc_address, and not
the directly mapped kernel address.

Here is another idea at the low-level dma API level:

 - dma_get_sgtable goes away.  The replacement is a new
   dma_alloc_remap helper that takes the virtual address returned
   from dma_alloc_attrs/coherent and creates a dma_addr_t for the
   given new device.  If the original allocation was a coherent
   one no cache flushing is required either (because the arch
   made sure it is coherent), if the original allocation used
   DMA_ATTR_NON_CONSISTENT the new allocation will need
   dma_cache_sync calls as well.
 - you never even try to share a mapping retourned from
   dma_map_resource - instead each device using it creates a new
   mapping, which makes sense as no virtual addresses are involved
   at all.

> So maybe a list of (struct page *, dma_addr_t, num_pages) would suit best,
> with struct page * being optional (if it's a resource, or something else
> that the kernel core mm isn't aware of). But that only has benefits if we
> really roll it out everywhere, in all the subsystems and drivers, since if
> we don't we've made the struct pages ** <-> sgt conversion fun only worse
> by adding a 3 representation of gpu buffer object backing storage.

I think the most important thing about such a buffer object is that
it can distinguish the underlying mapping types.  While
dma_alloc_coherent, dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT,
dma_map_page/dma_map_single/dma_map_sg and dma_map_resource all give
back a dma_addr_t they are in now way interchangable.  And trying to
stuff them all into a structure like struct scatterlist that has
no indication what kind of mapping you are dealing with is just
asking for trouble.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-24 18:48                               ` Christoph Hellwig
@ 2018-04-24 19:32                                 ` Daniel Vetter
  2018-04-25  5:48                                   ` Christoph Hellwig
  0 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2018-04-24 19:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Linux Kernel Mailing List, amd-gfx list, Jerome Glisse,
	dri-devel, Dan Williams, Logan Gunthorpe,
	open list:DMA BUFFER SHARING FRAMEWORK

On Tue, Apr 24, 2018 at 8:48 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Fri, Apr 20, 2018 at 05:21:11PM +0200, Daniel Vetter wrote:
>> > At the very lowest level they will need to be handled differently for
>> > many architectures, the questions is at what point we'll do the
>> > branching out.
>>
>> Having at least struct page also in that list with (dma_addr_t, lenght)
>> pairs has a bunch of benefits for drivers in unifying buffer handling
>> code. You just pass that one single list around, use the dma_addr_t side
>> for gpu access (generally bashing it into gpu ptes). And the struct page
>> (if present) for cpu access, using kmap or vm_insert_*. We generally
>> ignore virt, if we do need a full mapping then we construct a vmap for
>> that buffer of our own.
>
> Well, for mapping a resource (which gets back to the start of the
> discussion) you will need an explicit virt pointer.  You also need
> an explicit virt pointer and not just page_address/kmap for users of
> dma_get_sgtable, because for many architectures you will need to flush
> the virtual address used to access the data, which might be a
> vmap/ioremap style mapping retourned from dma_alloc_address, and not
> the directly mapped kernel address.

Out of curiosity, how much virtual flushing stuff is there still out
there? At least in drm we've pretty much ignore this, and seem to be
getting away without a huge uproar (at least from driver developers
and users, core folks are less amused about that).

And at least for gpus that seems to have been the case since forever,
or at least since AGP was a thing 20 years ago: AGP isn't coherent, so
needs explicit cache flushing, and we have our own implementations of
that in drivers/char/agp. Luckily AGP died 10 years ago, so no one yet
proposed to port it all over to the iommu framework and hide it behind
the dma api (which really would be the "clean" way to do this, AGP is
simply an IOMMU + special socket dedicated for the add-on gpu).

> Here is another idea at the low-level dma API level:
>
>  - dma_get_sgtable goes away.  The replacement is a new
>    dma_alloc_remap helper that takes the virtual address returned
>    from dma_alloc_attrs/coherent and creates a dma_addr_t for the
>    given new device.  If the original allocation was a coherent
>    one no cache flushing is required either (because the arch
>    made sure it is coherent), if the original allocation used
>    DMA_ATTR_NON_CONSISTENT the new allocation will need
>    dma_cache_sync calls as well.

Yeah I think that should work. dma_get_sgtable is a pretty nasty
layering violation.

>  - you never even try to share a mapping retourned from
>    dma_map_resource - instead each device using it creates a new
>    mapping, which makes sense as no virtual addresses are involved
>    at all.

Yeah the dma-buf exporter always knows what kind of backing storage it
is dealing with, and for which struct device it should set up a new
view. Hence can make sure that it calls the right functions to
establish a new mapping, whether that's dma_map_sg, dma_map_resource
or the new dma_alloc_remap (instead of the dma_get_sgtable layering
mixup). The importer doesn't know.

>> So maybe a list of (struct page *, dma_addr_t, num_pages) would suit best,
>> with struct page * being optional (if it's a resource, or something else
>> that the kernel core mm isn't aware of). But that only has benefits if we
>> really roll it out everywhere, in all the subsystems and drivers, since if
>> we don't we've made the struct pages ** <-> sgt conversion fun only worse
>> by adding a 3 representation of gpu buffer object backing storage.
>
> I think the most important thing about such a buffer object is that
> it can distinguish the underlying mapping types.  While
> dma_alloc_coherent, dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT,
> dma_map_page/dma_map_single/dma_map_sg and dma_map_resource all give
> back a dma_addr_t they are in now way interchangable.  And trying to
> stuff them all into a structure like struct scatterlist that has
> no indication what kind of mapping you are dealing with is just
> asking for trouble.

Well the idea was to have 1 interface to allow all drivers to share
buffers with anything else, no matter how exactly they're allocated.
dma-buf has all the functions for flushing, so you can have coherent
mappings, non-coherent mappings and pretty much anything else. Or well
could, because in practice people hack up layering violations until it
works for the 2-3 drivers they care about. On top of that there's the
small issue that x86 insists that dma is coherent (and that's true for
most devices, including v4l drivers you might want to share stuff
with), and gpus really, really really do want to make almost
everything incoherent.

The end result is pretty epic :-)
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-24 19:32                                 ` Daniel Vetter
@ 2018-04-25  5:48                                   ` Christoph Hellwig
  2018-04-25  6:10                                     ` Alex Deucher
  2018-04-25  6:13                                     ` Daniel Vetter
  0 siblings, 2 replies; 83+ messages in thread
From: Christoph Hellwig @ 2018-04-25  5:48 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christoph Hellwig, Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Linux Kernel Mailing List, amd-gfx list, Jerome Glisse,
	dri-devel, Dan Williams, Logan Gunthorpe,
	open list:DMA BUFFER SHARING FRAMEWORK

On Tue, Apr 24, 2018 at 09:32:20PM +0200, Daniel Vetter wrote:
> Out of curiosity, how much virtual flushing stuff is there still out
> there? At least in drm we've pretty much ignore this, and seem to be
> getting away without a huge uproar (at least from driver developers
> and users, core folks are less amused about that).

As I've just been wading through the code, the following architectures
have non-coherent dma that flushes by virtual address for at least some
platforms:

 - arm [1], arm64, hexagon, nds32, nios2, parisc, sh, xtensa, mips,
   powerpc

These have non-coherent dma ops that flush by physical address:

 - arc, arm [1], c6x, m68k, microblaze, openrisc, sparc

And these do not have non-coherent dma ops at all:

 - alpha, h8300, riscv, unicore32, x86

[1] arm ѕeems to do both virtually and physically based ops, further
audit needed.

Note that using virtual addresses in the cache flushing interface
doesn't mean that the cache actually is virtually indexed, but it at
least allows for the possibility.

> > I think the most important thing about such a buffer object is that
> > it can distinguish the underlying mapping types.  While
> > dma_alloc_coherent, dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT,
> > dma_map_page/dma_map_single/dma_map_sg and dma_map_resource all give
> > back a dma_addr_t they are in now way interchangable.  And trying to
> > stuff them all into a structure like struct scatterlist that has
> > no indication what kind of mapping you are dealing with is just
> > asking for trouble.
> 
> Well the idea was to have 1 interface to allow all drivers to share
> buffers with anything else, no matter how exactly they're allocated.

Isn't that interface supposed to be dmabuf?  Currently dma_map leaks
a scatterlist through the sg_table in dma_buf_map_attachment /
->map_dma_buf, but looking at a few of the callers it seems like they
really do not even want a scatterlist to start with, but check that
is contains a physically contiguous range first.  So kicking the
scatterlist our there will probably improve the interface in general.

> dma-buf has all the functions for flushing, so you can have coherent
> mappings, non-coherent mappings and pretty much anything else. Or well
> could, because in practice people hack up layering violations until it
> works for the 2-3 drivers they care about. On top of that there's the
> small issue that x86 insists that dma is coherent (and that's true for
> most devices, including v4l drivers you might want to share stuff
> with), and gpus really, really really do want to make almost
> everything incoherent.

How do discrete GPUs manage to be incoherent when attached over PCIe?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-25  5:48                                   ` Christoph Hellwig
@ 2018-04-25  6:10                                     ` Alex Deucher
  2018-04-25  6:13                                     ` Daniel Vetter
  1 sibling, 0 replies; 83+ messages in thread
From: Alex Deucher @ 2018-04-25  6:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Daniel Vetter, moderated list:DMA BUFFER SHARING FRAMEWORK,
	Linux Kernel Mailing List, amd-gfx list, Jerome Glisse,
	dri-devel, Dan Williams, Logan Gunthorpe, Christian König,
	open list:DMA BUFFER SHARING FRAMEWORK

On Wed, Apr 25, 2018 at 1:48 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Tue, Apr 24, 2018 at 09:32:20PM +0200, Daniel Vetter wrote:
>> Out of curiosity, how much virtual flushing stuff is there still out
>> there? At least in drm we've pretty much ignore this, and seem to be
>> getting away without a huge uproar (at least from driver developers
>> and users, core folks are less amused about that).
>
> As I've just been wading through the code, the following architectures
> have non-coherent dma that flushes by virtual address for at least some
> platforms:
>
>  - arm [1], arm64, hexagon, nds32, nios2, parisc, sh, xtensa, mips,
>    powerpc
>
> These have non-coherent dma ops that flush by physical address:
>
>  - arc, arm [1], c6x, m68k, microblaze, openrisc, sparc
>
> And these do not have non-coherent dma ops at all:
>
>  - alpha, h8300, riscv, unicore32, x86
>
> [1] arm ѕeems to do both virtually and physically based ops, further
> audit needed.
>
> Note that using virtual addresses in the cache flushing interface
> doesn't mean that the cache actually is virtually indexed, but it at
> least allows for the possibility.
>
>> > I think the most important thing about such a buffer object is that
>> > it can distinguish the underlying mapping types.  While
>> > dma_alloc_coherent, dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT,
>> > dma_map_page/dma_map_single/dma_map_sg and dma_map_resource all give
>> > back a dma_addr_t they are in now way interchangable.  And trying to
>> > stuff them all into a structure like struct scatterlist that has
>> > no indication what kind of mapping you are dealing with is just
>> > asking for trouble.
>>
>> Well the idea was to have 1 interface to allow all drivers to share
>> buffers with anything else, no matter how exactly they're allocated.
>
> Isn't that interface supposed to be dmabuf?  Currently dma_map leaks
> a scatterlist through the sg_table in dma_buf_map_attachment /
> ->map_dma_buf, but looking at a few of the callers it seems like they
> really do not even want a scatterlist to start with, but check that
> is contains a physically contiguous range first.  So kicking the
> scatterlist our there will probably improve the interface in general.
>
>> dma-buf has all the functions for flushing, so you can have coherent
>> mappings, non-coherent mappings and pretty much anything else. Or well
>> could, because in practice people hack up layering violations until it
>> works for the 2-3 drivers they care about. On top of that there's the
>> small issue that x86 insists that dma is coherent (and that's true for
>> most devices, including v4l drivers you might want to share stuff
>> with), and gpus really, really really do want to make almost
>> everything incoherent.
>
> How do discrete GPUs manage to be incoherent when attached over PCIe?

They can do CPU cache snooped (coherent) or non-snooped (incoherent)
DMA.  Also for things like APUs, they show up as a PCIe device, but
that actual GPU core is part of the same die as the CPU and they have
their own special paths to memory, etc.  The fact that they show up as
PCIe devices is mostly for enumeration purposes.

Alex

> _______________________________________________
> Linaro-mm-sig mailing list
> Linaro-mm-sig@lists.linaro.org
> https://lists.linaro.org/mailman/listinfo/linaro-mm-sig

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-25  5:48                                   ` Christoph Hellwig
  2018-04-25  6:10                                     ` Alex Deucher
@ 2018-04-25  6:13                                     ` Daniel Vetter
  2018-04-25  6:23                                       ` Daniel Vetter
  2018-04-25  6:24                                       ` [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag Alex Deucher
  1 sibling, 2 replies; 83+ messages in thread
From: Daniel Vetter @ 2018-04-25  6:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Linux Kernel Mailing List, amd-gfx list, Jerome Glisse,
	dri-devel, Dan Williams, Logan Gunthorpe,
	open list:DMA BUFFER SHARING FRAMEWORK

On Wed, Apr 25, 2018 at 7:48 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Tue, Apr 24, 2018 at 09:32:20PM +0200, Daniel Vetter wrote:
>> Out of curiosity, how much virtual flushing stuff is there still out
>> there? At least in drm we've pretty much ignore this, and seem to be
>> getting away without a huge uproar (at least from driver developers
>> and users, core folks are less amused about that).
>
> As I've just been wading through the code, the following architectures
> have non-coherent dma that flushes by virtual address for at least some
> platforms:
>
>  - arm [1], arm64, hexagon, nds32, nios2, parisc, sh, xtensa, mips,
>    powerpc
>
> These have non-coherent dma ops that flush by physical address:
>
>  - arc, arm [1], c6x, m68k, microblaze, openrisc, sparc
>
> And these do not have non-coherent dma ops at all:
>
>  - alpha, h8300, riscv, unicore32, x86
>
> [1] arm ѕeems to do both virtually and physically based ops, further
> audit needed.
>
> Note that using virtual addresses in the cache flushing interface
> doesn't mean that the cache actually is virtually indexed, but it at
> least allows for the possibility.
>
>> > I think the most important thing about such a buffer object is that
>> > it can distinguish the underlying mapping types.  While
>> > dma_alloc_coherent, dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT,
>> > dma_map_page/dma_map_single/dma_map_sg and dma_map_resource all give
>> > back a dma_addr_t they are in now way interchangable.  And trying to
>> > stuff them all into a structure like struct scatterlist that has
>> > no indication what kind of mapping you are dealing with is just
>> > asking for trouble.
>>
>> Well the idea was to have 1 interface to allow all drivers to share
>> buffers with anything else, no matter how exactly they're allocated.
>
> Isn't that interface supposed to be dmabuf?  Currently dma_map leaks
> a scatterlist through the sg_table in dma_buf_map_attachment /
> ->map_dma_buf, but looking at a few of the callers it seems like they
> really do not even want a scatterlist to start with, but check that
> is contains a physically contiguous range first.  So kicking the
> scatterlist our there will probably improve the interface in general.

I think by number most drm drivers require contiguous memory (or an
iommu that makes it look contiguous). But there's plenty others who
have another set of pagetables on the gpu itself and can
scatter-gather. Usually it's the former for display/video blocks, and
the latter for rendering.

>> dma-buf has all the functions for flushing, so you can have coherent
>> mappings, non-coherent mappings and pretty much anything else. Or well
>> could, because in practice people hack up layering violations until it
>> works for the 2-3 drivers they care about. On top of that there's the
>> small issue that x86 insists that dma is coherent (and that's true for
>> most devices, including v4l drivers you might want to share stuff
>> with), and gpus really, really really do want to make almost
>> everything incoherent.
>
> How do discrete GPUs manage to be incoherent when attached over PCIe?

It has a non-coherent transaction mode (which the chipset can opt to
not implement and still flush), to make sure the AGP horror show
doesn't happen again and GPU folks are happy with PCIe. That's at
least my understanding from digging around in amd the last time we had
coherency issues between intel and amd gpus. GPUs have some bits
somewhere (in the pagetables, or in the buffer object description
table created by userspace) to control that stuff.

For anything on the SoC it's presented as pci device, but that's
extremely fake, and we can definitely do non-snooped transactions on
drm/i915. Again, controlled by a mix of pagetables and
userspace-provided buffer object description tables.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-25  6:13                                     ` Daniel Vetter
@ 2018-04-25  6:23                                       ` Daniel Vetter
  2018-04-25  6:43                                         ` Christoph Hellwig
  2018-04-25  6:24                                       ` [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag Alex Deucher
  1 sibling, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2018-04-25  6:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Linux Kernel Mailing List, amd-gfx list, Jerome Glisse,
	dri-devel, Dan Williams, Logan Gunthorpe,
	open list:DMA BUFFER SHARING FRAMEWORK

On Wed, Apr 25, 2018 at 8:13 AM, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Wed, Apr 25, 2018 at 7:48 AM, Christoph Hellwig <hch@infradead.org> wrote:
>> On Tue, Apr 24, 2018 at 09:32:20PM +0200, Daniel Vetter wrote:
>>> Out of curiosity, how much virtual flushing stuff is there still out
>>> there? At least in drm we've pretty much ignore this, and seem to be
>>> getting away without a huge uproar (at least from driver developers
>>> and users, core folks are less amused about that).
>>
>> As I've just been wading through the code, the following architectures
>> have non-coherent dma that flushes by virtual address for at least some
>> platforms:
>>
>>  - arm [1], arm64, hexagon, nds32, nios2, parisc, sh, xtensa, mips,
>>    powerpc
>>
>> These have non-coherent dma ops that flush by physical address:
>>
>>  - arc, arm [1], c6x, m68k, microblaze, openrisc, sparc
>>
>> And these do not have non-coherent dma ops at all:
>>
>>  - alpha, h8300, riscv, unicore32, x86
>>
>> [1] arm ѕeems to do both virtually and physically based ops, further
>> audit needed.
>>
>> Note that using virtual addresses in the cache flushing interface
>> doesn't mean that the cache actually is virtually indexed, but it at
>> least allows for the possibility.
>>
>>> > I think the most important thing about such a buffer object is that
>>> > it can distinguish the underlying mapping types.  While
>>> > dma_alloc_coherent, dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT,
>>> > dma_map_page/dma_map_single/dma_map_sg and dma_map_resource all give
>>> > back a dma_addr_t they are in now way interchangable.  And trying to
>>> > stuff them all into a structure like struct scatterlist that has
>>> > no indication what kind of mapping you are dealing with is just
>>> > asking for trouble.
>>>
>>> Well the idea was to have 1 interface to allow all drivers to share
>>> buffers with anything else, no matter how exactly they're allocated.
>>
>> Isn't that interface supposed to be dmabuf?  Currently dma_map leaks
>> a scatterlist through the sg_table in dma_buf_map_attachment /
>> ->map_dma_buf, but looking at a few of the callers it seems like they
>> really do not even want a scatterlist to start with, but check that
>> is contains a physically contiguous range first.  So kicking the
>> scatterlist our there will probably improve the interface in general.
>
> I think by number most drm drivers require contiguous memory (or an
> iommu that makes it look contiguous). But there's plenty others who
> have another set of pagetables on the gpu itself and can
> scatter-gather. Usually it's the former for display/video blocks, and
> the latter for rendering.

For more fun:

https://www.spinics.net/lists/dri-devel/msg173630.html

Yeah, sometimes we want to disable the iommu because the on-gpu
pagetables are faster ...
-Daniel

>>> dma-buf has all the functions for flushing, so you can have coherent
>>> mappings, non-coherent mappings and pretty much anything else. Or well
>>> could, because in practice people hack up layering violations until it
>>> works for the 2-3 drivers they care about. On top of that there's the
>>> small issue that x86 insists that dma is coherent (and that's true for
>>> most devices, including v4l drivers you might want to share stuff
>>> with), and gpus really, really really do want to make almost
>>> everything incoherent.
>>
>> How do discrete GPUs manage to be incoherent when attached over PCIe?
>
> It has a non-coherent transaction mode (which the chipset can opt to
> not implement and still flush), to make sure the AGP horror show
> doesn't happen again and GPU folks are happy with PCIe. That's at
> least my understanding from digging around in amd the last time we had
> coherency issues between intel and amd gpus. GPUs have some bits
> somewhere (in the pagetables, or in the buffer object description
> table created by userspace) to control that stuff.
>
> For anything on the SoC it's presented as pci device, but that's
> extremely fake, and we can definitely do non-snooped transactions on
> drm/i915. Again, controlled by a mix of pagetables and
> userspace-provided buffer object description tables.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-25  6:13                                     ` Daniel Vetter
  2018-04-25  6:23                                       ` Daniel Vetter
@ 2018-04-25  6:24                                       ` Alex Deucher
  2018-04-25  6:41                                         ` Christoph Hellwig
  1 sibling, 1 reply; 83+ messages in thread
From: Alex Deucher @ 2018-04-25  6:24 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christoph Hellwig, Linux Kernel Mailing List, dri-devel,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Jerome Glisse,
	amd-gfx list, Dan Williams, Logan Gunthorpe,
	Christian König, open list:DMA BUFFER SHARING FRAMEWORK

On Wed, Apr 25, 2018 at 2:13 AM, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Wed, Apr 25, 2018 at 7:48 AM, Christoph Hellwig <hch@infradead.org> wrote:
>> On Tue, Apr 24, 2018 at 09:32:20PM +0200, Daniel Vetter wrote:
>>> Out of curiosity, how much virtual flushing stuff is there still out
>>> there? At least in drm we've pretty much ignore this, and seem to be
>>> getting away without a huge uproar (at least from driver developers
>>> and users, core folks are less amused about that).
>>
>> As I've just been wading through the code, the following architectures
>> have non-coherent dma that flushes by virtual address for at least some
>> platforms:
>>
>>  - arm [1], arm64, hexagon, nds32, nios2, parisc, sh, xtensa, mips,
>>    powerpc
>>
>> These have non-coherent dma ops that flush by physical address:
>>
>>  - arc, arm [1], c6x, m68k, microblaze, openrisc, sparc
>>
>> And these do not have non-coherent dma ops at all:
>>
>>  - alpha, h8300, riscv, unicore32, x86
>>
>> [1] arm ѕeems to do both virtually and physically based ops, further
>> audit needed.
>>
>> Note that using virtual addresses in the cache flushing interface
>> doesn't mean that the cache actually is virtually indexed, but it at
>> least allows for the possibility.
>>
>>> > I think the most important thing about such a buffer object is that
>>> > it can distinguish the underlying mapping types.  While
>>> > dma_alloc_coherent, dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT,
>>> > dma_map_page/dma_map_single/dma_map_sg and dma_map_resource all give
>>> > back a dma_addr_t they are in now way interchangable.  And trying to
>>> > stuff them all into a structure like struct scatterlist that has
>>> > no indication what kind of mapping you are dealing with is just
>>> > asking for trouble.
>>>
>>> Well the idea was to have 1 interface to allow all drivers to share
>>> buffers with anything else, no matter how exactly they're allocated.
>>
>> Isn't that interface supposed to be dmabuf?  Currently dma_map leaks
>> a scatterlist through the sg_table in dma_buf_map_attachment /
>> ->map_dma_buf, but looking at a few of the callers it seems like they
>> really do not even want a scatterlist to start with, but check that
>> is contains a physically contiguous range first.  So kicking the
>> scatterlist our there will probably improve the interface in general.
>
> I think by number most drm drivers require contiguous memory (or an
> iommu that makes it look contiguous). But there's plenty others who
> have another set of pagetables on the gpu itself and can
> scatter-gather. Usually it's the former for display/video blocks, and
> the latter for rendering.
>
>>> dma-buf has all the functions for flushing, so you can have coherent
>>> mappings, non-coherent mappings and pretty much anything else. Or well
>>> could, because in practice people hack up layering violations until it
>>> works for the 2-3 drivers they care about. On top of that there's the
>>> small issue that x86 insists that dma is coherent (and that's true for
>>> most devices, including v4l drivers you might want to share stuff
>>> with), and gpus really, really really do want to make almost
>>> everything incoherent.
>>
>> How do discrete GPUs manage to be incoherent when attached over PCIe?
>
> It has a non-coherent transaction mode (which the chipset can opt to
> not implement and still flush), to make sure the AGP horror show
> doesn't happen again and GPU folks are happy with PCIe. That's at
> least my understanding from digging around in amd the last time we had
> coherency issues between intel and amd gpus. GPUs have some bits
> somewhere (in the pagetables, or in the buffer object description
> table created by userspace) to control that stuff.

Right.  We have a bit in the GPU page table entries that determines
whether we snoop the CPU's cache or not.

Alex

>
> For anything on the SoC it's presented as pci device, but that's
> extremely fake, and we can definitely do non-snooped transactions on
> drm/i915. Again, controlled by a mix of pagetables and
> userspace-provided buffer object description tables.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-25  6:24                                       ` [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag Alex Deucher
@ 2018-04-25  6:41                                         ` Christoph Hellwig
  2018-04-25 17:44                                           ` Alex Deucher
  0 siblings, 1 reply; 83+ messages in thread
From: Christoph Hellwig @ 2018-04-25  6:41 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Daniel Vetter, Christoph Hellwig, Linux Kernel Mailing List,
	dri-devel, moderated list:DMA BUFFER SHARING FRAMEWORK,
	Jerome Glisse, amd-gfx list, Dan Williams, Logan Gunthorpe,
	Christian König, open list:DMA BUFFER SHARING FRAMEWORK

On Wed, Apr 25, 2018 at 02:24:36AM -0400, Alex Deucher wrote:
> > It has a non-coherent transaction mode (which the chipset can opt to
> > not implement and still flush), to make sure the AGP horror show
> > doesn't happen again and GPU folks are happy with PCIe. That's at
> > least my understanding from digging around in amd the last time we had
> > coherency issues between intel and amd gpus. GPUs have some bits
> > somewhere (in the pagetables, or in the buffer object description
> > table created by userspace) to control that stuff.
> 
> Right.  We have a bit in the GPU page table entries that determines
> whether we snoop the CPU's cache or not.

I can see how that works with the GPU on the same SOC or SOC set as the
CPU.  But how is that going to work for a GPU that is a plain old PCIe
card?  The cache snooping in that case is happening in the PCIe root
complex.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-25  6:23                                       ` Daniel Vetter
@ 2018-04-25  6:43                                         ` Christoph Hellwig
  2018-04-25  7:02                                           ` Daniel Vetter
  2018-04-25  7:41                                           ` Thierry Reding
  0 siblings, 2 replies; 83+ messages in thread
From: Christoph Hellwig @ 2018-04-25  6:43 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christoph Hellwig, Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Linux Kernel Mailing List, amd-gfx list, Jerome Glisse,
	dri-devel, Dan Williams, Logan Gunthorpe,
	open list:DMA BUFFER SHARING FRAMEWORK, Thierry Reding

On Wed, Apr 25, 2018 at 08:23:15AM +0200, Daniel Vetter wrote:
> For more fun:
> 
> https://www.spinics.net/lists/dri-devel/msg173630.html
> 
> Yeah, sometimes we want to disable the iommu because the on-gpu
> pagetables are faster ...

I am not on this list, but remote NAK from here.  This needs an
API from the iommu/dma-mapping code.  Drivers have no business poking
into these details.

Thierry, please resend this with at least the iommu list and
linux-arm-kernel in Cc to have a proper discussion on the right API.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-25  6:43                                         ` Christoph Hellwig
@ 2018-04-25  7:02                                           ` Daniel Vetter
  2018-04-25  7:09                                             ` Christoph Hellwig
  2018-04-25  7:41                                           ` Thierry Reding
  1 sibling, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2018-04-25  7:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Linux Kernel Mailing List, amd-gfx list, Jerome Glisse,
	dri-devel, Dan Williams, Logan Gunthorpe,
	open list:DMA BUFFER SHARING FRAMEWORK, Thierry Reding

On Wed, Apr 25, 2018 at 8:43 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Wed, Apr 25, 2018 at 08:23:15AM +0200, Daniel Vetter wrote:
>> For more fun:
>>
>> https://www.spinics.net/lists/dri-devel/msg173630.html
>>
>> Yeah, sometimes we want to disable the iommu because the on-gpu
>> pagetables are faster ...
>
> I am not on this list, but remote NAK from here.  This needs an
> API from the iommu/dma-mapping code.  Drivers have no business poking
> into these details.

Can we please not nack everything right away? Doesn't really motivate
me to show you all the various things we're doing in gpu to make the
dma layer work for us. That kind of noodling around in lower levels to
get them to do what we want is absolutely par-for-course for gpu
drivers. If you just nack everything I point you at for illustrative
purposes, then I can't show you stuff anymore.

Just to make it clear: I do want to get this stuff sorted, and it's
awesome that someone from core finally takes a serious look at what
gpu folks have been doing for decades (instead of just telling us
we're incompetent and doing it all wrong and then steaming off), and
how to make this work without layering violations to no end. But
stopping the world until this is fixed isn't really a good option.

Thanks, Daniel

> Thierry, please resend this with at least the iommu list and
> linux-arm-kernel in Cc to have a proper discussion on the right API.



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-25  7:02                                           ` Daniel Vetter
@ 2018-04-25  7:09                                             ` Christoph Hellwig
  2018-04-25  7:30                                               ` Daniel Vetter
  2018-04-25  7:43                                               ` Thierry Reding
  0 siblings, 2 replies; 83+ messages in thread
From: Christoph Hellwig @ 2018-04-25  7:09 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christoph Hellwig, Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Linux Kernel Mailing List, amd-gfx list, Jerome Glisse,
	dri-devel, Dan Williams, Logan Gunthorpe,
	open list:DMA BUFFER SHARING FRAMEWORK, Thierry Reding

On Wed, Apr 25, 2018 at 09:02:17AM +0200, Daniel Vetter wrote:
> Can we please not nack everything right away? Doesn't really motivate
> me to show you all the various things we're doing in gpu to make the
> dma layer work for us. That kind of noodling around in lower levels to
> get them to do what we want is absolutely par-for-course for gpu
> drivers. If you just nack everything I point you at for illustrative
> purposes, then I can't show you stuff anymore.

No, it's not.  No driver (and that includes the magic GPUs) has
any business messing with dma ops directly.

A GPU driver imght have a very valid reason to disable the IOMMU,
but the code to do so needs to be at least in the arch code, maybe
in the dma-mapping/iommu code, not in the driver.

As a first step to get the discussion started we'll simply need
to move the code Thierry wrote into a helper in arch/arm and that
alone would be a massive improvement.  I'm not even talking about
minor details like actually using arm_get_dma_map_ops instead
of duplicating it.

And doing this basic trivial work really helps to get this whole
mess under control.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-25  7:09                                             ` Christoph Hellwig
@ 2018-04-25  7:30                                               ` Daniel Vetter
  2018-04-25  7:56                                                 ` Thierry Reding
  2018-04-25  7:43                                               ` Thierry Reding
  1 sibling, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2018-04-25  7:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Daniel Vetter, Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Linux Kernel Mailing List, amd-gfx list, Jerome Glisse,
	dri-devel, Dan Williams, Logan Gunthorpe,
	open list:DMA BUFFER SHARING FRAMEWORK, Thierry Reding

On Wed, Apr 25, 2018 at 12:09:05AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 25, 2018 at 09:02:17AM +0200, Daniel Vetter wrote:
> > Can we please not nack everything right away? Doesn't really motivate
> > me to show you all the various things we're doing in gpu to make the
> > dma layer work for us. That kind of noodling around in lower levels to
> > get them to do what we want is absolutely par-for-course for gpu
> > drivers. If you just nack everything I point you at for illustrative
> > purposes, then I can't show you stuff anymore.
> 
> No, it's not.  No driver (and that includes the magic GPUs) has
> any business messing with dma ops directly.
> 
> A GPU driver imght have a very valid reason to disable the IOMMU,
> but the code to do so needs to be at least in the arch code, maybe
> in the dma-mapping/iommu code, not in the driver.
> 
> As a first step to get the discussion started we'll simply need
> to move the code Thierry wrote into a helper in arch/arm and that
> alone would be a massive improvement.  I'm not even talking about
> minor details like actually using arm_get_dma_map_ops instead
> of duplicating it.
> 
> And doing this basic trivial work really helps to get this whole
> mess under control.

Ah ok. It did sound a bit like a much more cathegorical NAK than an "ack
in principle, but we need to shuffle the implementation into the right
place first". In the past we generally got a principled NAK on anything
funny we've been doing with the dma api, and the dma api maintainer
steaming off telling us we're incompetent idiots. I guess I've been
branded a bit on this topic :-/

Really great that this is changing now.

On the patch itself: It might not be the right thing in all cases, since
for certain compression formats the nv gpu wants larger pages (easy to
allocate from vram, not so easy from main memory), so might need the iommu
still. But currently that's not implemented:

https://www.spinics.net/lists/dri-devel/msg173932.html

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-25  6:43                                         ` Christoph Hellwig
  2018-04-25  7:02                                           ` Daniel Vetter
@ 2018-04-25  7:41                                           ` Thierry Reding
  2018-04-25  8:54                                             ` noveau vs arm dma ops Christoph Hellwig
  1 sibling, 1 reply; 83+ messages in thread
From: Thierry Reding @ 2018-04-25  7:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Daniel Vetter, Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Linux Kernel Mailing List, amd-gfx list, Jerome Glisse,
	dri-devel, Dan Williams, Logan Gunthorpe,
	open list:DMA BUFFER SHARING FRAMEWORK

[-- Attachment #1: Type: text/plain, Size: 2182 bytes --]

On Tue, Apr 24, 2018 at 11:43:35PM -0700, Christoph Hellwig wrote:
> On Wed, Apr 25, 2018 at 08:23:15AM +0200, Daniel Vetter wrote:
> > For more fun:
> > 
> > https://www.spinics.net/lists/dri-devel/msg173630.html
> > 
> > Yeah, sometimes we want to disable the iommu because the on-gpu
> > pagetables are faster ...
> 
> I am not on this list, but remote NAK from here.  This needs an
> API from the iommu/dma-mapping code.  Drivers have no business poking
> into these details.

The interfaces that the above patch uses are all EXPORT_SYMBOL_GPL,
which is rather misleading if they are not meant to be used by drivers
directly.

> Thierry, please resend this with at least the iommu list and
> linux-arm-kernel in Cc to have a proper discussion on the right API.

I'm certainly open to help with finding a correct solution, but the
patch above was purposefully terse because this is something that I
hope we can get backported to v4.16 to unbreak Nouveau. Coordinating
such a backport between ARM and DRM trees does not sound like something
that would help getting this fixed in v4.16.

The fundamental issue here is that the DMA/IOMMU integration is
something that has caused a number of surprising regressions in the past
because it tends to sneak in unexpectedly. For example the current
regression shows up only if CONFIG_ARM_DMA_USE_IOMMU=y because the DMA
API will then transparently create a second mapping and mess things up.
Everything works fine if that option is disabled. This is ultimately why
we didn't notice, since we don't enable that option by default. I do
have a patch that I plan to apply to the Tegra tree that will always
enable CONFIG_ARM_DMA_USE_IOMMU=y on Tegra to avoid any such surprises
in the future, but I can obviously only apply that once the above patch
is applied to Nouveau, otherwise we'll break Nouveau unconditionally.

Granted, this issue could've been caught with a little more testing, but
in retrospect I think it would've been a lot better if ARM_DMA_USE_IOMMU
was just enabled unconditionally if it has side-effects that platforms
don't opt in to but have to explicitly opt out of.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-25  7:09                                             ` Christoph Hellwig
  2018-04-25  7:30                                               ` Daniel Vetter
@ 2018-04-25  7:43                                               ` Thierry Reding
  1 sibling, 0 replies; 83+ messages in thread
From: Thierry Reding @ 2018-04-25  7:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Daniel Vetter, Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Linux Kernel Mailing List, amd-gfx list, Jerome Glisse,
	dri-devel, Dan Williams, Logan Gunthorpe,
	open list:DMA BUFFER SHARING FRAMEWORK

[-- Attachment #1: Type: text/plain, Size: 1450 bytes --]

On Wed, Apr 25, 2018 at 12:09:05AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 25, 2018 at 09:02:17AM +0200, Daniel Vetter wrote:
> > Can we please not nack everything right away? Doesn't really motivate
> > me to show you all the various things we're doing in gpu to make the
> > dma layer work for us. That kind of noodling around in lower levels to
> > get them to do what we want is absolutely par-for-course for gpu
> > drivers. If you just nack everything I point you at for illustrative
> > purposes, then I can't show you stuff anymore.
> 
> No, it's not.  No driver (and that includes the magic GPUs) has
> any business messing with dma ops directly.
> 
> A GPU driver imght have a very valid reason to disable the IOMMU,
> but the code to do so needs to be at least in the arch code, maybe
> in the dma-mapping/iommu code, not in the driver.
> 
> As a first step to get the discussion started we'll simply need
> to move the code Thierry wrote into a helper in arch/arm and that
> alone would be a massive improvement.  I'm not even talking about
> minor details like actually using arm_get_dma_map_ops instead
> of duplicating it.
> 
> And doing this basic trivial work really helps to get this whole
> mess under control.

That's a good idea and I can prepare patches for this, but I'd like to
make those changes on top of the fix to keep the option of getting this
into v4.16 if at all possible.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-25  7:30                                               ` Daniel Vetter
@ 2018-04-25  7:56                                                 ` Thierry Reding
  2018-04-25  8:55                                                   ` Christoph Hellwig
  0 siblings, 1 reply; 83+ messages in thread
From: Thierry Reding @ 2018-04-25  7:56 UTC (permalink / raw)
  To: Christoph Hellwig, Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Linux Kernel Mailing List, amd-gfx list, Jerome Glisse,
	dri-devel, Dan Williams, Logan Gunthorpe,
	open list:DMA BUFFER SHARING FRAMEWORK

[-- Attachment #1: Type: text/plain, Size: 3063 bytes --]

On Wed, Apr 25, 2018 at 09:30:39AM +0200, Daniel Vetter wrote:
> On Wed, Apr 25, 2018 at 12:09:05AM -0700, Christoph Hellwig wrote:
> > On Wed, Apr 25, 2018 at 09:02:17AM +0200, Daniel Vetter wrote:
> > > Can we please not nack everything right away? Doesn't really motivate
> > > me to show you all the various things we're doing in gpu to make the
> > > dma layer work for us. That kind of noodling around in lower levels to
> > > get them to do what we want is absolutely par-for-course for gpu
> > > drivers. If you just nack everything I point you at for illustrative
> > > purposes, then I can't show you stuff anymore.
> > 
> > No, it's not.  No driver (and that includes the magic GPUs) has
> > any business messing with dma ops directly.
> > 
> > A GPU driver imght have a very valid reason to disable the IOMMU,
> > but the code to do so needs to be at least in the arch code, maybe
> > in the dma-mapping/iommu code, not in the driver.
> > 
> > As a first step to get the discussion started we'll simply need
> > to move the code Thierry wrote into a helper in arch/arm and that
> > alone would be a massive improvement.  I'm not even talking about
> > minor details like actually using arm_get_dma_map_ops instead
> > of duplicating it.
> > 
> > And doing this basic trivial work really helps to get this whole
> > mess under control.
> 
> Ah ok. It did sound a bit like a much more cathegorical NAK than an "ack
> in principle, but we need to shuffle the implementation into the right
> place first". In the past we generally got a principled NAK on anything
> funny we've been doing with the dma api, and the dma api maintainer
> steaming off telling us we're incompetent idiots. I guess I've been
> branded a bit on this topic :-/
> 
> Really great that this is changing now.
> 
> On the patch itself: It might not be the right thing in all cases, since
> for certain compression formats the nv gpu wants larger pages (easy to
> allocate from vram, not so easy from main memory), so might need the iommu
> still. But currently that's not implemented:
> 
> https://www.spinics.net/lists/dri-devel/msg173932.html

To clarify: we do want to use the IOMMU, but we want to use it
explicitly via the IOMMU API rather than hiding it behind the DMA API.
We do the same thing in Tegra DRM where we don't want to use the DMA API
because it doesn't allow us to share the same mapping between multiple
display controllers in the same way the IOMMU API does. We've also been
thinking about using the IOMMU API directly in order to support process
isolation for devices that accept command streams from userspace.

Fortunately the issue I'm seeing with Nouveau doesn't happen with Tegra
DRM, which seems to be because we have an IOMMU group with multiple
devices and that prevents the DMA API from "hijacking" the IOMMU domain
for the group.

And to add to the confusion, none of this seems to be an issue on 64-bit
ARM where the generic DMA/IOMMU code from drivers/iommu/dma-iommu.c is
used.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* noveau vs arm dma ops
  2018-04-25  7:41                                           ` Thierry Reding
@ 2018-04-25  8:54                                             ` Christoph Hellwig
  2018-04-25  9:25                                               ` Russell King - ARM Linux
  2018-04-25 10:04                                               ` Daniel Vetter
  0 siblings, 2 replies; 83+ messages in thread
From: Christoph Hellwig @ 2018-04-25  8:54 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Christoph Hellwig, Daniel Vetter, Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Linux Kernel Mailing List, amd-gfx list, Jerome Glisse,
	dri-devel, Dan Williams, Logan Gunthorpe,
	open list:DMA BUFFER SHARING FRAMEWORK, iommu, linux-arm-kernel

[discussion about this patch, which should have been cced to the iommu
 and linux-arm-kernel lists, but wasn't:
 https://www.spinics.net/lists/dri-devel/msg173630.html]

On Wed, Apr 25, 2018 at 09:41:51AM +0200, Thierry Reding wrote:
> > API from the iommu/dma-mapping code.  Drivers have no business poking
> > into these details.
> 
> The interfaces that the above patch uses are all EXPORT_SYMBOL_GPL,
> which is rather misleading if they are not meant to be used by drivers
> directly.

The only reason the DMA ops are exported is because get_arch_dma_ops
references (or in case of the coherent ones used to reference).  We
don't let drivers assign random dma ops.

> 
> > Thierry, please resend this with at least the iommu list and
> > linux-arm-kernel in Cc to have a proper discussion on the right API.
> 
> I'm certainly open to help with finding a correct solution, but the
> patch above was purposefully terse because this is something that I
> hope we can get backported to v4.16 to unbreak Nouveau. Coordinating
> such a backport between ARM and DRM trees does not sound like something
> that would help getting this fixed in v4.16.

Coordinating the backport of a trivial helper in the arm tree is not
the end of the world.  Really, this cowboy attitude is a good reason
why graphics folks have such a bad rep.  You keep poking into random
kernel internals, don't talk to anoyone and then complain if people
are upset.  This shouldn't be surprising.

> Granted, this issue could've been caught with a little more testing, but
> in retrospect I think it would've been a lot better if ARM_DMA_USE_IOMMU
> was just enabled unconditionally if it has side-effects that platforms
> don't opt in to but have to explicitly opt out of.

Agreed on that count.  Please send a patch.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-25  7:56                                                 ` Thierry Reding
@ 2018-04-25  8:55                                                   ` Christoph Hellwig
  0 siblings, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2018-04-25  8:55 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Christoph Hellwig, Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Linux Kernel Mailing List, amd-gfx list, Jerome Glisse,
	dri-devel, Dan Williams, Logan Gunthorpe,
	open list:DMA BUFFER SHARING FRAMEWORK

On Wed, Apr 25, 2018 at 09:56:43AM +0200, Thierry Reding wrote:
> And to add to the confusion, none of this seems to be an issue on 64-bit
> ARM where the generic DMA/IOMMU code from drivers/iommu/dma-iommu.c is
> used.

In the long term I want everyone to use that code.  Help welcome!

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: noveau vs arm dma ops
  2018-04-25  8:54                                             ` noveau vs arm dma ops Christoph Hellwig
@ 2018-04-25  9:25                                               ` Russell King - ARM Linux
  2018-04-25 10:04                                               ` Daniel Vetter
  1 sibling, 0 replies; 83+ messages in thread
From: Russell King - ARM Linux @ 2018-04-25  9:25 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Christoph Hellwig, moderated list:DMA BUFFER SHARING FRAMEWORK,
	Linux Kernel Mailing List, amd-gfx list, Jerome Glisse, iommu,
	dri-devel, Daniel Vetter, Dan Williams, Logan Gunthorpe,
	Christian König, linux-arm-kernel,
	open list:DMA BUFFER SHARING FRAMEWORK

On Wed, Apr 25, 2018 at 01:54:39AM -0700, Christoph Hellwig wrote:
> [discussion about this patch, which should have been cced to the iommu
>  and linux-arm-kernel lists, but wasn't:
>  https://www.spinics.net/lists/dri-devel/msg173630.html]
> 
> On Wed, Apr 25, 2018 at 09:41:51AM +0200, Thierry Reding wrote:
> > > API from the iommu/dma-mapping code.  Drivers have no business poking
> > > into these details.
> > 
> > The interfaces that the above patch uses are all EXPORT_SYMBOL_GPL,
> > which is rather misleading if they are not meant to be used by drivers
> > directly.

EXPORT_SYMBOL* means nothing as far as whether a driver should
be able to use the symbol or not - it merely means that the symbol is
made available to a module.  Modules cover much more than just device
drivers - we have library modules, filesystem modules, helper modules
to name a few non-driver classes of modules.

We also have symbols that are exported as part of the architecture
implementation detail of a public interface.  For example, the
public interface "copy_from_user" is implemented as an inline
function (actually several layers of inline functions) eventually
calling into arm_copy_from_user().  arm_copy_from_user() is exported,
but drivers (in fact no module) is allowed to make direct reference
to arm_copy_from_user() - it'd fail when software PAN is enabled.

The whole idea that "if a symbol is exported, it's fine for a driver
to use it" is a complete work of fiction, always has been, and always
will be.

We've had this with the architecture implementation details of the
DMA API before, and with the architecture implementation details of
the CPU cache flushing.  There's only so much commentry, or __
prefixes you can add to a symbol before things get rediculous, and
none of it stops people creating this abuse.  The only thing that
seems to prevent it is to make life hard for folk wanting to use
the symbols (eg, hiding the symbol prototype in a private header,
etc.)

Never, ever go under the covers of an interface.  If the interface
doesn't do what you want, _discuss_ it, don't just think "oh, that
architecture private facility looks like what I need, I'll use that
directly."

If you ever are on the side of trying to maintain those implementation
details that are abused in this way, you'll soon understand why this
behaviour by driver authors is soo annoying, and the maintainability
problems it creates.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: noveau vs arm dma ops
  2018-04-25  8:54                                             ` noveau vs arm dma ops Christoph Hellwig
  2018-04-25  9:25                                               ` Russell King - ARM Linux
@ 2018-04-25 10:04                                               ` Daniel Vetter
  2018-04-25 15:33                                                 ` Christoph Hellwig
  1 sibling, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2018-04-25 10:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Thierry Reding, Daniel Vetter, Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Linux Kernel Mailing List, amd-gfx list, Jerome Glisse,
	dri-devel, Dan Williams, Logan Gunthorpe,
	open list:DMA BUFFER SHARING FRAMEWORK, iommu, linux-arm-kernel

On Wed, Apr 25, 2018 at 01:54:39AM -0700, Christoph Hellwig wrote:
> [discussion about this patch, which should have been cced to the iommu
>  and linux-arm-kernel lists, but wasn't:
>  https://www.spinics.net/lists/dri-devel/msg173630.html]
> 
> On Wed, Apr 25, 2018 at 09:41:51AM +0200, Thierry Reding wrote:
> > > API from the iommu/dma-mapping code.  Drivers have no business poking
> > > into these details.
> > 
> > The interfaces that the above patch uses are all EXPORT_SYMBOL_GPL,
> > which is rather misleading if they are not meant to be used by drivers
> > directly.
> 
> The only reason the DMA ops are exported is because get_arch_dma_ops
> references (or in case of the coherent ones used to reference).  We
> don't let drivers assign random dma ops.
> 
> > 
> > > Thierry, please resend this with at least the iommu list and
> > > linux-arm-kernel in Cc to have a proper discussion on the right API.
> > 
> > I'm certainly open to help with finding a correct solution, but the
> > patch above was purposefully terse because this is something that I
> > hope we can get backported to v4.16 to unbreak Nouveau. Coordinating
> > such a backport between ARM and DRM trees does not sound like something
> > that would help getting this fixed in v4.16.
> 
> Coordinating the backport of a trivial helper in the arm tree is not
> the end of the world.  Really, this cowboy attitude is a good reason
> why graphics folks have such a bad rep.  You keep poking into random
> kernel internals, don't talk to anoyone and then complain if people
> are upset.  This shouldn't be surprising.

Not really agreeing on the cowboy thing. The fundamental problem is that
the dma api provides abstraction that seriously gets in the way of writing
a gpu driver. Some examples:

- We never want bounce buffers, ever. dma_map_sg gives us that, so there's
  hacks to fall back to a cache of pages allocated using
  dma_alloc_coherent if you build a kernel with bounce buffers.

- dma api hides the cache flushing requirements from us. GPUs love
  non-snooped access, and worse give userspace control over that. We want
  a strict separation between mapping stuff and flushing stuff. With the
  IOMMU api we mostly have the former, but for the later arch maintainers
  regularly tells they won't allow that. So we have drm_clflush.c.

- dma api hides how/where memory is allocated. Kinda similar problem,
  except now for CMA or address limits. So either we roll our own
  allocators and then dma_map_sg (and pray it doesn't bounce buffer), or
  we use dma_alloc_coherent and then grab the sgt to get at the CMA
  allocations because that's the only way. Which sucks, because we can't
  directly tell CMA how to back off if there's some way to make CMA memory
  available through other means (gpus love to hog all of memory, so we
  have shrinkers and everything).

For display drivers the dma api mostly works, until you start sharing
buffers with other devices.

So from our perspective it looks fairly often that core folks just don't
want to support gpu use-cases, so we play a bit more cowboy and get things
done some other way. Since this has been going on for years now we often
don't even bother to start a discussion first, since it tended to go
nowhere useful.
-Daniel

> > Granted, this issue could've been caught with a little more testing, but
> > in retrospect I think it would've been a lot better if ARM_DMA_USE_IOMMU
> > was just enabled unconditionally if it has side-effects that platforms
> > don't opt in to but have to explicitly opt out of.
> 
> Agreed on that count.  Please send a patch.

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: noveau vs arm dma ops
  2018-04-25 10:04                                               ` Daniel Vetter
@ 2018-04-25 15:33                                                 ` Christoph Hellwig
  2018-04-25 21:35                                                   ` Daniel Vetter
  2018-04-25 22:54                                                   ` Russell King - ARM Linux
  0 siblings, 2 replies; 83+ messages in thread
From: Christoph Hellwig @ 2018-04-25 15:33 UTC (permalink / raw)
  To: Christoph Hellwig, Thierry Reding, Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Linux Kernel Mailing List, amd-gfx list, Jerome Glisse,
	dri-devel, Dan Williams, Logan Gunthorpe,
	open list:DMA BUFFER SHARING FRAMEWORK, iommu, linux-arm-kernel

On Wed, Apr 25, 2018 at 12:04:29PM +0200, Daniel Vetter wrote:
> > Coordinating the backport of a trivial helper in the arm tree is not
> > the end of the world.  Really, this cowboy attitude is a good reason
> > why graphics folks have such a bad rep.  You keep poking into random
> > kernel internals, don't talk to anoyone and then complain if people
> > are upset.  This shouldn't be surprising.
> 
> Not really agreeing on the cowboy thing. The fundamental problem is that
> the dma api provides abstraction that seriously gets in the way of writing
> a gpu driver. Some examples:

So talk to other people.  Maybe people share your frustation.  Or maybe
other people have a way to help.

> - We never want bounce buffers, ever. dma_map_sg gives us that, so there's
>   hacks to fall back to a cache of pages allocated using
>   dma_alloc_coherent if you build a kernel with bounce buffers.

get_required_mask() is supposed to tell you if you are safe.  However
we are missing lots of implementations of it for iommus so you might get
some false negatives, improvements welcome.  It's been on my list of
things to fix in the DMA API, but it is nowhere near the top.

> - dma api hides the cache flushing requirements from us. GPUs love
>   non-snooped access, and worse give userspace control over that. We want
>   a strict separation between mapping stuff and flushing stuff. With the
>   IOMMU api we mostly have the former, but for the later arch maintainers
>   regularly tells they won't allow that. So we have drm_clflush.c.

The problem is that a cache flushing API entirely separate is hard. That
being said if you look at my generic dma-noncoherent API series it tries
to move that way.  So far it is in early stages and apparently rather
buggy unfortunately.

> - dma api hides how/where memory is allocated. Kinda similar problem,
>   except now for CMA or address limits. So either we roll our own
>   allocators and then dma_map_sg (and pray it doesn't bounce buffer), or
>   we use dma_alloc_coherent and then grab the sgt to get at the CMA
>   allocations because that's the only way. Which sucks, because we can't
>   directly tell CMA how to back off if there's some way to make CMA memory
>   available through other means (gpus love to hog all of memory, so we
>   have shrinkers and everything).

If you really care about doing explicitly cache flushing anyway (see
above) allocating your own memory and mapping it where needed is by
far the superior solution.  On cache coherent architectures
dma_alloc_coherent is nothing but allocate memory + dma_map_single.
On non coherent allocations the memory might come through a special
pool or must be used through a special virtual address mapping that
is set up either statically or dynamically.  For that case splitting
allocation and mapping is a good idea in many ways, and I plan to move
towards that once the number of dma mapping implementations is down
to a reasonable number so that it can actually be done.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-25  6:41                                         ` Christoph Hellwig
@ 2018-04-25 17:44                                           ` Alex Deucher
  2018-04-25 18:38                                             ` Dan Williams
  2018-05-04 12:45                                             ` Lucas Stach
  0 siblings, 2 replies; 83+ messages in thread
From: Alex Deucher @ 2018-04-25 17:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Daniel Vetter, Linux Kernel Mailing List, dri-devel,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Jerome Glisse,
	amd-gfx list, Dan Williams, Logan Gunthorpe,
	Christian König, open list:DMA BUFFER SHARING FRAMEWORK

On Wed, Apr 25, 2018 at 2:41 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Wed, Apr 25, 2018 at 02:24:36AM -0400, Alex Deucher wrote:
>> > It has a non-coherent transaction mode (which the chipset can opt to
>> > not implement and still flush), to make sure the AGP horror show
>> > doesn't happen again and GPU folks are happy with PCIe. That's at
>> > least my understanding from digging around in amd the last time we had
>> > coherency issues between intel and amd gpus. GPUs have some bits
>> > somewhere (in the pagetables, or in the buffer object description
>> > table created by userspace) to control that stuff.
>>
>> Right.  We have a bit in the GPU page table entries that determines
>> whether we snoop the CPU's cache or not.
>
> I can see how that works with the GPU on the same SOC or SOC set as the
> CPU.  But how is that going to work for a GPU that is a plain old PCIe
> card?  The cache snooping in that case is happening in the PCIe root
> complex.

I'm not a pci expert, but as far as I know, the device sends either a
snooped or non-snooped transaction on the bus.  I think the
transaction descriptor supports a no snoop attribute.  Our GPUs have
supported this feature for probably 20 years if not more, going back
to PCI.  Using non-snooped transactions have lower latency and faster
throughput compared to snooped transactions.

Alex

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-25 17:44                                           ` Alex Deucher
@ 2018-04-25 18:38                                             ` Dan Williams
  2018-05-04 12:45                                             ` Lucas Stach
  1 sibling, 0 replies; 83+ messages in thread
From: Dan Williams @ 2018-04-25 18:38 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Christoph Hellwig, Daniel Vetter, Linux Kernel Mailing List,
	dri-devel, moderated list:DMA BUFFER SHARING FRAMEWORK,
	Jerome Glisse, amd-gfx list, Logan Gunthorpe,
	Christian König, open list:DMA BUFFER SHARING FRAMEWORK

On Wed, Apr 25, 2018 at 10:44 AM, Alex Deucher <alexdeucher@gmail.com> wrote:
> On Wed, Apr 25, 2018 at 2:41 AM, Christoph Hellwig <hch@infradead.org> wrote:
>> On Wed, Apr 25, 2018 at 02:24:36AM -0400, Alex Deucher wrote:
>>> > It has a non-coherent transaction mode (which the chipset can opt to
>>> > not implement and still flush), to make sure the AGP horror show
>>> > doesn't happen again and GPU folks are happy with PCIe. That's at
>>> > least my understanding from digging around in amd the last time we had
>>> > coherency issues between intel and amd gpus. GPUs have some bits
>>> > somewhere (in the pagetables, or in the buffer object description
>>> > table created by userspace) to control that stuff.
>>>
>>> Right.  We have a bit in the GPU page table entries that determines
>>> whether we snoop the CPU's cache or not.
>>
>> I can see how that works with the GPU on the same SOC or SOC set as the
>> CPU.  But how is that going to work for a GPU that is a plain old PCIe
>> card?  The cache snooping in that case is happening in the PCIe root
>> complex.
>
> I'm not a pci expert, but as far as I know, the device sends either a
> snooped or non-snooped transaction on the bus.  I think the
> transaction descriptor supports a no snoop attribute.  Our GPUs have
> supported this feature for probably 20 years if not more, going back
> to PCI.  Using non-snooped transactions have lower latency and faster
> throughput compared to snooped transactions.

Right, 'no snoop' and 'relaxed ordering' have been part of the PCI
spec since forever. With a plain old PCI-E card the root complex
indeed arranges for caches to be snooped. Although it's not always
faster depending on the platform. 'No snoop' traffic may be relegated
to less bus resources relative to the much more common snooped
traffic.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: noveau vs arm dma ops
  2018-04-25 15:33                                                 ` Christoph Hellwig
@ 2018-04-25 21:35                                                   ` Daniel Vetter
  2018-04-25 23:26                                                     ` Russell King - ARM Linux
  2018-04-26  9:09                                                     ` Christoph Hellwig
  2018-04-25 22:54                                                   ` Russell King - ARM Linux
  1 sibling, 2 replies; 83+ messages in thread
From: Daniel Vetter @ 2018-04-25 21:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Thierry Reding, Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Linux Kernel Mailing List, amd-gfx list, Jerome Glisse,
	dri-devel, Dan Williams, Logan Gunthorpe,
	open list:DMA BUFFER SHARING FRAMEWORK, iommu, Linux ARM

On Wed, Apr 25, 2018 at 5:33 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Wed, Apr 25, 2018 at 12:04:29PM +0200, Daniel Vetter wrote:
>> > Coordinating the backport of a trivial helper in the arm tree is not
>> > the end of the world.  Really, this cowboy attitude is a good reason
>> > why graphics folks have such a bad rep.  You keep poking into random
>> > kernel internals, don't talk to anoyone and then complain if people
>> > are upset.  This shouldn't be surprising.
>>
>> Not really agreeing on the cowboy thing. The fundamental problem is that
>> the dma api provides abstraction that seriously gets in the way of writing
>> a gpu driver. Some examples:
>
> So talk to other people.  Maybe people share your frustation.  Or maybe
> other people have a way to help.
>
>> - We never want bounce buffers, ever. dma_map_sg gives us that, so there's
>>   hacks to fall back to a cache of pages allocated using
>>   dma_alloc_coherent if you build a kernel with bounce buffers.
>
> get_required_mask() is supposed to tell you if you are safe.  However
> we are missing lots of implementations of it for iommus so you might get
> some false negatives, improvements welcome.  It's been on my list of
> things to fix in the DMA API, but it is nowhere near the top.

I hasn't come up in a while in some fireworks, so I honestly don't
remember exactly what the issues have been. But

commit d766ef53006c2c38a7fe2bef0904105a793383f2
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Dec 19 12:43:45 2016 +0000

    drm/i915: Fallback to single PAGE_SIZE segments for DMA remapping

and the various bits of code that a

$ git grep SWIOTLB -- drivers/gpu

turns up is what we're doing to hack around that stuff. And in general
(there's some exceptions) gpus should be able to address everything,
so I never fully understood where that's even coming from.

>> - dma api hides the cache flushing requirements from us. GPUs love
>>   non-snooped access, and worse give userspace control over that. We want
>>   a strict separation between mapping stuff and flushing stuff. With the
>>   IOMMU api we mostly have the former, but for the later arch maintainers
>>   regularly tells they won't allow that. So we have drm_clflush.c.
>
> The problem is that a cache flushing API entirely separate is hard. That
> being said if you look at my generic dma-noncoherent API series it tries
> to move that way.  So far it is in early stages and apparently rather
> buggy unfortunately.

I'm assuming this stuff here?

https://lkml.org/lkml/2018/4/20/146

Anyway got lost in all that work a bit, looks really nice.

>> - dma api hides how/where memory is allocated. Kinda similar problem,
>>   except now for CMA or address limits. So either we roll our own
>>   allocators and then dma_map_sg (and pray it doesn't bounce buffer), or
>>   we use dma_alloc_coherent and then grab the sgt to get at the CMA
>>   allocations because that's the only way. Which sucks, because we can't
>>   directly tell CMA how to back off if there's some way to make CMA memory
>>   available through other means (gpus love to hog all of memory, so we
>>   have shrinkers and everything).
>
> If you really care about doing explicitly cache flushing anyway (see
> above) allocating your own memory and mapping it where needed is by
> far the superior solution.  On cache coherent architectures
> dma_alloc_coherent is nothing but allocate memory + dma_map_single.
> On non coherent allocations the memory might come through a special
> pool or must be used through a special virtual address mapping that
> is set up either statically or dynamically.  For that case splitting
> allocation and mapping is a good idea in many ways, and I plan to move
> towards that once the number of dma mapping implementations is down
> to a reasonable number so that it can actually be done.

Yeah the above is pretty much what we do on x86. dma-api believes
everything is coherent, so dma_map_sg does the mapping we want and
nothing else (minus swiotlb fun). Cache flushing, allocations, all
done by the driver.

On arm that doesn't work. The iommu api seems like a good fit, except
the dma-api tends to get in the way a bit (drm/msm apparently has
similar problems like tegra), and if you need contiguous memory
dma_alloc_coherent is the only way to get at contiguous memory. There
was a huge discussion years ago about that, and direct cma access was
shot down because it would have exposed too much of the caching
attribute mangling required (most arm platforms need wc-pages to not
be in the kernel's linear map apparently).

Anything that separate these 3 things more (allocation pools, mapping
through IOMMUs and flushing cpu caches) sounds like the right
direction to me. Even if that throws some portability across platforms
away - drivers who want to control things in this much detail aren't
really portable (without some serious work) anyway.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: noveau vs arm dma ops
  2018-04-25 15:33                                                 ` Christoph Hellwig
  2018-04-25 21:35                                                   ` Daniel Vetter
@ 2018-04-25 22:54                                                   ` Russell King - ARM Linux
  2018-04-26  9:13                                                     ` Christoph Hellwig
  2018-04-26  9:20                                                     ` [Linaro-mm-sig] " Daniel Vetter
  1 sibling, 2 replies; 83+ messages in thread
From: Russell King - ARM Linux @ 2018-04-25 22:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Thierry Reding, Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Linux Kernel Mailing List, amd-gfx list, Jerome Glisse,
	dri-devel, Dan Williams, Logan Gunthorpe,
	open list:DMA BUFFER SHARING FRAMEWORK, iommu, linux-arm-kernel

On Wed, Apr 25, 2018 at 08:33:12AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 25, 2018 at 12:04:29PM +0200, Daniel Vetter wrote:
> > - dma api hides the cache flushing requirements from us. GPUs love
> >   non-snooped access, and worse give userspace control over that. We want
> >   a strict separation between mapping stuff and flushing stuff. With the
> >   IOMMU api we mostly have the former, but for the later arch maintainers
> >   regularly tells they won't allow that. So we have drm_clflush.c.
> 
> The problem is that a cache flushing API entirely separate is hard. That
> being said if you look at my generic dma-noncoherent API series it tries
> to move that way.  So far it is in early stages and apparently rather
> buggy unfortunately.

And if folk want a cacheable mapping with explicit cache flushing, the
cache flushing must not be defined in terms of "this is what CPU seems
to need" but from the point of view of a CPU with infinite prefetching,
infinite caching and infinite capacity to perform writebacks of dirty
cache lines at unexpected moments when the memory is mapped in a
cacheable mapping.

(The reason for that is you're operating in a non-CPU specific space,
so you can't make any guarantees as to how much caching or prefetching
will occur by the CPU - different CPUs will do different amounts.)

So, for example, the sequence:

GPU writes to memory
			CPU reads from cacheable memory

if the memory was previously dirty (iow, CPU has written), you need to
flush the dirty cache lines _before_ the GPU writes happen, but you
don't know whether the CPU has speculatively prefetched, so you need
to flush any prefetched cache lines before reading from the cacheable
memory _after_ the GPU has finished writing.

Also note that "flush" there can be "clean the cache", "clean and
invalidate the cache" or "invalidate the cache" as appropriate - some
CPUs are able to perform those three operations, and the appropriate
one depends on not only where in the above sequence it's being used,
but also on what the operations are.

So, the above sequence could be:

			CPU invalidates cache for memory
				(due to possible dirty cache lines)
GPU writes to memory
			CPU invalidates cache for memory
				(to get rid of any speculatively prefetched
				 lines)
			CPU reads from cacheable memory

Yes, in the above case, _two_ cache operations are required to ensure
correct behaviour.  However, if you know for certain that the memory was
previously clean, then the first cache operation can be skipped.

What I'm pointing out is there's much more than just "I want to flush
the cache" here, which is currently what DRM seems to assume at the
moment with the code in drm_cache.c.

If we can agree a set of interfaces that allows _proper_ use of these
facilities, one which can be used appropriately, then there shouldn't
be a problem.  The DMA API does that via it's ideas about who owns a
particular buffer (because of the above problem) and that's something
which would need to be carried over to such a cache flushing API (it
should be pretty obvious that having a GPU read or write memory while
the cache for that memory is being cleaned will lead to unexpected
results.)

Also note that things get even more interesting in a SMP environment
if cache operations aren't broadcasted across the SMP cluster (which
means cache operations have to be IPI'd to other CPUs.)

The next issue, which I've brought up before, is that exposing cache
flushing to userspace on architectures where it isn't already exposed
comes.  As has been shown by Google Project Zero, this risks exposing
those architectures to Spectre and Meltdown exploits where they weren't
at such a risk before.  (I've pretty much shown here that you _do_
need to control which cache lines get flushed to make these exploits
work, and flushing the cache by reading lots of data in liu of having
the ability to explicitly flush bits of cache makes it very difficult
to impossible for them to work.)

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: noveau vs arm dma ops
  2018-04-25 21:35                                                   ` Daniel Vetter
@ 2018-04-25 23:26                                                     ` Russell King - ARM Linux
  2018-04-26  9:17                                                       ` Daniel Vetter
  2018-04-26  9:09                                                     ` Christoph Hellwig
  1 sibling, 1 reply; 83+ messages in thread
From: Russell King - ARM Linux @ 2018-04-25 23:26 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christoph Hellwig, Linux Kernel Mailing List, amd-gfx list,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Jerome Glisse,
	iommu, dri-devel, Dan Williams, Thierry Reding, Logan Gunthorpe,
	Christian König, Linux ARM,
	open list:DMA BUFFER SHARING FRAMEWORK

On Wed, Apr 25, 2018 at 11:35:13PM +0200, Daniel Vetter wrote:
> On arm that doesn't work. The iommu api seems like a good fit, except
> the dma-api tends to get in the way a bit (drm/msm apparently has
> similar problems like tegra), and if you need contiguous memory
> dma_alloc_coherent is the only way to get at contiguous memory. There
> was a huge discussion years ago about that, and direct cma access was
> shot down because it would have exposed too much of the caching
> attribute mangling required (most arm platforms need wc-pages to not
> be in the kernel's linear map apparently).

I think you completely misunderstand ARM from what you've written above,
and this worries me greatly about giving DRM the level of control that
is being asked for.

Modern ARMs have a PIPT cache or a non-aliasing VIPT cache, and cache
attributes are stored in the page tables.  These caches are inherently
non-aliasing when there are multiple mappings (which is a great step
forward compared to the previous aliasing caches.)

As the cache attributes are stored in the page tables, this in theory
allows different virtual mappings of the same physical memory to have
different cache attributes.  However, there's a problem, and that's
called speculative prefetching.

Let's say you have one mapping which is cacheable, and another that is
marked as write combining.  If a cache line is speculatively prefetched
through the cacheable mapping of this memory, and then you read the
same physical location through the write combining mapping, it is
possible that you could read cached data.

So, it is generally accepted that all mappings of any particular
physical bit of memory should have the same cache attributes to avoid
unpredictable behaviour.

This presents a problem with what is generally called "lowmem" where
the memory is mapped in kernel virtual space with cacheable
attributes.  It can also happen with highmem if the memory is
kmapped.

This is why, on ARM, you can't use something like get_free_pages() to
grab some pages from the system, pass it to the GPU, map it into
userspace as write-combining, etc.  It _might_ work for some CPUs,
but ARM CPUs vary in how much prefetching they do, and what may work
for one particular CPU is in no way guaranteed to work for another
ARM CPU.

The official line from architecture folk is to assume that the caches
infinitely speculate, are of infinite size, and can writeback *dirty*
data at any moment.

The way to stop things like speculative prefetches to particular
physical memory is to, quite "simply", not have any cacheable
mappings of that physical memory anywhere in the system.

Now, cache flushes on ARM tend to be fairly expensive for GPU buffers.
If you have, say, an 8MB buffer (for a 1080p frame) and you need to
do a cache operation on that buffer, you'll be iterating over it
32 or maybe 64 bytes at a time "just in case" there's a cache line
present.  Referring to my previous email, where I detailed the
potential need for _two_ flushes, one before the GPU operation and
one after, and this becomes _really_ expensive.  At that point, you're
probably way better off using write-combine memory where you don't
need to spend CPU cycles performing cache flushing - potentially
across all CPUs in the system if cache operations aren't broadcasted.

This isn't a simple matter of "just provide some APIs for cache
operations" - there's much more that needs to be understood by
all parties here, especially when we have GPU drivers that can be
used with quite different CPUs.

It may well be that for some combinations of CPUs and workloads, it's
better to use write-combine memory without cache flushing, but for
other CPUs that tradeoff (for the same workload) could well be
different.

Older ARMs get more interesting, because they have aliasing caches.
That means the CPU cache aliases across different virtual space
mappings in some way, which complicates (a) the mapping of memory
and (b) handling the cache operations on it.

It's too late for me to go into that tonight, and I probably won't
be reading mail for the next week and a half, sorry.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: noveau vs arm dma ops
  2018-04-25 21:35                                                   ` Daniel Vetter
  2018-04-25 23:26                                                     ` Russell King - ARM Linux
@ 2018-04-26  9:09                                                     ` Christoph Hellwig
  2018-04-26  9:45                                                       ` Daniel Vetter
  2018-04-26 11:12                                                       ` Russell King - ARM Linux
  1 sibling, 2 replies; 83+ messages in thread
From: Christoph Hellwig @ 2018-04-26  9:09 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christoph Hellwig, Thierry Reding, Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Linux Kernel Mailing List, amd-gfx list, Jerome Glisse,
	dri-devel, Dan Williams, Logan Gunthorpe,
	open list:DMA BUFFER SHARING FRAMEWORK, iommu, Linux ARM

On Wed, Apr 25, 2018 at 11:35:13PM +0200, Daniel Vetter wrote:
> > get_required_mask() is supposed to tell you if you are safe.  However
> > we are missing lots of implementations of it for iommus so you might get
> > some false negatives, improvements welcome.  It's been on my list of
> > things to fix in the DMA API, but it is nowhere near the top.
> 
> I hasn't come up in a while in some fireworks, so I honestly don't
> remember exactly what the issues have been. But
> 
> commit d766ef53006c2c38a7fe2bef0904105a793383f2
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Mon Dec 19 12:43:45 2016 +0000
> 
>     drm/i915: Fallback to single PAGE_SIZE segments for DMA remapping
> 
> and the various bits of code that a
> 
> $ git grep SWIOTLB -- drivers/gpu
> 
> turns up is what we're doing to hack around that stuff. And in general
> (there's some exceptions) gpus should be able to address everything,
> so I never fully understood where that's even coming from.

I'm pretty sure I've seen some oddly low dma masks in GPU drivers.  E.g.
duplicated in various AMD files:

	adev->need_dma32 = false;
        dma_bits = adev->need_dma32 ? 32 : 40;
        r = pci_set_dma_mask(adev->pdev, DMA_BIT_MASK(dma_bits));
        if (r) {
                adev->need_dma32 = true;
                dma_bits = 32;
                dev_warn(adev->dev, "amdgpu: No suitable DMA available.\n");
        }

synopsis:

drivers/gpu/drm/bridge/synopsys/dw-hdmi-i2s-audio.c:    pdevinfo.dma_mask       = DMA_BIT_MASK(32);
drivers/gpu/drm/bridge/synopsys/dw-hdmi.c:              pdevinfo.dma_mask = DMA_BIT_MASK(32);
drivers/gpu/drm/bridge/synopsys/dw-hdmi.c:              pdevinfo.dma_mask = DMA_BIT_MASK(32);

etnaviv gets it right:

drivers/gpu/drm/etnaviv/etnaviv_gpu.c:          u32 dma_mask = (u32)dma_get_required_mask(gpu->dev);


But yes, the swiotlb hackery really irks me.  I just have some more
important and bigger fires to fight first, but I plan to get back to the
root cause of that eventually.

> 
> >> - dma api hides the cache flushing requirements from us. GPUs love
> >>   non-snooped access, and worse give userspace control over that. We want
> >>   a strict separation between mapping stuff and flushing stuff. With the
> >>   IOMMU api we mostly have the former, but for the later arch maintainers
> >>   regularly tells they won't allow that. So we have drm_clflush.c.
> >
> > The problem is that a cache flushing API entirely separate is hard. That
> > being said if you look at my generic dma-noncoherent API series it tries
> > to move that way.  So far it is in early stages and apparently rather
> > buggy unfortunately.
> 
> I'm assuming this stuff here?
> 
> https://lkml.org/lkml/2018/4/20/146
> 
> Anyway got lost in all that work a bit, looks really nice.

That url doesn't seem to work currently.  But I am talking about the
thread titled '[RFC] common non-cache coherent direct dma mapping ops'

> Yeah the above is pretty much what we do on x86. dma-api believes
> everything is coherent, so dma_map_sg does the mapping we want and
> nothing else (minus swiotlb fun). Cache flushing, allocations, all
> done by the driver.

Which sounds like the right thing to do to me.

> On arm that doesn't work. The iommu api seems like a good fit, except
> the dma-api tends to get in the way a bit (drm/msm apparently has
> similar problems like tegra), and if you need contiguous memory
> dma_alloc_coherent is the only way to get at contiguous memory. There
> was a huge discussion years ago about that, and direct cma access was
> shot down because it would have exposed too much of the caching
> attribute mangling required (most arm platforms need wc-pages to not
> be in the kernel's linear map apparently).

Simple cma_alloc() doesn't do anything about cache handling, it
just is a very dumb allocator for large contiguous regions inside
a big pool.

I'm not the CMA maintainer, but in general I'd love to see an
EXPORT_SYMBOL_GPL slapped onto cma_alloc/release and drivers use
that were needed.  Using that plus dma_map*/dma_unmap* sounds like
a much saner interface than dma_alloc_attrs + DMA_ATTR_NON_CONSISTENT
or DMA_ATTR_NO_KERNEL_MAPPING.

You don't happen to have a pointer to that previous discussion?

> Anything that separate these 3 things more (allocation pools, mapping
> through IOMMUs and flushing cpu caches) sounds like the right
> direction to me. Even if that throws some portability across platforms
> away - drivers who want to control things in this much detail aren't
> really portable (without some serious work) anyway.

As long as we stay away from the dma coherent allocations separating
them is fine, and I'm working towards it.  dma coherent allocations on
the other hand are very hairy beasts, and provide by very different
implementations depending on the architecture, or often even depending
on the specifics of individual implementations inside the architecture.

So for your GPU/media case it seems like you'd better stay away from
them as much as you can.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: noveau vs arm dma ops
  2018-04-25 22:54                                                   ` Russell King - ARM Linux
@ 2018-04-26  9:13                                                     ` Christoph Hellwig
  2018-04-26  9:20                                                     ` [Linaro-mm-sig] " Daniel Vetter
  1 sibling, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2018-04-26  9:13 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Christoph Hellwig, Thierry Reding, Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Linux Kernel Mailing List, amd-gfx list, Jerome Glisse,
	dri-devel, Dan Williams, Logan Gunthorpe,
	open list:DMA BUFFER SHARING FRAMEWORK, iommu, linux-arm-kernel

On Wed, Apr 25, 2018 at 11:54:43PM +0100, Russell King - ARM Linux wrote:
> 
> if the memory was previously dirty (iow, CPU has written), you need to
> flush the dirty cache lines _before_ the GPU writes happen, but you
> don't know whether the CPU has speculatively prefetched, so you need
> to flush any prefetched cache lines before reading from the cacheable
> memory _after_ the GPU has finished writing.
> 
> Also note that "flush" there can be "clean the cache", "clean and
> invalidate the cache" or "invalidate the cache" as appropriate - some
> CPUs are able to perform those three operations, and the appropriate
> one depends on not only where in the above sequence it's being used,
> but also on what the operations are.

Agreed on all these counts.

> If we can agree a set of interfaces that allows _proper_ use of these
> facilities, one which can be used appropriately, then there shouldn't
> be a problem.  The DMA API does that via it's ideas about who owns a
> particular buffer (because of the above problem) and that's something
> which would need to be carried over to such a cache flushing API (it
> should be pretty obvious that having a GPU read or write memory while
> the cache for that memory is being cleaned will lead to unexpected
> results.)

I've been trying to come up with such an interface, for now only for
internal use in a generic set of noncoherent ops.  The API is basically
a variant of the existing dma_sync_single_to_device/cpu calls:

http://git.infradead.org/users/hch/misc.git/commitdiff/044dae5f94509288f4655de2f22cb0bea85778af

Review welcome!

> The next issue, which I've brought up before, is that exposing cache
> flushing to userspace on architectures where it isn't already exposed
> comes.  As has been shown by Google Project Zero, this risks exposing
> those architectures to Spectre and Meltdown exploits where they weren't
> at such a risk before.  (I've pretty much shown here that you _do_
> need to control which cache lines get flushed to make these exploits
> work, and flushing the cache by reading lots of data in liu of having
> the ability to explicitly flush bits of cache makes it very difficult
> to impossible for them to work.)

Extending dma coherence to userspace is going to be the next major
nightmare indeed.  I'm not sure how much of that actually still is
going on in the graphics world, but we'll need a coherent (pun intended)
plan how to deal with it.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: noveau vs arm dma ops
  2018-04-25 23:26                                                     ` Russell King - ARM Linux
@ 2018-04-26  9:17                                                       ` Daniel Vetter
  0 siblings, 0 replies; 83+ messages in thread
From: Daniel Vetter @ 2018-04-26  9:17 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Christoph Hellwig, Linux Kernel Mailing List, amd-gfx list,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Jerome Glisse,
	iommu, dri-devel, Dan Williams, Thierry Reding, Logan Gunthorpe,
	Christian König, Linux ARM,
	open list:DMA BUFFER SHARING FRAMEWORK

On Thu, Apr 26, 2018 at 1:26 AM, Russell King - ARM Linux
<linux@armlinux.org.uk> wrote:
> On Wed, Apr 25, 2018 at 11:35:13PM +0200, Daniel Vetter wrote:
>> On arm that doesn't work. The iommu api seems like a good fit, except
>> the dma-api tends to get in the way a bit (drm/msm apparently has
>> similar problems like tegra), and if you need contiguous memory
>> dma_alloc_coherent is the only way to get at contiguous memory. There
>> was a huge discussion years ago about that, and direct cma access was
>> shot down because it would have exposed too much of the caching
>> attribute mangling required (most arm platforms need wc-pages to not
>> be in the kernel's linear map apparently).
>
> I think you completely misunderstand ARM from what you've written above,
> and this worries me greatly about giving DRM the level of control that
> is being asked for.
>
> Modern ARMs have a PIPT cache or a non-aliasing VIPT cache, and cache
> attributes are stored in the page tables.  These caches are inherently
> non-aliasing when there are multiple mappings (which is a great step
> forward compared to the previous aliasing caches.)
>
> As the cache attributes are stored in the page tables, this in theory
> allows different virtual mappings of the same physical memory to have
> different cache attributes.  However, there's a problem, and that's
> called speculative prefetching.
>
> Let's say you have one mapping which is cacheable, and another that is
> marked as write combining.  If a cache line is speculatively prefetched
> through the cacheable mapping of this memory, and then you read the
> same physical location through the write combining mapping, it is
> possible that you could read cached data.
>
> So, it is generally accepted that all mappings of any particular
> physical bit of memory should have the same cache attributes to avoid
> unpredictable behaviour.
>
> This presents a problem with what is generally called "lowmem" where
> the memory is mapped in kernel virtual space with cacheable
> attributes.  It can also happen with highmem if the memory is
> kmapped.
>
> This is why, on ARM, you can't use something like get_free_pages() to
> grab some pages from the system, pass it to the GPU, map it into
> userspace as write-combining, etc.  It _might_ work for some CPUs,
> but ARM CPUs vary in how much prefetching they do, and what may work
> for one particular CPU is in no way guaranteed to work for another
> ARM CPU.
>
> The official line from architecture folk is to assume that the caches
> infinitely speculate, are of infinite size, and can writeback *dirty*
> data at any moment.
>
> The way to stop things like speculative prefetches to particular
> physical memory is to, quite "simply", not have any cacheable
> mappings of that physical memory anywhere in the system.
>
> Now, cache flushes on ARM tend to be fairly expensive for GPU buffers.
> If you have, say, an 8MB buffer (for a 1080p frame) and you need to
> do a cache operation on that buffer, you'll be iterating over it
> 32 or maybe 64 bytes at a time "just in case" there's a cache line
> present.  Referring to my previous email, where I detailed the
> potential need for _two_ flushes, one before the GPU operation and
> one after, and this becomes _really_ expensive.  At that point, you're
> probably way better off using write-combine memory where you don't
> need to spend CPU cycles performing cache flushing - potentially
> across all CPUs in the system if cache operations aren't broadcasted.
>
> This isn't a simple matter of "just provide some APIs for cache
> operations" - there's much more that needs to be understood by
> all parties here, especially when we have GPU drivers that can be
> used with quite different CPUs.
>
> It may well be that for some combinations of CPUs and workloads, it's
> better to use write-combine memory without cache flushing, but for
> other CPUs that tradeoff (for the same workload) could well be
> different.
>
> Older ARMs get more interesting, because they have aliasing caches.
> That means the CPU cache aliases across different virtual space
> mappings in some way, which complicates (a) the mapping of memory
> and (b) handling the cache operations on it.
>
> It's too late for me to go into that tonight, and I probably won't
> be reading mail for the next week and a half, sorry.

I didn't know all the details well enough (and neither had the time to
write a few paragraphs like you did), but the above is what I had in
mind and meant. Sorry if my sloppy reply sounded like I'm mixing stuff
up.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] noveau vs arm dma ops
  2018-04-25 22:54                                                   ` Russell King - ARM Linux
  2018-04-26  9:13                                                     ` Christoph Hellwig
@ 2018-04-26  9:20                                                     ` Daniel Vetter
  2018-04-26  9:24                                                       ` Christoph Hellwig
  1 sibling, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2018-04-26  9:20 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Christoph Hellwig, Linux Kernel Mailing List, amd-gfx list,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Jerome Glisse,
	iommu, dri-devel, Dan Williams, Thierry Reding, Logan Gunthorpe,
	Christian König, Linux ARM,
	open list:DMA BUFFER SHARING FRAMEWORK

On Thu, Apr 26, 2018 at 12:54 AM, Russell King - ARM Linux
<linux@armlinux.org.uk> wrote:
> On Wed, Apr 25, 2018 at 08:33:12AM -0700, Christoph Hellwig wrote:
>> On Wed, Apr 25, 2018 at 12:04:29PM +0200, Daniel Vetter wrote:
>> > - dma api hides the cache flushing requirements from us. GPUs love
>> >   non-snooped access, and worse give userspace control over that. We want
>> >   a strict separation between mapping stuff and flushing stuff. With the
>> >   IOMMU api we mostly have the former, but for the later arch maintainers
>> >   regularly tells they won't allow that. So we have drm_clflush.c.
>>
>> The problem is that a cache flushing API entirely separate is hard. That
>> being said if you look at my generic dma-noncoherent API series it tries
>> to move that way.  So far it is in early stages and apparently rather
>> buggy unfortunately.
>
> And if folk want a cacheable mapping with explicit cache flushing, the
> cache flushing must not be defined in terms of "this is what CPU seems
> to need" but from the point of view of a CPU with infinite prefetching,
> infinite caching and infinite capacity to perform writebacks of dirty
> cache lines at unexpected moments when the memory is mapped in a
> cacheable mapping.
>
> (The reason for that is you're operating in a non-CPU specific space,
> so you can't make any guarantees as to how much caching or prefetching
> will occur by the CPU - different CPUs will do different amounts.)
>
> So, for example, the sequence:
>
> GPU writes to memory
>                         CPU reads from cacheable memory
>
> if the memory was previously dirty (iow, CPU has written), you need to
> flush the dirty cache lines _before_ the GPU writes happen, but you
> don't know whether the CPU has speculatively prefetched, so you need
> to flush any prefetched cache lines before reading from the cacheable
> memory _after_ the GPU has finished writing.
>
> Also note that "flush" there can be "clean the cache", "clean and
> invalidate the cache" or "invalidate the cache" as appropriate - some
> CPUs are able to perform those three operations, and the appropriate
> one depends on not only where in the above sequence it's being used,
> but also on what the operations are.
>
> So, the above sequence could be:
>
>                         CPU invalidates cache for memory
>                                 (due to possible dirty cache lines)
> GPU writes to memory
>                         CPU invalidates cache for memory
>                                 (to get rid of any speculatively prefetched
>                                  lines)
>                         CPU reads from cacheable memory
>
> Yes, in the above case, _two_ cache operations are required to ensure
> correct behaviour.  However, if you know for certain that the memory was
> previously clean, then the first cache operation can be skipped.
>
> What I'm pointing out is there's much more than just "I want to flush
> the cache" here, which is currently what DRM seems to assume at the
> moment with the code in drm_cache.c.
>
> If we can agree a set of interfaces that allows _proper_ use of these
> facilities, one which can be used appropriately, then there shouldn't
> be a problem.  The DMA API does that via it's ideas about who owns a
> particular buffer (because of the above problem) and that's something
> which would need to be carried over to such a cache flushing API (it
> should be pretty obvious that having a GPU read or write memory while
> the cache for that memory is being cleaned will lead to unexpected
> results.)
>
> Also note that things get even more interesting in a SMP environment
> if cache operations aren't broadcasted across the SMP cluster (which
> means cache operations have to be IPI'd to other CPUs.)
>
> The next issue, which I've brought up before, is that exposing cache
> flushing to userspace on architectures where it isn't already exposed
> comes.  As has been shown by Google Project Zero, this risks exposing
> those architectures to Spectre and Meltdown exploits where they weren't
> at such a risk before.  (I've pretty much shown here that you _do_
> need to control which cache lines get flushed to make these exploits
> work, and flushing the cache by reading lots of data in liu of having
> the ability to explicitly flush bits of cache makes it very difficult
> to impossible for them to work.)

The above is already what we're implementing in i915, at least
conceptually (it all boils down to clflush instructions because those
both invalidate and flush).

One architectural guarantee we're exploiting is that prefetched (and
hence non-dirty) cachelines will never get written back, but dropped
instead. But we kinda need that, otherwise the cpu could randomly
corrupt the data the gpu is writing and non-coherent would just not
work on those platforms. But aside from that, yes we do an invalidate
before reading, and flushing after every writing (or anything else
that could leave dirty cachelines behind). Plus a bit of tracking in
the driver (kernel/userspace both do this, together, with some
hilariously bad evolved semantics at least for i915, but oh well can't
fix uapi mistakes) to avoid redundant cacheline flushes/invalidates.

So ack.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] noveau vs arm dma ops
  2018-04-26  9:20                                                     ` [Linaro-mm-sig] " Daniel Vetter
@ 2018-04-26  9:24                                                       ` Christoph Hellwig
  2018-04-26  9:39                                                         ` Daniel Vetter
  0 siblings, 1 reply; 83+ messages in thread
From: Christoph Hellwig @ 2018-04-26  9:24 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Russell King - ARM Linux, Christoph Hellwig,
	Linux Kernel Mailing List, amd-gfx list,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Jerome Glisse,
	iommu, dri-devel, Dan Williams, Thierry Reding, Logan Gunthorpe,
	Christian König, Linux ARM,
	open list:DMA BUFFER SHARING FRAMEWORK

On Thu, Apr 26, 2018 at 11:20:44AM +0200, Daniel Vetter wrote:
> The above is already what we're implementing in i915, at least
> conceptually (it all boils down to clflush instructions because those
> both invalidate and flush).

The clwb instruction that just writes back dirty cache lines might
be very interesting for the x86 non-coherent dma case.  A lot of
architectures use their equivalent to prepare to to device transfers.

> One architectural guarantee we're exploiting is that prefetched (and
> hence non-dirty) cachelines will never get written back, but dropped
> instead.

And to make this work you'll need exactly this guarantee.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] noveau vs arm dma ops
  2018-04-26  9:24                                                       ` Christoph Hellwig
@ 2018-04-26  9:39                                                         ` Daniel Vetter
  0 siblings, 0 replies; 83+ messages in thread
From: Daniel Vetter @ 2018-04-26  9:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Russell King - ARM Linux, Linux Kernel Mailing List,
	amd-gfx list, moderated list:DMA BUFFER SHARING FRAMEWORK,
	Jerome Glisse, iommu, dri-devel, Dan Williams, Thierry Reding,
	Logan Gunthorpe, Christian König, Linux ARM,
	open list:DMA BUFFER SHARING FRAMEWORK

On Thu, Apr 26, 2018 at 11:24 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Thu, Apr 26, 2018 at 11:20:44AM +0200, Daniel Vetter wrote:
>> The above is already what we're implementing in i915, at least
>> conceptually (it all boils down to clflush instructions because those
>> both invalidate and flush).
>
> The clwb instruction that just writes back dirty cache lines might
> be very interesting for the x86 non-coherent dma case.  A lot of
> architectures use their equivalent to prepare to to device transfers.

Iirc didn't help for i915 use-cases much. Either data gets streamed
between cpu and gpu, and then keeping the clean cacheline around
doesn't buy you anything. In other cases we need to flush because the
gpu really wants to use non-snooped transactions (faster/lower
latency/less power required for display because you can shut down the
caches), and then there's also no benefit with keeping the cacheline
around (no one will ever need it again).

I think clwb is more for persistent memory and stuff like that, not so
much for gpus.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: noveau vs arm dma ops
  2018-04-26  9:09                                                     ` Christoph Hellwig
@ 2018-04-26  9:45                                                       ` Daniel Vetter
  2018-04-26 11:12                                                       ` Russell King - ARM Linux
  1 sibling, 0 replies; 83+ messages in thread
From: Daniel Vetter @ 2018-04-26  9:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Thierry Reding, Christian König,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Linux Kernel Mailing List, amd-gfx list, Jerome Glisse,
	dri-devel, Dan Williams, Logan Gunthorpe,
	open list:DMA BUFFER SHARING FRAMEWORK, iommu, Linux ARM

On Thu, Apr 26, 2018 at 11:09 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Wed, Apr 25, 2018 at 11:35:13PM +0200, Daniel Vetter wrote:
>> > get_required_mask() is supposed to tell you if you are safe.  However
>> > we are missing lots of implementations of it for iommus so you might get
>> > some false negatives, improvements welcome.  It's been on my list of
>> > things to fix in the DMA API, but it is nowhere near the top.
>>
>> I hasn't come up in a while in some fireworks, so I honestly don't
>> remember exactly what the issues have been. But
>>
>> commit d766ef53006c2c38a7fe2bef0904105a793383f2
>> Author: Chris Wilson <chris@chris-wilson.co.uk>
>> Date:   Mon Dec 19 12:43:45 2016 +0000
>>
>>     drm/i915: Fallback to single PAGE_SIZE segments for DMA remapping
>>
>> and the various bits of code that a
>>
>> $ git grep SWIOTLB -- drivers/gpu
>>
>> turns up is what we're doing to hack around that stuff. And in general
>> (there's some exceptions) gpus should be able to address everything,
>> so I never fully understood where that's even coming from.
>
> I'm pretty sure I've seen some oddly low dma masks in GPU drivers.  E.g.
> duplicated in various AMD files:
>
>         adev->need_dma32 = false;
>         dma_bits = adev->need_dma32 ? 32 : 40;
>         r = pci_set_dma_mask(adev->pdev, DMA_BIT_MASK(dma_bits));
>         if (r) {
>                 adev->need_dma32 = true;
>                 dma_bits = 32;
>                 dev_warn(adev->dev, "amdgpu: No suitable DMA available.\n");
>         }
>
> synopsis:
>
> drivers/gpu/drm/bridge/synopsys/dw-hdmi-i2s-audio.c:    pdevinfo.dma_mask       = DMA_BIT_MASK(32);
> drivers/gpu/drm/bridge/synopsys/dw-hdmi.c:              pdevinfo.dma_mask = DMA_BIT_MASK(32);
> drivers/gpu/drm/bridge/synopsys/dw-hdmi.c:              pdevinfo.dma_mask = DMA_BIT_MASK(32);
>
> etnaviv gets it right:
>
> drivers/gpu/drm/etnaviv/etnaviv_gpu.c:          u32 dma_mask = (u32)dma_get_required_mask(gpu->dev);
>
>
> But yes, the swiotlb hackery really irks me.  I just have some more
> important and bigger fires to fight first, but I plan to get back to the
> root cause of that eventually.
>
>>
>> >> - dma api hides the cache flushing requirements from us. GPUs love
>> >>   non-snooped access, and worse give userspace control over that. We want
>> >>   a strict separation between mapping stuff and flushing stuff. With the
>> >>   IOMMU api we mostly have the former, but for the later arch maintainers
>> >>   regularly tells they won't allow that. So we have drm_clflush.c.
>> >
>> > The problem is that a cache flushing API entirely separate is hard. That
>> > being said if you look at my generic dma-noncoherent API series it tries
>> > to move that way.  So far it is in early stages and apparently rather
>> > buggy unfortunately.
>>
>> I'm assuming this stuff here?
>>
>> https://lkml.org/lkml/2018/4/20/146
>>
>> Anyway got lost in all that work a bit, looks really nice.
>
> That url doesn't seem to work currently.  But I am talking about the
> thread titled '[RFC] common non-cache coherent direct dma mapping ops'
>
>> Yeah the above is pretty much what we do on x86. dma-api believes
>> everything is coherent, so dma_map_sg does the mapping we want and
>> nothing else (minus swiotlb fun). Cache flushing, allocations, all
>> done by the driver.
>
> Which sounds like the right thing to do to me.
>
>> On arm that doesn't work. The iommu api seems like a good fit, except
>> the dma-api tends to get in the way a bit (drm/msm apparently has
>> similar problems like tegra), and if you need contiguous memory
>> dma_alloc_coherent is the only way to get at contiguous memory. There
>> was a huge discussion years ago about that, and direct cma access was
>> shot down because it would have exposed too much of the caching
>> attribute mangling required (most arm platforms need wc-pages to not
>> be in the kernel's linear map apparently).
>
> Simple cma_alloc() doesn't do anything about cache handling, it
> just is a very dumb allocator for large contiguous regions inside
> a big pool.
>
> I'm not the CMA maintainer, but in general I'd love to see an
> EXPORT_SYMBOL_GPL slapped onto cma_alloc/release and drivers use
> that were needed.  Using that plus dma_map*/dma_unmap* sounds like
> a much saner interface than dma_alloc_attrs + DMA_ATTR_NON_CONSISTENT
> or DMA_ATTR_NO_KERNEL_MAPPING.
>
> You don't happen to have a pointer to that previous discussion?

I'll try to dig them up, I tried to stay as far away from that
discussion as possible (since I have the luxury to not care for intel
gpus).

>> Anything that separate these 3 things more (allocation pools, mapping
>> through IOMMUs and flushing cpu caches) sounds like the right
>> direction to me. Even if that throws some portability across platforms
>> away - drivers who want to control things in this much detail aren't
>> really portable (without some serious work) anyway.
>
> As long as we stay away from the dma coherent allocations separating
> them is fine, and I'm working towards it.  dma coherent allocations on
> the other hand are very hairy beasts, and provide by very different
> implementations depending on the architecture, or often even depending
> on the specifics of individual implementations inside the architecture.
>
> So for your GPU/media case it seems like you'd better stay away from
> them as much as you can.

Hm, at least on x86 we do set up write-combine mappings ourselves (by
calling relevant functions, which also changes the kernel mapping to
avoid aliasing fun). Of course that means the gpu drivers are less
portable, but then they're not really all that portable anyway - the
buffer handling code always needs to be tuned for the platform you're
running on. GPUs really love write-combining everything (we do the
same for our mmio mappings to kernel/userspace, see stuff like
io-mapping.h.

But in many other cases dma_alloc_coherent is only used because it's
the convenient/only way to allocate the memory we need.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: noveau vs arm dma ops
  2018-04-26  9:09                                                     ` Christoph Hellwig
  2018-04-26  9:45                                                       ` Daniel Vetter
@ 2018-04-26 11:12                                                       ` Russell King - ARM Linux
  1 sibling, 0 replies; 83+ messages in thread
From: Russell King - ARM Linux @ 2018-04-26 11:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Daniel Vetter, moderated list:DMA BUFFER SHARING FRAMEWORK,
	Linux Kernel Mailing List, amd-gfx list, Jerome Glisse, iommu,
	dri-devel, Dan Williams, Thierry Reding, Logan Gunthorpe,
	Christian König, Linux ARM,
	open list:DMA BUFFER SHARING FRAMEWORK

(While there's a rain shower...)

On Thu, Apr 26, 2018 at 02:09:42AM -0700, Christoph Hellwig wrote:
> synopsis:
> 
> drivers/gpu/drm/bridge/synopsys/dw-hdmi-i2s-audio.c:    pdevinfo.dma_mask       = DMA_BIT_MASK(32);
> drivers/gpu/drm/bridge/synopsys/dw-hdmi.c:              pdevinfo.dma_mask = DMA_BIT_MASK(32);

This is for the AHB audio driver, and is a correct default on two counts:

1. It is historically usual to initialise DMA masks to 32-bit, and leave
   it to the device drivers to negotiate via the DMA mask functions if
   they wish to use higher orders of bits.

2. The AHB audio hardware in the HDMI block only supports 32-bit addresses.

What I've missed from the AHB audio driver is calling the DMA mask
functions... oops.  Patch below.

> drivers/gpu/drm/bridge/synopsys/dw-hdmi.c:              pdevinfo.dma_mask = DMA_BIT_MASK(32);

This is for the I2S audio driver, and I suspect is wrong - I doubt
that the I2S sub-device itself does any DMA what so ever.

8<===
From: Russell King <rmk+kernel@armlinux.org.uk>
Subject: drm: bridge: dw-hdmi: Negotiate dma mask with DMA API

DMA drivers are supposed to negotiate the DMA mask with the DMA API,
but this was missing from the AHB audio driver.  Add the necessary
call.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
---
 drivers/gpu/drm/bridge/synopsys/dw-hdmi-ahb-audio.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/bridge/synopsys/dw-hdmi-ahb-audio.c b/drivers/gpu/drm/bridge/synopsys/dw-hdmi-ahb-audio.c
index cf3f0caf9c63..16c45b6cd6af 100644
--- a/drivers/gpu/drm/bridge/synopsys/dw-hdmi-ahb-audio.c
+++ b/drivers/gpu/drm/bridge/synopsys/dw-hdmi-ahb-audio.c
@@ -539,6 +539,10 @@ static int snd_dw_hdmi_probe(struct platform_device *pdev)
 	unsigned revision;
 	int ret;
 
+	ret = dma_set_mask_and_coherent(dev, DMA_BIT_MASK(32));
+	if (ret)
+		return ret;
+
 	writeb_relaxed(HDMI_IH_MUTE_AHBDMAAUD_STAT0_ALL,
 		       data->base + HDMI_IH_MUTE_AHBDMAAUD_STAT0);
 	revision = readb_relaxed(data->base + HDMI_REVISION_ID);


-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
  2018-04-25 17:44                                           ` Alex Deucher
  2018-04-25 18:38                                             ` Dan Williams
@ 2018-05-04 12:45                                             ` Lucas Stach
  1 sibling, 0 replies; 83+ messages in thread
From: Lucas Stach @ 2018-05-04 12:45 UTC (permalink / raw)
  To: Alex Deucher, Christoph Hellwig
  Cc: Linux Kernel Mailing List, dri-devel,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Jerome Glisse,
	amd-gfx list, Dan Williams, Logan Gunthorpe,
	Christian König, open list:DMA BUFFER SHARING FRAMEWORK

Am Mittwoch, den 25.04.2018, 13:44 -0400 schrieb Alex Deucher:
> On Wed, Apr 25, 2018 at 2:41 AM, Christoph Hellwig <hch@infradead.org
> > wrote:
> > On Wed, Apr 25, 2018 at 02:24:36AM -0400, Alex Deucher wrote:
> > > > It has a non-coherent transaction mode (which the chipset can opt to
> > > > not implement and still flush), to make sure the AGP horror show
> > > > doesn't happen again and GPU folks are happy with PCIe. That's at
> > > > least my understanding from digging around in amd the last time we had
> > > > coherency issues between intel and amd gpus. GPUs have some bits
> > > > somewhere (in the pagetables, or in the buffer object description
> > > > table created by userspace) to control that stuff.
> > > 
> > > Right.  We have a bit in the GPU page table entries that determines
> > > whether we snoop the CPU's cache or not.
> > 
> > I can see how that works with the GPU on the same SOC or SOC set as the
> > CPU.  But how is that going to work for a GPU that is a plain old PCIe
> > card?  The cache snooping in that case is happening in the PCIe root
> > complex.
> 
> I'm not a pci expert, but as far as I know, the device sends either a
> snooped or non-snooped transaction on the bus.  I think the
> transaction descriptor supports a no snoop attribute.  Our GPUs have
> supported this feature for probably 20 years if not more, going back
> to PCI.  Using non-snooped transactions have lower latency and faster
> throughput compared to snooped transactions.

PCI-X (as in the thing which as all the rage before PCIe) added a
attribute phase to each transaction where the requestor can enable
relaxed ordering and/or no-snoop on a transaction basis. As those are
strictly performance optimizations the root-complex isn't required to
honor those attributes, but implementations that care about performance
 usually will.

Regards,
Lucas

^ permalink raw reply	[flat|nested] 83+ messages in thread

end of thread, other threads:[~2018-05-04 12:45 UTC | newest]

Thread overview: 83+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-25 10:59 [PATCH 1/8] lib/scatterlist: add sg_set_dma_addr() helper Christian König
2018-03-25 10:59 ` [PATCH 2/8] PCI: Add pci_find_common_upstream_dev() Christian König
2018-03-28 12:38   ` Christoph Hellwig
2018-03-28 15:07     ` Christian König
2018-03-28 15:47       ` Logan Gunthorpe
2018-03-28 16:02         ` Christian König
2018-03-28 16:25           ` Logan Gunthorpe
2018-03-28 18:28             ` Christian König
2018-03-28 18:57               ` Logan Gunthorpe
2018-03-28 19:44                 ` Christian König
2018-03-28 19:53                   ` Logan Gunthorpe
2018-03-29 11:44                     ` Christian König
2018-03-29 15:45                       ` Logan Gunthorpe
2018-03-29 16:10                         ` Christian König
2018-03-29 16:25                           ` Logan Gunthorpe
2018-03-29 18:15                             ` Christian König
2018-03-30  1:58                             ` Jerome Glisse
2018-03-30  6:33                               ` Christoph Hellwig
2018-03-30 15:25                                 ` Jerome Glisse
2018-03-30 18:46                               ` Logan Gunthorpe
2018-03-30 19:45                                 ` Jerome Glisse
2018-04-02 17:02                                   ` Logan Gunthorpe
2018-04-02 17:20                                     ` Jerome Glisse
2018-04-02 17:37                                       ` Logan Gunthorpe
2018-04-02 19:16                                         ` Jerome Glisse
2018-04-02 19:32                                           ` Logan Gunthorpe
2018-04-02 19:45                                             ` Jerome Glisse
     [not found]                     ` <CADnq5_P-z=Noos_jaME9_CERri3C-m2hPPvx2bArr36O=1FnrA@mail.gmail.com>
2018-03-29 14:37                       ` Alex Deucher
2018-03-25 10:59 ` [PATCH 3/8] PCI: Add pci_peer_traffic_supported() Christian König
2018-03-25 10:59 ` [PATCH 4/8] dma-buf: add peer2peer flag Christian König
2018-03-29  6:57   ` Daniel Vetter
2018-03-29 11:34     ` Christian König
2018-04-03  9:09       ` Daniel Vetter
2018-04-03 17:06         ` Jerome Glisse
2018-04-03 18:08           ` Daniel Vetter
2018-04-16 12:39             ` Christoph Hellwig
2018-04-16 13:38               ` Daniel Vetter
2018-04-19  8:16                 ` Christoph Hellwig
2018-04-20  7:13                   ` Daniel Vetter
2018-04-20  8:58                     ` Christian König
2018-04-20 10:17                       ` Christoph Hellwig
2018-04-20 10:44                         ` Christian König
2018-04-20 12:46                           ` Christoph Hellwig
2018-04-20 15:21                             ` [Linaro-mm-sig] " Daniel Vetter
2018-04-24 18:48                               ` Christoph Hellwig
2018-04-24 19:32                                 ` Daniel Vetter
2018-04-25  5:48                                   ` Christoph Hellwig
2018-04-25  6:10                                     ` Alex Deucher
2018-04-25  6:13                                     ` Daniel Vetter
2018-04-25  6:23                                       ` Daniel Vetter
2018-04-25  6:43                                         ` Christoph Hellwig
2018-04-25  7:02                                           ` Daniel Vetter
2018-04-25  7:09                                             ` Christoph Hellwig
2018-04-25  7:30                                               ` Daniel Vetter
2018-04-25  7:56                                                 ` Thierry Reding
2018-04-25  8:55                                                   ` Christoph Hellwig
2018-04-25  7:43                                               ` Thierry Reding
2018-04-25  7:41                                           ` Thierry Reding
2018-04-25  8:54                                             ` noveau vs arm dma ops Christoph Hellwig
2018-04-25  9:25                                               ` Russell King - ARM Linux
2018-04-25 10:04                                               ` Daniel Vetter
2018-04-25 15:33                                                 ` Christoph Hellwig
2018-04-25 21:35                                                   ` Daniel Vetter
2018-04-25 23:26                                                     ` Russell King - ARM Linux
2018-04-26  9:17                                                       ` Daniel Vetter
2018-04-26  9:09                                                     ` Christoph Hellwig
2018-04-26  9:45                                                       ` Daniel Vetter
2018-04-26 11:12                                                       ` Russell King - ARM Linux
2018-04-25 22:54                                                   ` Russell King - ARM Linux
2018-04-26  9:13                                                     ` Christoph Hellwig
2018-04-26  9:20                                                     ` [Linaro-mm-sig] " Daniel Vetter
2018-04-26  9:24                                                       ` Christoph Hellwig
2018-04-26  9:39                                                         ` Daniel Vetter
2018-04-25  6:24                                       ` [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag Alex Deucher
2018-04-25  6:41                                         ` Christoph Hellwig
2018-04-25 17:44                                           ` Alex Deucher
2018-04-25 18:38                                             ` Dan Williams
2018-05-04 12:45                                             ` Lucas Stach
2018-03-25 10:59 ` [PATCH 5/8] drm/amdgpu: print DMA-buf status in debugfs Christian König
2018-03-25 10:59 ` [PATCH 6/8] drm/amdgpu: note that we can handle peer2peer DMA-buf Christian König
2018-03-25 10:59 ` [PATCH 7/8] drm/amdgpu: add amdgpu_gem_attach Christian König
2018-03-25 11:00 ` [PATCH 8/8] drm/amdgpu: add support for exporting VRAM using DMA-buf Christian König
2018-03-28 12:37 ` [PATCH 1/8] lib/scatterlist: add sg_set_dma_addr() helper Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).