linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [RFC/PATCH 0/7] Add MSM SMMUv1 support
@ 2014-06-30 16:51 Olav Haugan
  2014-06-30 16:51 ` [RFC/PATCH 1/7] iommu: msm: Rename iommu driver files Olav Haugan
                   ` (6 more replies)
  0 siblings, 7 replies; 29+ messages in thread
From: Olav Haugan @ 2014-06-30 16:51 UTC (permalink / raw)
  To: linux-arm-kernel

These patches add support for Qualcomm MSM SMMUv1 hardware. The first patch
renames the files for the existing MSM IOMMU driver to align with the SMMU
hardware revision (v1 is ARM SMMUv1 spec). The second patch adds back
map_range/unmap_range APIs. These APIs allows SMMU driver implementations to
optimize mappings of scatter-gather list of physically contiguous chunks of
memory. The third patch adds common macros to allow device drivers to poll
memory mapped registers. The fourth and fifth patch is the actual MSM SMMUv1
driver which supports the following:

	- ARM V7S and V7L page table format independent of ARM CPU page table
	  format
	- 4K/64K/1M/16M mappings (V7S)
	- 4K/64K/2M/32M/1G mappings (V7L)
	- ATOS used for unit testing of driver
	- Sharing of page tables among SMMUs
	- Verbose context bank fault reporting
	- Verbose global fault reporting
	- Support for clocks and GDSC
	- map/unmap range
	- Domain specific enabling of coherent Hardware Table Walk (HTW)

The last patch adds a new IOMMU domain attribute allowing us to set whether
hardware table walks should go to cache or not.

Matt Wagantall (1):
  iopoll: Introduce memory-mapped IO polling macros

Olav Haugan (6):
  iommu: msm: Rename iommu driver files
  iommu-api: Add map_range/unmap_range functions
  iommu: msm: Add MSM IOMMUv1 driver
  iommu: msm: Add support for V7L page table format
  defconfig: msm: Enable Qualcomm SMMUv1 driver
  iommu-api: Add domain attribute to enable coherent HTW

 .../devicetree/bindings/iommu/msm,iommu_v1.txt     |   60 +
 arch/arm/configs/qcom_defconfig                    |    3 +-
 drivers/iommu/Kconfig                              |   57 +-
 drivers/iommu/Makefile                             |    8 +-
 drivers/iommu/iommu.c                              |   24 +
 drivers/iommu/{msm_iommu.c => msm_iommu-v0.c}      |    2 +-
 drivers/iommu/msm_iommu-v1.c                       | 1529 +++++++++++++
 drivers/iommu/msm_iommu.c                          |  771 +------
 .../iommu/{msm_iommu_dev.c => msm_iommu_dev-v0.c}  |    2 +-
 drivers/iommu/msm_iommu_dev-v1.c                   |  345 +++
 .../{msm_iommu_hw-8xxx.h => msm_iommu_hw-v0.h}     |    0
 drivers/iommu/msm_iommu_hw-v1.h                    | 2322 ++++++++++++++++++++
 drivers/iommu/msm_iommu_pagetable.c                |  600 +++++
 drivers/iommu/msm_iommu_pagetable.h                |   33 +
 drivers/iommu/msm_iommu_pagetable_lpae.c           |  717 ++++++
 drivers/iommu/msm_iommu_priv.h                     |   65 +
 include/linux/iommu.h                              |   25 +
 include/linux/iopoll.h                             |  114 +
 include/linux/qcom_iommu.h                         |  221 ++
 19 files changed, 6236 insertions(+), 662 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/iommu/msm,iommu_v1.txt
 copy drivers/iommu/{msm_iommu.c => msm_iommu-v0.c} (99%)
 create mode 100644 drivers/iommu/msm_iommu-v1.c
 rename drivers/iommu/{msm_iommu_dev.c => msm_iommu_dev-v0.c} (99%)
 create mode 100644 drivers/iommu/msm_iommu_dev-v1.c
 rename drivers/iommu/{msm_iommu_hw-8xxx.h => msm_iommu_hw-v0.h} (100%)
 create mode 100644 drivers/iommu/msm_iommu_hw-v1.h
 create mode 100644 drivers/iommu/msm_iommu_pagetable.c
 create mode 100644 drivers/iommu/msm_iommu_pagetable.h
 create mode 100644 drivers/iommu/msm_iommu_pagetable_lpae.c
 create mode 100644 drivers/iommu/msm_iommu_priv.h
 create mode 100644 include/linux/iopoll.h
 create mode 100644 include/linux/qcom_iommu.h

--
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC/PATCH 1/7] iommu: msm: Rename iommu driver files
  2014-06-30 16:51 [RFC/PATCH 0/7] Add MSM SMMUv1 support Olav Haugan
@ 2014-06-30 16:51 ` Olav Haugan
  2014-06-30 16:51 ` [RFC/PATCH 2/7] iommu-api: Add map_range/unmap_range functions Olav Haugan
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 29+ messages in thread
From: Olav Haugan @ 2014-06-30 16:51 UTC (permalink / raw)
  To: linux-arm-kernel

Rename the MSM IOMMU driver for MSM8960 SoC to "-v0" version to align
with hardware version number for next generation MSM IOMMU (v1).

Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
---
 arch/arm/configs/qcom_defconfig                          |  2 +-
 drivers/iommu/Kconfig                                    | 11 +++++++++--
 drivers/iommu/Makefile                                   |  2 +-
 drivers/iommu/{msm_iommu.c => msm_iommu-v0.c}            |  2 +-
 drivers/iommu/{msm_iommu_dev.c => msm_iommu_dev-v0.c}    |  2 +-
 drivers/iommu/{msm_iommu_hw-8xxx.h => msm_iommu_hw-v0.h} |  0
 6 files changed, 13 insertions(+), 6 deletions(-)
 rename drivers/iommu/{msm_iommu.c => msm_iommu-v0.c} (99%)
 rename drivers/iommu/{msm_iommu_dev.c => msm_iommu_dev-v0.c} (99%)
 rename drivers/iommu/{msm_iommu_hw-8xxx.h => msm_iommu_hw-v0.h} (100%)

diff --git a/arch/arm/configs/qcom_defconfig b/arch/arm/configs/qcom_defconfig
index 42ebd72..0414889 100644
--- a/arch/arm/configs/qcom_defconfig
+++ b/arch/arm/configs/qcom_defconfig
@@ -136,7 +136,7 @@ CONFIG_COMMON_CLK_QCOM=y
 CONFIG_MSM_GCC_8660=y
 CONFIG_MSM_MMCC_8960=y
 CONFIG_MSM_MMCC_8974=y
-CONFIG_MSM_IOMMU=y
+CONFIG_MSM_IOMMU_V0=y
 CONFIG_GENERIC_PHY=y
 CONFIG_EXT2_FS=y
 CONFIG_EXT2_FS_XATTR=y
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index d260605..705a257 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -28,12 +28,19 @@ config FSL_PAMU
 	  transaction types.
 
 # MSM IOMMU support
+
+# MSM_IOMMU always gets selected by whoever wants it.
 config MSM_IOMMU
-	bool "MSM IOMMU Support"
+	bool
+
+# MSM IOMMUv0 support
+config MSM_IOMMU_V0
+	bool "MSM IOMMUv0 Support"
 	depends on ARCH_MSM8X60 || ARCH_MSM8960
 	select IOMMU_API
+	select MSM_IOMMU
 	help
-	  Support for the IOMMUs found on certain Qualcomm SOCs.
+	  Support for the IOMMUs (v0) found on certain Qualcomm SOCs.
 	  These IOMMUs allow virtualization of the address space used by most
 	  cores within the multimedia subsystem.
 
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 8893bad..894ced9 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -1,7 +1,7 @@
 obj-$(CONFIG_IOMMU_API) += iommu.o
 obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_OF_IOMMU)	+= of_iommu.o
-obj-$(CONFIG_MSM_IOMMU) += msm_iommu.o msm_iommu_dev.o
+obj-$(CONFIG_MSM_IOMMU_V0) += msm_iommu-v0.o msm_iommu_dev-v0.o
 obj-$(CONFIG_AMD_IOMMU) += amd_iommu.o amd_iommu_init.o
 obj-$(CONFIG_AMD_IOMMU_V2) += amd_iommu_v2.o
 obj-$(CONFIG_ARM_SMMU) += arm-smmu.o
diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu-v0.c
similarity index 99%
rename from drivers/iommu/msm_iommu.c
rename to drivers/iommu/msm_iommu-v0.c
index f5ff657..17731061 100644
--- a/drivers/iommu/msm_iommu.c
+++ b/drivers/iommu/msm_iommu-v0.c
@@ -31,7 +31,7 @@
 #include <asm/cacheflush.h>
 #include <asm/sizes.h>
 
-#include "msm_iommu_hw-8xxx.h"
+#include "msm_iommu_hw-v0.h"
 #include "msm_iommu.h"
 
 #define MRC(reg, processor, op1, crn, crm, op2)				\
diff --git a/drivers/iommu/msm_iommu_dev.c b/drivers/iommu/msm_iommu_dev-v0.c
similarity index 99%
rename from drivers/iommu/msm_iommu_dev.c
rename to drivers/iommu/msm_iommu_dev-v0.c
index 61def7cb..2f86e46 100644
--- a/drivers/iommu/msm_iommu_dev.c
+++ b/drivers/iommu/msm_iommu_dev-v0.c
@@ -27,7 +27,7 @@
 #include <linux/err.h>
 #include <linux/slab.h>
 
-#include "msm_iommu_hw-8xxx.h"
+#include "msm_iommu_hw-v0.h"
 #include "msm_iommu.h"
 
 struct iommu_ctx_iter_data {
diff --git a/drivers/iommu/msm_iommu_hw-8xxx.h b/drivers/iommu/msm_iommu_hw-v0.h
similarity index 100%
rename from drivers/iommu/msm_iommu_hw-8xxx.h
rename to drivers/iommu/msm_iommu_hw-v0.h
-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC/PATCH 2/7] iommu-api: Add map_range/unmap_range functions
  2014-06-30 16:51 [RFC/PATCH 0/7] Add MSM SMMUv1 support Olav Haugan
  2014-06-30 16:51 ` [RFC/PATCH 1/7] iommu: msm: Rename iommu driver files Olav Haugan
@ 2014-06-30 16:51 ` Olav Haugan
  2014-06-30 19:42   ` Thierry Reding
                     ` (3 more replies)
  2014-06-30 16:51 ` [RFC/PATCH 3/7] iopoll: Introduce memory-mapped IO polling macros Olav Haugan
                   ` (4 subsequent siblings)
  6 siblings, 4 replies; 29+ messages in thread
From: Olav Haugan @ 2014-06-30 16:51 UTC (permalink / raw)
  To: linux-arm-kernel

Mapping and unmapping are more often than not in the critical path.
map_range and unmap_range allows SMMU driver implementations to optimize
the process of mapping and unmapping buffers into the SMMU page tables.
Instead of mapping one physical address, do TLB operation (expensive),
mapping, do TLB operation, mapping, do TLB operation the driver can map
a scatter-gatherlist of physically contiguous pages into one virtual
address space and then at the end do one TLB operation.

Additionally, the mapping operation would be faster in general since
clients does not have to keep calling map API over and over again for
each physically contiguous chunk of memory that needs to be mapped to a
virtually contiguous region.

Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
---
 drivers/iommu/iommu.c | 24 ++++++++++++++++++++++++
 include/linux/iommu.h | 24 ++++++++++++++++++++++++
 2 files changed, 48 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index e5555fc..f2a6b80 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -898,6 +898,30 @@ size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size)
 EXPORT_SYMBOL_GPL(iommu_unmap);
 
 
+int iommu_map_range(struct iommu_domain *domain, unsigned int iova,
+		    struct scatterlist *sg, unsigned int len, int prot)
+{
+	if (unlikely(domain->ops->map_range == NULL))
+		return -ENODEV;
+
+	BUG_ON(iova & (~PAGE_MASK));
+
+	return domain->ops->map_range(domain, iova, sg, len, prot);
+}
+EXPORT_SYMBOL_GPL(iommu_map_range);
+
+int iommu_unmap_range(struct iommu_domain *domain, unsigned int iova,
+		      unsigned int len)
+{
+	if (unlikely(domain->ops->unmap_range == NULL))
+		return -ENODEV;
+
+	BUG_ON(iova & (~PAGE_MASK));
+
+	return domain->ops->unmap_range(domain, iova, len);
+}
+EXPORT_SYMBOL_GPL(iommu_unmap_range);
+
 int iommu_domain_window_enable(struct iommu_domain *domain, u32 wnd_nr,
 			       phys_addr_t paddr, u64 size, int prot)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index b96a5b2..63dca6d 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -22,6 +22,7 @@
 #include <linux/errno.h>
 #include <linux/err.h>
 #include <linux/types.h>
+#include <linux/scatterlist.h>
 #include <trace/events/iommu.h>
 
 #define IOMMU_READ	(1 << 0)
@@ -93,6 +94,8 @@ enum iommu_attr {
  * @detach_dev: detach device from an iommu domain
  * @map: map a physically contiguous memory region to an iommu domain
  * @unmap: unmap a physically contiguous memory region from an iommu domain
+ * @map_range: map a scatter-gather list of physically contiguous memory chunks to an iommu domain
+ * @unmap_range: unmap a scatter-gather list of physically contiguous memory chunks from an iommu domain
  * @iova_to_phys: translate iova to physical address
  * @domain_has_cap: domain capabilities query
  * @add_device: add device to iommu grouping
@@ -110,6 +113,10 @@ struct iommu_ops {
 		   phys_addr_t paddr, size_t size, int prot);
 	size_t (*unmap)(struct iommu_domain *domain, unsigned long iova,
 		     size_t size);
+	int (*map_range)(struct iommu_domain *domain, unsigned int iova,
+		    struct scatterlist *sg, unsigned int len, int prot);
+	int (*unmap_range)(struct iommu_domain *domain, unsigned int iova,
+		      unsigned int len);
 	phys_addr_t (*iova_to_phys)(struct iommu_domain *domain, dma_addr_t iova);
 	int (*domain_has_cap)(struct iommu_domain *domain,
 			      unsigned long cap);
@@ -153,6 +160,10 @@ extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
 		     phys_addr_t paddr, size_t size, int prot);
 extern size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova,
 		       size_t size);
+extern int iommu_map_range(struct iommu_domain *domain, unsigned int iova,
+		    struct scatterlist *sg, unsigned int len, int prot);
+extern int iommu_unmap_range(struct iommu_domain *domain, unsigned int iova,
+		      unsigned int len);
 extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova);
 extern int iommu_domain_has_cap(struct iommu_domain *domain,
 				unsigned long cap);
@@ -280,6 +291,19 @@ static inline int iommu_unmap(struct iommu_domain *domain, unsigned long iova,
 	return -ENODEV;
 }
 
+static inline int iommu_map_range(struct iommu_domain *domain,
+				  unsigned int iova, struct scatterlist *sg,
+				  unsigned int len, int prot)
+{
+	return -ENODEV;
+}
+
+static inline int iommu_unmap_range(struct iommu_domain *domain,
+				    unsigned int iova, unsigned int len)
+{
+	return -ENODEV;
+}
+
 static inline int iommu_domain_window_enable(struct iommu_domain *domain,
 					     u32 wnd_nr, phys_addr_t paddr,
 					     u64 size, int prot)
-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC/PATCH 3/7] iopoll: Introduce memory-mapped IO polling macros
  2014-06-30 16:51 [RFC/PATCH 0/7] Add MSM SMMUv1 support Olav Haugan
  2014-06-30 16:51 ` [RFC/PATCH 1/7] iommu: msm: Rename iommu driver files Olav Haugan
  2014-06-30 16:51 ` [RFC/PATCH 2/7] iommu-api: Add map_range/unmap_range functions Olav Haugan
@ 2014-06-30 16:51 ` Olav Haugan
  2014-06-30 19:46   ` Thierry Reding
  2014-07-01  9:40   ` Will Deacon
  2014-06-30 16:51 ` [RFC/PATCH 5/7] iommu: msm: Add support for V7L page table format Olav Haugan
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 29+ messages in thread
From: Olav Haugan @ 2014-06-30 16:51 UTC (permalink / raw)
  To: linux-arm-kernel

From: Matt Wagantall <mattw@codeaurora.org>

It is sometimes necessary to poll a memory-mapped register until its
value satisfies some condition. Introduce a family of convenience macros
that do this. Tight-loop and sleeping versions are provided with and
without timeouts.

Signed-off-by: Matt Wagantall <mattw@codeaurora.org>
---
 include/linux/iopoll.h | 114 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 114 insertions(+)
 create mode 100644 include/linux/iopoll.h

diff --git a/include/linux/iopoll.h b/include/linux/iopoll.h
new file mode 100644
index 0000000..d085e03
--- /dev/null
+++ b/include/linux/iopoll.h
@@ -0,0 +1,114 @@
+/*
+ * Copyright (c) 2012-2014 The Linux Foundation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 and
+ * only version 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#ifndef _LINUX_IOPOLL_H
+#define _LINUX_IOPOLL_H
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/hrtimer.h>
+#include <linux/delay.h>
+#include <asm-generic/errno.h>
+#include <asm/io.h>
+
+/**
+ * readl_poll_timeout - Periodically poll an address until a condition is met or a timeout occurs
+ * @addr: Address to poll
+ * @val: Variable to read the value into
+ * @cond: Break condition (usually involving @val)
+ * @sleep_us: Maximum time to sleep between reads in uS (0 tight-loops)
+ * @timeout_us: Timeout in uS, 0 means never timeout
+ *
+ * Returns 0 on success and -ETIMEDOUT upon a timeout. In either
+ * case, the last read value at @addr is stored in @val. Must not
+ * be called from atomic context if sleep_us or timeout_us are used.
+ */
+#define readl_poll_timeout(addr, val, cond, sleep_us, timeout_us) \
+({ \
+	ktime_t timeout = ktime_add_us(ktime_get(), timeout_us); \
+	might_sleep_if(timeout_us); \
+	for (;;) { \
+		(val) = readl(addr); \
+		if (cond) \
+			break; \
+		if (timeout_us && ktime_compare(ktime_get(), timeout) > 0) { \
+			(val) = readl(addr); \
+			break; \
+		} \
+		if (sleep_us) \
+			usleep_range(DIV_ROUND_UP(sleep_us, 4), sleep_us); \
+	} \
+	(cond) ? 0 : -ETIMEDOUT; \
+})
+
+/**
+ * readl_poll_timeout_noirq - Periodically poll an address until a condition is met or a timeout occurs
+ * @addr: Address to poll
+ * @val: Variable to read the value into
+ * @cond: Break condition (usually involving @val)
+ * @max_reads: Maximum number of reads before giving up
+ * @time_between_us: Time to udelay() between successive reads
+ *
+ * Returns 0 on success and -ETIMEDOUT upon a timeout.
+ */
+#define readl_poll_timeout_noirq(addr, val, cond, max_reads, time_between_us) \
+({ \
+	int count; \
+	for (count = (max_reads); count > 0; count--) { \
+		(val) = readl(addr); \
+		if (cond) \
+			break; \
+		udelay(time_between_us); \
+	} \
+	(cond) ? 0 : -ETIMEDOUT; \
+})
+
+/**
+ * readl_poll - Periodically poll an address until a condition is met
+ * @addr: Address to poll
+ * @val: Variable to read the value into
+ * @cond: Break condition (usually involving @val)
+ * @sleep_us: Maximum time to sleep between reads in uS (0 tight-loops)
+ *
+ * Must not be called from atomic context if sleep_us is used.
+ */
+#define readl_poll(addr, val, cond, sleep_us) \
+	readl_poll_timeout(addr, val, cond, sleep_us, 0)
+
+/**
+ * readl_tight_poll_timeout - Tight-loop on an address until a condition is met or a timeout occurs
+ * @addr: Address to poll
+ * @val: Variable to read the value into
+ * @cond: Break condition (usually involving @val)
+ * @timeout_us: Timeout in uS, 0 means never timeout
+ *
+ * Returns 0 on success and -ETIMEDOUT upon a timeout. In either
+ * case, the last read value at @addr is stored in @val. Must not
+ * be called from atomic context if timeout_us is used.
+ */
+#define readl_tight_poll_timeout(addr, val, cond, timeout_us) \
+	readl_poll_timeout(addr, val, cond, 0, timeout_us)
+
+/**
+ * readl_tight_poll - Tight-loop on an address until a condition is met
+ * @addr: Address to poll
+ * @val: Variable to read the value into
+ * @cond: Break condition (usually involving @val)
+ *
+ * May be called from atomic context.
+ */
+#define readl_tight_poll(addr, val, cond) \
+	readl_poll_timeout(addr, val, cond, 0, 0)
+
+#endif /* _LINUX_IOPOLL_H */
-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC/PATCH 5/7] iommu: msm: Add support for V7L page table format
  2014-06-30 16:51 [RFC/PATCH 0/7] Add MSM SMMUv1 support Olav Haugan
                   ` (2 preceding siblings ...)
  2014-06-30 16:51 ` [RFC/PATCH 3/7] iopoll: Introduce memory-mapped IO polling macros Olav Haugan
@ 2014-06-30 16:51 ` Olav Haugan
  2014-06-30 16:51 ` [RFC/PATCH 6/7] defconfig: msm: Enable Qualcomm SMMUv1 driver Olav Haugan
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 29+ messages in thread
From: Olav Haugan @ 2014-06-30 16:51 UTC (permalink / raw)
  To: linux-arm-kernel

Add support for VMSA long descriptor page table format (V7L) supporting the
following features:

    - ARM V7L page table format independent of ARM CPU page table format
    - 4K/64K/2M/32M/1G mappings (V7L)

Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
---
 .../devicetree/bindings/iommu/msm,iommu_v1.txt     |   4 +
 drivers/iommu/Kconfig                              |  10 +
 drivers/iommu/Makefile                             |   4 +
 drivers/iommu/msm_iommu-v1.c                       |  65 ++
 drivers/iommu/msm_iommu.c                          |  47 ++
 drivers/iommu/msm_iommu_dev-v1.c                   |   5 +
 drivers/iommu/msm_iommu_hw-v1.h                    |  86 +++
 drivers/iommu/msm_iommu_pagetable_lpae.c           | 717 +++++++++++++++++++++
 drivers/iommu/msm_iommu_priv.h                     |  12 +-
 9 files changed, 949 insertions(+), 1 deletion(-)
 create mode 100644 drivers/iommu/msm_iommu_pagetable_lpae.c

diff --git a/Documentation/devicetree/bindings/iommu/msm,iommu_v1.txt b/Documentation/devicetree/bindings/iommu/msm,iommu_v1.txt
index 412ed44..c0a8f6c 100644
--- a/Documentation/devicetree/bindings/iommu/msm,iommu_v1.txt
+++ b/Documentation/devicetree/bindings/iommu/msm,iommu_v1.txt
@@ -38,6 +38,10 @@ Optional properties:
   qcom,iommu-bfb-regs property. If this property is present, the
   qcom,iommu-bfb-regs property shall also be present, and the lengths of both
   properties shall be the same.
+- qcom,iommu-lpae-bfb-regs : See description for qcom,iommu-bfb-regs. This is
+  the same property except this is for IOMMU with LPAE enabled.
+- qcom,iommu-lpae-bfb-data : See description for qcom,iommu-bfb-data. This is
+  the same property except this is for IOMMU with LPAE enabled.
 
 Example:
 
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index e972127..9053908 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -63,6 +63,16 @@ config MSM_IOMMU_V1
 
 	  If unsure, say N here.
 
+config MSM_IOMMU_LPAE
+	bool "Enable support for LPAE in IOMMU"
+	depends on MSM_IOMMU
+	help
+	  Enables Large Physical Address Extension (LPAE) for IOMMU. This allows
+	  clients of IOMMU to access physical addresses that are greater than
+	  32 bits.
+
+	  If unsure, say N here.
+
 config MSM_IOMMU_VBIF_CHECK
 	bool "Enable support for VBIF check when IOMMU gets stuck"
 	depends on MSM_IOMMU
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 1f98fcc..debb251 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -3,7 +3,11 @@ obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_OF_IOMMU)	+= of_iommu.o
 obj-$(CONFIG_MSM_IOMMU_V0) += msm_iommu-v0.o msm_iommu_dev-v0.o
 obj-$(CONFIG_MSM_IOMMU_V1) += msm_iommu-v1.o msm_iommu_dev-v1.o msm_iommu.o
+ifdef CONFIG_MSM_IOMMU_LPAE
+obj-$(CONFIG_MSM_IOMMU_V1) += msm_iommu_pagetable_lpae.o
+else
 obj-$(CONFIG_MSM_IOMMU_V1) += msm_iommu_pagetable.o
+endif
 obj-$(CONFIG_AMD_IOMMU) += amd_iommu.o amd_iommu_init.o
 obj-$(CONFIG_AMD_IOMMU_V2) += amd_iommu_v2.o
 obj-$(CONFIG_ARM_SMMU) += arm-smmu.o
diff --git a/drivers/iommu/msm_iommu-v1.c b/drivers/iommu/msm_iommu-v1.c
index 046c3cf..2c574ef 100644
--- a/drivers/iommu/msm_iommu-v1.c
+++ b/drivers/iommu/msm_iommu-v1.c
@@ -35,8 +35,13 @@
 #include "msm_iommu_priv.h"
 #include "msm_iommu_pagetable.h"
 
+#ifdef CONFIG_MSM_IOMMU_LPAE
+/* bitmap of the page sizes currently supported */
+#define MSM_IOMMU_PGSIZES	(SZ_4K | SZ_64K | SZ_2M | SZ_32M | SZ_1G)
+#else
 /* bitmap of the page sizes currently supported */
 #define MSM_IOMMU_PGSIZES	(SZ_4K | SZ_64K | SZ_1M | SZ_16M)
+#endif
 
 #define IOMMU_MSEC_STEP		10
 #define IOMMU_MSEC_TIMEOUT	5000
@@ -461,11 +466,19 @@ static void __release_SMT(u32 cb_num, void __iomem *base)
 	}
 }
 
+#ifdef CONFIG_MSM_IOMMU_LPAE
+static void msm_iommu_set_ASID(void __iomem *base, unsigned int ctx_num,
+			       unsigned int asid)
+{
+	SET_CB_TTBR0_ASID(base, ctx_num, asid);
+}
+#else
 static void msm_iommu_set_ASID(void __iomem *base, unsigned int ctx_num,
 			       unsigned int asid)
 {
 	SET_CB_CONTEXTIDR_ASID(base, ctx_num, asid);
 }
+#endif
 
 static void msm_iommu_assign_ASID(const struct msm_iommu_drvdata *iommu_drvdata,
 				  struct msm_iommu_master *master,
@@ -503,6 +516,38 @@ static void msm_iommu_assign_ASID(const struct msm_iommu_drvdata *iommu_drvdata,
 	msm_iommu_set_ASID(cb_base, master->cb_num, master->asid);
 }
 
+#ifdef CONFIG_MSM_IOMMU_LPAE
+static void msm_iommu_setup_ctx(void __iomem *base, unsigned int ctx)
+{
+	SET_CB_TTBCR_EAE(base, ctx, 1); /* Extended Address Enable (EAE) */
+}
+
+static void msm_iommu_setup_memory_remap(void __iomem *base, unsigned int ctx)
+{
+	SET_CB_MAIR0(base, ctx, msm_iommu_get_mair0());
+	SET_CB_MAIR1(base, ctx, msm_iommu_get_mair1());
+}
+
+static void msm_iommu_setup_pg_l2_redirect(void __iomem *base, unsigned int ctx)
+{
+	/*
+	 * Configure page tables as inner-cacheable and shareable to reduce
+	 * the TLB miss penalty.
+	 */
+	SET_CB_TTBCR_SH0(base, ctx, 3); /* Inner shareable */
+	SET_CB_TTBCR_ORGN0(base, ctx, 1); /* outer cachable*/
+	SET_CB_TTBCR_IRGN0(base, ctx, 1); /* inner cachable*/
+	SET_CB_TTBCR_T0SZ(base, ctx, 0); /* 0GB-4GB */
+
+
+	SET_CB_TTBCR_SH1(base, ctx, 3); /* Inner shareable */
+	SET_CB_TTBCR_ORGN1(base, ctx, 1); /* outer cachable*/
+	SET_CB_TTBCR_IRGN1(base, ctx, 1); /* inner cachable*/
+	SET_CB_TTBCR_T1SZ(base, ctx, 0); /* TTBR1 not used */
+}
+
+#else
+
 static void msm_iommu_setup_ctx(void __iomem *base, unsigned int ctx)
 {
 	/* Turn on TEX Remap */
@@ -527,6 +572,8 @@ static void msm_iommu_setup_pg_l2_redirect(void __iomem *base, unsigned int ctx)
 	SET_CB_TTBR0_RGN(base, ctx, 1);   /* WB, WA */
 }
 
+#endif
+
 static int program_SMT(struct msm_iommu_master *master, void __iomem *base)
 {
 	u32 *sids = master->sids;
@@ -915,6 +962,15 @@ static int msm_iommu_unmap_range(struct iommu_domain *domain, unsigned int va,
 	return 0;
 }
 
+#ifdef CONFIG_MSM_IOMMU_LPAE
+static phys_addr_t msm_iommu_get_phy_from_PAR(unsigned long va, u64 par)
+{
+	phys_addr_t phy;
+	/* Upper 28 bits from PAR, lower 12 from VA */
+	phy = (par & 0xFFFFFFF000ULL) | (va & 0x00000FFF);
+	return phy;
+}
+#else
 static phys_addr_t msm_iommu_get_phy_from_PAR(unsigned long va, u64 par)
 {
 	phys_addr_t phy;
@@ -927,6 +983,7 @@ static phys_addr_t msm_iommu_get_phy_from_PAR(unsigned long va, u64 par)
 
 	return phy;
 }
+#endif
 
 static phys_addr_t msm_iommu_iova_to_phys(struct iommu_domain *domain,
 					  phys_addr_t va)
@@ -1013,11 +1070,19 @@ static int msm_iommu_domain_has_cap(struct iommu_domain *domain,
 	return 0;
 }
 
+#ifdef CONFIG_MSM_IOMMU_LPAE
+static inline void print_ctx_mem_attr_regs(struct msm_iommu_context_reg regs[])
+{
+	pr_err("MAIR0   = %08x    MAIR1   = %08x\n",
+		 regs[DUMP_REG_MAIR0].val, regs[DUMP_REG_MAIR1].val);
+}
+#else
 static inline void print_ctx_mem_attr_regs(struct msm_iommu_context_reg regs[])
 {
 	pr_err("PRRR   = %08x    NMRR   = %08x\n",
 		 regs[DUMP_REG_PRRR].val, regs[DUMP_REG_NMRR].val);
 }
+#endif
 
 void print_ctx_regs(struct msm_iommu_context_reg regs[])
 {
diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c
index 5c7981e..34fe73a 100644
--- a/drivers/iommu/msm_iommu.c
+++ b/drivers/iommu/msm_iommu.c
@@ -105,7 +105,49 @@ struct msm_iommu_master *msm_iommu_find_master(struct msm_iommu_drvdata *drv,
 }
 
 #ifdef CONFIG_ARM
+#ifdef CONFIG_MSM_IOMMU_LPAE
 #ifdef CONFIG_ARM_LPAE
+/*
+ * If CONFIG_ARM_LPAE AND CONFIG_MSM_IOMMU_LPAE are enabled we can use the MAIR
+ * register directly
+ */
+u32 msm_iommu_get_mair0(void)
+{
+	unsigned int mair0;
+
+	RCP15_MAIR0(mair0);
+	return mair0;
+}
+
+u32 msm_iommu_get_mair1(void)
+{
+	unsigned int mair1;
+
+	RCP15_MAIR1(mair1);
+	return mair1;
+}
+#else
+/*
+ * However, If CONFIG_ARM_LPAE is not enabled but CONFIG_MSM_IOMMU_LPAE is enabled
+ * we'll just use the hard coded values directly..
+ */
+u32 msm_iommu_get_mair0(void)
+{
+	return MAIR0_VALUE;
+}
+
+u32 msm_iommu_get_mair1(void)
+{
+	return MAIR1_VALUE;
+}
+#endif
+
+#else
+#ifdef CONFIG_ARM_LPAE
+/*
+ * If CONFIG_ARM_LPAE is enabled AND CONFIG_MSM_IOMMU_LPAE is disabled
+ * we must use the hardcoded values.
+ */
 u32 msm_iommu_get_prrr(void)
 {
 	return PRRR_VALUE;
@@ -116,6 +158,10 @@ u32 msm_iommu_get_nmrr(void)
 	return NMRR_VALUE;
 }
 #else
+/*
+ * If both CONFIG_ARM_LPAE AND CONFIG_MSM_IOMMU_LPAE are disabled
+ * we can use the registers directly.
+ */
 #define RCP15_PRRR(reg)		MRC(reg, p15, 0, c10, c2, 0)
 #define RCP15_NMRR(reg)		MRC(reg, p15, 0, c10, c2, 1)
 
@@ -136,6 +182,7 @@ u32 msm_iommu_get_nmrr(void)
 }
 #endif
 #endif
+#endif
 #ifdef CONFIG_ARM64
 u32 msm_iommu_get_prrr(void)
 {
diff --git a/drivers/iommu/msm_iommu_dev-v1.c b/drivers/iommu/msm_iommu_dev-v1.c
index c1fa732..30f6b07 100644
--- a/drivers/iommu/msm_iommu_dev-v1.c
+++ b/drivers/iommu/msm_iommu_dev-v1.c
@@ -28,8 +28,13 @@
 
 #include "msm_iommu_hw-v1.h"
 
+#ifdef CONFIG_MSM_IOMMU_LPAE
+static const char *BFB_REG_NODE_NAME = "qcom,iommu-lpae-bfb-regs";
+static const char *BFB_DATA_NODE_NAME = "qcom,iommu-lpae-bfb-data";
+#else
 static const char *BFB_REG_NODE_NAME = "qcom,iommu-bfb-regs";
 static const char *BFB_DATA_NODE_NAME = "qcom,iommu-bfb-data";
+#endif
 
 static int msm_iommu_parse_bfb_settings(struct platform_device *pdev,
 				    struct msm_iommu_drvdata *drvdata)
diff --git a/drivers/iommu/msm_iommu_hw-v1.h b/drivers/iommu/msm_iommu_hw-v1.h
index f26ca7c..64e951e 100644
--- a/drivers/iommu/msm_iommu_hw-v1.h
+++ b/drivers/iommu/msm_iommu_hw-v1.h
@@ -924,6 +924,7 @@ do { \
 			GET_CONTEXT_FIELD(b, c, CB_TLBSTATUS, SACTIVE)
 
 /* Translation Table Base Control Register: CB_TTBCR */
+/* These are shared between VMSA and LPAE */
 #define GET_CB_TTBCR_EAE(b, c)       GET_CONTEXT_FIELD(b, c, CB_TTBCR, EAE)
 #define SET_CB_TTBCR_EAE(b, c, v)    SET_CONTEXT_FIELD(b, c, CB_TTBCR, EAE, v)
 
@@ -937,6 +938,54 @@ do { \
 #define GET_CB_TTBCR_NSCFG1(b, c)    \
 			GET_CONTEXT_FIELD(b, c, CB_TTBCR, NSCFG1)
 
+#ifdef CONFIG_MSM_IOMMU_LPAE
+
+/* LPAE format */
+
+/* Translation Table Base Register 0: CB_TTBR */
+#define SET_TTBR0(b, c, v)       SET_CTX_REG_Q(CB_TTBR0, (b), (c), (v))
+#define SET_TTBR1(b, c, v)       SET_CTX_REG_Q(CB_TTBR1, (b), (c), (v))
+
+#define SET_CB_TTBR0_ASID(b, c, v)  SET_CONTEXT_FIELD_Q(b, c, CB_TTBR0, ASID, v)
+#define SET_CB_TTBR0_ADDR(b, c, v)  SET_CONTEXT_FIELD_Q(b, c, CB_TTBR0, ADDR, v)
+
+#define GET_CB_TTBR0_ASID(b, c)     GET_CONTEXT_FIELD_Q(b, c, CB_TTBR0, ASID)
+#define GET_CB_TTBR0_ADDR(b, c)     GET_CONTEXT_FIELD_Q(b, c, CB_TTBR0, ADDR)
+#define GET_CB_TTBR0(b, c)          GET_CTX_REG_Q(CB_TTBR0, (b), (c))
+
+/* Translation Table Base Control Register: CB_TTBCR */
+#define SET_CB_TTBCR_T0SZ(b, c, v)   SET_CONTEXT_FIELD(b, c, CB_TTBCR, T0SZ, v)
+#define SET_CB_TTBCR_T1SZ(b, c, v)   SET_CONTEXT_FIELD(b, c, CB_TTBCR, T1SZ, v)
+#define SET_CB_TTBCR_EPD0(b, c, v)   SET_CONTEXT_FIELD(b, c, CB_TTBCR, EPD0, v)
+#define SET_CB_TTBCR_EPD1(b, c, v)   SET_CONTEXT_FIELD(b, c, CB_TTBCR, EPD1, v)
+#define SET_CB_TTBCR_IRGN0(b, c, v)  SET_CONTEXT_FIELD(b, c, CB_TTBCR, IRGN0, v)
+#define SET_CB_TTBCR_IRGN1(b, c, v)  SET_CONTEXT_FIELD(b, c, CB_TTBCR, IRGN1, v)
+#define SET_CB_TTBCR_ORGN0(b, c, v)  SET_CONTEXT_FIELD(b, c, CB_TTBCR, ORGN0, v)
+#define SET_CB_TTBCR_ORGN1(b, c, v)  SET_CONTEXT_FIELD(b, c, CB_TTBCR, ORGN1, v)
+#define SET_CB_TTBCR_NSCFG0(b, c, v) \
+				SET_CONTEXT_FIELD(b, c, CB_TTBCR, NSCFG0, v)
+#define SET_CB_TTBCR_NSCFG1(b, c, v) \
+				SET_CONTEXT_FIELD(b, c, CB_TTBCR, NSCFG1, v)
+
+#define SET_CB_TTBCR_SH0(b, c, v)    SET_CONTEXT_FIELD(b, c, CB_TTBCR, SH0, v)
+#define SET_CB_TTBCR_SH1(b, c, v)    SET_CONTEXT_FIELD(b, c, CB_TTBCR, SH1, v)
+#define SET_CB_TTBCR_A1(b, c, v)     SET_CONTEXT_FIELD(b, c, CB_TTBCR, A1, v)
+
+#define GET_CB_TTBCR_T0SZ(b, c)      GET_CONTEXT_FIELD(b, c, CB_TTBCR, T0SZ)
+#define GET_CB_TTBCR_T1SZ(b, c)      GET_CONTEXT_FIELD(b, c, CB_TTBCR, T1SZ)
+#define GET_CB_TTBCR_EPD0(b, c)      GET_CONTEXT_FIELD(b, c, CB_TTBCR, EPD0)
+#define GET_CB_TTBCR_EPD1(b, c)      GET_CONTEXT_FIELD(b, c, CB_TTBCR, EPD1)
+#define GET_CB_TTBCR_IRGN0(b, c, v)  GET_CONTEXT_FIELD(b, c, CB_TTBCR, IRGN0)
+#define GET_CB_TTBCR_IRGN1(b, c, v)  GET_CONTEXT_FIELD(b, c, CB_TTBCR, IRGN1)
+#define GET_CB_TTBCR_ORGN0(b, c, v)  GET_CONTEXT_FIELD(b, c, CB_TTBCR, ORGN0)
+#define GET_CB_TTBCR_ORGN1(b, c, v) GET_CONTEXT_FIELD(b, c, CB_TTBCR, ORGN1)
+
+#define SET_CB_MAIR0(b, c, v)        SET_CTX_REG(CB_MAIR0, (b), (c), (v))
+#define SET_CB_MAIR1(b, c, v)        SET_CTX_REG(CB_MAIR1, (b), (c), (v))
+
+#define GET_CB_MAIR0(b, c)           GET_CTX_REG(CB_MAIR0, (b), (c))
+#define GET_CB_MAIR1(b, c)           GET_CTX_REG(CB_MAIR1, (b), (c))
+#else
 #define SET_TTBR0(b, c, v)           SET_CTX_REG(CB_TTBR0, (b), (c), (v))
 #define SET_TTBR1(b, c, v)           SET_CTX_REG(CB_TTBR1, (b), (c), (v))
 
@@ -956,6 +1005,7 @@ do { \
 #define GET_CB_TTBR0_NOS(b, c)      GET_CONTEXT_FIELD(b, c, CB_TTBR0, NOS)
 #define GET_CB_TTBR0_IRGN0(b, c)    GET_CONTEXT_FIELD(b, c, CB_TTBR0, IRGN0)
 #define GET_CB_TTBR0_ADDR(b, c)     GET_CONTEXT_FIELD(b, c, CB_TTBR0, ADDR)
+#endif
 
 /* Translation Table Base Register 1: CB_TTBR1 */
 #define SET_CB_TTBR1_IRGN1(b, c, v) SET_CONTEXT_FIELD(b, c, CB_TTBR1, IRGN1, v)
@@ -1439,6 +1489,28 @@ do { \
 
 #define CB_TTBR0_ADDR        (CB_TTBR0_ADDR_MASK    << CB_TTBR0_ADDR_SHIFT)
 
+#ifdef CONFIG_MSM_IOMMU_LPAE
+/* Translation Table Base Register: CB_TTBR */
+#define CB_TTBR0_ASID        (CB_TTBR0_ASID_MASK    << CB_TTBR0_ASID_SHIFT)
+#define CB_TTBR1_ASID        (CB_TTBR1_ASID_MASK    << CB_TTBR1_ASID_SHIFT)
+
+/* Translation Table Base Control Register: CB_TTBCR */
+#define CB_TTBCR_T0SZ        (CB_TTBCR_T0SZ_MASK    << CB_TTBCR_T0SZ_SHIFT)
+#define CB_TTBCR_T1SZ        (CB_TTBCR_T1SZ_MASK    << CB_TTBCR_T1SZ_SHIFT)
+#define CB_TTBCR_EPD0        (CB_TTBCR_EPD0_MASK    << CB_TTBCR_EPD0_SHIFT)
+#define CB_TTBCR_EPD1        (CB_TTBCR_EPD1_MASK    << CB_TTBCR_EPD1_SHIFT)
+#define CB_TTBCR_IRGN0       (CB_TTBCR_IRGN0_MASK   << CB_TTBCR_IRGN0_SHIFT)
+#define CB_TTBCR_IRGN1       (CB_TTBCR_IRGN1_MASK   << CB_TTBCR_IRGN1_SHIFT)
+#define CB_TTBCR_ORGN0       (CB_TTBCR_ORGN0_MASK   << CB_TTBCR_ORGN0_SHIFT)
+#define CB_TTBCR_ORGN1       (CB_TTBCR_ORGN1_MASK   << CB_TTBCR_ORGN1_SHIFT)
+#define CB_TTBCR_NSCFG0      (CB_TTBCR_NSCFG0_MASK  << CB_TTBCR_NSCFG0_SHIFT)
+#define CB_TTBCR_NSCFG1      (CB_TTBCR_NSCFG1_MASK  << CB_TTBCR_NSCFG1_SHIFT)
+#define CB_TTBCR_SH0         (CB_TTBCR_SH0_MASK     << CB_TTBCR_SH0_SHIFT)
+#define CB_TTBCR_SH1         (CB_TTBCR_SH1_MASK     << CB_TTBCR_SH1_SHIFT)
+#define CB_TTBCR_A1          (CB_TTBCR_A1_MASK      << CB_TTBCR_A1_SHIFT)
+
+#else
+
 /* Translation Table Base Register 0: CB_TTBR0 */
 #define CB_TTBR0_IRGN1       (CB_TTBR0_IRGN1_MASK   << CB_TTBR0_IRGN1_SHIFT)
 #define CB_TTBR0_S           (CB_TTBR0_S_MASK       << CB_TTBR0_S_SHIFT)
@@ -1452,6 +1524,7 @@ do { \
 #define CB_TTBR1_RGN         (CB_TTBR1_RGN_MASK     << CB_TTBR1_RGN_SHIFT)
 #define CB_TTBR1_NOS         (CB_TTBR1_NOS_MASK     << CB_TTBR1_NOS_SHIFT)
 #define CB_TTBR1_IRGN0       (CB_TTBR1_IRGN0_MASK   << CB_TTBR1_IRGN0_SHIFT)
+#endif
 
 /* Global Register Masks */
 /* Configuration Register 0 */
@@ -1830,6 +1903,12 @@ do { \
 #define CB_TTBCR_A1_MASK           0x01
 #define CB_TTBCR_EAE_MASK          0x01
 
+/* Translation Table Base Register 0/1: CB_TTBR */
+#ifdef CONFIG_MSM_IOMMU_LPAE
+#define CB_TTBR0_ADDR_MASK         0x7FFFFFFFFULL
+#define CB_TTBR0_ASID_MASK         0xFF
+#define CB_TTBR1_ASID_MASK         0xFF
+#else
 #define CB_TTBR0_IRGN1_MASK        0x01
 #define CB_TTBR0_S_MASK            0x01
 #define CB_TTBR0_RGN_MASK          0x01
@@ -1842,6 +1921,7 @@ do { \
 #define CB_TTBR1_RGN_MASK          0x1
 #define CB_TTBR1_NOS_MASK          0X1
 #define CB_TTBR1_IRGN0_MASK        0X1
+#endif
 
 /* Global Register Shifts */
 /* Configuration Register: CR0 */
@@ -2219,6 +2299,11 @@ do { \
 #define CB_TTBCR_SH1_SHIFT          28
 
 /* Translation Table Base Register 0/1: CB_TTBR */
+#ifdef CONFIG_MSM_IOMMU_LPAE
+#define CB_TTBR0_ADDR_SHIFT         5
+#define CB_TTBR0_ASID_SHIFT         48
+#define CB_TTBR1_ASID_SHIFT         48
+#else
 #define CB_TTBR0_IRGN1_SHIFT        0
 #define CB_TTBR0_S_SHIFT            1
 #define CB_TTBR0_RGN_SHIFT          3
@@ -2232,5 +2317,6 @@ do { \
 #define CB_TTBR1_NOS_SHIFT          5
 #define CB_TTBR1_IRGN0_SHIFT        6
 #define CB_TTBR1_ADDR_SHIFT         14
+#endif
 
 #endif
diff --git a/drivers/iommu/msm_iommu_pagetable_lpae.c b/drivers/iommu/msm_iommu_pagetable_lpae.c
new file mode 100644
index 0000000..60908a8
--- /dev/null
+++ b/drivers/iommu/msm_iommu_pagetable_lpae.c
@@ -0,0 +1,717 @@
+/* Copyright (c) 2013-2014, The Linux Foundation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 and
+ * only version 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/errno.h>
+#include <linux/iommu.h>
+#include <linux/scatterlist.h>
+#include <linux/slab.h>
+
+#include <asm/cacheflush.h>
+
+#include "msm_iommu_priv.h"
+#include "msm_iommu_pagetable.h"
+
+#define NUM_FL_PTE      4   /* First level */
+#define NUM_SL_PTE      512 /* Second level */
+#define NUM_TL_PTE      512 /* Third level */
+
+#define PTE_SIZE	8
+
+#define FL_ALIGN	0x20
+
+/* First-level/second-level page table bits */
+#define FL_OFFSET(va)           (((va) & 0xC0000000) >> 30)
+
+#define FLSL_BASE_MASK            (0xFFFFFFF000ULL)
+#define FLSL_1G_BLOCK_MASK        (0xFFC0000000ULL)
+#define FLSL_BLOCK_MASK           (0xFFFFE00000ULL)
+#define FLSL_TYPE_BLOCK           (1 << 0)
+#define FLSL_TYPE_TABLE           (3 << 0)
+#define FLSL_PTE_TYPE_MASK        (3 << 0)
+#define FLSL_APTABLE_RO           (2 << 61)
+#define FLSL_APTABLE_RW           (0 << 61)
+
+#define FL_TYPE_SECT              (2 << 0)
+#define FL_SUPERSECTION           (1 << 18)
+#define FL_AP0                    (1 << 10)
+#define FL_AP1                    (1 << 11)
+#define FL_AP2                    (1 << 15)
+#define FL_SHARED                 (1 << 16)
+#define FL_BUFFERABLE             (1 << 2)
+#define FL_CACHEABLE              (1 << 3)
+#define FL_TEX0                   (1 << 12)
+#define FL_NG                     (1 << 17)
+
+/* Second-level page table bits */
+#define SL_OFFSET(va)             (((va) & 0x3FE00000) >> 21)
+
+/* Third-level page table bits */
+#define TL_OFFSET(va)             (((va) & 0x1FF000) >> 12)
+
+#define TL_TYPE_PAGE              (3 << 0)
+#define TL_PAGE_MASK              (0xFFFFFFF000ULL)
+#define TL_ATTR_INDEX_MASK        (0x7)
+#define TL_ATTR_INDEX_SHIFT       (0x2)
+#define TL_NS                     (0x1 << 5)
+#define TL_AP_RO                  (0x3 << 6) /* Access Permission: R */
+#define TL_AP_RW                  (0x1 << 6) /* Access Permission: RW */
+#define TL_SH_ISH                 (0x3 << 8) /* Inner shareable */
+#define TL_SH_OSH                 (0x2 << 8) /* Outer shareable */
+#define TL_SH_NSH                 (0x0 << 8) /* Non-shareable */
+#define TL_AF                     (0x1 << 10)  /* Access Flag */
+#define TL_NG                     (0x1 << 11) /* Non-Global */
+#define TL_CH                     (0x1ULL << 52) /* Contiguous hint */
+#define TL_PXN                    (0x1ULL << 53) /* Privilege Execute Never */
+#define TL_XN                     (0x1ULL << 54) /* Execute Never */
+
+/* normal non-cacheable */
+#define PTE_MT_BUFFERABLE         (1 << 2)
+/* normal inner write-alloc */
+#define PTE_MT_WRITEALLOC         (7 << 2)
+
+#define PTE_MT_MASK               (7 << 2)
+
+#define FOLLOW_TO_NEXT_TABLE(pte) ((u64 *) __va(((*pte) & FLSL_BASE_MASK)))
+
+static void __msm_iommu_pagetable_unmap_range(struct msm_iommu_pt *pt, u32 va,
+					      u32 len, u32 silent);
+
+static inline void clean_pte(u64 *start, u64 *end,
+				s32 redirect)
+{
+	if (!redirect)
+		dmac_flush_range(start, end);
+}
+
+s32 msm_iommu_pagetable_alloc(struct msm_iommu_pt *pt)
+{
+	u32 size = PTE_SIZE * NUM_FL_PTE + FL_ALIGN;
+	phys_addr_t fl_table_phys;
+
+	pt->unaligned_fl_table = kzalloc(size, GFP_KERNEL);
+	if (!pt->unaligned_fl_table)
+		return -ENOMEM;
+
+
+	fl_table_phys = virt_to_phys(pt->unaligned_fl_table);
+	fl_table_phys = ALIGN(fl_table_phys, FL_ALIGN);
+	pt->fl_table = phys_to_virt(fl_table_phys);
+
+	pt->sl_table_shadow = kcalloc(NUM_FL_PTE, sizeof(u64 *), GFP_KERNEL);
+	if (!pt->sl_table_shadow) {
+		kfree(pt->unaligned_fl_table);
+		return -ENOMEM;
+	}
+	clean_pte(pt->fl_table, pt->fl_table + NUM_FL_PTE, pt->redirect);
+	return 0;
+}
+
+void msm_iommu_pagetable_free(struct msm_iommu_pt *pt)
+{
+	s32 i;
+	u64 *fl_table = pt->fl_table;
+
+	for (i = 0; i < NUM_FL_PTE; ++i) {
+		if ((fl_table[i] & FLSL_TYPE_TABLE) == FLSL_TYPE_TABLE) {
+			u64 p = fl_table[i] & FLSL_BASE_MASK;
+
+			free_page((u32)phys_to_virt(p));
+		}
+		if ((pt->sl_table_shadow[i]))
+			free_page((u32)pt->sl_table_shadow[i]);
+	}
+	kfree(pt->unaligned_fl_table);
+
+	pt->unaligned_fl_table = 0;
+	pt->fl_table = 0;
+
+	kfree(pt->sl_table_shadow);
+}
+
+void msm_iommu_pagetable_free_tables(struct msm_iommu_pt *pt, unsigned long va,
+				 size_t len)
+{
+	/*
+	 * Adding 2 for worst case. We could be spanning 3 second level pages
+	 * if we unmapped just over 2MB.
+	 */
+	u32 n_entries = len / SZ_2M + 2;
+	u32 fl_offset = FL_OFFSET(va);
+	u32 sl_offset = SL_OFFSET(va);
+	u32 i;
+
+	for (i = 0; i < n_entries && fl_offset < NUM_FL_PTE; ++i) {
+		void *tl_table_va;
+		u32 entry;
+		u64 *sl_pte_shadow;
+
+		sl_pte_shadow = pt->sl_table_shadow[fl_offset];
+		if (!sl_pte_shadow)
+			break;
+		sl_pte_shadow += sl_offset;
+		entry = *sl_pte_shadow;
+		tl_table_va = __va(((*sl_pte_shadow) & ~0xFFF));
+
+		if (entry && !(entry & 0xFFF)) {
+			free_page((unsigned long)tl_table_va);
+			*sl_pte_shadow = 0;
+		}
+		++sl_offset;
+		if (sl_offset >= NUM_TL_PTE) {
+			sl_offset = 0;
+			++fl_offset;
+		}
+	}
+}
+
+
+#ifdef CONFIG_ARM_LPAE
+/*
+ * If LPAE is enabled in the ARM processor then just use the same
+ * cache policy as the kernel for the SMMU cached mappings.
+ */
+static inline u32 __get_cache_attr(void)
+{
+	return pgprot_kernel & PTE_MT_MASK;
+}
+#else
+/*
+ * If LPAE is NOT enabled in the ARM processor then hard code the policy.
+ * This is mostly for debugging so that we can enable SMMU LPAE without
+ * ARM CPU LPAE.
+ */
+static inline u32 __get_cache_attr(void)
+{
+	return PTE_MT_WRITEALLOC;
+}
+
+#endif
+
+/*
+ * Get the IOMMU attributes for the ARM LPAE long descriptor format page
+ * table entry bits. The only upper attribute bits we currently use is the
+ * contiguous bit which is set when we actually have a contiguous mapping.
+ * Lower attribute bits specify memory attributes and the protection
+ * (Read/Write/Execute).
+ */
+static inline void __get_attr(s32 prot, u64 *upper_attr, u64 *lower_attr)
+{
+	u32 attr_idx = PTE_MT_BUFFERABLE;
+
+	*upper_attr = 0;
+	*lower_attr = 0;
+
+	if (!(prot & (IOMMU_READ | IOMMU_WRITE))) {
+		prot |= IOMMU_READ | IOMMU_WRITE;
+		WARN_ONCE(1, "No attributes in iommu mapping; assuming RW\n");
+	}
+
+	if ((prot & IOMMU_WRITE) && !(prot & IOMMU_READ)) {
+		prot |= IOMMU_READ;
+		WARN_ONCE(1, "Write-only unsupported; falling back to RW\n");
+	}
+
+	if (prot & IOMMU_CACHE)
+		attr_idx = __get_cache_attr();
+
+	*lower_attr |= attr_idx;
+	*lower_attr |= TL_NG | TL_AF;
+	*lower_attr |= (prot & IOMMU_CACHE) ? TL_SH_ISH : TL_SH_NSH;
+	*lower_attr |= (prot & IOMMU_WRITE) ? TL_AP_RW : TL_AP_RO;
+}
+
+static inline u64 *make_second_level_tbl(struct msm_iommu_pt *pt, u32 offset)
+{
+	u64 *sl = (u64 *) __get_free_page(GFP_KERNEL);
+	u64 *fl_pte = pt->fl_table + offset;
+
+	if (!sl) {
+		pr_err("Could not allocate second level table\n");
+		goto fail;
+	}
+
+	pt->sl_table_shadow[offset] = (u64 *) __get_free_page(GFP_KERNEL);
+	if (!pt->sl_table_shadow[offset]) {
+		free_page((u32) sl);
+		pr_err("Could not allocate second level shadow table\n");
+		goto fail;
+	}
+
+	memset(sl, 0, SZ_4K);
+	memset(pt->sl_table_shadow[offset], 0, SZ_4K);
+	clean_pte(sl, sl + NUM_SL_PTE, pt->redirect);
+
+	/* Leave APTable bits 0 to let next level decide access permissinons */
+	*fl_pte = (((phys_addr_t)__pa(sl)) & FLSL_BASE_MASK) | FLSL_TYPE_TABLE;
+	clean_pte(fl_pte, fl_pte + 1, pt->redirect);
+fail:
+	return sl;
+}
+
+static inline u64 *make_third_level_tbl(s32 redirect, u64 *sl_pte,
+					u64 *sl_pte_shadow)
+{
+	u64 *tl = (u64 *) __get_free_page(GFP_KERNEL);
+
+	if (!tl) {
+		pr_err("Could not allocate third level table\n");
+		goto fail;
+	}
+	memset(tl, 0, SZ_4K);
+	clean_pte(tl, tl + NUM_TL_PTE, redirect);
+
+	/* Leave APTable bits 0 to let next level decide access permissions */
+	*sl_pte = (((phys_addr_t)__pa(tl)) & FLSL_BASE_MASK) | FLSL_TYPE_TABLE;
+	*sl_pte_shadow = *sl_pte & ~0xFFF;
+	clean_pte(sl_pte, sl_pte + 1, redirect);
+fail:
+	return tl;
+}
+
+static inline s32 tl_4k_map(u64 *tl_pte, phys_addr_t pa,
+			    u64 upper_attr, u64 lower_attr, s32 redirect)
+{
+	s32 ret = 0;
+
+	if (*tl_pte) {
+		ret = -EBUSY;
+		goto fail;
+	}
+
+	*tl_pte = upper_attr | (pa & TL_PAGE_MASK) | lower_attr | TL_TYPE_PAGE;
+	clean_pte(tl_pte, tl_pte + 1, redirect);
+fail:
+	return ret;
+}
+
+static inline s32 tl_64k_map(u64 *tl_pte, phys_addr_t pa,
+			     u64 upper_attr, u64 lower_attr, s32 redirect)
+{
+	s32 ret = 0;
+	s32 i;
+
+	for (i = 0; i < 16; ++i)
+		if (*(tl_pte+i)) {
+			ret = -EBUSY;
+			goto fail;
+		}
+
+	/* Add Contiguous hint TL_CH */
+	upper_attr |= TL_CH;
+
+	for (i = 0; i < 16; ++i)
+		*(tl_pte+i) = upper_attr | (pa & TL_PAGE_MASK) |
+			      lower_attr | TL_TYPE_PAGE;
+	clean_pte(tl_pte, tl_pte + 16, redirect);
+fail:
+	return ret;
+}
+
+static inline s32 sl_2m_map(u64 *sl_pte, phys_addr_t pa,
+			    u64 upper_attr, u64 lower_attr, s32 redirect)
+{
+	s32 ret = 0;
+
+	if (*sl_pte) {
+		ret = -EBUSY;
+		goto fail;
+	}
+
+	*sl_pte = upper_attr | (pa & FLSL_BLOCK_MASK) |
+		  lower_attr | FLSL_TYPE_BLOCK;
+	clean_pte(sl_pte, sl_pte + 1, redirect);
+fail:
+	return ret;
+}
+
+static inline s32 sl_32m_map(u64 *sl_pte, phys_addr_t pa,
+			     u64 upper_attr, u64 lower_attr, s32 redirect)
+{
+	s32 i;
+	s32 ret = 0;
+
+	for (i = 0; i < 16; ++i) {
+		if (*(sl_pte+i)) {
+			ret = -EBUSY;
+			goto fail;
+		}
+	}
+
+	/* Add Contiguous hint TL_CH */
+	upper_attr |= TL_CH;
+
+	for (i = 0; i < 16; ++i)
+		*(sl_pte+i) = upper_attr | (pa & FLSL_BLOCK_MASK) |
+			      lower_attr | FLSL_TYPE_BLOCK;
+	clean_pte(sl_pte, sl_pte + 16, redirect);
+fail:
+	return ret;
+}
+
+static inline s32 fl_1G_map(u64 *fl_pte, phys_addr_t pa,
+			    u64 upper_attr, u64 lower_attr, s32 redirect)
+{
+	s32 ret = 0;
+
+	if (*fl_pte) {
+		ret = -EBUSY;
+		goto fail;
+	}
+
+	*fl_pte = upper_attr | (pa & FLSL_1G_BLOCK_MASK) |
+		  lower_attr | FLSL_TYPE_BLOCK;
+
+	clean_pte(fl_pte, fl_pte + 1, redirect);
+fail:
+	return ret;
+}
+
+static inline s32 common_error_check(size_t len, u64 const *fl_table)
+{
+	s32 ret = 0;
+
+	if (len != SZ_1G && len != SZ_32M && len != SZ_2M &&
+	    len != SZ_64K && len != SZ_4K) {
+		pr_err("Bad length: %d\n", len);
+		ret = -EINVAL;
+	} else if (!fl_table) {
+		pr_err("Null page table\n");
+		ret = -EINVAL;
+	}
+	return ret;
+}
+
+static inline s32 handle_1st_lvl(struct msm_iommu_pt *pt, u32 offset,
+				 phys_addr_t pa, size_t len, u64 upper_attr,
+				 u64 lower_attr)
+{
+	s32 ret = 0;
+	u64 *fl_pte = pt->fl_table + offset;
+
+	if (len == SZ_1G) {
+		ret = fl_1G_map(fl_pte, pa, upper_attr, lower_attr,
+				pt->redirect);
+	} else {
+		/* Need second level page table */
+		if (*fl_pte == 0) {
+			if (make_second_level_tbl(pt, offset) == NULL)
+				ret = -ENOMEM;
+		}
+		if (!ret) {
+			if ((*fl_pte & FLSL_TYPE_TABLE) != FLSL_TYPE_TABLE)
+				ret = -EBUSY;
+		}
+	}
+	return ret;
+}
+
+static inline s32 handle_3rd_lvl(u64 *sl_pte, u64 *sl_pte_shadow, u32 va,
+				 phys_addr_t pa, u64 upper_attr,
+				 u64 lower_attr, size_t len, s32 redirect)
+{
+	u64 *tl_table;
+	u64 *tl_pte;
+	u32 tl_offset;
+	s32 ret = 0;
+	u32 n_entries;
+
+	/* Need a 3rd level table */
+	if (*sl_pte == 0) {
+		if (make_third_level_tbl(redirect, sl_pte, sl_pte_shadow)
+					 == NULL) {
+			ret = -ENOMEM;
+			goto fail;
+		}
+	}
+
+	if ((*sl_pte & FLSL_TYPE_TABLE) != FLSL_TYPE_TABLE) {
+		ret = -EBUSY;
+		goto fail;
+	}
+
+	tl_table = FOLLOW_TO_NEXT_TABLE(sl_pte);
+	tl_offset = TL_OFFSET(va);
+	tl_pte = tl_table + tl_offset;
+
+	if (len == SZ_64K) {
+		ret = tl_64k_map(tl_pte, pa, upper_attr, lower_attr, redirect);
+		n_entries = 16;
+	} else {
+		ret = tl_4k_map(tl_pte, pa, upper_attr, lower_attr, redirect);
+		n_entries = 1;
+	}
+
+	/* Increment map count */
+	if (!ret)
+		*sl_pte_shadow += n_entries;
+
+fail:
+	return ret;
+}
+
+int msm_iommu_pagetable_map(struct msm_iommu_pt *pt, unsigned long va,
+			    phys_addr_t pa, size_t len, int prot)
+{
+	s32 ret;
+	struct scatterlist sg;
+
+	ret = common_error_check(len, pt->fl_table);
+	if (ret)
+		goto fail;
+
+	sg_init_table(&sg, 1);
+	sg_dma_address(&sg) = pa;
+	sg.length = len;
+
+	ret = msm_iommu_pagetable_map_range(pt, va, &sg, len, prot);
+
+fail:
+	return ret;
+}
+
+static void fl_1G_unmap(u64 *fl_pte, s32 redirect)
+{
+	*fl_pte = 0;
+	clean_pte(fl_pte, fl_pte + 1, redirect);
+}
+
+size_t msm_iommu_pagetable_unmap(struct msm_iommu_pt *pt, unsigned long va,
+				size_t len)
+{
+	msm_iommu_pagetable_unmap_range(pt, va, len);
+	return len;
+}
+
+static phys_addr_t get_phys_addr(struct scatterlist *sg)
+{
+	/*
+	 * Try sg_dma_address first so that we can
+	 * map carveout regions that do not have a
+	 * struct page associated with them.
+	 */
+	phys_addr_t pa = sg_dma_address(sg);
+
+	if (pa == 0)
+		pa = sg_phys(sg);
+	return pa;
+}
+
+#ifdef CONFIG_IOMMU_FORCE_4K_MAPPINGS
+static inline int is_fully_aligned(unsigned int va, phys_addr_t pa, size_t len,
+				   int align)
+{
+	if (align == SZ_4K)
+		return  IS_ALIGNED(va | pa, align) && (len >= align);
+	else
+		return 0;
+}
+#else
+static inline int is_fully_aligned(unsigned int va, phys_addr_t pa, size_t len,
+				   int align)
+{
+	return  IS_ALIGNED(va | pa, align) && (len >= align);
+}
+#endif
+
+s32 msm_iommu_pagetable_map_range(struct msm_iommu_pt *pt, u32 va,
+		       struct scatterlist *sg, u32 len, s32 prot)
+{
+	phys_addr_t pa;
+	u32 offset = 0;
+	u64 *fl_pte;
+	u64 *sl_pte;
+	u64 *sl_pte_shadow;
+	u32 fl_offset;
+	u32 sl_offset;
+	u64 *sl_table = NULL;
+	u32 chunk_size, chunk_offset = 0;
+	s32 ret = 0;
+	u64 up_at;
+	u64 lo_at;
+	u32 redirect = pt->redirect;
+	unsigned int start_va = va;
+
+	BUG_ON(len & (SZ_4K - 1));
+
+	if (!pt->fl_table) {
+		pr_err("Null page table\n");
+		ret = -EINVAL;
+		goto fail;
+	}
+
+	__get_attr(prot, &up_at, &lo_at);
+
+	pa = get_phys_addr(sg);
+
+	while (offset < len) {
+		u32 chunk_left = sg->length - chunk_offset;
+
+		fl_offset = FL_OFFSET(va);
+		fl_pte = pt->fl_table + fl_offset;
+
+		chunk_size = SZ_4K;
+		if (is_fully_aligned(va, pa, chunk_left, SZ_1G))
+			chunk_size = SZ_1G;
+		else if (is_fully_aligned(va, pa, chunk_left, SZ_32M))
+			chunk_size = SZ_32M;
+		else if (is_fully_aligned(va, pa, chunk_left, SZ_2M))
+			chunk_size = SZ_2M;
+		else if (is_fully_aligned(va, pa, chunk_left, SZ_64K))
+			chunk_size = SZ_64K;
+
+		ret = handle_1st_lvl(pt, fl_offset, pa, chunk_size,
+				     up_at, lo_at);
+		if (ret)
+			goto fail;
+
+		sl_table = FOLLOW_TO_NEXT_TABLE(fl_pte);
+		sl_offset = SL_OFFSET(va);
+		sl_pte = sl_table + sl_offset;
+		sl_pte_shadow = pt->sl_table_shadow[fl_offset] + sl_offset;
+
+		if (chunk_size == SZ_32M)
+			ret = sl_32m_map(sl_pte, pa, up_at, lo_at, redirect);
+		else if (chunk_size == SZ_2M)
+			ret = sl_2m_map(sl_pte, pa, up_at, lo_at, redirect);
+		else if (chunk_size == SZ_64K || chunk_size == SZ_4K)
+			ret = handle_3rd_lvl(sl_pte, sl_pte_shadow, va, pa,
+					     up_at, lo_at, chunk_size,
+					     redirect);
+		if (ret)
+			goto fail;
+
+		offset += chunk_size;
+		chunk_offset += chunk_size;
+		va += chunk_size;
+		pa += chunk_size;
+
+		if (chunk_offset >= sg->length && offset < len) {
+			chunk_offset = 0;
+			sg = sg_next(sg);
+			pa = get_phys_addr(sg);
+		}
+	}
+fail:
+	if (ret && offset > 0)
+		__msm_iommu_pagetable_unmap_range(pt, start_va, offset, 1);
+	return ret;
+}
+
+void msm_iommu_pagetable_unmap_range(struct msm_iommu_pt *pt, u32 va, u32 len)
+{
+	__msm_iommu_pagetable_unmap_range(pt, va, len, 0);
+}
+
+static void __msm_iommu_pagetable_unmap_range(struct msm_iommu_pt *pt, u32 va,
+					      u32 len, u32 silent)
+{
+	u32 offset = 0;
+	u64 *fl_pte;
+	u64 *sl_pte;
+	u64 *tl_pte;
+	u32 fl_offset;
+	u32 sl_offset;
+	u64 *sl_table;
+	u64 *tl_table;
+	u32 tl_start, tl_end;
+	u32 redirect = pt->redirect;
+
+	BUG_ON(len & (SZ_4K - 1));
+
+	while (offset < len) {
+		u32 entries;
+		u32 left_to_unmap = len - offset;
+		u32 type;
+
+		fl_offset = FL_OFFSET(va);
+		fl_pte = pt->fl_table + fl_offset;
+
+		if (*fl_pte == 0) {
+			if (!silent)
+				pr_err("First level PTE is 0 at index 0x%x (offset: 0x%x)\n",
+					fl_offset, offset);
+			return;
+		}
+		type = *fl_pte & FLSL_PTE_TYPE_MASK;
+
+		if (type == FLSL_TYPE_BLOCK) {
+			fl_1G_unmap(fl_pte, redirect);
+			va += SZ_1G;
+			offset += SZ_1G;
+		} else if (type == FLSL_TYPE_TABLE) {
+			sl_table = FOLLOW_TO_NEXT_TABLE(fl_pte);
+			sl_offset = SL_OFFSET(va);
+			sl_pte = sl_table + sl_offset;
+			type = *sl_pte & FLSL_PTE_TYPE_MASK;
+
+			if (type == FLSL_TYPE_BLOCK) {
+				*sl_pte = 0;
+
+				clean_pte(sl_pte, sl_pte + 1, redirect);
+
+				offset += SZ_2M;
+				va += SZ_2M;
+			} else if (type == FLSL_TYPE_TABLE) {
+				u64 *sl_pte_shadow =
+				    pt->sl_table_shadow[fl_offset] + sl_offset;
+
+				tl_start = TL_OFFSET(va);
+				tl_table =  FOLLOW_TO_NEXT_TABLE(sl_pte);
+				tl_end = (left_to_unmap / SZ_4K) + tl_start;
+
+				if (tl_end > NUM_TL_PTE)
+					tl_end = NUM_TL_PTE;
+
+				entries = tl_end - tl_start;
+
+				memset(tl_table + tl_start, 0,
+				       entries * sizeof(*tl_pte));
+
+				clean_pte(tl_table + tl_start,
+					  tl_table + tl_end, redirect);
+
+				BUG_ON((*sl_pte_shadow & 0xFFF) < entries);
+
+				/* Decrement map count */
+				*sl_pte_shadow -= entries;
+
+				if (!(*sl_pte_shadow & 0xFFF)) {
+					*sl_pte = 0;
+					clean_pte(sl_pte, sl_pte + 1,
+						  pt->redirect);
+				}
+
+				offset += entries * SZ_4K;
+				va += entries * SZ_4K;
+			} else {
+				if (!silent)
+					pr_err("Second level PTE (0x%llx) is invalid at index 0x%x (offset: 0x%x)\n",
+						*sl_pte, sl_offset, offset);
+			}
+		} else {
+			if (!silent)
+				pr_err("First level PTE (0x%llx) is invalid at index 0x%x (offset: 0x%x)\n",
+					*fl_pte, fl_offset, offset);
+		}
+	}
+}
+
+phys_addr_t msm_iommu_iova_to_phys_soft(struct iommu_domain *domain,
+							phys_addr_t va)
+{
+	pr_err("iova_to_phys is not implemented for LPAE\n");
+	return 0;
+}
+
+void __init msm_iommu_pagetable_init(void)
+{
+}
diff --git a/drivers/iommu/msm_iommu_priv.h b/drivers/iommu/msm_iommu_priv.h
index 031e6b4..1064d89 100644
--- a/drivers/iommu/msm_iommu_priv.h
+++ b/drivers/iommu/msm_iommu_priv.h
@@ -31,13 +31,23 @@
  * clients trying to unmap an address that is being used.
  * fl_table_shadow will use the lower 9 bits for the use count and the upper
  * bits for the second level page table address.
+ * sl_table_shadow uses the same concept as fl_table_shadow but for LPAE 2nd
+ * level page tables.
  */
+#ifdef CONFIG_MSM_IOMMU_LPAE
+struct msm_iommu_pt {
+	u64 *fl_table;
+	u64 **sl_table_shadow;
+	int redirect;
+	u64 *unaligned_fl_table;
+};
+#else
 struct msm_iommu_pt {
 	u32 *fl_table;
 	int redirect;
 	u32 *fl_table_shadow;
 };
-
+#endif
 /**
  * struct msm_iommu_priv - Container for page table attributes and other
  * private iommu domain information.
-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC/PATCH 6/7] defconfig: msm: Enable Qualcomm SMMUv1 driver
  2014-06-30 16:51 [RFC/PATCH 0/7] Add MSM SMMUv1 support Olav Haugan
                   ` (3 preceding siblings ...)
  2014-06-30 16:51 ` [RFC/PATCH 5/7] iommu: msm: Add support for V7L page table format Olav Haugan
@ 2014-06-30 16:51 ` Olav Haugan
  2014-06-30 16:51 ` [RFC/PATCH 7/7] iommu-api: Add domain attribute to enable coherent HTW Olav Haugan
       [not found] ` <1404147116-4598-5-git-send-email-ohaugan@codeaurora.org>
  6 siblings, 0 replies; 29+ messages in thread
From: Olav Haugan @ 2014-06-30 16:51 UTC (permalink / raw)
  To: linux-arm-kernel

Enable the Qualcomm SMMUv1 driver allowing bus masters to operate
on physically discontigous memory.

Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
---
 arch/arm/configs/qcom_defconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/configs/qcom_defconfig b/arch/arm/configs/qcom_defconfig
index 0414889..12838bb 100644
--- a/arch/arm/configs/qcom_defconfig
+++ b/arch/arm/configs/qcom_defconfig
@@ -137,6 +137,7 @@ CONFIG_MSM_GCC_8660=y
 CONFIG_MSM_MMCC_8960=y
 CONFIG_MSM_MMCC_8974=y
 CONFIG_MSM_IOMMU_V0=y
+CONFIG_MSM_IOMMU_V1=y
 CONFIG_GENERIC_PHY=y
 CONFIG_EXT2_FS=y
 CONFIG_EXT2_FS_XATTR=y
-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC/PATCH 7/7] iommu-api: Add domain attribute to enable coherent HTW
  2014-06-30 16:51 [RFC/PATCH 0/7] Add MSM SMMUv1 support Olav Haugan
                   ` (4 preceding siblings ...)
  2014-06-30 16:51 ` [RFC/PATCH 6/7] defconfig: msm: Enable Qualcomm SMMUv1 driver Olav Haugan
@ 2014-06-30 16:51 ` Olav Haugan
  2014-07-01  8:49   ` Varun Sethi
       [not found] ` <1404147116-4598-5-git-send-email-ohaugan@codeaurora.org>
  6 siblings, 1 reply; 29+ messages in thread
From: Olav Haugan @ 2014-06-30 16:51 UTC (permalink / raw)
  To: linux-arm-kernel

Add a new iommu domain attribute that can be used to enable cache
coherent hardware table walks (HTW) by the SMMU. HTW might be supported
by the SMMU HW but depending on the use case and the usage of the SMMU
in the SoC it might not be always beneficial to always turn on coherent HTW for
all domains/iommu's.

Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
---
 drivers/iommu/msm_iommu-v1.c | 16 ++++++++++++++++
 include/linux/iommu.h        |  1 +
 2 files changed, 17 insertions(+)

diff --git a/drivers/iommu/msm_iommu-v1.c b/drivers/iommu/msm_iommu-v1.c
index 2c574ef..e163ffc 100644
--- a/drivers/iommu/msm_iommu-v1.c
+++ b/drivers/iommu/msm_iommu-v1.c
@@ -1456,8 +1456,16 @@ static int msm_domain_get_attr(struct iommu_domain *domain,
 			       enum iommu_attr attr, void *data)
 {
 	s32 ret = 0;
+	struct msm_iommu_priv *priv = domain->priv;
 
 	switch (attr) {
+	case DOMAIN_ATTR_COHERENT_HTW:
+	{
+		s32 *int_ptr = (s32 *) data;
+
+		*int_ptr = priv->pt.redirect;
+		break;
+	}
 	default:
 		pr_err("Unsupported attribute type\n");
 		ret = -EINVAL;
@@ -1471,8 +1479,16 @@ static int msm_domain_set_attr(struct iommu_domain *domain,
 			       enum iommu_attr attr, void *data)
 {
 	s32 ret = 0;
+	struct msm_iommu_priv *priv = domain->priv;
 
 	switch (attr) {
+	case DOMAIN_ATTR_COHERENT_HTW:
+	{
+		s32 *int_ptr = (s32 *) data;
+
+		priv->pt.redirect = *int_ptr;
+		break;
+	}
 	default:
 		pr_err("Unsupported attribute type\n");
 		ret = -EINVAL;
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 63dca6d..6d9596d 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -81,6 +81,7 @@ enum iommu_attr {
 	DOMAIN_ATTR_FSL_PAMU_STASH,
 	DOMAIN_ATTR_FSL_PAMU_ENABLE,
 	DOMAIN_ATTR_FSL_PAMUV1,
+	DOMAIN_ATTR_COHERENT_HTW,
 	DOMAIN_ATTR_MAX,
 };
 
-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC/PATCH 4/7] iommu: msm: Add MSM IOMMUv1 driver
       [not found] ` <1404147116-4598-5-git-send-email-ohaugan@codeaurora.org>
@ 2014-06-30 17:02   ` Will Deacon
  2014-07-02 22:32     ` Olav Haugan
  0 siblings, 1 reply; 29+ messages in thread
From: Will Deacon @ 2014-06-30 17:02 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Olav,

On Mon, Jun 30, 2014 at 05:51:53PM +0100, Olav Haugan wrote:
> MSM IOMMUv1 driver supports Qualcomm SoC MSM8974 and
> MSM8084.
> 
> The IOMMU driver supports the following features:
> 
>     - ARM V7S page table format independent of ARM CPU page table format
>     - 4K/64K/1M/16M mappings (V7S)
>     - ATOS used for unit testing of driver
>     - Sharing of page tables among SMMUs
>     - Verbose context bank fault reporting
>     - Verbose global fault reporting
>     - Support for clocks and GDSC
>     - map/unmap range
>     - Domain specific enabling of coherent Hardware Table Walk (HTW)
> 
> Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
> ---
>  .../devicetree/bindings/iommu/msm,iommu_v1.txt     |   56 +
>  drivers/iommu/Kconfig                              |   36 +
>  drivers/iommu/Makefile                             |    2 +
>  drivers/iommu/msm_iommu-v1.c                       | 1448 +++++++++++++
>  drivers/iommu/msm_iommu.c                          |  149 ++
>  drivers/iommu/msm_iommu_dev-v1.c                   |  340 +++
>  drivers/iommu/msm_iommu_hw-v1.h                    | 2236 ++++++++++++++++++++
>  drivers/iommu/msm_iommu_pagetable.c                |  600 ++++++
>  drivers/iommu/msm_iommu_pagetable.h                |   33 +
>  drivers/iommu/msm_iommu_priv.h                     |   55 +
>  include/linux/qcom_iommu.h                         |  221 ++
>  11 files changed, 5176 insertions(+)

This patch is *huge*! It may get bounced from some lists (I think the
linux-arm-kernel lists has a ~100k limit), so it might be worth trying to do
this incrementally.

That said, a quick glance at your code indicates that this IOMMU is
compliant with the ARM SMMU architecture, and we already have a driver for
that. Please can you rework this series to build on top of the code in
mainline already, rather than simply duplicating it? We need fewer IOMMU
drivers, not more!

It's also worth talking to Varun Sethi, as he was already looking at
implementing block mappings in the existing driver.

Thanks,

Will

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC/PATCH 2/7] iommu-api: Add map_range/unmap_range functions
  2014-06-30 16:51 ` [RFC/PATCH 2/7] iommu-api: Add map_range/unmap_range functions Olav Haugan
@ 2014-06-30 19:42   ` Thierry Reding
  2014-07-01  9:33   ` Will Deacon
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 29+ messages in thread
From: Thierry Reding @ 2014-06-30 19:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jun 30, 2014 at 09:51:51AM -0700, Olav Haugan wrote:
[...]
> +int iommu_map_range(struct iommu_domain *domain, unsigned int iova,
> +		    struct scatterlist *sg, unsigned int len, int prot)
> +{
> +	if (unlikely(domain->ops->map_range == NULL))
> +		return -ENODEV;

Should we perhaps make this mandatory? For drivers that don't provide it
we could implement a generic helper that wraps iommu_{map,unmap}().

> +
> +	BUG_ON(iova & (~PAGE_MASK));
> +
> +	return domain->ops->map_range(domain, iova, sg, len, prot);
> +}
> +EXPORT_SYMBOL_GPL(iommu_map_range);
> +
> +int iommu_unmap_range(struct iommu_domain *domain, unsigned int iova,
> +		      unsigned int len)
> +{
> +	if (unlikely(domain->ops->unmap_range == NULL))
> +		return -ENODEV;
> +
> +	BUG_ON(iova & (~PAGE_MASK));
> +
> +	return domain->ops->unmap_range(domain, iova, len);
> +}
> +EXPORT_SYMBOL_GPL(iommu_unmap_range);

Could these be renamed iommu_{map,unmap}_sg() instead to make it more
obvious what exactly they map? And perhaps this could take an sg_table
instead, which already provides a count and is a very common structure
used in drivers (and the DMA mapping API).

Thierry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140630/40852b16/attachment-0001.sig>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC/PATCH 3/7] iopoll: Introduce memory-mapped IO polling macros
  2014-06-30 16:51 ` [RFC/PATCH 3/7] iopoll: Introduce memory-mapped IO polling macros Olav Haugan
@ 2014-06-30 19:46   ` Thierry Reding
  2014-07-01  9:40   ` Will Deacon
  1 sibling, 0 replies; 29+ messages in thread
From: Thierry Reding @ 2014-06-30 19:46 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jun 30, 2014 at 09:51:52AM -0700, Olav Haugan wrote:
[...]
> diff --git a/include/linux/iopoll.h b/include/linux/iopoll.h
[...]
> +/**
> + * readl_poll_timeout - Periodically poll an address until a condition is met or a timeout occurs
> + * @addr: Address to poll
> + * @val: Variable to read the value into
> + * @cond: Break condition (usually involving @val)
> + * @sleep_us: Maximum time to sleep between reads in uS (0 tight-loops)

s/uS/us/ here and elsewhere. S is the symbol for Siemens.

> + * @timeout_us: Timeout in uS, 0 means never timeout
> + *
> + * Returns 0 on success and -ETIMEDOUT upon a timeout. In either
> + * case, the last read value at @addr is stored in @val. Must not
> + * be called from atomic context if sleep_us or timeout_us are used.
> + */
> +#define readl_poll_timeout(addr, val, cond, sleep_us, timeout_us) \
> +({ \
> +	ktime_t timeout = ktime_add_us(ktime_get(), timeout_us); \
> +	might_sleep_if(timeout_us); \
> +	for (;;) { \
> +		(val) = readl(addr); \
> +		if (cond) \
> +			break; \
> +		if (timeout_us && ktime_compare(ktime_get(), timeout) > 0) { \
> +			(val) = readl(addr); \
> +			break; \
> +		} \
> +		if (sleep_us) \
> +			usleep_range(DIV_ROUND_UP(sleep_us, 4), sleep_us); \
> +	} \
> +	(cond) ? 0 : -ETIMEDOUT; \
> +})

Why can't these be functions?

Thierry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140630/67f196b9/attachment.sig>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC/PATCH 7/7] iommu-api: Add domain attribute to enable coherent HTW
  2014-06-30 16:51 ` [RFC/PATCH 7/7] iommu-api: Add domain attribute to enable coherent HTW Olav Haugan
@ 2014-07-01  8:49   ` Varun Sethi
  2014-07-02 22:11     ` Olav Haugan
  0 siblings, 1 reply; 29+ messages in thread
From: Varun Sethi @ 2014-07-01  8:49 UTC (permalink / raw)
  To: linux-arm-kernel



> -----Original Message-----
> From: iommu-bounces at lists.linux-foundation.org [mailto:iommu-
> bounces at lists.linux-foundation.org] On Behalf Of Olav Haugan
> Sent: Monday, June 30, 2014 10:22 PM
> To: linux-arm-kernel at lists.infradead.org; iommu at lists.linux-
> foundation.org
> Cc: linux-arm-msm at vger.kernel.org; will.deacon at arm.com;
> thierry.reding at gmail.com; vgandhi at codeaurora.org
> Subject: [RFC/PATCH 7/7] iommu-api: Add domain attribute to enable
> coherent HTW
> 
> Add a new iommu domain attribute that can be used to enable cache
> coherent hardware table walks (HTW) by the SMMU. HTW might be supported
> by the SMMU HW but depending on the use case and the usage of the SMMU in
> the SoC it might not be always beneficial to always turn on coherent HTW
> for all domains/iommu's.
> 
[Sethi Varun-B16395] Why won't you want to use the coherent table walk feature?

> Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
> ---
>  drivers/iommu/msm_iommu-v1.c | 16 ++++++++++++++++
>  include/linux/iommu.h        |  1 +
>  2 files changed, 17 insertions(+)
> 
> diff --git a/drivers/iommu/msm_iommu-v1.c b/drivers/iommu/msm_iommu-v1.c
> index 2c574ef..e163ffc 100644
> --- a/drivers/iommu/msm_iommu-v1.c
> +++ b/drivers/iommu/msm_iommu-v1.c
> @@ -1456,8 +1456,16 @@ static int msm_domain_get_attr(struct iommu_domain
> *domain,
>  			       enum iommu_attr attr, void *data)  {
>  	s32 ret = 0;
> +	struct msm_iommu_priv *priv = domain->priv;
> 
>  	switch (attr) {
> +	case DOMAIN_ATTR_COHERENT_HTW:
> +	{
> +		s32 *int_ptr = (s32 *) data;
> +
> +		*int_ptr = priv->pt.redirect;
> +		break;
> +	}
>  	default:
>  		pr_err("Unsupported attribute type\n");
>  		ret = -EINVAL;
> @@ -1471,8 +1479,16 @@ static int msm_domain_set_attr(struct iommu_domain
> *domain,
>  			       enum iommu_attr attr, void *data)  {
>  	s32 ret = 0;
> +	struct msm_iommu_priv *priv = domain->priv;
> 
>  	switch (attr) {
> +	case DOMAIN_ATTR_COHERENT_HTW:
> +	{
> +		s32 *int_ptr = (s32 *) data;
> +
> +		priv->pt.redirect = *int_ptr;
> +		break;
> +	}
>  	default:
>  		pr_err("Unsupported attribute type\n");
>  		ret = -EINVAL;
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h index
> 63dca6d..6d9596d 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -81,6 +81,7 @@ enum iommu_attr {
>  	DOMAIN_ATTR_FSL_PAMU_STASH,
>  	DOMAIN_ATTR_FSL_PAMU_ENABLE,
>  	DOMAIN_ATTR_FSL_PAMUV1,
> +	DOMAIN_ATTR_COHERENT_HTW,
[Sethi Varun-B16395] Would it make sense to represent this as DOMAIN_ATTR_SMMU_COHERENT_HTW? I believe this is specific to SMMU.

-Varun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC/PATCH 2/7] iommu-api: Add map_range/unmap_range functions
  2014-06-30 16:51 ` [RFC/PATCH 2/7] iommu-api: Add map_range/unmap_range functions Olav Haugan
  2014-06-30 19:42   ` Thierry Reding
@ 2014-07-01  9:33   ` Will Deacon
  2014-07-01  9:58     ` Varun Sethi
  2014-07-04  4:29   ` Hiroshi Doyu
  2014-07-11 10:20   ` Joerg Roedel
  3 siblings, 1 reply; 29+ messages in thread
From: Will Deacon @ 2014-07-01  9:33 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Olav,

On Mon, Jun 30, 2014 at 05:51:51PM +0100, Olav Haugan wrote:
> Mapping and unmapping are more often than not in the critical path.
> map_range and unmap_range allows SMMU driver implementations to optimize
> the process of mapping and unmapping buffers into the SMMU page tables.
> Instead of mapping one physical address, do TLB operation (expensive),
> mapping, do TLB operation, mapping, do TLB operation the driver can map
> a scatter-gatherlist of physically contiguous pages into one virtual
> address space and then at the end do one TLB operation.
> 
> Additionally, the mapping operation would be faster in general since
> clients does not have to keep calling map API over and over again for
> each physically contiguous chunk of memory that needs to be mapped to a
> virtually contiguous region.

I like the idea of this, although it does mean that drivers implementing the
range mapping functions need more featureful page-table manipulation code
than currently required.

For example, iommu_map uses iommu_pgsize to guarantee that mappings are
created in blocks of the largest support page size. This can be used to
simplify iterating in the SMMU driver (although the ARM SMMU driver doesn't
yet make use of this, I think Varun would add this when he adds support for
sections).

Given that we're really trying to kill the TLBI here, why not implement
something like iommu_unmap_nosync (unmap without DSB; TLBI) and iommu_sync
(DSB; TLBI) instead? If we guarantee that ranges must be unmapped before
being remapped, then there shouldn't be a TLBI on the map path anyway.

Will

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC/PATCH 3/7] iopoll: Introduce memory-mapped IO polling macros
  2014-06-30 16:51 ` [RFC/PATCH 3/7] iopoll: Introduce memory-mapped IO polling macros Olav Haugan
  2014-06-30 19:46   ` Thierry Reding
@ 2014-07-01  9:40   ` Will Deacon
  1 sibling, 0 replies; 29+ messages in thread
From: Will Deacon @ 2014-07-01  9:40 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Matt,

On Mon, Jun 30, 2014 at 05:51:52PM +0100, Olav Haugan wrote:
> From: Matt Wagantall <mattw@codeaurora.org>
> 
> It is sometimes necessary to poll a memory-mapped register until its
> value satisfies some condition. Introduce a family of convenience macros
> that do this. Tight-loop and sleeping versions are provided with and
> without timeouts.

We could certainly use something like this in the SMMU and GICv3 drivers, so
I agree that it makes sense for this to be in generic code.

> +/**
> + * readl_poll_timeout - Periodically poll an address until a condition is met or a timeout occurs
> + * @addr: Address to poll
> + * @val: Variable to read the value into
> + * @cond: Break condition (usually involving @val)
> + * @sleep_us: Maximum time to sleep between reads in uS (0 tight-loops)
> + * @timeout_us: Timeout in uS, 0 means never timeout

I think 0 should actually mean `use the default timeout', which could be
something daft like 1s. Removing the timeout is asking for kernel lock-ups.
We could also have a version without the timeout parameter at all, which
acts like a timeout of 0.

> + *
> + * Returns 0 on success and -ETIMEDOUT upon a timeout. In either
> + * case, the last read value at @addr is stored in @val. Must not
> + * be called from atomic context if sleep_us or timeout_us are used.
> + */
> +#define readl_poll_timeout(addr, val, cond, sleep_us, timeout_us) \
> +({ \
> +	ktime_t timeout = ktime_add_us(ktime_get(), timeout_us); \
> +	might_sleep_if(timeout_us); \
> +	for (;;) { \
> +		(val) = readl(addr); \
> +		if (cond) \
> +			break; \
> +		if (timeout_us && ktime_compare(ktime_get(), timeout) > 0) { \
> +			(val) = readl(addr); \
> +			break; \
> +		} \
> +		if (sleep_us) \
> +			usleep_range(DIV_ROUND_UP(sleep_us, 4), sleep_us); \
> +	} \
> +	(cond) ? 0 : -ETIMEDOUT; \
> +})
> +
> +/**
> + * readl_poll_timeout_noirq - Periodically poll an address until a condition is met or a timeout occurs
> + * @addr: Address to poll
> + * @val: Variable to read the value into
> + * @cond: Break condition (usually involving @val)
> + * @max_reads: Maximum number of reads before giving up

I don't think max_reads is a useful tunable.

> + * @time_between_us: Time to udelay() between successive reads
> + *
> + * Returns 0 on success and -ETIMEDOUT upon a timeout.
> + */
> +#define readl_poll_timeout_noirq(addr, val, cond, max_reads, time_between_us) \

Maybe readl_poll_[timeout_]atomic is a better name?

> +({ \
> +	int count; \
> +	for (count = (max_reads); count > 0; count--) { \
> +		(val) = readl(addr); \
> +		if (cond) \
> +			break; \
> +		udelay(time_between_us); \
> +	} \
> +	(cond) ? 0 : -ETIMEDOUT; \
> +})
> +
> +/**
> + * readl_poll - Periodically poll an address until a condition is met
> + * @addr: Address to poll
> + * @val: Variable to read the value into
> + * @cond: Break condition (usually involving @val)
> + * @sleep_us: Maximum time to sleep between reads in uS (0 tight-loops)
> + *
> + * Must not be called from atomic context if sleep_us is used.
> + */
> +#define readl_poll(addr, val, cond, sleep_us) \
> +	readl_poll_timeout(addr, val, cond, sleep_us, 0)

Good idea ;)

> +/**
> + * readl_tight_poll_timeout - Tight-loop on an address until a condition is met or a timeout occurs
> + * @addr: Address to poll
> + * @val: Variable to read the value into
> + * @cond: Break condition (usually involving @val)
> + * @timeout_us: Timeout in uS, 0 means never timeout
> + *
> + * Returns 0 on success and -ETIMEDOUT upon a timeout. In either
> + * case, the last read value at @addr is stored in @val. Must not
> + * be called from atomic context if timeout_us is used.
> + */
> +#define readl_tight_poll_timeout(addr, val, cond, timeout_us) \
> +	readl_poll_timeout(addr, val, cond, 0, timeout_us)
> +
> +/**
> + * readl_tight_poll - Tight-loop on an address until a condition is met
> + * @addr: Address to poll
> + * @val: Variable to read the value into
> + * @cond: Break condition (usually involving @val)
> + *
> + * May be called from atomic context.
> + */
> +#define readl_tight_poll(addr, val, cond) \
> +	readl_poll_timeout(addr, val, cond, 0, 0)

This would be readl_poll_timeout_atomic if you went with my suggestion (i.e.
readl_poll_timeout would have an unconditional might_sleep).

What do you reckon?

Will

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC/PATCH 2/7] iommu-api: Add map_range/unmap_range functions
  2014-07-01  9:33   ` Will Deacon
@ 2014-07-01  9:58     ` Varun Sethi
  0 siblings, 0 replies; 29+ messages in thread
From: Varun Sethi @ 2014-07-01  9:58 UTC (permalink / raw)
  To: linux-arm-kernel



> -----Original Message-----
> From: iommu-bounces at lists.linux-foundation.org [mailto:iommu-
> bounces at lists.linux-foundation.org] On Behalf Of Will Deacon
> Sent: Tuesday, July 01, 2014 3:04 PM
> To: Olav Haugan
> Cc: linux-arm-msm at vger.kernel.org; iommu at lists.linux-foundation.org;
> thierry.reding at gmail.com; vgandhi at codeaurora.org; linux-arm-
> kernel at lists.infradead.org
> Subject: Re: [RFC/PATCH 2/7] iommu-api: Add map_range/unmap_range
> functions
> 
> Hi Olav,
> 
> On Mon, Jun 30, 2014 at 05:51:51PM +0100, Olav Haugan wrote:
> > Mapping and unmapping are more often than not in the critical path.
> > map_range and unmap_range allows SMMU driver implementations to
> > optimize the process of mapping and unmapping buffers into the SMMU
> page tables.
> > Instead of mapping one physical address, do TLB operation (expensive),
> > mapping, do TLB operation, mapping, do TLB operation the driver can
> > map a scatter-gatherlist of physically contiguous pages into one
> > virtual address space and then at the end do one TLB operation.
> >
> > Additionally, the mapping operation would be faster in general since
> > clients does not have to keep calling map API over and over again for
> > each physically contiguous chunk of memory that needs to be mapped to
> > a virtually contiguous region.
> 
> I like the idea of this, although it does mean that drivers implementing
> the range mapping functions need more featureful page-table manipulation
> code than currently required.
> 
> For example, iommu_map uses iommu_pgsize to guarantee that mappings are
> created in blocks of the largest support page size. This can be used to
> simplify iterating in the SMMU driver (although the ARM SMMU driver
> doesn't yet make use of this, I think Varun would add this when he adds
> support for sections).
Yes, this would be supported.

-Varun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC/PATCH 7/7] iommu-api: Add domain attribute to enable coherent HTW
  2014-07-01  8:49   ` Varun Sethi
@ 2014-07-02 22:11     ` Olav Haugan
  2014-07-03 17:43       ` Will Deacon
  0 siblings, 1 reply; 29+ messages in thread
From: Olav Haugan @ 2014-07-02 22:11 UTC (permalink / raw)
  To: linux-arm-kernel

On 7/1/2014 1:49 AM, Varun Sethi wrote:
> 
> 
>> -----Original Message-----
>> From: iommu-bounces at lists.linux-foundation.org [mailto:iommu-
>> bounces at lists.linux-foundation.org] On Behalf Of Olav Haugan
>> Sent: Monday, June 30, 2014 10:22 PM
>> To: linux-arm-kernel at lists.infradead.org; iommu at lists.linux-
>> foundation.org
>> Cc: linux-arm-msm at vger.kernel.org; will.deacon at arm.com;
>> thierry.reding at gmail.com; vgandhi at codeaurora.org
>> Subject: [RFC/PATCH 7/7] iommu-api: Add domain attribute to enable
>> coherent HTW
>>
>> Add a new iommu domain attribute that can be used to enable cache
>> coherent hardware table walks (HTW) by the SMMU. HTW might be supported
>> by the SMMU HW but depending on the use case and the usage of the SMMU in
>> the SoC it might not be always beneficial to always turn on coherent HTW
>> for all domains/iommu's.
>>
> [Sethi Varun-B16395] Why won't you want to use the coherent table walk feature?

Very good question. We have found that turning on IOMMU coherent HTW is
not always beneficial to performance (performance either the same or
slightly worse in some cases). Even if the perf. is the same we would
like to avoid using precious L2 cache for no benefit to the IOMMU.
Although our HW supports this feature we don't always want to turn this
on for a specific use case/domain (bus master).

>> Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
>> ---
>>  drivers/iommu/msm_iommu-v1.c | 16 ++++++++++++++++
>>  include/linux/iommu.h        |  1 +
>>  2 files changed, 17 insertions(+)
>>
>> diff --git a/drivers/iommu/msm_iommu-v1.c b/drivers/iommu/msm_iommu-v1.c
>> index 2c574ef..e163ffc 100644
>> --- a/drivers/iommu/msm_iommu-v1.c
>> +++ b/drivers/iommu/msm_iommu-v1.c
>> @@ -1456,8 +1456,16 @@ static int msm_domain_get_attr(struct iommu_domain
>> *domain,
>>  			       enum iommu_attr attr, void *data)  {
>>  	s32 ret = 0;
>> +	struct msm_iommu_priv *priv = domain->priv;
>>
>>  	switch (attr) {
>> +	case DOMAIN_ATTR_COHERENT_HTW:
>> +	{
>> +		s32 *int_ptr = (s32 *) data;
>> +
>> +		*int_ptr = priv->pt.redirect;
>> +		break;
>> +	}
>>  	default:
>>  		pr_err("Unsupported attribute type\n");
>>  		ret = -EINVAL;
>> @@ -1471,8 +1479,16 @@ static int msm_domain_set_attr(struct iommu_domain
>> *domain,
>>  			       enum iommu_attr attr, void *data)  {
>>  	s32 ret = 0;
>> +	struct msm_iommu_priv *priv = domain->priv;
>>
>>  	switch (attr) {
>> +	case DOMAIN_ATTR_COHERENT_HTW:
>> +	{
>> +		s32 *int_ptr = (s32 *) data;
>> +
>> +		priv->pt.redirect = *int_ptr;
>> +		break;
>> +	}
>>  	default:
>>  		pr_err("Unsupported attribute type\n");
>>  		ret = -EINVAL;
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h index
>> 63dca6d..6d9596d 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -81,6 +81,7 @@ enum iommu_attr {
>>  	DOMAIN_ATTR_FSL_PAMU_STASH,
>>  	DOMAIN_ATTR_FSL_PAMU_ENABLE,
>>  	DOMAIN_ATTR_FSL_PAMUV1,
>> +	DOMAIN_ATTR_COHERENT_HTW,
> [Sethi Varun-B16395] Would it make sense to represent this as DOMAIN_ATTR_SMMU_COHERENT_HTW? I believe this is specific to SMMU.

Yes, it does.

Thanks,

Olav Haugan

-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC/PATCH 4/7] iommu: msm: Add MSM IOMMUv1 driver
  2014-06-30 17:02   ` [RFC/PATCH 4/7] iommu: msm: Add MSM IOMMUv1 driver Will Deacon
@ 2014-07-02 22:32     ` Olav Haugan
  0 siblings, 0 replies; 29+ messages in thread
From: Olav Haugan @ 2014-07-02 22:32 UTC (permalink / raw)
  To: linux-arm-kernel

On 6/30/2014 10:02 AM, Will Deacon wrote:
> Hi Olav,
> 
> On Mon, Jun 30, 2014 at 05:51:53PM +0100, Olav Haugan wrote:
>> MSM IOMMUv1 driver supports Qualcomm SoC MSM8974 and
>> MSM8084.
>>
>> The IOMMU driver supports the following features:
>>
>>     - ARM V7S page table format independent of ARM CPU page table format
>>     - 4K/64K/1M/16M mappings (V7S)
>>     - ATOS used for unit testing of driver
>>     - Sharing of page tables among SMMUs
>>     - Verbose context bank fault reporting
>>     - Verbose global fault reporting
>>     - Support for clocks and GDSC
>>     - map/unmap range
>>     - Domain specific enabling of coherent Hardware Table Walk (HTW)
>>
>> Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
>> ---
>>  .../devicetree/bindings/iommu/msm,iommu_v1.txt     |   56 +
>>  drivers/iommu/Kconfig                              |   36 +
>>  drivers/iommu/Makefile                             |    2 +
>>  drivers/iommu/msm_iommu-v1.c                       | 1448 +++++++++++++
>>  drivers/iommu/msm_iommu.c                          |  149 ++
>>  drivers/iommu/msm_iommu_dev-v1.c                   |  340 +++
>>  drivers/iommu/msm_iommu_hw-v1.h                    | 2236 ++++++++++++++++++++
>>  drivers/iommu/msm_iommu_pagetable.c                |  600 ++++++
>>  drivers/iommu/msm_iommu_pagetable.h                |   33 +
>>  drivers/iommu/msm_iommu_priv.h                     |   55 +
>>  include/linux/qcom_iommu.h                         |  221 ++
>>  11 files changed, 5176 insertions(+)
> 
> This patch is *huge*! It may get bounced from some lists (I think the
> linux-arm-kernel lists has a ~100k limit), so it might be worth trying to do
> this incrementally.

Yes, I noticed. Sorry about that.

> That said, a quick glance at your code indicates that this IOMMU is
> compliant with the ARM SMMU architecture, and we already have a driver for
> that. Please can you rework this series to build on top of the code in
> mainline already, rather than simply duplicating it? We need fewer IOMMU
> drivers, not more!

Ok, I will rework.

Thanks,

Olav Haugan

-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC/PATCH 7/7] iommu-api: Add domain attribute to enable coherent HTW
  2014-07-02 22:11     ` Olav Haugan
@ 2014-07-03 17:43       ` Will Deacon
  2014-07-08 22:24         ` Olav Haugan
  0 siblings, 1 reply; 29+ messages in thread
From: Will Deacon @ 2014-07-03 17:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jul 02, 2014 at 11:11:13PM +0100, Olav Haugan wrote:
> On 7/1/2014 1:49 AM, Varun Sethi wrote:
> > 
> > 
> >> -----Original Message-----
> >> From: iommu-bounces at lists.linux-foundation.org [mailto:iommu-
> >> bounces at lists.linux-foundation.org] On Behalf Of Olav Haugan
> >> Sent: Monday, June 30, 2014 10:22 PM
> >> To: linux-arm-kernel at lists.infradead.org; iommu at lists.linux-
> >> foundation.org
> >> Cc: linux-arm-msm at vger.kernel.org; will.deacon at arm.com;
> >> thierry.reding at gmail.com; vgandhi at codeaurora.org
> >> Subject: [RFC/PATCH 7/7] iommu-api: Add domain attribute to enable
> >> coherent HTW
> >>
> >> Add a new iommu domain attribute that can be used to enable cache
> >> coherent hardware table walks (HTW) by the SMMU. HTW might be supported
> >> by the SMMU HW but depending on the use case and the usage of the SMMU in
> >> the SoC it might not be always beneficial to always turn on coherent HTW
> >> for all domains/iommu's.
> >>
> > [Sethi Varun-B16395] Why won't you want to use the coherent table walk feature?
> 
> Very good question. We have found that turning on IOMMU coherent HTW is
> not always beneficial to performance (performance either the same or
> slightly worse in some cases). Even if the perf. is the same we would
> like to avoid using precious L2 cache for no benefit to the IOMMU.
> Although our HW supports this feature we don't always want to turn this
> on for a specific use case/domain (bus master).

Could we at least invert the feature flag, please? i.e. you set an attribute
to *disable* coherent walks? I'd also be interested to see some performance
numbers, as the added cacheflushing overhead from non-coherent walks is
going to be non-trivial.

Will

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC/PATCH 2/7] iommu-api: Add map_range/unmap_range functions
  2014-06-30 16:51 ` [RFC/PATCH 2/7] iommu-api: Add map_range/unmap_range functions Olav Haugan
  2014-06-30 19:42   ` Thierry Reding
  2014-07-01  9:33   ` Will Deacon
@ 2014-07-04  4:29   ` Hiroshi Doyu
  2014-07-08 21:53     ` Olav Haugan
  2014-07-11 10:20   ` Joerg Roedel
  3 siblings, 1 reply; 29+ messages in thread
From: Hiroshi Doyu @ 2014-07-04  4:29 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Olav,

Olav Haugan <ohaugan@codeaurora.org> writes:

> Mapping and unmapping are more often than not in the critical path.
> map_range and unmap_range allows SMMU driver implementations to optimize
> the process of mapping and unmapping buffers into the SMMU page tables.
> Instead of mapping one physical address, do TLB operation (expensive),
> mapping, do TLB operation, mapping, do TLB operation the driver can map
> a scatter-gatherlist of physically contiguous pages into one virtual
> address space and then at the end do one TLB operation.
>
> Additionally, the mapping operation would be faster in general since
> clients does not have to keep calling map API over and over again for
> each physically contiguous chunk of memory that needs to be mapped to a
> virtually contiguous region.
>
> Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
> ---
>  drivers/iommu/iommu.c | 24 ++++++++++++++++++++++++
>  include/linux/iommu.h | 24 ++++++++++++++++++++++++
>  2 files changed, 48 insertions(+)
>
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index e5555fc..f2a6b80 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -898,6 +898,30 @@ size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size)
>  EXPORT_SYMBOL_GPL(iommu_unmap);
>  
>  
> +int iommu_map_range(struct iommu_domain *domain, unsigned int iova,
> +		    struct scatterlist *sg, unsigned int len, int prot)
> +{
> +	if (unlikely(domain->ops->map_range == NULL))
> +		return -ENODEV;
> +
> +	BUG_ON(iova & (~PAGE_MASK));
> +
> +	return domain->ops->map_range(domain, iova, sg, len, prot);
> +}
> +EXPORT_SYMBOL_GPL(iommu_map_range);

We have the similar one internally, which is named, "iommu_map_sg()",
called from DMA API.

> +int iommu_unmap_range(struct iommu_domain *domain, unsigned int iova,
> +		      unsigned int len)
> +{
> +	if (unlikely(domain->ops->unmap_range == NULL))
> +		return -ENODEV;
> +
> +	BUG_ON(iova & (~PAGE_MASK));
> +
> +	return domain->ops->unmap_range(domain, iova, len);
> +}
> +EXPORT_SYMBOL_GPL(iommu_unmap_range);

Can the existing iommu_unmap() do the same?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC/PATCH 2/7] iommu-api: Add map_range/unmap_range functions
  2014-07-04  4:29   ` Hiroshi Doyu
@ 2014-07-08 21:53     ` Olav Haugan
  2014-07-08 23:49       ` Rob Clark
  0 siblings, 1 reply; 29+ messages in thread
From: Olav Haugan @ 2014-07-08 21:53 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Hiroshi,

On 7/3/2014 9:29 PM, Hiroshi Doyu wrote:
> Hi Olav,
> 
> Olav Haugan <ohaugan@codeaurora.org> writes:
> 
>> Mapping and unmapping are more often than not in the critical path.
>> map_range and unmap_range allows SMMU driver implementations to optimize
>> the process of mapping and unmapping buffers into the SMMU page tables.
>> Instead of mapping one physical address, do TLB operation (expensive),
>> mapping, do TLB operation, mapping, do TLB operation the driver can map
>> a scatter-gatherlist of physically contiguous pages into one virtual
>> address space and then at the end do one TLB operation.
>>
>> Additionally, the mapping operation would be faster in general since
>> clients does not have to keep calling map API over and over again for
>> each physically contiguous chunk of memory that needs to be mapped to a
>> virtually contiguous region.
>>
>> Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
>> ---
>>  drivers/iommu/iommu.c | 24 ++++++++++++++++++++++++
>>  include/linux/iommu.h | 24 ++++++++++++++++++++++++
>>  2 files changed, 48 insertions(+)
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index e5555fc..f2a6b80 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -898,6 +898,30 @@ size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size)
>>  EXPORT_SYMBOL_GPL(iommu_unmap);
>>  
>>  
>> +int iommu_map_range(struct iommu_domain *domain, unsigned int iova,
>> +		    struct scatterlist *sg, unsigned int len, int prot)
>> +{
>> +	if (unlikely(domain->ops->map_range == NULL))
>> +		return -ENODEV;
>> +
>> +	BUG_ON(iova & (~PAGE_MASK));
>> +
>> +	return domain->ops->map_range(domain, iova, sg, len, prot);
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_map_range);
> 
> We have the similar one internally, which is named, "iommu_map_sg()",
> called from DMA API.

Great, so this new API will be useful to more people!

>> +int iommu_unmap_range(struct iommu_domain *domain, unsigned int iova,
>> +		      unsigned int len)
>> +{
>> +	if (unlikely(domain->ops->unmap_range == NULL))
>> +		return -ENODEV;
>> +
>> +	BUG_ON(iova & (~PAGE_MASK));
>> +
>> +	return domain->ops->unmap_range(domain, iova, len);
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_unmap_range);
> 
> Can the existing iommu_unmap() do the same?

I believe iommu_unmap() behaves a bit differently because it will keep
on calling domain->ops->unmap() until everything is unmapped instead of
letting the iommu implementation take care of unmapping everything in
one call.

I am abandoning the patch series since our driver was not accepted.
However, if there are no objections I will resubmit this patch (PATCH
2/7) as an independent patch to add this new map_range API.

Thanks,

Olav Haugan

-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC/PATCH 7/7] iommu-api: Add domain attribute to enable coherent HTW
  2014-07-03 17:43       ` Will Deacon
@ 2014-07-08 22:24         ` Olav Haugan
  0 siblings, 0 replies; 29+ messages in thread
From: Olav Haugan @ 2014-07-08 22:24 UTC (permalink / raw)
  To: linux-arm-kernel

On 7/3/2014 10:43 AM, Will Deacon wrote:
> On Wed, Jul 02, 2014 at 11:11:13PM +0100, Olav Haugan wrote:
>> On 7/1/2014 1:49 AM, Varun Sethi wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: iommu-bounces at lists.linux-foundation.org [mailto:iommu-
>>>> bounces at lists.linux-foundation.org] On Behalf Of Olav Haugan
>>>> Sent: Monday, June 30, 2014 10:22 PM
>>>> To: linux-arm-kernel at lists.infradead.org; iommu at lists.linux-
>>>> foundation.org
>>>> Cc: linux-arm-msm at vger.kernel.org; will.deacon at arm.com;
>>>> thierry.reding at gmail.com; vgandhi at codeaurora.org
>>>> Subject: [RFC/PATCH 7/7] iommu-api: Add domain attribute to enable
>>>> coherent HTW
>>>>
>>>> Add a new iommu domain attribute that can be used to enable cache
>>>> coherent hardware table walks (HTW) by the SMMU. HTW might be supported
>>>> by the SMMU HW but depending on the use case and the usage of the SMMU in
>>>> the SoC it might not be always beneficial to always turn on coherent HTW
>>>> for all domains/iommu's.
>>>>
>>> [Sethi Varun-B16395] Why won't you want to use the coherent table walk feature?
>>
>> Very good question. We have found that turning on IOMMU coherent HTW is
>> not always beneficial to performance (performance either the same or
>> slightly worse in some cases). Even if the perf. is the same we would
>> like to avoid using precious L2 cache for no benefit to the IOMMU.
>> Although our HW supports this feature we don't always want to turn this
>> on for a specific use case/domain (bus master).
> 
> Could we at least invert the feature flag, please? i.e. you set an attribute
> to *disable* coherent walks? I'd also be interested to see some performance
> numbers, as the added cacheflushing overhead from non-coherent walks is
> going to be non-trivial.
> 

Yes, agree that we can do the inverse. On one SoC I saw about 5%
degradation in performance with coherent table walk enabled for a
specific bus master. However, we have seen improved performance also
with other SMMUs/bus masters. It just depends on the SMMU/bus master and
how it is being used. Hence the need to be able to disable this on a
per-domain basis.

Thanks,

Olav Haugan

-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC/PATCH 2/7] iommu-api: Add map_range/unmap_range functions
  2014-07-08 21:53     ` Olav Haugan
@ 2014-07-08 23:49       ` Rob Clark
  2014-07-10  0:03         ` Olav Haugan
  0 siblings, 1 reply; 29+ messages in thread
From: Rob Clark @ 2014-07-08 23:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jul 8, 2014 at 5:53 PM, Olav Haugan <ohaugan@codeaurora.org> wrote:
> Hi Hiroshi,
>
> On 7/3/2014 9:29 PM, Hiroshi Doyu wrote:
>> Hi Olav,
>>
>> Olav Haugan <ohaugan@codeaurora.org> writes:
>>
>>> Mapping and unmapping are more often than not in the critical path.
>>> map_range and unmap_range allows SMMU driver implementations to optimize
>>> the process of mapping and unmapping buffers into the SMMU page tables.
>>> Instead of mapping one physical address, do TLB operation (expensive),
>>> mapping, do TLB operation, mapping, do TLB operation the driver can map
>>> a scatter-gatherlist of physically contiguous pages into one virtual
>>> address space and then at the end do one TLB operation.
>>>
>>> Additionally, the mapping operation would be faster in general since
>>> clients does not have to keep calling map API over and over again for
>>> each physically contiguous chunk of memory that needs to be mapped to a
>>> virtually contiguous region.
>>>
>>> Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
>>> ---
>>>  drivers/iommu/iommu.c | 24 ++++++++++++++++++++++++
>>>  include/linux/iommu.h | 24 ++++++++++++++++++++++++
>>>  2 files changed, 48 insertions(+)
>>>
>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>> index e5555fc..f2a6b80 100644
>>> --- a/drivers/iommu/iommu.c
>>> +++ b/drivers/iommu/iommu.c
>>> @@ -898,6 +898,30 @@ size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size)
>>>  EXPORT_SYMBOL_GPL(iommu_unmap);
>>>
>>>
>>> +int iommu_map_range(struct iommu_domain *domain, unsigned int iova,
>>> +                struct scatterlist *sg, unsigned int len, int prot)
>>> +{
>>> +    if (unlikely(domain->ops->map_range == NULL))
>>> +            return -ENODEV;
>>> +
>>> +    BUG_ON(iova & (~PAGE_MASK));
>>> +
>>> +    return domain->ops->map_range(domain, iova, sg, len, prot);
>>> +}
>>> +EXPORT_SYMBOL_GPL(iommu_map_range);
>>
>> We have the similar one internally, which is named, "iommu_map_sg()",
>> called from DMA API.
>
> Great, so this new API will be useful to more people!
>
>>> +int iommu_unmap_range(struct iommu_domain *domain, unsigned int iova,
>>> +                  unsigned int len)
>>> +{
>>> +    if (unlikely(domain->ops->unmap_range == NULL))
>>> +            return -ENODEV;
>>> +
>>> +    BUG_ON(iova & (~PAGE_MASK));
>>> +
>>> +    return domain->ops->unmap_range(domain, iova, len);
>>> +}
>>> +EXPORT_SYMBOL_GPL(iommu_unmap_range);
>>
>> Can the existing iommu_unmap() do the same?
>
> I believe iommu_unmap() behaves a bit differently because it will keep
> on calling domain->ops->unmap() until everything is unmapped instead of
> letting the iommu implementation take care of unmapping everything in
> one call.
>
> I am abandoning the patch series since our driver was not accepted.
> However, if there are no objections I will resubmit this patch (PATCH
> 2/7) as an independent patch to add this new map_range API.

+1 for map_range().. I've seen for gpu workloads, at least, it is the
downstream map_range() API is quite beneficial.   It was worth at
least a few fps in xonotic.

And, possibly getting off the subject a bit, but I was wondering about
the possibility of going one step further and batching up mapping
and/or unmapping multiple buffers (ranges) at once.  I have a pretty
convenient sync point in drm/msm to flush out multiple mappings before
kicking gpu.

BR,
-R

> Thanks,
>
> Olav Haugan
>
> --
> The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> hosted by The Linux Foundation
> --
> To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC/PATCH 2/7] iommu-api: Add map_range/unmap_range functions
  2014-07-08 23:49       ` Rob Clark
@ 2014-07-10  0:03         ` Olav Haugan
  2014-07-10  0:40           ` Rob Clark
  0 siblings, 1 reply; 29+ messages in thread
From: Olav Haugan @ 2014-07-10  0:03 UTC (permalink / raw)
  To: linux-arm-kernel

On 7/8/2014 4:49 PM, Rob Clark wrote:
> On Tue, Jul 8, 2014 at 5:53 PM, Olav Haugan <ohaugan@codeaurora.org> wrote:
>> Hi Hiroshi,
>>
>> On 7/3/2014 9:29 PM, Hiroshi Doyu wrote:
>>> Hi Olav,
>>>
>>> Olav Haugan <ohaugan@codeaurora.org> writes:
>>>
>>>> Mapping and unmapping are more often than not in the critical path.
>>>> map_range and unmap_range allows SMMU driver implementations to optimize
>>>> the process of mapping and unmapping buffers into the SMMU page tables.
>>>> Instead of mapping one physical address, do TLB operation (expensive),
>>>> mapping, do TLB operation, mapping, do TLB operation the driver can map
>>>> a scatter-gatherlist of physically contiguous pages into one virtual
>>>> address space and then at the end do one TLB operation.
>>>>
>>>> Additionally, the mapping operation would be faster in general since
>>>> clients does not have to keep calling map API over and over again for
>>>> each physically contiguous chunk of memory that needs to be mapped to a
>>>> virtually contiguous region.
>>>>
>>>> Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
>>>> ---
>>>>  drivers/iommu/iommu.c | 24 ++++++++++++++++++++++++
>>>>  include/linux/iommu.h | 24 ++++++++++++++++++++++++
>>>>  2 files changed, 48 insertions(+)
>>>>
>>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>>> index e5555fc..f2a6b80 100644
>>>> --- a/drivers/iommu/iommu.c
>>>> +++ b/drivers/iommu/iommu.c
>>>> @@ -898,6 +898,30 @@ size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size)
>>>>  EXPORT_SYMBOL_GPL(iommu_unmap);
>>>>
>>>>
>>>> +int iommu_map_range(struct iommu_domain *domain, unsigned int iova,
>>>> +                struct scatterlist *sg, unsigned int len, int prot)
>>>> +{
>>>> +    if (unlikely(domain->ops->map_range == NULL))
>>>> +            return -ENODEV;
>>>> +
>>>> +    BUG_ON(iova & (~PAGE_MASK));
>>>> +
>>>> +    return domain->ops->map_range(domain, iova, sg, len, prot);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(iommu_map_range);
>>>
>>> We have the similar one internally, which is named, "iommu_map_sg()",
>>> called from DMA API.
>>
>> Great, so this new API will be useful to more people!
>>
>>>> +int iommu_unmap_range(struct iommu_domain *domain, unsigned int iova,
>>>> +                  unsigned int len)
>>>> +{
>>>> +    if (unlikely(domain->ops->unmap_range == NULL))
>>>> +            return -ENODEV;
>>>> +
>>>> +    BUG_ON(iova & (~PAGE_MASK));
>>>> +
>>>> +    return domain->ops->unmap_range(domain, iova, len);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(iommu_unmap_range);
>>>
>>> Can the existing iommu_unmap() do the same?
>>
>> I believe iommu_unmap() behaves a bit differently because it will keep
>> on calling domain->ops->unmap() until everything is unmapped instead of
>> letting the iommu implementation take care of unmapping everything in
>> one call.
>>
>> I am abandoning the patch series since our driver was not accepted.
>> However, if there are no objections I will resubmit this patch (PATCH
>> 2/7) as an independent patch to add this new map_range API.
> 
> +1 for map_range().. I've seen for gpu workloads, at least, it is the
> downstream map_range() API is quite beneficial.   It was worth at
> least a few fps in xonotic.
> 
> And, possibly getting off the subject a bit, but I was wondering about
> the possibility of going one step further and batching up mapping
> and/or unmapping multiple buffers (ranges) at once.  I have a pretty
> convenient sync point in drm/msm to flush out multiple mappings before
> kicking gpu.

I think you should be able to do that with this API already - at least
the mapping part since we are passing in a sg list (this could be a
chained sglist).

Thanks,

Olav

-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC/PATCH 2/7] iommu-api: Add map_range/unmap_range functions
  2014-07-10  0:03         ` Olav Haugan
@ 2014-07-10  0:40           ` Rob Clark
  2014-07-10  7:10             ` Thierry Reding
  2014-07-10 22:43             ` Olav Haugan
  0 siblings, 2 replies; 29+ messages in thread
From: Rob Clark @ 2014-07-10  0:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jul 9, 2014 at 8:03 PM, Olav Haugan <ohaugan@codeaurora.org> wrote:
> On 7/8/2014 4:49 PM, Rob Clark wrote:
>> On Tue, Jul 8, 2014 at 5:53 PM, Olav Haugan <ohaugan@codeaurora.org> wrote:
>>> Hi Hiroshi,
>>>
>>> On 7/3/2014 9:29 PM, Hiroshi Doyu wrote:
>>>> Hi Olav,
>>>>
>>>> Olav Haugan <ohaugan@codeaurora.org> writes:
>>>>
>>>>> Mapping and unmapping are more often than not in the critical path.
>>>>> map_range and unmap_range allows SMMU driver implementations to optimize
>>>>> the process of mapping and unmapping buffers into the SMMU page tables.
>>>>> Instead of mapping one physical address, do TLB operation (expensive),
>>>>> mapping, do TLB operation, mapping, do TLB operation the driver can map
>>>>> a scatter-gatherlist of physically contiguous pages into one virtual
>>>>> address space and then at the end do one TLB operation.
>>>>>
>>>>> Additionally, the mapping operation would be faster in general since
>>>>> clients does not have to keep calling map API over and over again for
>>>>> each physically contiguous chunk of memory that needs to be mapped to a
>>>>> virtually contiguous region.
>>>>>
>>>>> Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
>>>>> ---
>>>>>  drivers/iommu/iommu.c | 24 ++++++++++++++++++++++++
>>>>>  include/linux/iommu.h | 24 ++++++++++++++++++++++++
>>>>>  2 files changed, 48 insertions(+)
>>>>>
>>>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>>>> index e5555fc..f2a6b80 100644
>>>>> --- a/drivers/iommu/iommu.c
>>>>> +++ b/drivers/iommu/iommu.c
>>>>> @@ -898,6 +898,30 @@ size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size)
>>>>>  EXPORT_SYMBOL_GPL(iommu_unmap);
>>>>>
>>>>>
>>>>> +int iommu_map_range(struct iommu_domain *domain, unsigned int iova,
>>>>> +                struct scatterlist *sg, unsigned int len, int prot)
>>>>> +{
>>>>> +    if (unlikely(domain->ops->map_range == NULL))
>>>>> +            return -ENODEV;
>>>>> +
>>>>> +    BUG_ON(iova & (~PAGE_MASK));
>>>>> +
>>>>> +    return domain->ops->map_range(domain, iova, sg, len, prot);
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(iommu_map_range);
>>>>
>>>> We have the similar one internally, which is named, "iommu_map_sg()",
>>>> called from DMA API.
>>>
>>> Great, so this new API will be useful to more people!
>>>
>>>>> +int iommu_unmap_range(struct iommu_domain *domain, unsigned int iova,
>>>>> +                  unsigned int len)
>>>>> +{
>>>>> +    if (unlikely(domain->ops->unmap_range == NULL))
>>>>> +            return -ENODEV;
>>>>> +
>>>>> +    BUG_ON(iova & (~PAGE_MASK));
>>>>> +
>>>>> +    return domain->ops->unmap_range(domain, iova, len);
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(iommu_unmap_range);
>>>>
>>>> Can the existing iommu_unmap() do the same?
>>>
>>> I believe iommu_unmap() behaves a bit differently because it will keep
>>> on calling domain->ops->unmap() until everything is unmapped instead of
>>> letting the iommu implementation take care of unmapping everything in
>>> one call.
>>>
>>> I am abandoning the patch series since our driver was not accepted.
>>> However, if there are no objections I will resubmit this patch (PATCH
>>> 2/7) as an independent patch to add this new map_range API.
>>
>> +1 for map_range().. I've seen for gpu workloads, at least, it is the
>> downstream map_range() API is quite beneficial.   It was worth at
>> least a few fps in xonotic.
>>
>> And, possibly getting off the subject a bit, but I was wondering about
>> the possibility of going one step further and batching up mapping
>> and/or unmapping multiple buffers (ranges) at once.  I have a pretty
>> convenient sync point in drm/msm to flush out multiple mappings before
>> kicking gpu.
>
> I think you should be able to do that with this API already - at least
> the mapping part since we are passing in a sg list (this could be a
> chained sglist).

What I mean by batching up is mapping and unmapping multiple sglists
each at different iova's with minmal cpu cache and iommu tlb flushes..

Ideally we'd let the IOMMU driver be clever and build out all 2nd
level tables before inserting into first level tables (to minimize cpu
cache flushing).. also, there is probably a reasonable chance that
we'd be mapping a new buffer into existing location, so there might be
some potential to reuse existing 2nd level tables (and save a tiny bit
of free/alloc).  I've not thought too much about how that would look
in code.. might be kinda, umm, fun..

But at an API level, we should be able to do a bunch of
map/unmap_range's with one flush.

Maybe it could look like a sequence of iommu_{map,unmap}_range()
followed by iommu_flush()?

BR,
-R

> Thanks,
>
> Olav
>
> --
> The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC/PATCH 2/7] iommu-api: Add map_range/unmap_range functions
  2014-07-10  0:40           ` Rob Clark
@ 2014-07-10  7:10             ` Thierry Reding
  2014-07-10 11:15               ` Rob Clark
  2014-07-10 22:43             ` Olav Haugan
  1 sibling, 1 reply; 29+ messages in thread
From: Thierry Reding @ 2014-07-10  7:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jul 09, 2014 at 08:40:21PM -0400, Rob Clark wrote:
> On Wed, Jul 9, 2014 at 8:03 PM, Olav Haugan <ohaugan@codeaurora.org> wrote:
> > On 7/8/2014 4:49 PM, Rob Clark wrote:
> >> On Tue, Jul 8, 2014 at 5:53 PM, Olav Haugan <ohaugan@codeaurora.org> wrote:
> >>> Hi Hiroshi,
> >>>
> >>> On 7/3/2014 9:29 PM, Hiroshi Doyu wrote:
> >>>> Hi Olav,
> >>>>
> >>>> Olav Haugan <ohaugan@codeaurora.org> writes:
> >>>>
> >>>>> Mapping and unmapping are more often than not in the critical path.
> >>>>> map_range and unmap_range allows SMMU driver implementations to optimize
> >>>>> the process of mapping and unmapping buffers into the SMMU page tables.
> >>>>> Instead of mapping one physical address, do TLB operation (expensive),
> >>>>> mapping, do TLB operation, mapping, do TLB operation the driver can map
> >>>>> a scatter-gatherlist of physically contiguous pages into one virtual
> >>>>> address space and then at the end do one TLB operation.
> >>>>>
> >>>>> Additionally, the mapping operation would be faster in general since
> >>>>> clients does not have to keep calling map API over and over again for
> >>>>> each physically contiguous chunk of memory that needs to be mapped to a
> >>>>> virtually contiguous region.
> >>>>>
> >>>>> Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
> >>>>> ---
> >>>>>  drivers/iommu/iommu.c | 24 ++++++++++++++++++++++++
> >>>>>  include/linux/iommu.h | 24 ++++++++++++++++++++++++
> >>>>>  2 files changed, 48 insertions(+)
> >>>>>
> >>>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> >>>>> index e5555fc..f2a6b80 100644
> >>>>> --- a/drivers/iommu/iommu.c
> >>>>> +++ b/drivers/iommu/iommu.c
> >>>>> @@ -898,6 +898,30 @@ size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size)
> >>>>>  EXPORT_SYMBOL_GPL(iommu_unmap);
> >>>>>
> >>>>>
> >>>>> +int iommu_map_range(struct iommu_domain *domain, unsigned int iova,
> >>>>> +                struct scatterlist *sg, unsigned int len, int prot)
> >>>>> +{
> >>>>> +    if (unlikely(domain->ops->map_range == NULL))
> >>>>> +            return -ENODEV;
> >>>>> +
> >>>>> +    BUG_ON(iova & (~PAGE_MASK));
> >>>>> +
> >>>>> +    return domain->ops->map_range(domain, iova, sg, len, prot);
> >>>>> +}
> >>>>> +EXPORT_SYMBOL_GPL(iommu_map_range);
> >>>>
> >>>> We have the similar one internally, which is named, "iommu_map_sg()",
> >>>> called from DMA API.
> >>>
> >>> Great, so this new API will be useful to more people!
> >>>
> >>>>> +int iommu_unmap_range(struct iommu_domain *domain, unsigned int iova,
> >>>>> +                  unsigned int len)
> >>>>> +{
> >>>>> +    if (unlikely(domain->ops->unmap_range == NULL))
> >>>>> +            return -ENODEV;
> >>>>> +
> >>>>> +    BUG_ON(iova & (~PAGE_MASK));
> >>>>> +
> >>>>> +    return domain->ops->unmap_range(domain, iova, len);
> >>>>> +}
> >>>>> +EXPORT_SYMBOL_GPL(iommu_unmap_range);
> >>>>
> >>>> Can the existing iommu_unmap() do the same?
> >>>
> >>> I believe iommu_unmap() behaves a bit differently because it will keep
> >>> on calling domain->ops->unmap() until everything is unmapped instead of
> >>> letting the iommu implementation take care of unmapping everything in
> >>> one call.
> >>>
> >>> I am abandoning the patch series since our driver was not accepted.
> >>> However, if there are no objections I will resubmit this patch (PATCH
> >>> 2/7) as an independent patch to add this new map_range API.
> >>
> >> +1 for map_range().. I've seen for gpu workloads, at least, it is the
> >> downstream map_range() API is quite beneficial.   It was worth at
> >> least a few fps in xonotic.
> >>
> >> And, possibly getting off the subject a bit, but I was wondering about
> >> the possibility of going one step further and batching up mapping
> >> and/or unmapping multiple buffers (ranges) at once.  I have a pretty
> >> convenient sync point in drm/msm to flush out multiple mappings before
> >> kicking gpu.
> >
> > I think you should be able to do that with this API already - at least
> > the mapping part since we are passing in a sg list (this could be a
> > chained sglist).
> 
> What I mean by batching up is mapping and unmapping multiple sglists
> each at different iova's with minmal cpu cache and iommu tlb flushes..
> 
> Ideally we'd let the IOMMU driver be clever and build out all 2nd
> level tables before inserting into first level tables (to minimize cpu
> cache flushing).. also, there is probably a reasonable chance that
> we'd be mapping a new buffer into existing location, so there might be
> some potential to reuse existing 2nd level tables (and save a tiny bit
> of free/alloc).  I've not thought too much about how that would look
> in code.. might be kinda, umm, fun..
> 
> But at an API level, we should be able to do a bunch of
> map/unmap_range's with one flush.
> 
> Maybe it could look like a sequence of iommu_{map,unmap}_range()
> followed by iommu_flush()?

Doesn't that mean that the IOMMU driver would have to keep track of all
mappings until it sees an iommu_flush()? That sounds like it could be a
lot of work and complicated code.

Thierry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140710/5d9ecb22/attachment-0001.sig>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC/PATCH 2/7] iommu-api: Add map_range/unmap_range functions
  2014-07-10  7:10             ` Thierry Reding
@ 2014-07-10 11:15               ` Rob Clark
  0 siblings, 0 replies; 29+ messages in thread
From: Rob Clark @ 2014-07-10 11:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jul 10, 2014 at 3:10 AM, Thierry Reding
<thierry.reding@gmail.com> wrote:
> On Wed, Jul 09, 2014 at 08:40:21PM -0400, Rob Clark wrote:
>> On Wed, Jul 9, 2014 at 8:03 PM, Olav Haugan <ohaugan@codeaurora.org> wrote:
>> > On 7/8/2014 4:49 PM, Rob Clark wrote:
>> >> On Tue, Jul 8, 2014 at 5:53 PM, Olav Haugan <ohaugan@codeaurora.org> wrote:
>> >>> Hi Hiroshi,
>> >>>
>> >>> On 7/3/2014 9:29 PM, Hiroshi Doyu wrote:
>> >>>> Hi Olav,
>> >>>>
>> >>>> Olav Haugan <ohaugan@codeaurora.org> writes:
>> >>>>
>> >>>>> Mapping and unmapping are more often than not in the critical path.
>> >>>>> map_range and unmap_range allows SMMU driver implementations to optimize
>> >>>>> the process of mapping and unmapping buffers into the SMMU page tables.
>> >>>>> Instead of mapping one physical address, do TLB operation (expensive),
>> >>>>> mapping, do TLB operation, mapping, do TLB operation the driver can map
>> >>>>> a scatter-gatherlist of physically contiguous pages into one virtual
>> >>>>> address space and then at the end do one TLB operation.
>> >>>>>
>> >>>>> Additionally, the mapping operation would be faster in general since
>> >>>>> clients does not have to keep calling map API over and over again for
>> >>>>> each physically contiguous chunk of memory that needs to be mapped to a
>> >>>>> virtually contiguous region.
>> >>>>>
>> >>>>> Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
>> >>>>> ---
>> >>>>>  drivers/iommu/iommu.c | 24 ++++++++++++++++++++++++
>> >>>>>  include/linux/iommu.h | 24 ++++++++++++++++++++++++
>> >>>>>  2 files changed, 48 insertions(+)
>> >>>>>
>> >>>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> >>>>> index e5555fc..f2a6b80 100644
>> >>>>> --- a/drivers/iommu/iommu.c
>> >>>>> +++ b/drivers/iommu/iommu.c
>> >>>>> @@ -898,6 +898,30 @@ size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size)
>> >>>>>  EXPORT_SYMBOL_GPL(iommu_unmap);
>> >>>>>
>> >>>>>
>> >>>>> +int iommu_map_range(struct iommu_domain *domain, unsigned int iova,
>> >>>>> +                struct scatterlist *sg, unsigned int len, int prot)
>> >>>>> +{
>> >>>>> +    if (unlikely(domain->ops->map_range == NULL))
>> >>>>> +            return -ENODEV;
>> >>>>> +
>> >>>>> +    BUG_ON(iova & (~PAGE_MASK));
>> >>>>> +
>> >>>>> +    return domain->ops->map_range(domain, iova, sg, len, prot);
>> >>>>> +}
>> >>>>> +EXPORT_SYMBOL_GPL(iommu_map_range);
>> >>>>
>> >>>> We have the similar one internally, which is named, "iommu_map_sg()",
>> >>>> called from DMA API.
>> >>>
>> >>> Great, so this new API will be useful to more people!
>> >>>
>> >>>>> +int iommu_unmap_range(struct iommu_domain *domain, unsigned int iova,
>> >>>>> +                  unsigned int len)
>> >>>>> +{
>> >>>>> +    if (unlikely(domain->ops->unmap_range == NULL))
>> >>>>> +            return -ENODEV;
>> >>>>> +
>> >>>>> +    BUG_ON(iova & (~PAGE_MASK));
>> >>>>> +
>> >>>>> +    return domain->ops->unmap_range(domain, iova, len);
>> >>>>> +}
>> >>>>> +EXPORT_SYMBOL_GPL(iommu_unmap_range);
>> >>>>
>> >>>> Can the existing iommu_unmap() do the same?
>> >>>
>> >>> I believe iommu_unmap() behaves a bit differently because it will keep
>> >>> on calling domain->ops->unmap() until everything is unmapped instead of
>> >>> letting the iommu implementation take care of unmapping everything in
>> >>> one call.
>> >>>
>> >>> I am abandoning the patch series since our driver was not accepted.
>> >>> However, if there are no objections I will resubmit this patch (PATCH
>> >>> 2/7) as an independent patch to add this new map_range API.
>> >>
>> >> +1 for map_range().. I've seen for gpu workloads, at least, it is the
>> >> downstream map_range() API is quite beneficial.   It was worth at
>> >> least a few fps in xonotic.
>> >>
>> >> And, possibly getting off the subject a bit, but I was wondering about
>> >> the possibility of going one step further and batching up mapping
>> >> and/or unmapping multiple buffers (ranges) at once.  I have a pretty
>> >> convenient sync point in drm/msm to flush out multiple mappings before
>> >> kicking gpu.
>> >
>> > I think you should be able to do that with this API already - at least
>> > the mapping part since we are passing in a sg list (this could be a
>> > chained sglist).
>>
>> What I mean by batching up is mapping and unmapping multiple sglists
>> each at different iova's with minmal cpu cache and iommu tlb flushes..
>>
>> Ideally we'd let the IOMMU driver be clever and build out all 2nd
>> level tables before inserting into first level tables (to minimize cpu
>> cache flushing).. also, there is probably a reasonable chance that
>> we'd be mapping a new buffer into existing location, so there might be
>> some potential to reuse existing 2nd level tables (and save a tiny bit
>> of free/alloc).  I've not thought too much about how that would look
>> in code.. might be kinda, umm, fun..
>>
>> But at an API level, we should be able to do a bunch of
>> map/unmap_range's with one flush.
>>
>> Maybe it could look like a sequence of iommu_{map,unmap}_range()
>> followed by iommu_flush()?
>
> Doesn't that mean that the IOMMU driver would have to keep track of all
> mappings until it sees an iommu_flush()? That sounds like it could be a
> lot of work and complicated code.

Well, depends on how elaborate you want to get.  If you don't want to
be too fancy, it may just be a matter of not doing TLB flush until
iommu_flush().  If you want to get fancy and minimize cpu flushes too,
then iommu driver would have to do some more tracking to build up a
transaction internally.  I'm not really sure how you avoid that.

I'm not quite sure how frequent it would be that separate buffers
touch the same 2nd level table, so it might be sufficient to treat it
like N map_range and unmap_range's followed by one TLB flush.  I
would, I think, need to implement a prototype or at least instrument
the iommu driver somehow to generate some statistics.

I've nearly got qcom-iommu-v0 working here on top of upstream + small
set of patches.. but once that is a bit more complete, experimenting
with some of this will be on my TODO list to see what amount of
crazy/complicated brings worthwhile performance benefits.

BR,
-R

> Thierry

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC/PATCH 2/7] iommu-api: Add map_range/unmap_range functions
  2014-07-10  0:40           ` Rob Clark
  2014-07-10  7:10             ` Thierry Reding
@ 2014-07-10 22:43             ` Olav Haugan
  2014-07-10 23:42               ` Rob Clark
  1 sibling, 1 reply; 29+ messages in thread
From: Olav Haugan @ 2014-07-10 22:43 UTC (permalink / raw)
  To: linux-arm-kernel

On 7/9/2014 5:40 PM, Rob Clark wrote:
> On Wed, Jul 9, 2014 at 8:03 PM, Olav Haugan <ohaugan@codeaurora.org> wrote:
>> On 7/8/2014 4:49 PM, Rob Clark wrote:
>>> On Tue, Jul 8, 2014 at 5:53 PM, Olav Haugan <ohaugan@codeaurora.org> wrote:
>>>> Hi Hiroshi,
>>>>
>>>> On 7/3/2014 9:29 PM, Hiroshi Doyu wrote:
>>>>> Hi Olav,
>>>>>
>>>>> Olav Haugan <ohaugan@codeaurora.org> writes:
>>>>>
>>>>>> Mapping and unmapping are more often than not in the critical path.
>>>>>> map_range and unmap_range allows SMMU driver implementations to optimize
>>>>>> the process of mapping and unmapping buffers into the SMMU page tables.
>>>>>> Instead of mapping one physical address, do TLB operation (expensive),
>>>>>> mapping, do TLB operation, mapping, do TLB operation the driver can map
>>>>>> a scatter-gatherlist of physically contiguous pages into one virtual
>>>>>> address space and then at the end do one TLB operation.
>>>>>>
>>>>>> Additionally, the mapping operation would be faster in general since
>>>>>> clients does not have to keep calling map API over and over again for
>>>>>> each physically contiguous chunk of memory that needs to be mapped to a
>>>>>> virtually contiguous region.
>>>>>>
>>>>>> Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
>>>>>> ---
>>>>>>  drivers/iommu/iommu.c | 24 ++++++++++++++++++++++++
>>>>>>  include/linux/iommu.h | 24 ++++++++++++++++++++++++
>>>>>>  2 files changed, 48 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>>>>> index e5555fc..f2a6b80 100644
>>>>>> --- a/drivers/iommu/iommu.c
>>>>>> +++ b/drivers/iommu/iommu.c
>>>>>> @@ -898,6 +898,30 @@ size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size)
>>>>>>  EXPORT_SYMBOL_GPL(iommu_unmap);
>>>>>>
>>>>>>
>>>>>> +int iommu_map_range(struct iommu_domain *domain, unsigned int iova,
>>>>>> +                struct scatterlist *sg, unsigned int len, int prot)
>>>>>> +{
>>>>>> +    if (unlikely(domain->ops->map_range == NULL))
>>>>>> +            return -ENODEV;
>>>>>> +
>>>>>> +    BUG_ON(iova & (~PAGE_MASK));
>>>>>> +
>>>>>> +    return domain->ops->map_range(domain, iova, sg, len, prot);
>>>>>> +}
>>>>>> +EXPORT_SYMBOL_GPL(iommu_map_range);
>>>>>
>>>>> We have the similar one internally, which is named, "iommu_map_sg()",
>>>>> called from DMA API.
>>>>
>>>> Great, so this new API will be useful to more people!
>>>>
>>>>>> +int iommu_unmap_range(struct iommu_domain *domain, unsigned int iova,
>>>>>> +                  unsigned int len)
>>>>>> +{
>>>>>> +    if (unlikely(domain->ops->unmap_range == NULL))
>>>>>> +            return -ENODEV;
>>>>>> +
>>>>>> +    BUG_ON(iova & (~PAGE_MASK));
>>>>>> +
>>>>>> +    return domain->ops->unmap_range(domain, iova, len);
>>>>>> +}
>>>>>> +EXPORT_SYMBOL_GPL(iommu_unmap_range);
>>>>>
>>>>> Can the existing iommu_unmap() do the same?
>>>>
>>>> I believe iommu_unmap() behaves a bit differently because it will keep
>>>> on calling domain->ops->unmap() until everything is unmapped instead of
>>>> letting the iommu implementation take care of unmapping everything in
>>>> one call.
>>>>
>>>> I am abandoning the patch series since our driver was not accepted.
>>>> However, if there are no objections I will resubmit this patch (PATCH
>>>> 2/7) as an independent patch to add this new map_range API.
>>>
>>> +1 for map_range().. I've seen for gpu workloads, at least, it is the
>>> downstream map_range() API is quite beneficial.   It was worth at
>>> least a few fps in xonotic.
>>>
>>> And, possibly getting off the subject a bit, but I was wondering about
>>> the possibility of going one step further and batching up mapping
>>> and/or unmapping multiple buffers (ranges) at once.  I have a pretty
>>> convenient sync point in drm/msm to flush out multiple mappings before
>>> kicking gpu.
>>
>> I think you should be able to do that with this API already - at least
>> the mapping part since we are passing in a sg list (this could be a
>> chained sglist).
> 
> What I mean by batching up is mapping and unmapping multiple sglists
> each at different iova's with minmal cpu cache and iommu tlb flushes..
>
> Ideally we'd let the IOMMU driver be clever and build out all 2nd
> level tables before inserting into first level tables (to minimize cpu
> cache flushing).. also, there is probably a reasonable chance that
> we'd be mapping a new buffer into existing location, so there might be
> some potential to reuse existing 2nd level tables (and save a tiny bit
> of free/alloc).  I've not thought too much about how that would look
> in code.. might be kinda, umm, fun..
> 
> But at an API level, we should be able to do a bunch of
> map/unmap_range's with one flush.
> 
> Maybe it could look like a sequence of iommu_{map,unmap}_range()
> followed by iommu_flush()?
> 

So we could add another argument ("options") in the range api that
allows you to indicate whether you want to invalidate TLB or not.

Thanks,

Olav

-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC/PATCH 2/7] iommu-api: Add map_range/unmap_range functions
  2014-07-10 22:43             ` Olav Haugan
@ 2014-07-10 23:42               ` Rob Clark
  0 siblings, 0 replies; 29+ messages in thread
From: Rob Clark @ 2014-07-10 23:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jul 10, 2014 at 6:43 PM, Olav Haugan <ohaugan@codeaurora.org> wrote:
> On 7/9/2014 5:40 PM, Rob Clark wrote:
>> On Wed, Jul 9, 2014 at 8:03 PM, Olav Haugan <ohaugan@codeaurora.org> wrote:
>>> On 7/8/2014 4:49 PM, Rob Clark wrote:
>>>> On Tue, Jul 8, 2014 at 5:53 PM, Olav Haugan <ohaugan@codeaurora.org> wrote:
>>>>> Hi Hiroshi,
>>>>>
>>>>> On 7/3/2014 9:29 PM, Hiroshi Doyu wrote:
>>>>>> Hi Olav,
>>>>>>
>>>>>> Olav Haugan <ohaugan@codeaurora.org> writes:
>>>>>>
>>>>>>> Mapping and unmapping are more often than not in the critical path.
>>>>>>> map_range and unmap_range allows SMMU driver implementations to optimize
>>>>>>> the process of mapping and unmapping buffers into the SMMU page tables.
>>>>>>> Instead of mapping one physical address, do TLB operation (expensive),
>>>>>>> mapping, do TLB operation, mapping, do TLB operation the driver can map
>>>>>>> a scatter-gatherlist of physically contiguous pages into one virtual
>>>>>>> address space and then at the end do one TLB operation.
>>>>>>>
>>>>>>> Additionally, the mapping operation would be faster in general since
>>>>>>> clients does not have to keep calling map API over and over again for
>>>>>>> each physically contiguous chunk of memory that needs to be mapped to a
>>>>>>> virtually contiguous region.
>>>>>>>
>>>>>>> Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
>>>>>>> ---
>>>>>>>  drivers/iommu/iommu.c | 24 ++++++++++++++++++++++++
>>>>>>>  include/linux/iommu.h | 24 ++++++++++++++++++++++++
>>>>>>>  2 files changed, 48 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>>>>>> index e5555fc..f2a6b80 100644
>>>>>>> --- a/drivers/iommu/iommu.c
>>>>>>> +++ b/drivers/iommu/iommu.c
>>>>>>> @@ -898,6 +898,30 @@ size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size)
>>>>>>>  EXPORT_SYMBOL_GPL(iommu_unmap);
>>>>>>>
>>>>>>>
>>>>>>> +int iommu_map_range(struct iommu_domain *domain, unsigned int iova,
>>>>>>> +                struct scatterlist *sg, unsigned int len, int prot)
>>>>>>> +{
>>>>>>> +    if (unlikely(domain->ops->map_range == NULL))
>>>>>>> +            return -ENODEV;
>>>>>>> +
>>>>>>> +    BUG_ON(iova & (~PAGE_MASK));
>>>>>>> +
>>>>>>> +    return domain->ops->map_range(domain, iova, sg, len, prot);
>>>>>>> +}
>>>>>>> +EXPORT_SYMBOL_GPL(iommu_map_range);
>>>>>>
>>>>>> We have the similar one internally, which is named, "iommu_map_sg()",
>>>>>> called from DMA API.
>>>>>
>>>>> Great, so this new API will be useful to more people!
>>>>>
>>>>>>> +int iommu_unmap_range(struct iommu_domain *domain, unsigned int iova,
>>>>>>> +                  unsigned int len)
>>>>>>> +{
>>>>>>> +    if (unlikely(domain->ops->unmap_range == NULL))
>>>>>>> +            return -ENODEV;
>>>>>>> +
>>>>>>> +    BUG_ON(iova & (~PAGE_MASK));
>>>>>>> +
>>>>>>> +    return domain->ops->unmap_range(domain, iova, len);
>>>>>>> +}
>>>>>>> +EXPORT_SYMBOL_GPL(iommu_unmap_range);
>>>>>>
>>>>>> Can the existing iommu_unmap() do the same?
>>>>>
>>>>> I believe iommu_unmap() behaves a bit differently because it will keep
>>>>> on calling domain->ops->unmap() until everything is unmapped instead of
>>>>> letting the iommu implementation take care of unmapping everything in
>>>>> one call.
>>>>>
>>>>> I am abandoning the patch series since our driver was not accepted.
>>>>> However, if there are no objections I will resubmit this patch (PATCH
>>>>> 2/7) as an independent patch to add this new map_range API.
>>>>
>>>> +1 for map_range().. I've seen for gpu workloads, at least, it is the
>>>> downstream map_range() API is quite beneficial.   It was worth at
>>>> least a few fps in xonotic.
>>>>
>>>> And, possibly getting off the subject a bit, but I was wondering about
>>>> the possibility of going one step further and batching up mapping
>>>> and/or unmapping multiple buffers (ranges) at once.  I have a pretty
>>>> convenient sync point in drm/msm to flush out multiple mappings before
>>>> kicking gpu.
>>>
>>> I think you should be able to do that with this API already - at least
>>> the mapping part since we are passing in a sg list (this could be a
>>> chained sglist).
>>
>> What I mean by batching up is mapping and unmapping multiple sglists
>> each at different iova's with minmal cpu cache and iommu tlb flushes..
>>
>> Ideally we'd let the IOMMU driver be clever and build out all 2nd
>> level tables before inserting into first level tables (to minimize cpu
>> cache flushing).. also, there is probably a reasonable chance that
>> we'd be mapping a new buffer into existing location, so there might be
>> some potential to reuse existing 2nd level tables (and save a tiny bit
>> of free/alloc).  I've not thought too much about how that would look
>> in code.. might be kinda, umm, fun..
>>
>> But at an API level, we should be able to do a bunch of
>> map/unmap_range's with one flush.
>>
>> Maybe it could look like a sequence of iommu_{map,unmap}_range()
>> followed by iommu_flush()?
>>
>
> So we could add another argument ("options") in the range api that
> allows you to indicate whether you want to invalidate TLB or not.

sounds reasonable.. I'm pretty sure we want explict-flush to be an
opt-in behaviour.

BR,
-R

> Thanks,
>
> Olav
>
> --
> The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC/PATCH 2/7] iommu-api: Add map_range/unmap_range functions
  2014-06-30 16:51 ` [RFC/PATCH 2/7] iommu-api: Add map_range/unmap_range functions Olav Haugan
                     ` (2 preceding siblings ...)
  2014-07-04  4:29   ` Hiroshi Doyu
@ 2014-07-11 10:20   ` Joerg Roedel
  2014-07-15  1:13     ` Olav Haugan
  3 siblings, 1 reply; 29+ messages in thread
From: Joerg Roedel @ 2014-07-11 10:20 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jun 30, 2014 at 09:51:51AM -0700, Olav Haugan wrote:
> +int iommu_map_range(struct iommu_domain *domain, unsigned int iova,
> +		    struct scatterlist *sg, unsigned int len, int prot)
> +{
> +	if (unlikely(domain->ops->map_range == NULL))
> +		return -ENODEV;
> +
> +	BUG_ON(iova & (~PAGE_MASK));
> +
> +	return domain->ops->map_range(domain, iova, sg, len, prot);
> +}
> +EXPORT_SYMBOL_GPL(iommu_map_range);
> +
> +int iommu_unmap_range(struct iommu_domain *domain, unsigned int iova,
> +		      unsigned int len)
> +{
> +	if (unlikely(domain->ops->unmap_range == NULL))
> +		return -ENODEV;
> +
> +	BUG_ON(iova & (~PAGE_MASK));
> +
> +	return domain->ops->unmap_range(domain, iova, len);
> +}
> +EXPORT_SYMBOL_GPL(iommu_unmap_range);

Before introducing these new API functions there should be a fall-back
for IOMMU drivers that do (not yet) implement the map_range and
unmap_range call-backs.

The last thing we want is this kind of functional partitioning between
different IOMMU drivers.


	Joerg

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC/PATCH 2/7] iommu-api: Add map_range/unmap_range functions
  2014-07-11 10:20   ` Joerg Roedel
@ 2014-07-15  1:13     ` Olav Haugan
  0 siblings, 0 replies; 29+ messages in thread
From: Olav Haugan @ 2014-07-15  1:13 UTC (permalink / raw)
  To: linux-arm-kernel

On 7/11/2014 3:20 AM, Joerg Roedel wrote:
> On Mon, Jun 30, 2014 at 09:51:51AM -0700, Olav Haugan wrote:
>> +int iommu_map_range(struct iommu_domain *domain, unsigned int iova,
>> +		    struct scatterlist *sg, unsigned int len, int prot)
>> +{
>> +	if (unlikely(domain->ops->map_range == NULL))
>> +		return -ENODEV;
>> +
>> +	BUG_ON(iova & (~PAGE_MASK));
>> +
>> +	return domain->ops->map_range(domain, iova, sg, len, prot);
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_map_range);
>> +
>> +int iommu_unmap_range(struct iommu_domain *domain, unsigned int iova,
>> +		      unsigned int len)
>> +{
>> +	if (unlikely(domain->ops->unmap_range == NULL))
>> +		return -ENODEV;
>> +
>> +	BUG_ON(iova & (~PAGE_MASK));
>> +
>> +	return domain->ops->unmap_range(domain, iova, len);
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_unmap_range);
> 
> Before introducing these new API functions there should be a fall-back
> for IOMMU drivers that do (not yet) implement the map_range and
> unmap_range call-backs.
> 
> The last thing we want is this kind of functional partitioning between
> different IOMMU drivers.

Yes, I can definitely add a fallback instead of returning -ENODEV.


Thanks,

Olav

-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2014-07-15  1:13 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-30 16:51 [RFC/PATCH 0/7] Add MSM SMMUv1 support Olav Haugan
2014-06-30 16:51 ` [RFC/PATCH 1/7] iommu: msm: Rename iommu driver files Olav Haugan
2014-06-30 16:51 ` [RFC/PATCH 2/7] iommu-api: Add map_range/unmap_range functions Olav Haugan
2014-06-30 19:42   ` Thierry Reding
2014-07-01  9:33   ` Will Deacon
2014-07-01  9:58     ` Varun Sethi
2014-07-04  4:29   ` Hiroshi Doyu
2014-07-08 21:53     ` Olav Haugan
2014-07-08 23:49       ` Rob Clark
2014-07-10  0:03         ` Olav Haugan
2014-07-10  0:40           ` Rob Clark
2014-07-10  7:10             ` Thierry Reding
2014-07-10 11:15               ` Rob Clark
2014-07-10 22:43             ` Olav Haugan
2014-07-10 23:42               ` Rob Clark
2014-07-11 10:20   ` Joerg Roedel
2014-07-15  1:13     ` Olav Haugan
2014-06-30 16:51 ` [RFC/PATCH 3/7] iopoll: Introduce memory-mapped IO polling macros Olav Haugan
2014-06-30 19:46   ` Thierry Reding
2014-07-01  9:40   ` Will Deacon
2014-06-30 16:51 ` [RFC/PATCH 5/7] iommu: msm: Add support for V7L page table format Olav Haugan
2014-06-30 16:51 ` [RFC/PATCH 6/7] defconfig: msm: Enable Qualcomm SMMUv1 driver Olav Haugan
2014-06-30 16:51 ` [RFC/PATCH 7/7] iommu-api: Add domain attribute to enable coherent HTW Olav Haugan
2014-07-01  8:49   ` Varun Sethi
2014-07-02 22:11     ` Olav Haugan
2014-07-03 17:43       ` Will Deacon
2014-07-08 22:24         ` Olav Haugan
     [not found] ` <1404147116-4598-5-git-send-email-ohaugan@codeaurora.org>
2014-06-30 17:02   ` [RFC/PATCH 4/7] iommu: msm: Add MSM IOMMUv1 driver Will Deacon
2014-07-02 22:32     ` Olav Haugan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).