All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] Generic IOMMU page table framework
@ 2014-11-27 11:51 ` Will Deacon
  0 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-11-27 11:51 UTC (permalink / raw)
  To: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Will Deacon, laurent.pinchart-ryLnwIuWjnjg/C1BVhZhaw,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin.Murphy-5wv7dgnIgG8

Hi all,

This series introduces a generic IOMMU page table allocation framework,
implements support for ARM long-descriptors and then ports the arm-smmu
driver over to the new code.

There are a few reasons for doing this:

  - Page table code is hard, and I don't enjoy shopping

  - A number of IOMMUs actually use the same table format, but currently
    duplicate the code

  - It provides a CPU (and architecture) independent allocator, which
    may be useful for some systems where the CPU is using a different
    table format for its own mappings

As illustrated in the final patch, an IOMMU driver interacts with the
allocator by passing in a configuration structure describing the
input and output address ranges, the supported pages sizes and a set of
ops for performing various TLB invalidation and PTE flushing routines.

The LPAE code implements support for 4k/2M/1G, 16k/32M and 64k/512M
mappings, but I decided not to implement the contiguous bit in the
interest of trying to keep the code semi-readable. This could always be
added later, if needed.

I also included some self-tests for the LPAE implementation. Ideally
we'd merge these, but I'm also happy to drop them if there are
objections.

Tested with the self-tests, but also VFIO + MMU-500 at stage-1 and
stage-2. Patches taken against my iommu/devel branch (queued by Joerg
for 3.19).

All feedback welcome.

Will

--->8

Will Deacon (4):
  iommu: introduce generic page table allocation framework
  iommu: add ARM LPAE page table allocator
  iommu: add self-consistency tests to ARM LPAE IO page table allocator
  iommu/arm-smmu: make use of generic LPAE allocator

 MAINTAINERS                    |   1 +
 arch/arm64/Kconfig             |   1 -
 drivers/iommu/Kconfig          |  32 +-
 drivers/iommu/Makefile         |   2 +
 drivers/iommu/arm-smmu.c       | 872 +++++++++++---------------------------
 drivers/iommu/io-pgtable-arm.c | 925 +++++++++++++++++++++++++++++++++++++++++
 drivers/iommu/io-pgtable.c     |  78 ++++
 drivers/iommu/io-pgtable.h     |  77 ++++
 8 files changed, 1361 insertions(+), 627 deletions(-)
 create mode 100644 drivers/iommu/io-pgtable-arm.c
 create mode 100644 drivers/iommu/io-pgtable.c
 create mode 100644 drivers/iommu/io-pgtable.h

-- 
2.1.1

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 0/4] Generic IOMMU page table framework
@ 2014-11-27 11:51 ` Will Deacon
  0 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-11-27 11:51 UTC (permalink / raw)
  To: linux-arm-kernel

Hi all,

This series introduces a generic IOMMU page table allocation framework,
implements support for ARM long-descriptors and then ports the arm-smmu
driver over to the new code.

There are a few reasons for doing this:

  - Page table code is hard, and I don't enjoy shopping

  - A number of IOMMUs actually use the same table format, but currently
    duplicate the code

  - It provides a CPU (and architecture) independent allocator, which
    may be useful for some systems where the CPU is using a different
    table format for its own mappings

As illustrated in the final patch, an IOMMU driver interacts with the
allocator by passing in a configuration structure describing the
input and output address ranges, the supported pages sizes and a set of
ops for performing various TLB invalidation and PTE flushing routines.

The LPAE code implements support for 4k/2M/1G, 16k/32M and 64k/512M
mappings, but I decided not to implement the contiguous bit in the
interest of trying to keep the code semi-readable. This could always be
added later, if needed.

I also included some self-tests for the LPAE implementation. Ideally
we'd merge these, but I'm also happy to drop them if there are
objections.

Tested with the self-tests, but also VFIO + MMU-500 at stage-1 and
stage-2. Patches taken against my iommu/devel branch (queued by Joerg
for 3.19).

All feedback welcome.

Will

--->8

Will Deacon (4):
  iommu: introduce generic page table allocation framework
  iommu: add ARM LPAE page table allocator
  iommu: add self-consistency tests to ARM LPAE IO page table allocator
  iommu/arm-smmu: make use of generic LPAE allocator

 MAINTAINERS                    |   1 +
 arch/arm64/Kconfig             |   1 -
 drivers/iommu/Kconfig          |  32 +-
 drivers/iommu/Makefile         |   2 +
 drivers/iommu/arm-smmu.c       | 872 +++++++++++---------------------------
 drivers/iommu/io-pgtable-arm.c | 925 +++++++++++++++++++++++++++++++++++++++++
 drivers/iommu/io-pgtable.c     |  78 ++++
 drivers/iommu/io-pgtable.h     |  77 ++++
 8 files changed, 1361 insertions(+), 627 deletions(-)
 create mode 100644 drivers/iommu/io-pgtable-arm.c
 create mode 100644 drivers/iommu/io-pgtable.c
 create mode 100644 drivers/iommu/io-pgtable.h

-- 
2.1.1

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 1/4] iommu: introduce generic page table allocation framework
  2014-11-27 11:51 ` Will Deacon
@ 2014-11-27 11:51     ` Will Deacon
  -1 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-11-27 11:51 UTC (permalink / raw)
  To: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Will Deacon, laurent.pinchart-ryLnwIuWjnjg/C1BVhZhaw,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin.Murphy-5wv7dgnIgG8

This patch introduces a generic framework for allocating page tables for
an IOMMU. There are a number of reasons we want to do this:

  - It avoids duplication of complex table management code in IOMMU
    drivers that use the same page table format

  - It removes any coupling with the CPU table format (and even the
    architecture!)

  - It defines an API for IOMMU TLB maintenance

Signed-off-by: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
---
 drivers/iommu/Kconfig      |  8 ++++++
 drivers/iommu/Makefile     |  1 +
 drivers/iommu/io-pgtable.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++
 drivers/iommu/io-pgtable.h | 65 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 145 insertions(+)
 create mode 100644 drivers/iommu/io-pgtable.c
 create mode 100644 drivers/iommu/io-pgtable.h

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index dd5112265cc9..0f10554e7114 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -13,6 +13,14 @@ menuconfig IOMMU_SUPPORT
 
 if IOMMU_SUPPORT
 
+menu "Generic IOMMU Pagetable Support"
+
+# Selected by the actual pagetable implementations
+config IOMMU_IO_PGTABLE
+	bool
+
+endmenu
+
 config OF_IOMMU
        def_bool y
        depends on OF
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 16edef74b8ee..aff244c78181 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -1,6 +1,7 @@
 obj-$(CONFIG_IOMMU_API) += iommu.o
 obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
+obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
 obj-$(CONFIG_OF_IOMMU)	+= of_iommu.o
 obj-$(CONFIG_MSM_IOMMU) += msm_iommu.o msm_iommu_dev.o
 obj-$(CONFIG_AMD_IOMMU) += amd_iommu.o amd_iommu_init.o
diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
new file mode 100644
index 000000000000..82e39a0db94b
--- /dev/null
+++ b/drivers/iommu/io-pgtable.c
@@ -0,0 +1,71 @@
+/*
+ * Generic page table allocator for IOMMUs.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) 2014 ARM Limited
+ *
+ * Author: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
+ */
+
+#include <linux/bug.h>
+#include <linux/kernel.h>
+#include <linux/types.h>
+
+#include "io-pgtable.h"
+
+static struct io_pgtable_init_fns *io_pgtable_init_table[IO_PGTABLE_NUM_FMTS] =
+{
+};
+
+struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
+					    struct io_pgtable_cfg *cfg,
+					    void *cookie)
+{
+	struct io_pgtable *iop;
+	struct io_pgtable_init_fns *fns;
+
+	if (fmt >= IO_PGTABLE_NUM_FMTS)
+		return NULL;
+
+	fns = io_pgtable_init_table[fmt];
+	if (!fns)
+		return NULL;
+
+	iop = fns->alloc(cfg, cookie);
+	if (!iop)
+		return NULL;
+
+	iop->fmt	= fmt;
+	iop->cookie	= cookie;
+	iop->cfg	= *cfg;
+
+	return &iop->ops;
+}
+
+/*
+ * It is the IOMMU driver's responsibility to ensure that the page table
+ * is no longer accessible to the walker by this point.
+ */
+void free_io_pgtable_ops(struct io_pgtable_ops *ops)
+{
+	struct io_pgtable *iop;
+
+	if (!ops)
+		return;
+
+	iop = container_of(ops, struct io_pgtable, ops);
+	iop->cfg.tlb->tlb_flush_all(iop->cookie);
+	io_pgtable_init_table[iop->fmt]->free(iop);
+}
diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
new file mode 100644
index 000000000000..5ae75d9cae50
--- /dev/null
+++ b/drivers/iommu/io-pgtable.h
@@ -0,0 +1,65 @@
+#ifndef __IO_PGTABLE_H
+#define __IO_PGTABLE_H
+
+struct io_pgtable_ops {
+	int (*map)(struct io_pgtable_ops *ops, unsigned long iova,
+		   phys_addr_t paddr, size_t size, int prot);
+	int (*unmap)(struct io_pgtable_ops *ops, unsigned long iova,
+		     size_t size);
+	phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
+				    unsigned long iova);
+};
+
+struct iommu_gather_ops {
+	/* Synchronously invalidate the entire TLB context */
+	void (*tlb_flush_all)(void *cookie);
+
+	/* Queue up a TLB invalidation for a virtual address range */
+	void (*tlb_add_flush)(unsigned long iova, size_t size, bool leaf,
+			      void *cookie);
+	/* Ensure any queued TLB invalidation has taken effect */
+	void (*tlb_sync)(void *cookie);
+
+	/* Ensure page tables updates are visible to the IOMMU */
+	void (*flush_pgtable)(void *ptr, size_t size, void *cookie);
+};
+
+struct io_pgtable_cfg {
+	int			quirks; /* IO_PGTABLE_QUIRK_* */
+	unsigned long		pgsize_bitmap;
+	unsigned int		ias;
+	unsigned int		oas;
+	struct iommu_gather_ops	*tlb;
+
+	/* Low-level data specific to the table format */
+	union {
+	};
+};
+
+enum io_pgtable_fmt {
+	IO_PGTABLE_NUM_FMTS,
+};
+
+struct io_pgtable {
+	enum io_pgtable_fmt	fmt;
+	void			*cookie;
+	struct io_pgtable_cfg	cfg;
+	struct io_pgtable_ops	ops;
+};
+
+struct io_pgtable_init_fns {
+	struct io_pgtable *(*alloc)(struct io_pgtable_cfg *cfg, void *cookie);
+	void (*free)(struct io_pgtable *iop);
+};
+
+struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
+					    struct io_pgtable_cfg *cfg,
+					    void *cookie);
+
+/*
+ * Free an io_pgtable_ops structure. The caller *must* ensure that the
+ * page table is no longer live, but the TLB can be dirty.
+ */
+void free_io_pgtable_ops(struct io_pgtable_ops *ops);
+
+#endif /* __IO_PGTABLE_H */
-- 
2.1.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 1/4] iommu: introduce generic page table allocation framework
@ 2014-11-27 11:51     ` Will Deacon
  0 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-11-27 11:51 UTC (permalink / raw)
  To: linux-arm-kernel

This patch introduces a generic framework for allocating page tables for
an IOMMU. There are a number of reasons we want to do this:

  - It avoids duplication of complex table management code in IOMMU
    drivers that use the same page table format

  - It removes any coupling with the CPU table format (and even the
    architecture!)

  - It defines an API for IOMMU TLB maintenance

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 drivers/iommu/Kconfig      |  8 ++++++
 drivers/iommu/Makefile     |  1 +
 drivers/iommu/io-pgtable.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++
 drivers/iommu/io-pgtable.h | 65 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 145 insertions(+)
 create mode 100644 drivers/iommu/io-pgtable.c
 create mode 100644 drivers/iommu/io-pgtable.h

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index dd5112265cc9..0f10554e7114 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -13,6 +13,14 @@ menuconfig IOMMU_SUPPORT
 
 if IOMMU_SUPPORT
 
+menu "Generic IOMMU Pagetable Support"
+
+# Selected by the actual pagetable implementations
+config IOMMU_IO_PGTABLE
+	bool
+
+endmenu
+
 config OF_IOMMU
        def_bool y
        depends on OF
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 16edef74b8ee..aff244c78181 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -1,6 +1,7 @@
 obj-$(CONFIG_IOMMU_API) += iommu.o
 obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
+obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
 obj-$(CONFIG_OF_IOMMU)	+= of_iommu.o
 obj-$(CONFIG_MSM_IOMMU) += msm_iommu.o msm_iommu_dev.o
 obj-$(CONFIG_AMD_IOMMU) += amd_iommu.o amd_iommu_init.o
diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
new file mode 100644
index 000000000000..82e39a0db94b
--- /dev/null
+++ b/drivers/iommu/io-pgtable.c
@@ -0,0 +1,71 @@
+/*
+ * Generic page table allocator for IOMMUs.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) 2014 ARM Limited
+ *
+ * Author: Will Deacon <will.deacon@arm.com>
+ */
+
+#include <linux/bug.h>
+#include <linux/kernel.h>
+#include <linux/types.h>
+
+#include "io-pgtable.h"
+
+static struct io_pgtable_init_fns *io_pgtable_init_table[IO_PGTABLE_NUM_FMTS] =
+{
+};
+
+struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
+					    struct io_pgtable_cfg *cfg,
+					    void *cookie)
+{
+	struct io_pgtable *iop;
+	struct io_pgtable_init_fns *fns;
+
+	if (fmt >= IO_PGTABLE_NUM_FMTS)
+		return NULL;
+
+	fns = io_pgtable_init_table[fmt];
+	if (!fns)
+		return NULL;
+
+	iop = fns->alloc(cfg, cookie);
+	if (!iop)
+		return NULL;
+
+	iop->fmt	= fmt;
+	iop->cookie	= cookie;
+	iop->cfg	= *cfg;
+
+	return &iop->ops;
+}
+
+/*
+ * It is the IOMMU driver's responsibility to ensure that the page table
+ * is no longer accessible to the walker by this point.
+ */
+void free_io_pgtable_ops(struct io_pgtable_ops *ops)
+{
+	struct io_pgtable *iop;
+
+	if (!ops)
+		return;
+
+	iop = container_of(ops, struct io_pgtable, ops);
+	iop->cfg.tlb->tlb_flush_all(iop->cookie);
+	io_pgtable_init_table[iop->fmt]->free(iop);
+}
diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
new file mode 100644
index 000000000000..5ae75d9cae50
--- /dev/null
+++ b/drivers/iommu/io-pgtable.h
@@ -0,0 +1,65 @@
+#ifndef __IO_PGTABLE_H
+#define __IO_PGTABLE_H
+
+struct io_pgtable_ops {
+	int (*map)(struct io_pgtable_ops *ops, unsigned long iova,
+		   phys_addr_t paddr, size_t size, int prot);
+	int (*unmap)(struct io_pgtable_ops *ops, unsigned long iova,
+		     size_t size);
+	phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
+				    unsigned long iova);
+};
+
+struct iommu_gather_ops {
+	/* Synchronously invalidate the entire TLB context */
+	void (*tlb_flush_all)(void *cookie);
+
+	/* Queue up a TLB invalidation for a virtual address range */
+	void (*tlb_add_flush)(unsigned long iova, size_t size, bool leaf,
+			      void *cookie);
+	/* Ensure any queued TLB invalidation has taken effect */
+	void (*tlb_sync)(void *cookie);
+
+	/* Ensure page tables updates are visible to the IOMMU */
+	void (*flush_pgtable)(void *ptr, size_t size, void *cookie);
+};
+
+struct io_pgtable_cfg {
+	int			quirks; /* IO_PGTABLE_QUIRK_* */
+	unsigned long		pgsize_bitmap;
+	unsigned int		ias;
+	unsigned int		oas;
+	struct iommu_gather_ops	*tlb;
+
+	/* Low-level data specific to the table format */
+	union {
+	};
+};
+
+enum io_pgtable_fmt {
+	IO_PGTABLE_NUM_FMTS,
+};
+
+struct io_pgtable {
+	enum io_pgtable_fmt	fmt;
+	void			*cookie;
+	struct io_pgtable_cfg	cfg;
+	struct io_pgtable_ops	ops;
+};
+
+struct io_pgtable_init_fns {
+	struct io_pgtable *(*alloc)(struct io_pgtable_cfg *cfg, void *cookie);
+	void (*free)(struct io_pgtable *iop);
+};
+
+struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
+					    struct io_pgtable_cfg *cfg,
+					    void *cookie);
+
+/*
+ * Free an io_pgtable_ops structure. The caller *must* ensure that the
+ * page table is no longer live, but the TLB can be dirty.
+ */
+void free_io_pgtable_ops(struct io_pgtable_ops *ops);
+
+#endif /* __IO_PGTABLE_H */
-- 
2.1.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 2/4] iommu: add ARM LPAE page table allocator
  2014-11-27 11:51 ` Will Deacon
@ 2014-11-27 11:51     ` Will Deacon
  -1 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-11-27 11:51 UTC (permalink / raw)
  To: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Will Deacon, laurent.pinchart-ryLnwIuWjnjg/C1BVhZhaw,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin.Murphy-5wv7dgnIgG8

A number of IOMMUs found in ARM SoCs can walk architecture-compatible
page tables.

This patch adds a generic allocator for Stage-1 and Stage-2 v7/v8
long-descriptor page tables. 4k, 16k and 64k pages are supported, with
up to 4-levels of walk to cover a 48-bit address space.

Signed-off-by: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
---
 MAINTAINERS                    |   1 +
 drivers/iommu/Kconfig          |   9 +
 drivers/iommu/Makefile         |   1 +
 drivers/iommu/io-pgtable-arm.c | 735 +++++++++++++++++++++++++++++++++++++++++
 drivers/iommu/io-pgtable.c     |   7 +
 drivers/iommu/io-pgtable.h     |  12 +
 6 files changed, 765 insertions(+)
 create mode 100644 drivers/iommu/io-pgtable-arm.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 0ff630de8a6d..d3ca31b7c960 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1562,6 +1562,7 @@ M:	Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
 L:	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org (moderated for non-subscribers)
 S:	Maintained
 F:	drivers/iommu/arm-smmu.c
+F:	drivers/iommu/io-pgtable-arm.c
 
 ARM64 PORT (AARCH64 ARCHITECTURE)
 M:	Catalin Marinas <catalin.marinas-5wv7dgnIgG8@public.gmane.org>
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 0f10554e7114..e1742a0146f8 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -19,6 +19,15 @@ menu "Generic IOMMU Pagetable Support"
 config IOMMU_IO_PGTABLE
 	bool
 
+config IOMMU_IO_PGTABLE_LPAE
+	bool "ARMv7/v8 Long Descriptor Format"
+	select IOMMU_IO_PGTABLE
+	help
+	  Enable support for the ARM long descriptor pagetable format.
+	  This allocator supports 4K/2M/1G, 16K/32M and 64K/512M page
+	  sizes at both stage-1 and stage-2, as well as address spaces
+	  up to 48-bits in size.
+
 endmenu
 
 config OF_IOMMU
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index aff244c78181..269cdd82b672 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -2,6 +2,7 @@ obj-$(CONFIG_IOMMU_API) += iommu.o
 obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
+obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
 obj-$(CONFIG_OF_IOMMU)	+= of_iommu.o
 obj-$(CONFIG_MSM_IOMMU) += msm_iommu.o msm_iommu_dev.o
 obj-$(CONFIG_AMD_IOMMU) += amd_iommu.o amd_iommu_init.o
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
new file mode 100644
index 000000000000..9dbaa2e48424
--- /dev/null
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -0,0 +1,735 @@
+/*
+ * CPU-agnostic ARM page table allocator.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ *
+ * Copyright (C) 2014 ARM Limited
+ *
+ * Author: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
+ */
+
+#define pr_fmt(fmt)	"arm-lpae io-pgtable: " fmt
+
+#include <linux/iommu.h>
+#include <linux/kernel.h>
+#include <linux/sizes.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+
+#include "io-pgtable.h"
+
+#define ARM_LPAE_MAX_ADDR_BITS		48
+#define ARM_LPAE_S2_MAX_CONCAT_PAGES	16
+#define ARM_LPAE_MAX_LEVELS		4
+
+/* Struct accessors */
+#define io_pgtable_to_data(x)						\
+	container_of((x), struct arm_lpae_io_pgtable, iop)
+
+#define io_pgtable_ops_to_pgtable(x)					\
+	container_of((x), struct io_pgtable, ops)
+
+#define io_pgtable_ops_to_data(x)					\
+	io_pgtable_to_data(io_pgtable_ops_to_pgtable(x))
+
+/*
+ * For consistency with the architecture, we always consider
+ * ARM_LPAE_MAX_LEVELS levels, with the walk starting at level n >=0
+ */
+#define ARM_LPAE_START_LVL(d)	(ARM_LPAE_MAX_LEVELS - (d)->levels)
+
+/*
+ * Calculate the right shift amount to get to the portion describing level l
+ * in a virtual address mapped by the pagetable in d.
+ */
+#define ARM_LPAE_LVL_SHIFT(l,d)						\
+	((((d)->levels - ((l) - ARM_LPAE_START_LVL(d) + 1))		\
+	  * (d)->bits_per_level) + (d)->pg_shift)
+
+/*
+ * Calculate the index at level l used to map virtual address a using the
+ * pagetable in d.
+ */
+#define ARM_LPAE_PGD_IDX(l,d)						\
+	((l) == ARM_LPAE_START_LVL(d) ? ilog2((d)->pages_per_pgd) : 0)
+
+#define ARM_LPAE_LVL_IDX(a,l,d)						\
+	(((a) >> ARM_LPAE_LVL_SHIFT(l,d)) &				\
+	 ((1 << ((d)->bits_per_level + ARM_LPAE_PGD_IDX(l,d))) - 1))
+
+/* Calculate the block/page mapping size at level l for pagetable in d. */
+#define ARM_LPAE_BLOCK_SIZE(l,d)					\
+	(1 << (ilog2(sizeof(arm_lpae_iopte)) +				\
+		((ARM_LPAE_MAX_LEVELS - (l)) * (d)->bits_per_level)))
+
+/* Page table bits */
+#define ARM_LPAE_PTE_TYPE_SHIFT		0
+#define ARM_LPAE_PTE_TYPE_MASK		0x3
+
+#define ARM_LPAE_PTE_TYPE_BLOCK		1
+#define ARM_LPAE_PTE_TYPE_TABLE		3
+#define ARM_LPAE_PTE_TYPE_PAGE		3
+
+#define ARM_LPAE_PTE_XN			(((arm_lpae_iopte)3) << 53)
+#define ARM_LPAE_PTE_AF			(((arm_lpae_iopte)1) << 10)
+#define ARM_LPAE_PTE_SH_NS		(((arm_lpae_iopte)0) << 8)
+#define ARM_LPAE_PTE_SH_OS		(((arm_lpae_iopte)2) << 8)
+#define ARM_LPAE_PTE_SH_IS		(((arm_lpae_iopte)3) << 8)
+#define ARM_LPAE_PTE_VALID		(((arm_lpae_iopte)1) << 0)
+
+#define ARM_LPAE_PTE_ATTR_LO_MASK	(((arm_lpae_iopte)0x3ff) << 2)
+/* Ignore the contiguous bit for block splitting */
+#define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
+#define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
+					 ARM_LPAE_PTE_ATTR_HI_MASK)
+
+/* Stage-1 PTE */
+#define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
+#define ARM_LPAE_PTE_AP_RDONLY		(((arm_lpae_iopte)2) << 6)
+#define ARM_LPAE_PTE_ATTRINDX_SHIFT	2
+#define ARM_LPAE_PTE_nG			(((arm_lpae_iopte)1) << 11)
+
+/* Stage-2 PTE */
+#define ARM_LPAE_PTE_HAP_FAULT		(((arm_lpae_iopte)0) << 6)
+#define ARM_LPAE_PTE_HAP_READ		(((arm_lpae_iopte)1) << 6)
+#define ARM_LPAE_PTE_HAP_WRITE		(((arm_lpae_iopte)2) << 6)
+#define ARM_LPAE_PTE_MEMATTR_OIWB	(((arm_lpae_iopte)0xf) << 2)
+#define ARM_LPAE_PTE_MEMATTR_NC		(((arm_lpae_iopte)0x5) << 2)
+#define ARM_LPAE_PTE_MEMATTR_DEV	(((arm_lpae_iopte)0x1) << 2)
+
+/* Register bits */
+#define ARM_LPAE_TCR_EAE		(1 << 31)
+
+#define ARM_LPAE_TCR_TG0_4K		(0 << 14)
+#define ARM_LPAE_TCR_TG0_64K		(1 << 14)
+#define ARM_LPAE_TCR_TG0_16K		(2 << 14)
+
+#define ARM_LPAE_TCR_SH0_SHIFT		12
+#define ARM_LPAE_TCR_SH0_MASK		0x3
+#define ARM_LPAE_TCR_SH_NS		0
+#define ARM_LPAE_TCR_SH_OS		2
+#define ARM_LPAE_TCR_SH_IS		3
+
+#define ARM_LPAE_TCR_ORGN0_SHIFT	10
+#define ARM_LPAE_TCR_IRGN0_SHIFT	8
+#define ARM_LPAE_TCR_RGN_MASK		0x3
+#define ARM_LPAE_TCR_RGN_NC		0
+#define ARM_LPAE_TCR_RGN_WBWA		1
+#define ARM_LPAE_TCR_RGN_WT		2
+#define ARM_LPAE_TCR_RGN_WB		3
+
+#define ARM_LPAE_TCR_SL0_SHIFT		6
+#define ARM_LPAE_TCR_SL0_MASK		0x3
+
+#define ARM_LPAE_TCR_T0SZ_SHIFT		0
+#define ARM_LPAE_TCR_SZ_MASK		0xf
+
+#define ARM_LPAE_TCR_PS_SHIFT		16
+#define ARM_LPAE_TCR_PS_MASK		0x7
+
+#define ARM_LPAE_TCR_IPS_SHIFT		32
+#define ARM_LPAE_TCR_IPS_MASK		0x7
+
+#define ARM_LPAE_TCR_PS_32_BIT		0x0ULL
+#define ARM_LPAE_TCR_PS_36_BIT		0x1ULL
+#define ARM_LPAE_TCR_PS_40_BIT		0x2ULL
+#define ARM_LPAE_TCR_PS_42_BIT		0x3ULL
+#define ARM_LPAE_TCR_PS_44_BIT		0x4ULL
+#define ARM_LPAE_TCR_PS_48_BIT		0x5ULL
+
+#define ARM_LPAE_MAIR_ATTR_SHIFT(n)	((n) << 3)
+#define ARM_LPAE_MAIR_ATTR_MASK		0xff
+#define ARM_LPAE_MAIR_ATTR_DEVICE	0x04
+#define ARM_LPAE_MAIR_ATTR_NC		0x44
+#define ARM_LPAE_MAIR_ATTR_WBRWA	0xff
+#define ARM_LPAE_MAIR_ATTR_IDX_NC	0
+#define ARM_LPAE_MAIR_ATTR_IDX_CACHE	1
+#define ARM_LPAE_MAIR_ATTR_IDX_DEV	2
+
+/* IOPTE accessors */
+#define iopte_deref(pte,d)					\
+	(__va((pte) & ((1ULL << ARM_LPAE_MAX_ADDR_BITS) - 1)	\
+	& ~((1ULL << (d)->pg_shift) - 1)))
+
+#define iopte_type(pte,l)					\
+	(((pte) >> ARM_LPAE_PTE_TYPE_SHIFT) & ARM_LPAE_PTE_TYPE_MASK)
+
+#define iopte_prot(pte)	((pte) & ARM_LPAE_PTE_ATTR_MASK)
+
+#define iopte_leaf(pte,l)					\
+	(l == (ARM_LPAE_MAX_LEVELS - 1) ?			\
+		(iopte_type(pte,l) == ARM_LPAE_PTE_TYPE_PAGE) :	\
+		(iopte_type(pte,l) == ARM_LPAE_PTE_TYPE_BLOCK))
+
+#define iopte_to_pfn(pte,d)					\
+	(((pte) & ((1ULL << ARM_LPAE_MAX_ADDR_BITS) - 1)) >> (d)->pg_shift)
+
+#define pfn_to_iopte(pfn,d)					\
+	(((pfn) << (d)->pg_shift) & ((1ULL << ARM_LPAE_MAX_ADDR_BITS) - 1))
+
+struct arm_lpae_io_pgtable {
+	struct io_pgtable	iop;
+
+	int			levels;
+	int			pages_per_pgd;
+	unsigned long		pg_shift;
+	unsigned long		bits_per_level;
+
+	void			*pgd;
+};
+
+typedef u64 arm_lpae_iopte;
+
+static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
+			     unsigned long iova, phys_addr_t paddr,
+			     arm_lpae_iopte prot, int lvl,
+			     arm_lpae_iopte *ptep)
+{
+	arm_lpae_iopte pte = prot;
+
+	/* We require an unmap first */
+	if (iopte_leaf(*ptep, lvl))
+		return -EEXIST;
+
+	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
+		pte |= ARM_LPAE_PTE_TYPE_PAGE;
+	else
+		pte |= ARM_LPAE_PTE_TYPE_BLOCK;
+
+	pte |= ARM_LPAE_PTE_AF | ARM_LPAE_PTE_SH_IS;
+	pte |= pfn_to_iopte(paddr >> data->pg_shift, data);
+
+	*ptep = pte;
+	data->iop.cfg.tlb->flush_pgtable(ptep, sizeof(*ptep), data->iop.cookie);
+	return 0;
+}
+
+static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
+			  phys_addr_t paddr, size_t size, arm_lpae_iopte prot,
+			  int lvl, arm_lpae_iopte *ptep)
+{
+	arm_lpae_iopte *cptep, pte;
+	void *cookie = data->iop.cookie;
+	size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
+
+	/* Find our entry at the current level */
+	ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+
+	/* If we can install a leaf entry at this level, then do so */
+	if (size == block_size && (size & data->iop.cfg.pgsize_bitmap))
+		return arm_lpae_init_pte(data, iova, paddr, prot, lvl, ptep);
+
+	/* We can't allocate tables at the final level */
+	if (WARN_ON(lvl >= ARM_LPAE_MAX_LEVELS - 1))
+		return -EINVAL;
+
+	/* Grab a pointer to the next level */
+	pte = *ptep;
+	if (!pte) {
+		cptep = alloc_pages_exact(1UL << data->pg_shift,
+					 GFP_ATOMIC | __GFP_ZERO);
+		if (!cptep)
+			return -ENOMEM;
+
+		data->iop.cfg.tlb->flush_pgtable(cptep, 1UL << data->pg_shift,
+						 cookie);
+		pte = __pa(cptep) | ARM_LPAE_PTE_TYPE_TABLE;
+		*ptep = pte;
+		data->iop.cfg.tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
+	} else {
+		cptep = iopte_deref(pte, data);
+	}
+
+	/* Rinse, repeat */
+	return __arm_lpae_map(data, iova, paddr, size, prot, lvl + 1, cptep);
+}
+
+static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
+					   int prot)
+{
+	arm_lpae_iopte pte;
+
+	if (data->iop.fmt == ARM_LPAE_S1) {
+		pte = ARM_LPAE_PTE_AP_UNPRIV | ARM_LPAE_PTE_nG;
+
+		if (!(prot & IOMMU_WRITE) && (prot & IOMMU_READ))
+			pte |= ARM_LPAE_PTE_AP_RDONLY;
+
+		if (prot & IOMMU_CACHE)
+			pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
+				<< ARM_LPAE_PTE_ATTRINDX_SHIFT);
+	} else {
+		pte = ARM_LPAE_PTE_HAP_FAULT;
+		if (prot & IOMMU_READ)
+			pte |= ARM_LPAE_PTE_HAP_READ;
+		if (prot & IOMMU_WRITE)
+			pte |= ARM_LPAE_PTE_HAP_WRITE;
+		if (prot & IOMMU_CACHE)
+			pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
+		else
+			pte |= ARM_LPAE_PTE_MEMATTR_NC;
+	}
+
+	if (prot & IOMMU_NOEXEC)
+		pte |= ARM_LPAE_PTE_XN;
+
+	return pte;
+}
+
+static int arm_lpae_map(struct io_pgtable_ops *ops, unsigned long iova,
+			phys_addr_t paddr, size_t size, int iommu_prot)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	arm_lpae_iopte *ptep = data->pgd;
+	int lvl = ARM_LPAE_START_LVL(data);
+	arm_lpae_iopte prot;
+
+	/* If no access, then nothing to do */
+	if (!(iommu_prot & (IOMMU_READ | IOMMU_WRITE)))
+		return 0;
+
+	prot = arm_lpae_prot_to_pte(data, iommu_prot);
+	return __arm_lpae_map(data, iova, paddr, size, prot, lvl, ptep);
+}
+
+static void __arm_lpae_free_pgtable(struct arm_lpae_io_pgtable *data, int lvl,
+				    arm_lpae_iopte *ptep)
+{
+	arm_lpae_iopte *start, *end;
+	unsigned long table_size;
+
+	/* Only leaf entries at the last level */
+	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
+		return;
+
+	table_size = 1UL << data->pg_shift;
+	if (lvl == ARM_LPAE_START_LVL(data))
+		table_size *= data->pages_per_pgd;
+
+	start = ptep;
+	end = (void *)ptep + table_size;
+
+	while (ptep != end) {
+		arm_lpae_iopte pte = *ptep++;
+
+		if (!pte || iopte_leaf(pte, lvl))
+			continue;
+
+		__arm_lpae_free_pgtable(data, lvl + 1, iopte_deref(pte, data));
+	}
+
+	free_pages_exact(start, table_size);
+}
+
+static void arm_lpae_free_pgtable(struct io_pgtable *iop)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_to_data(iop);
+
+	__arm_lpae_free_pgtable(data, ARM_LPAE_START_LVL(data), data->pgd);
+	kfree(data);
+}
+
+static int arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
+				    unsigned long iova, size_t size,
+				    arm_lpae_iopte prot, int lvl,
+				    arm_lpae_iopte *ptep, size_t blk_size)
+{
+	unsigned long blk_start, blk_end;
+	phys_addr_t blk_paddr;
+	arm_lpae_iopte table = 0;
+	void *cookie = data->iop.cookie;
+	struct iommu_gather_ops *tlb = data->iop.cfg.tlb;
+
+	blk_start = iova & ~(blk_size - 1);
+	blk_end = blk_start + blk_size;
+	blk_paddr = iopte_to_pfn(*ptep, data) << data->pg_shift;
+
+	for (; blk_start < blk_end; blk_start += size, blk_paddr += size) {
+		arm_lpae_iopte *tablep;
+
+		/* Unmap! */
+		if (blk_start == iova)
+			continue;
+
+		/* __arm_lpae_map expects a pointer to the start of the table */
+		tablep = &table - ARM_LPAE_LVL_IDX(blk_start, lvl, data);
+		if (__arm_lpae_map(data, blk_start, blk_paddr, size, prot, lvl,
+				   tablep) < 0) {
+			if (table) {
+				/* Free the table we allocated */
+				tablep = iopte_deref(table, data);
+				__arm_lpae_free_pgtable(data, lvl + 1, tablep);
+			}
+			return 0; /* Bytes unmapped */
+		}
+	}
+
+	*ptep = table;
+	tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
+	iova &= ~(blk_size - 1);
+	tlb->tlb_add_flush(iova, blk_size, true, cookie);
+	return size;
+}
+
+static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
+			    unsigned long iova, size_t size, int lvl,
+			    arm_lpae_iopte *ptep)
+{
+	arm_lpae_iopte pte;
+	struct iommu_gather_ops *tlb = data->iop.cfg.tlb;
+	void *cookie = data->iop.cookie;
+	size_t blk_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
+
+	ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+	pte = *ptep;
+
+	/* Something went horribly wrong and we ran out of page table */
+	if (WARN_ON(!pte || (lvl == ARM_LPAE_MAX_LEVELS)))
+		return 0;
+
+	/* If the size matches this level, we're in the right place */
+	if (size == blk_size) {
+		*ptep = 0;
+		tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
+
+		if (!iopte_leaf(pte, lvl)) {
+			/* Also flush any partial walks */
+			tlb->tlb_add_flush(iova, size, false, cookie);
+			tlb->tlb_sync(data->iop.cookie);
+			ptep = iopte_deref(pte, data);
+			__arm_lpae_free_pgtable(data, lvl + 1, ptep);
+		} else {
+			tlb->tlb_add_flush(iova, size, true, cookie);
+		}
+
+		return size;
+	} else if (iopte_leaf(pte, lvl)) {
+		/*
+		 * Insert a table at the next level to map the old region,
+		 * minus the part we want to unmap
+		 */
+		return arm_lpae_split_blk_unmap(data, iova, size,
+						iopte_prot(pte), lvl, ptep,
+						blk_size);
+	}
+
+	/* Keep on walkin' */
+	ptep = iopte_deref(pte, data);
+	return __arm_lpae_unmap(data, iova, size, lvl + 1, ptep);
+}
+
+static int arm_lpae_unmap(struct io_pgtable_ops *ops, unsigned long iova,
+			  size_t size)
+{
+	size_t unmapped;
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	struct io_pgtable *iop = &data->iop;
+	arm_lpae_iopte *ptep = data->pgd;
+	int lvl = ARM_LPAE_START_LVL(data);
+
+	unmapped = __arm_lpae_unmap(data, iova, size, lvl, ptep);
+	if (unmapped)
+		iop->cfg.tlb->tlb_sync(iop->cookie);
+
+	return unmapped;
+}
+
+static phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
+					 unsigned long iova)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	arm_lpae_iopte pte, *ptep = data->pgd;
+	int lvl = ARM_LPAE_START_LVL(data);
+
+	do {
+		/* Valid IOPTE pointer? */
+		if (!ptep)
+			return 0;
+
+		/* Grab the IOPTE we're interested in */
+		pte = *(ptep + ARM_LPAE_LVL_IDX(iova, lvl, data));
+
+		/* Valid entry? */
+		if (!pte)
+			return 0;
+
+		/* Leaf entry? */
+		if (iopte_leaf(pte,lvl))
+			goto found_translation;
+
+		/* Take it to the next level */
+		ptep = iopte_deref(pte, data);
+	} while (++lvl < ARM_LPAE_MAX_LEVELS);
+
+	/* Ran out of page tables to walk */
+	return 0;
+
+found_translation:
+	iova &= ((1 << data->pg_shift) - 1);
+	return ((phys_addr_t)iopte_to_pfn(pte,data) << data->pg_shift) | iova;
+}
+
+static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
+{
+	unsigned long granule;
+
+	/*
+	 * We need to restrict the supported page sizes to match the
+	 * translation regime for a particular granule. Aim to match
+	 * the CPU page size if possible, otherwise prefer smaller sizes.
+	 * While we're at it, restrict the block sizes to match the
+	 * chosen granule.
+	 */
+	if (cfg->pgsize_bitmap & PAGE_SIZE)
+		granule = PAGE_SIZE;
+	else if (cfg->pgsize_bitmap & ~PAGE_MASK)
+		granule = 1UL << __fls(cfg->pgsize_bitmap & ~PAGE_MASK);
+	else if (cfg->pgsize_bitmap & PAGE_MASK)
+		granule = 1UL << __ffs(cfg->pgsize_bitmap & PAGE_MASK);
+	else
+		granule = 0;
+
+	switch (granule) {
+	case SZ_4K:
+		cfg->pgsize_bitmap &= (SZ_4K | SZ_2M | SZ_1G);
+		break;
+	case SZ_16K:
+		cfg->pgsize_bitmap &= (SZ_16K | SZ_32M);
+		break;
+	case SZ_64K:
+		cfg->pgsize_bitmap &= (SZ_64K | SZ_512M);
+		break;
+	default:
+		cfg->pgsize_bitmap = 0;
+	}
+}
+
+static struct arm_lpae_io_pgtable *
+arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
+{
+	unsigned long va_bits;
+	struct arm_lpae_io_pgtable *data;
+
+	arm_lpae_restrict_pgsizes(cfg);
+
+	if (!(cfg->pgsize_bitmap & (SZ_4K | SZ_16K | SZ_64K)))
+		return NULL;
+
+	if (cfg->ias > ARM_LPAE_MAX_ADDR_BITS)
+		return NULL;
+
+	if (cfg->oas > ARM_LPAE_MAX_ADDR_BITS)
+		return NULL;
+
+	data = kmalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return NULL;
+
+	data->pages_per_pgd = 1;
+	data->pg_shift = __ffs(cfg->pgsize_bitmap);
+	data->bits_per_level = data->pg_shift - ilog2(sizeof(arm_lpae_iopte));
+
+	va_bits = cfg->ias - data->pg_shift;
+	data->levels = DIV_ROUND_UP(va_bits, data->bits_per_level);
+
+	data->iop.ops = (struct io_pgtable_ops) {
+		.map		= arm_lpae_map,
+		.unmap		= arm_lpae_unmap,
+		.iova_to_phys	= arm_lpae_iova_to_phys,
+	};
+
+	return data;
+}
+
+static struct io_pgtable *arm_lpae_alloc_pgtable_s1(struct io_pgtable_cfg *cfg,
+						    void *cookie)
+{
+	u64 reg;
+	struct arm_lpae_io_pgtable *data = arm_lpae_alloc_pgtable(cfg);
+
+	if (!data)
+		return NULL;
+
+	/* TCR */
+	reg = ARM_LPAE_TCR_EAE |
+	     (ARM_LPAE_TCR_SH_IS << ARM_LPAE_TCR_SH0_SHIFT) |
+	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_IRGN0_SHIFT) |
+	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_ORGN0_SHIFT);
+
+	switch (1 << data->pg_shift) {
+	case SZ_4K:
+		reg |= ARM_LPAE_TCR_TG0_4K;
+		break;
+	case SZ_16K:
+		reg |= ARM_LPAE_TCR_TG0_16K;
+		break;
+	case SZ_64K:
+		reg |= ARM_LPAE_TCR_TG0_64K;
+		break;
+	}
+
+	switch (cfg->oas) {
+	case 32:
+		reg |= (ARM_LPAE_TCR_PS_32_BIT << ARM_LPAE_TCR_IPS_SHIFT);
+		break;
+	case 36:
+		reg |= (ARM_LPAE_TCR_PS_36_BIT << ARM_LPAE_TCR_IPS_SHIFT);
+		break;
+	case 40:
+		reg |= (ARM_LPAE_TCR_PS_40_BIT << ARM_LPAE_TCR_IPS_SHIFT);
+		break;
+	case 42:
+		reg |= (ARM_LPAE_TCR_PS_42_BIT << ARM_LPAE_TCR_IPS_SHIFT);
+		break;
+	case 44:
+		reg |= (ARM_LPAE_TCR_PS_44_BIT << ARM_LPAE_TCR_IPS_SHIFT);
+		break;
+	case 48:
+		reg |= (ARM_LPAE_TCR_PS_48_BIT << ARM_LPAE_TCR_IPS_SHIFT);
+		break;
+	default:
+		goto out_free_data;
+	}
+
+	reg |= (64ULL - cfg->ias) << ARM_LPAE_TCR_T0SZ_SHIFT;
+	cfg->arm_lpae_s1_cfg.tcr = reg;
+
+	/* MAIRs */
+	reg = (ARM_LPAE_MAIR_ATTR_NC
+	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_NC)) |
+	      (ARM_LPAE_MAIR_ATTR_WBRWA
+	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_CACHE)) |
+	      (ARM_LPAE_MAIR_ATTR_DEVICE
+	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_DEV));
+
+	cfg->arm_lpae_s1_cfg.mair[0] = reg;
+	cfg->arm_lpae_s1_cfg.mair[1] = 0;
+
+	/* Looking good; allocate a pgd */
+	data->pgd = alloc_pages_exact(1UL << data->pg_shift,
+				      GFP_KERNEL | __GFP_ZERO);
+	if (!data->pgd)
+		goto out_free_data;
+
+	cfg->tlb->flush_pgtable(data->pgd, (1UL << data->pg_shift), cookie);
+
+	/* TTBRs */
+	cfg->arm_lpae_s1_cfg.ttbr[0] = virt_to_phys(data->pgd);
+	cfg->arm_lpae_s1_cfg.ttbr[1] = 0;
+	return &data->iop;
+
+out_free_data:
+	kfree(data);
+	return NULL;
+}
+
+static struct io_pgtable *arm_lpae_alloc_pgtable_s2(struct io_pgtable_cfg *cfg,
+						    void *cookie)
+{
+	u64 reg, sl;
+	size_t pgd_size;
+	struct arm_lpae_io_pgtable *data = arm_lpae_alloc_pgtable(cfg);
+
+	if (!data)
+		return NULL;
+
+	/*
+	 * Concatenate PGDs at level 1 if possible in order to reduce
+	 * the depth of the stage-2 walk.
+	 */
+	if (data->levels == ARM_LPAE_MAX_LEVELS) {
+		unsigned long pgd_bits, pgd_pages;
+		unsigned long va_bits = cfg->ias - data->pg_shift;
+
+		pgd_bits = data->bits_per_level * (data->levels - 1);
+		pgd_pages = 1 << (va_bits - pgd_bits);
+		if (pgd_pages <= ARM_LPAE_S2_MAX_CONCAT_PAGES) {
+			data->pages_per_pgd = pgd_pages;
+			data->levels--;
+		}
+	}
+
+	/* VTCR */
+	reg = ARM_LPAE_TCR_EAE |
+	     (ARM_LPAE_TCR_SH_IS << ARM_LPAE_TCR_SH0_SHIFT) |
+	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_IRGN0_SHIFT) |
+	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_ORGN0_SHIFT);
+
+	sl = ARM_LPAE_START_LVL(data);
+
+	switch (1 << data->pg_shift) {
+	case SZ_4K:
+		reg |= ARM_LPAE_TCR_TG0_4K;
+		sl++; /* SL0 format is different for 4K granule size */
+		break;
+	case SZ_16K:
+		reg |= ARM_LPAE_TCR_TG0_16K;
+		break;
+	case SZ_64K:
+		reg |= ARM_LPAE_TCR_TG0_64K;
+		break;
+	}
+
+	switch (cfg->oas) {
+	case 32:
+		reg |= (ARM_LPAE_TCR_PS_32_BIT << ARM_LPAE_TCR_PS_SHIFT);
+		break;
+	case 36:
+		reg |= (ARM_LPAE_TCR_PS_36_BIT << ARM_LPAE_TCR_PS_SHIFT);
+		break;
+	case 40:
+		reg |= (ARM_LPAE_TCR_PS_40_BIT << ARM_LPAE_TCR_PS_SHIFT);
+		break;
+	case 42:
+		reg |= (ARM_LPAE_TCR_PS_42_BIT << ARM_LPAE_TCR_PS_SHIFT);
+		break;
+	case 44:
+		reg |= (ARM_LPAE_TCR_PS_44_BIT << ARM_LPAE_TCR_PS_SHIFT);
+		break;
+	case 48:
+		reg |= (ARM_LPAE_TCR_PS_48_BIT << ARM_LPAE_TCR_PS_SHIFT);
+		break;
+	default:
+		goto out_free_data;
+	}
+
+	reg |= (64ULL - cfg->ias) << ARM_LPAE_TCR_T0SZ_SHIFT;
+	reg |= (~sl & ARM_LPAE_TCR_SL0_MASK) << ARM_LPAE_TCR_SL0_SHIFT;
+	cfg->arm_lpae_s2_cfg.vtcr = reg;
+
+	/* Allocate pgd pages */
+	pgd_size = data->pages_per_pgd * (1UL << data->pg_shift);
+	data->pgd = alloc_pages_exact(pgd_size, GFP_KERNEL | __GFP_ZERO);
+	if (!data->pgd)
+		goto out_free_data;
+
+	cfg->tlb->flush_pgtable(data->pgd, pgd_size, cookie);
+
+	/* VTTBR */
+	cfg->arm_lpae_s2_cfg.vttbr = virt_to_phys(data->pgd);
+	return &data->iop;
+
+out_free_data:
+	kfree(data);
+	return NULL;
+}
+
+struct io_pgtable_init_fns io_pgtable_arm_lpae_s1_init_fns = {
+	.alloc	= arm_lpae_alloc_pgtable_s1,
+	.free	= arm_lpae_free_pgtable,
+};
+
+struct io_pgtable_init_fns io_pgtable_arm_lpae_s2_init_fns = {
+	.alloc	= arm_lpae_alloc_pgtable_s2,
+	.free	= arm_lpae_free_pgtable,
+};
diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
index 82e39a0db94b..d0a2016efcb4 100644
--- a/drivers/iommu/io-pgtable.c
+++ b/drivers/iommu/io-pgtable.c
@@ -25,8 +25,15 @@
 
 #include "io-pgtable.h"
 
+extern struct io_pgtable_init_fns io_pgtable_arm_lpae_s1_init_fns;
+extern struct io_pgtable_init_fns io_pgtable_arm_lpae_s2_init_fns;
+
 static struct io_pgtable_init_fns *io_pgtable_init_table[IO_PGTABLE_NUM_FMTS] =
 {
+#ifdef CONFIG_IOMMU_IO_PGTABLE_LPAE
+	[ARM_LPAE_S1] = &io_pgtable_arm_lpae_s1_init_fns,
+	[ARM_LPAE_S2] = &io_pgtable_arm_lpae_s2_init_fns,
+#endif
 };
 
 struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
index 5ae75d9cae50..c1cff3d045db 100644
--- a/drivers/iommu/io-pgtable.h
+++ b/drivers/iommu/io-pgtable.h
@@ -33,10 +33,22 @@ struct io_pgtable_cfg {
 
 	/* Low-level data specific to the table format */
 	union {
+		struct {
+			u64	ttbr[2];
+			u64	tcr;
+			u64	mair[2];
+		} arm_lpae_s1_cfg;
+
+		struct {
+			u64	vttbr;
+			u64	vtcr;
+		} arm_lpae_s2_cfg;
 	};
 };
 
 enum io_pgtable_fmt {
+	ARM_LPAE_S1,
+	ARM_LPAE_S2,
 	IO_PGTABLE_NUM_FMTS,
 };
 
-- 
2.1.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 2/4] iommu: add ARM LPAE page table allocator
@ 2014-11-27 11:51     ` Will Deacon
  0 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-11-27 11:51 UTC (permalink / raw)
  To: linux-arm-kernel

A number of IOMMUs found in ARM SoCs can walk architecture-compatible
page tables.

This patch adds a generic allocator for Stage-1 and Stage-2 v7/v8
long-descriptor page tables. 4k, 16k and 64k pages are supported, with
up to 4-levels of walk to cover a 48-bit address space.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 MAINTAINERS                    |   1 +
 drivers/iommu/Kconfig          |   9 +
 drivers/iommu/Makefile         |   1 +
 drivers/iommu/io-pgtable-arm.c | 735 +++++++++++++++++++++++++++++++++++++++++
 drivers/iommu/io-pgtable.c     |   7 +
 drivers/iommu/io-pgtable.h     |  12 +
 6 files changed, 765 insertions(+)
 create mode 100644 drivers/iommu/io-pgtable-arm.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 0ff630de8a6d..d3ca31b7c960 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1562,6 +1562,7 @@ M:	Will Deacon <will.deacon@arm.com>
 L:	linux-arm-kernel at lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
 F:	drivers/iommu/arm-smmu.c
+F:	drivers/iommu/io-pgtable-arm.c
 
 ARM64 PORT (AARCH64 ARCHITECTURE)
 M:	Catalin Marinas <catalin.marinas@arm.com>
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 0f10554e7114..e1742a0146f8 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -19,6 +19,15 @@ menu "Generic IOMMU Pagetable Support"
 config IOMMU_IO_PGTABLE
 	bool
 
+config IOMMU_IO_PGTABLE_LPAE
+	bool "ARMv7/v8 Long Descriptor Format"
+	select IOMMU_IO_PGTABLE
+	help
+	  Enable support for the ARM long descriptor pagetable format.
+	  This allocator supports 4K/2M/1G, 16K/32M and 64K/512M page
+	  sizes at both stage-1 and stage-2, as well as address spaces
+	  up to 48-bits in size.
+
 endmenu
 
 config OF_IOMMU
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index aff244c78181..269cdd82b672 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -2,6 +2,7 @@ obj-$(CONFIG_IOMMU_API) += iommu.o
 obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
+obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
 obj-$(CONFIG_OF_IOMMU)	+= of_iommu.o
 obj-$(CONFIG_MSM_IOMMU) += msm_iommu.o msm_iommu_dev.o
 obj-$(CONFIG_AMD_IOMMU) += amd_iommu.o amd_iommu_init.o
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
new file mode 100644
index 000000000000..9dbaa2e48424
--- /dev/null
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -0,0 +1,735 @@
+/*
+ * CPU-agnostic ARM page table allocator.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ *
+ * Copyright (C) 2014 ARM Limited
+ *
+ * Author: Will Deacon <will.deacon@arm.com>
+ */
+
+#define pr_fmt(fmt)	"arm-lpae io-pgtable: " fmt
+
+#include <linux/iommu.h>
+#include <linux/kernel.h>
+#include <linux/sizes.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+
+#include "io-pgtable.h"
+
+#define ARM_LPAE_MAX_ADDR_BITS		48
+#define ARM_LPAE_S2_MAX_CONCAT_PAGES	16
+#define ARM_LPAE_MAX_LEVELS		4
+
+/* Struct accessors */
+#define io_pgtable_to_data(x)						\
+	container_of((x), struct arm_lpae_io_pgtable, iop)
+
+#define io_pgtable_ops_to_pgtable(x)					\
+	container_of((x), struct io_pgtable, ops)
+
+#define io_pgtable_ops_to_data(x)					\
+	io_pgtable_to_data(io_pgtable_ops_to_pgtable(x))
+
+/*
+ * For consistency with the architecture, we always consider
+ * ARM_LPAE_MAX_LEVELS levels, with the walk starting at level n >=0
+ */
+#define ARM_LPAE_START_LVL(d)	(ARM_LPAE_MAX_LEVELS - (d)->levels)
+
+/*
+ * Calculate the right shift amount to get to the portion describing level l
+ * in a virtual address mapped by the pagetable in d.
+ */
+#define ARM_LPAE_LVL_SHIFT(l,d)						\
+	((((d)->levels - ((l) - ARM_LPAE_START_LVL(d) + 1))		\
+	  * (d)->bits_per_level) + (d)->pg_shift)
+
+/*
+ * Calculate the index at level l used to map virtual address a using the
+ * pagetable in d.
+ */
+#define ARM_LPAE_PGD_IDX(l,d)						\
+	((l) == ARM_LPAE_START_LVL(d) ? ilog2((d)->pages_per_pgd) : 0)
+
+#define ARM_LPAE_LVL_IDX(a,l,d)						\
+	(((a) >> ARM_LPAE_LVL_SHIFT(l,d)) &				\
+	 ((1 << ((d)->bits_per_level + ARM_LPAE_PGD_IDX(l,d))) - 1))
+
+/* Calculate the block/page mapping size at level l for pagetable in d. */
+#define ARM_LPAE_BLOCK_SIZE(l,d)					\
+	(1 << (ilog2(sizeof(arm_lpae_iopte)) +				\
+		((ARM_LPAE_MAX_LEVELS - (l)) * (d)->bits_per_level)))
+
+/* Page table bits */
+#define ARM_LPAE_PTE_TYPE_SHIFT		0
+#define ARM_LPAE_PTE_TYPE_MASK		0x3
+
+#define ARM_LPAE_PTE_TYPE_BLOCK		1
+#define ARM_LPAE_PTE_TYPE_TABLE		3
+#define ARM_LPAE_PTE_TYPE_PAGE		3
+
+#define ARM_LPAE_PTE_XN			(((arm_lpae_iopte)3) << 53)
+#define ARM_LPAE_PTE_AF			(((arm_lpae_iopte)1) << 10)
+#define ARM_LPAE_PTE_SH_NS		(((arm_lpae_iopte)0) << 8)
+#define ARM_LPAE_PTE_SH_OS		(((arm_lpae_iopte)2) << 8)
+#define ARM_LPAE_PTE_SH_IS		(((arm_lpae_iopte)3) << 8)
+#define ARM_LPAE_PTE_VALID		(((arm_lpae_iopte)1) << 0)
+
+#define ARM_LPAE_PTE_ATTR_LO_MASK	(((arm_lpae_iopte)0x3ff) << 2)
+/* Ignore the contiguous bit for block splitting */
+#define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
+#define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
+					 ARM_LPAE_PTE_ATTR_HI_MASK)
+
+/* Stage-1 PTE */
+#define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
+#define ARM_LPAE_PTE_AP_RDONLY		(((arm_lpae_iopte)2) << 6)
+#define ARM_LPAE_PTE_ATTRINDX_SHIFT	2
+#define ARM_LPAE_PTE_nG			(((arm_lpae_iopte)1) << 11)
+
+/* Stage-2 PTE */
+#define ARM_LPAE_PTE_HAP_FAULT		(((arm_lpae_iopte)0) << 6)
+#define ARM_LPAE_PTE_HAP_READ		(((arm_lpae_iopte)1) << 6)
+#define ARM_LPAE_PTE_HAP_WRITE		(((arm_lpae_iopte)2) << 6)
+#define ARM_LPAE_PTE_MEMATTR_OIWB	(((arm_lpae_iopte)0xf) << 2)
+#define ARM_LPAE_PTE_MEMATTR_NC		(((arm_lpae_iopte)0x5) << 2)
+#define ARM_LPAE_PTE_MEMATTR_DEV	(((arm_lpae_iopte)0x1) << 2)
+
+/* Register bits */
+#define ARM_LPAE_TCR_EAE		(1 << 31)
+
+#define ARM_LPAE_TCR_TG0_4K		(0 << 14)
+#define ARM_LPAE_TCR_TG0_64K		(1 << 14)
+#define ARM_LPAE_TCR_TG0_16K		(2 << 14)
+
+#define ARM_LPAE_TCR_SH0_SHIFT		12
+#define ARM_LPAE_TCR_SH0_MASK		0x3
+#define ARM_LPAE_TCR_SH_NS		0
+#define ARM_LPAE_TCR_SH_OS		2
+#define ARM_LPAE_TCR_SH_IS		3
+
+#define ARM_LPAE_TCR_ORGN0_SHIFT	10
+#define ARM_LPAE_TCR_IRGN0_SHIFT	8
+#define ARM_LPAE_TCR_RGN_MASK		0x3
+#define ARM_LPAE_TCR_RGN_NC		0
+#define ARM_LPAE_TCR_RGN_WBWA		1
+#define ARM_LPAE_TCR_RGN_WT		2
+#define ARM_LPAE_TCR_RGN_WB		3
+
+#define ARM_LPAE_TCR_SL0_SHIFT		6
+#define ARM_LPAE_TCR_SL0_MASK		0x3
+
+#define ARM_LPAE_TCR_T0SZ_SHIFT		0
+#define ARM_LPAE_TCR_SZ_MASK		0xf
+
+#define ARM_LPAE_TCR_PS_SHIFT		16
+#define ARM_LPAE_TCR_PS_MASK		0x7
+
+#define ARM_LPAE_TCR_IPS_SHIFT		32
+#define ARM_LPAE_TCR_IPS_MASK		0x7
+
+#define ARM_LPAE_TCR_PS_32_BIT		0x0ULL
+#define ARM_LPAE_TCR_PS_36_BIT		0x1ULL
+#define ARM_LPAE_TCR_PS_40_BIT		0x2ULL
+#define ARM_LPAE_TCR_PS_42_BIT		0x3ULL
+#define ARM_LPAE_TCR_PS_44_BIT		0x4ULL
+#define ARM_LPAE_TCR_PS_48_BIT		0x5ULL
+
+#define ARM_LPAE_MAIR_ATTR_SHIFT(n)	((n) << 3)
+#define ARM_LPAE_MAIR_ATTR_MASK		0xff
+#define ARM_LPAE_MAIR_ATTR_DEVICE	0x04
+#define ARM_LPAE_MAIR_ATTR_NC		0x44
+#define ARM_LPAE_MAIR_ATTR_WBRWA	0xff
+#define ARM_LPAE_MAIR_ATTR_IDX_NC	0
+#define ARM_LPAE_MAIR_ATTR_IDX_CACHE	1
+#define ARM_LPAE_MAIR_ATTR_IDX_DEV	2
+
+/* IOPTE accessors */
+#define iopte_deref(pte,d)					\
+	(__va((pte) & ((1ULL << ARM_LPAE_MAX_ADDR_BITS) - 1)	\
+	& ~((1ULL << (d)->pg_shift) - 1)))
+
+#define iopte_type(pte,l)					\
+	(((pte) >> ARM_LPAE_PTE_TYPE_SHIFT) & ARM_LPAE_PTE_TYPE_MASK)
+
+#define iopte_prot(pte)	((pte) & ARM_LPAE_PTE_ATTR_MASK)
+
+#define iopte_leaf(pte,l)					\
+	(l == (ARM_LPAE_MAX_LEVELS - 1) ?			\
+		(iopte_type(pte,l) == ARM_LPAE_PTE_TYPE_PAGE) :	\
+		(iopte_type(pte,l) == ARM_LPAE_PTE_TYPE_BLOCK))
+
+#define iopte_to_pfn(pte,d)					\
+	(((pte) & ((1ULL << ARM_LPAE_MAX_ADDR_BITS) - 1)) >> (d)->pg_shift)
+
+#define pfn_to_iopte(pfn,d)					\
+	(((pfn) << (d)->pg_shift) & ((1ULL << ARM_LPAE_MAX_ADDR_BITS) - 1))
+
+struct arm_lpae_io_pgtable {
+	struct io_pgtable	iop;
+
+	int			levels;
+	int			pages_per_pgd;
+	unsigned long		pg_shift;
+	unsigned long		bits_per_level;
+
+	void			*pgd;
+};
+
+typedef u64 arm_lpae_iopte;
+
+static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
+			     unsigned long iova, phys_addr_t paddr,
+			     arm_lpae_iopte prot, int lvl,
+			     arm_lpae_iopte *ptep)
+{
+	arm_lpae_iopte pte = prot;
+
+	/* We require an unmap first */
+	if (iopte_leaf(*ptep, lvl))
+		return -EEXIST;
+
+	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
+		pte |= ARM_LPAE_PTE_TYPE_PAGE;
+	else
+		pte |= ARM_LPAE_PTE_TYPE_BLOCK;
+
+	pte |= ARM_LPAE_PTE_AF | ARM_LPAE_PTE_SH_IS;
+	pte |= pfn_to_iopte(paddr >> data->pg_shift, data);
+
+	*ptep = pte;
+	data->iop.cfg.tlb->flush_pgtable(ptep, sizeof(*ptep), data->iop.cookie);
+	return 0;
+}
+
+static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
+			  phys_addr_t paddr, size_t size, arm_lpae_iopte prot,
+			  int lvl, arm_lpae_iopte *ptep)
+{
+	arm_lpae_iopte *cptep, pte;
+	void *cookie = data->iop.cookie;
+	size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
+
+	/* Find our entry at the current level */
+	ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+
+	/* If we can install a leaf entry at this level, then do so */
+	if (size == block_size && (size & data->iop.cfg.pgsize_bitmap))
+		return arm_lpae_init_pte(data, iova, paddr, prot, lvl, ptep);
+
+	/* We can't allocate tables at the final level */
+	if (WARN_ON(lvl >= ARM_LPAE_MAX_LEVELS - 1))
+		return -EINVAL;
+
+	/* Grab a pointer to the next level */
+	pte = *ptep;
+	if (!pte) {
+		cptep = alloc_pages_exact(1UL << data->pg_shift,
+					 GFP_ATOMIC | __GFP_ZERO);
+		if (!cptep)
+			return -ENOMEM;
+
+		data->iop.cfg.tlb->flush_pgtable(cptep, 1UL << data->pg_shift,
+						 cookie);
+		pte = __pa(cptep) | ARM_LPAE_PTE_TYPE_TABLE;
+		*ptep = pte;
+		data->iop.cfg.tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
+	} else {
+		cptep = iopte_deref(pte, data);
+	}
+
+	/* Rinse, repeat */
+	return __arm_lpae_map(data, iova, paddr, size, prot, lvl + 1, cptep);
+}
+
+static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
+					   int prot)
+{
+	arm_lpae_iopte pte;
+
+	if (data->iop.fmt == ARM_LPAE_S1) {
+		pte = ARM_LPAE_PTE_AP_UNPRIV | ARM_LPAE_PTE_nG;
+
+		if (!(prot & IOMMU_WRITE) && (prot & IOMMU_READ))
+			pte |= ARM_LPAE_PTE_AP_RDONLY;
+
+		if (prot & IOMMU_CACHE)
+			pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
+				<< ARM_LPAE_PTE_ATTRINDX_SHIFT);
+	} else {
+		pte = ARM_LPAE_PTE_HAP_FAULT;
+		if (prot & IOMMU_READ)
+			pte |= ARM_LPAE_PTE_HAP_READ;
+		if (prot & IOMMU_WRITE)
+			pte |= ARM_LPAE_PTE_HAP_WRITE;
+		if (prot & IOMMU_CACHE)
+			pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
+		else
+			pte |= ARM_LPAE_PTE_MEMATTR_NC;
+	}
+
+	if (prot & IOMMU_NOEXEC)
+		pte |= ARM_LPAE_PTE_XN;
+
+	return pte;
+}
+
+static int arm_lpae_map(struct io_pgtable_ops *ops, unsigned long iova,
+			phys_addr_t paddr, size_t size, int iommu_prot)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	arm_lpae_iopte *ptep = data->pgd;
+	int lvl = ARM_LPAE_START_LVL(data);
+	arm_lpae_iopte prot;
+
+	/* If no access, then nothing to do */
+	if (!(iommu_prot & (IOMMU_READ | IOMMU_WRITE)))
+		return 0;
+
+	prot = arm_lpae_prot_to_pte(data, iommu_prot);
+	return __arm_lpae_map(data, iova, paddr, size, prot, lvl, ptep);
+}
+
+static void __arm_lpae_free_pgtable(struct arm_lpae_io_pgtable *data, int lvl,
+				    arm_lpae_iopte *ptep)
+{
+	arm_lpae_iopte *start, *end;
+	unsigned long table_size;
+
+	/* Only leaf entries at the last level */
+	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
+		return;
+
+	table_size = 1UL << data->pg_shift;
+	if (lvl == ARM_LPAE_START_LVL(data))
+		table_size *= data->pages_per_pgd;
+
+	start = ptep;
+	end = (void *)ptep + table_size;
+
+	while (ptep != end) {
+		arm_lpae_iopte pte = *ptep++;
+
+		if (!pte || iopte_leaf(pte, lvl))
+			continue;
+
+		__arm_lpae_free_pgtable(data, lvl + 1, iopte_deref(pte, data));
+	}
+
+	free_pages_exact(start, table_size);
+}
+
+static void arm_lpae_free_pgtable(struct io_pgtable *iop)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_to_data(iop);
+
+	__arm_lpae_free_pgtable(data, ARM_LPAE_START_LVL(data), data->pgd);
+	kfree(data);
+}
+
+static int arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
+				    unsigned long iova, size_t size,
+				    arm_lpae_iopte prot, int lvl,
+				    arm_lpae_iopte *ptep, size_t blk_size)
+{
+	unsigned long blk_start, blk_end;
+	phys_addr_t blk_paddr;
+	arm_lpae_iopte table = 0;
+	void *cookie = data->iop.cookie;
+	struct iommu_gather_ops *tlb = data->iop.cfg.tlb;
+
+	blk_start = iova & ~(blk_size - 1);
+	blk_end = blk_start + blk_size;
+	blk_paddr = iopte_to_pfn(*ptep, data) << data->pg_shift;
+
+	for (; blk_start < blk_end; blk_start += size, blk_paddr += size) {
+		arm_lpae_iopte *tablep;
+
+		/* Unmap! */
+		if (blk_start == iova)
+			continue;
+
+		/* __arm_lpae_map expects a pointer to the start of the table */
+		tablep = &table - ARM_LPAE_LVL_IDX(blk_start, lvl, data);
+		if (__arm_lpae_map(data, blk_start, blk_paddr, size, prot, lvl,
+				   tablep) < 0) {
+			if (table) {
+				/* Free the table we allocated */
+				tablep = iopte_deref(table, data);
+				__arm_lpae_free_pgtable(data, lvl + 1, tablep);
+			}
+			return 0; /* Bytes unmapped */
+		}
+	}
+
+	*ptep = table;
+	tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
+	iova &= ~(blk_size - 1);
+	tlb->tlb_add_flush(iova, blk_size, true, cookie);
+	return size;
+}
+
+static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
+			    unsigned long iova, size_t size, int lvl,
+			    arm_lpae_iopte *ptep)
+{
+	arm_lpae_iopte pte;
+	struct iommu_gather_ops *tlb = data->iop.cfg.tlb;
+	void *cookie = data->iop.cookie;
+	size_t blk_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
+
+	ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+	pte = *ptep;
+
+	/* Something went horribly wrong and we ran out of page table */
+	if (WARN_ON(!pte || (lvl == ARM_LPAE_MAX_LEVELS)))
+		return 0;
+
+	/* If the size matches this level, we're in the right place */
+	if (size == blk_size) {
+		*ptep = 0;
+		tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
+
+		if (!iopte_leaf(pte, lvl)) {
+			/* Also flush any partial walks */
+			tlb->tlb_add_flush(iova, size, false, cookie);
+			tlb->tlb_sync(data->iop.cookie);
+			ptep = iopte_deref(pte, data);
+			__arm_lpae_free_pgtable(data, lvl + 1, ptep);
+		} else {
+			tlb->tlb_add_flush(iova, size, true, cookie);
+		}
+
+		return size;
+	} else if (iopte_leaf(pte, lvl)) {
+		/*
+		 * Insert a table at the next level to map the old region,
+		 * minus the part we want to unmap
+		 */
+		return arm_lpae_split_blk_unmap(data, iova, size,
+						iopte_prot(pte), lvl, ptep,
+						blk_size);
+	}
+
+	/* Keep on walkin' */
+	ptep = iopte_deref(pte, data);
+	return __arm_lpae_unmap(data, iova, size, lvl + 1, ptep);
+}
+
+static int arm_lpae_unmap(struct io_pgtable_ops *ops, unsigned long iova,
+			  size_t size)
+{
+	size_t unmapped;
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	struct io_pgtable *iop = &data->iop;
+	arm_lpae_iopte *ptep = data->pgd;
+	int lvl = ARM_LPAE_START_LVL(data);
+
+	unmapped = __arm_lpae_unmap(data, iova, size, lvl, ptep);
+	if (unmapped)
+		iop->cfg.tlb->tlb_sync(iop->cookie);
+
+	return unmapped;
+}
+
+static phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
+					 unsigned long iova)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	arm_lpae_iopte pte, *ptep = data->pgd;
+	int lvl = ARM_LPAE_START_LVL(data);
+
+	do {
+		/* Valid IOPTE pointer? */
+		if (!ptep)
+			return 0;
+
+		/* Grab the IOPTE we're interested in */
+		pte = *(ptep + ARM_LPAE_LVL_IDX(iova, lvl, data));
+
+		/* Valid entry? */
+		if (!pte)
+			return 0;
+
+		/* Leaf entry? */
+		if (iopte_leaf(pte,lvl))
+			goto found_translation;
+
+		/* Take it to the next level */
+		ptep = iopte_deref(pte, data);
+	} while (++lvl < ARM_LPAE_MAX_LEVELS);
+
+	/* Ran out of page tables to walk */
+	return 0;
+
+found_translation:
+	iova &= ((1 << data->pg_shift) - 1);
+	return ((phys_addr_t)iopte_to_pfn(pte,data) << data->pg_shift) | iova;
+}
+
+static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
+{
+	unsigned long granule;
+
+	/*
+	 * We need to restrict the supported page sizes to match the
+	 * translation regime for a particular granule. Aim to match
+	 * the CPU page size if possible, otherwise prefer smaller sizes.
+	 * While we're at it, restrict the block sizes to match the
+	 * chosen granule.
+	 */
+	if (cfg->pgsize_bitmap & PAGE_SIZE)
+		granule = PAGE_SIZE;
+	else if (cfg->pgsize_bitmap & ~PAGE_MASK)
+		granule = 1UL << __fls(cfg->pgsize_bitmap & ~PAGE_MASK);
+	else if (cfg->pgsize_bitmap & PAGE_MASK)
+		granule = 1UL << __ffs(cfg->pgsize_bitmap & PAGE_MASK);
+	else
+		granule = 0;
+
+	switch (granule) {
+	case SZ_4K:
+		cfg->pgsize_bitmap &= (SZ_4K | SZ_2M | SZ_1G);
+		break;
+	case SZ_16K:
+		cfg->pgsize_bitmap &= (SZ_16K | SZ_32M);
+		break;
+	case SZ_64K:
+		cfg->pgsize_bitmap &= (SZ_64K | SZ_512M);
+		break;
+	default:
+		cfg->pgsize_bitmap = 0;
+	}
+}
+
+static struct arm_lpae_io_pgtable *
+arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
+{
+	unsigned long va_bits;
+	struct arm_lpae_io_pgtable *data;
+
+	arm_lpae_restrict_pgsizes(cfg);
+
+	if (!(cfg->pgsize_bitmap & (SZ_4K | SZ_16K | SZ_64K)))
+		return NULL;
+
+	if (cfg->ias > ARM_LPAE_MAX_ADDR_BITS)
+		return NULL;
+
+	if (cfg->oas > ARM_LPAE_MAX_ADDR_BITS)
+		return NULL;
+
+	data = kmalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return NULL;
+
+	data->pages_per_pgd = 1;
+	data->pg_shift = __ffs(cfg->pgsize_bitmap);
+	data->bits_per_level = data->pg_shift - ilog2(sizeof(arm_lpae_iopte));
+
+	va_bits = cfg->ias - data->pg_shift;
+	data->levels = DIV_ROUND_UP(va_bits, data->bits_per_level);
+
+	data->iop.ops = (struct io_pgtable_ops) {
+		.map		= arm_lpae_map,
+		.unmap		= arm_lpae_unmap,
+		.iova_to_phys	= arm_lpae_iova_to_phys,
+	};
+
+	return data;
+}
+
+static struct io_pgtable *arm_lpae_alloc_pgtable_s1(struct io_pgtable_cfg *cfg,
+						    void *cookie)
+{
+	u64 reg;
+	struct arm_lpae_io_pgtable *data = arm_lpae_alloc_pgtable(cfg);
+
+	if (!data)
+		return NULL;
+
+	/* TCR */
+	reg = ARM_LPAE_TCR_EAE |
+	     (ARM_LPAE_TCR_SH_IS << ARM_LPAE_TCR_SH0_SHIFT) |
+	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_IRGN0_SHIFT) |
+	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_ORGN0_SHIFT);
+
+	switch (1 << data->pg_shift) {
+	case SZ_4K:
+		reg |= ARM_LPAE_TCR_TG0_4K;
+		break;
+	case SZ_16K:
+		reg |= ARM_LPAE_TCR_TG0_16K;
+		break;
+	case SZ_64K:
+		reg |= ARM_LPAE_TCR_TG0_64K;
+		break;
+	}
+
+	switch (cfg->oas) {
+	case 32:
+		reg |= (ARM_LPAE_TCR_PS_32_BIT << ARM_LPAE_TCR_IPS_SHIFT);
+		break;
+	case 36:
+		reg |= (ARM_LPAE_TCR_PS_36_BIT << ARM_LPAE_TCR_IPS_SHIFT);
+		break;
+	case 40:
+		reg |= (ARM_LPAE_TCR_PS_40_BIT << ARM_LPAE_TCR_IPS_SHIFT);
+		break;
+	case 42:
+		reg |= (ARM_LPAE_TCR_PS_42_BIT << ARM_LPAE_TCR_IPS_SHIFT);
+		break;
+	case 44:
+		reg |= (ARM_LPAE_TCR_PS_44_BIT << ARM_LPAE_TCR_IPS_SHIFT);
+		break;
+	case 48:
+		reg |= (ARM_LPAE_TCR_PS_48_BIT << ARM_LPAE_TCR_IPS_SHIFT);
+		break;
+	default:
+		goto out_free_data;
+	}
+
+	reg |= (64ULL - cfg->ias) << ARM_LPAE_TCR_T0SZ_SHIFT;
+	cfg->arm_lpae_s1_cfg.tcr = reg;
+
+	/* MAIRs */
+	reg = (ARM_LPAE_MAIR_ATTR_NC
+	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_NC)) |
+	      (ARM_LPAE_MAIR_ATTR_WBRWA
+	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_CACHE)) |
+	      (ARM_LPAE_MAIR_ATTR_DEVICE
+	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_DEV));
+
+	cfg->arm_lpae_s1_cfg.mair[0] = reg;
+	cfg->arm_lpae_s1_cfg.mair[1] = 0;
+
+	/* Looking good; allocate a pgd */
+	data->pgd = alloc_pages_exact(1UL << data->pg_shift,
+				      GFP_KERNEL | __GFP_ZERO);
+	if (!data->pgd)
+		goto out_free_data;
+
+	cfg->tlb->flush_pgtable(data->pgd, (1UL << data->pg_shift), cookie);
+
+	/* TTBRs */
+	cfg->arm_lpae_s1_cfg.ttbr[0] = virt_to_phys(data->pgd);
+	cfg->arm_lpae_s1_cfg.ttbr[1] = 0;
+	return &data->iop;
+
+out_free_data:
+	kfree(data);
+	return NULL;
+}
+
+static struct io_pgtable *arm_lpae_alloc_pgtable_s2(struct io_pgtable_cfg *cfg,
+						    void *cookie)
+{
+	u64 reg, sl;
+	size_t pgd_size;
+	struct arm_lpae_io_pgtable *data = arm_lpae_alloc_pgtable(cfg);
+
+	if (!data)
+		return NULL;
+
+	/*
+	 * Concatenate PGDs at level 1 if possible in order to reduce
+	 * the depth of the stage-2 walk.
+	 */
+	if (data->levels == ARM_LPAE_MAX_LEVELS) {
+		unsigned long pgd_bits, pgd_pages;
+		unsigned long va_bits = cfg->ias - data->pg_shift;
+
+		pgd_bits = data->bits_per_level * (data->levels - 1);
+		pgd_pages = 1 << (va_bits - pgd_bits);
+		if (pgd_pages <= ARM_LPAE_S2_MAX_CONCAT_PAGES) {
+			data->pages_per_pgd = pgd_pages;
+			data->levels--;
+		}
+	}
+
+	/* VTCR */
+	reg = ARM_LPAE_TCR_EAE |
+	     (ARM_LPAE_TCR_SH_IS << ARM_LPAE_TCR_SH0_SHIFT) |
+	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_IRGN0_SHIFT) |
+	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_ORGN0_SHIFT);
+
+	sl = ARM_LPAE_START_LVL(data);
+
+	switch (1 << data->pg_shift) {
+	case SZ_4K:
+		reg |= ARM_LPAE_TCR_TG0_4K;
+		sl++; /* SL0 format is different for 4K granule size */
+		break;
+	case SZ_16K:
+		reg |= ARM_LPAE_TCR_TG0_16K;
+		break;
+	case SZ_64K:
+		reg |= ARM_LPAE_TCR_TG0_64K;
+		break;
+	}
+
+	switch (cfg->oas) {
+	case 32:
+		reg |= (ARM_LPAE_TCR_PS_32_BIT << ARM_LPAE_TCR_PS_SHIFT);
+		break;
+	case 36:
+		reg |= (ARM_LPAE_TCR_PS_36_BIT << ARM_LPAE_TCR_PS_SHIFT);
+		break;
+	case 40:
+		reg |= (ARM_LPAE_TCR_PS_40_BIT << ARM_LPAE_TCR_PS_SHIFT);
+		break;
+	case 42:
+		reg |= (ARM_LPAE_TCR_PS_42_BIT << ARM_LPAE_TCR_PS_SHIFT);
+		break;
+	case 44:
+		reg |= (ARM_LPAE_TCR_PS_44_BIT << ARM_LPAE_TCR_PS_SHIFT);
+		break;
+	case 48:
+		reg |= (ARM_LPAE_TCR_PS_48_BIT << ARM_LPAE_TCR_PS_SHIFT);
+		break;
+	default:
+		goto out_free_data;
+	}
+
+	reg |= (64ULL - cfg->ias) << ARM_LPAE_TCR_T0SZ_SHIFT;
+	reg |= (~sl & ARM_LPAE_TCR_SL0_MASK) << ARM_LPAE_TCR_SL0_SHIFT;
+	cfg->arm_lpae_s2_cfg.vtcr = reg;
+
+	/* Allocate pgd pages */
+	pgd_size = data->pages_per_pgd * (1UL << data->pg_shift);
+	data->pgd = alloc_pages_exact(pgd_size, GFP_KERNEL | __GFP_ZERO);
+	if (!data->pgd)
+		goto out_free_data;
+
+	cfg->tlb->flush_pgtable(data->pgd, pgd_size, cookie);
+
+	/* VTTBR */
+	cfg->arm_lpae_s2_cfg.vttbr = virt_to_phys(data->pgd);
+	return &data->iop;
+
+out_free_data:
+	kfree(data);
+	return NULL;
+}
+
+struct io_pgtable_init_fns io_pgtable_arm_lpae_s1_init_fns = {
+	.alloc	= arm_lpae_alloc_pgtable_s1,
+	.free	= arm_lpae_free_pgtable,
+};
+
+struct io_pgtable_init_fns io_pgtable_arm_lpae_s2_init_fns = {
+	.alloc	= arm_lpae_alloc_pgtable_s2,
+	.free	= arm_lpae_free_pgtable,
+};
diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
index 82e39a0db94b..d0a2016efcb4 100644
--- a/drivers/iommu/io-pgtable.c
+++ b/drivers/iommu/io-pgtable.c
@@ -25,8 +25,15 @@
 
 #include "io-pgtable.h"
 
+extern struct io_pgtable_init_fns io_pgtable_arm_lpae_s1_init_fns;
+extern struct io_pgtable_init_fns io_pgtable_arm_lpae_s2_init_fns;
+
 static struct io_pgtable_init_fns *io_pgtable_init_table[IO_PGTABLE_NUM_FMTS] =
 {
+#ifdef CONFIG_IOMMU_IO_PGTABLE_LPAE
+	[ARM_LPAE_S1] = &io_pgtable_arm_lpae_s1_init_fns,
+	[ARM_LPAE_S2] = &io_pgtable_arm_lpae_s2_init_fns,
+#endif
 };
 
 struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
index 5ae75d9cae50..c1cff3d045db 100644
--- a/drivers/iommu/io-pgtable.h
+++ b/drivers/iommu/io-pgtable.h
@@ -33,10 +33,22 @@ struct io_pgtable_cfg {
 
 	/* Low-level data specific to the table format */
 	union {
+		struct {
+			u64	ttbr[2];
+			u64	tcr;
+			u64	mair[2];
+		} arm_lpae_s1_cfg;
+
+		struct {
+			u64	vttbr;
+			u64	vtcr;
+		} arm_lpae_s2_cfg;
 	};
 };
 
 enum io_pgtable_fmt {
+	ARM_LPAE_S1,
+	ARM_LPAE_S2,
 	IO_PGTABLE_NUM_FMTS,
 };
 
-- 
2.1.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 3/4] iommu: add self-consistency tests to ARM LPAE IO page table allocator
  2014-11-27 11:51 ` Will Deacon
@ 2014-11-27 11:51     ` Will Deacon
  -1 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-11-27 11:51 UTC (permalink / raw)
  To: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Will Deacon, laurent.pinchart-ryLnwIuWjnjg/C1BVhZhaw,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin.Murphy-5wv7dgnIgG8

This patch adds a series of basic self-consistency tests to the ARM LPAE
IO page table allocator that exercise corner cases in map/unmap, as well
as testing all valid configurations of pagesize, ias and stage.

Signed-off-by: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
---
 drivers/iommu/Kconfig          |   9 ++
 drivers/iommu/io-pgtable-arm.c | 190 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 199 insertions(+)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index e1742a0146f8..dde72d0990b0 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -28,6 +28,15 @@ config IOMMU_IO_PGTABLE_LPAE
 	  sizes at both stage-1 and stage-2, as well as address spaces
 	  up to 48-bits in size.
 
+config IOMMU_IO_PGTABLE_LPAE_SELFTEST
+	bool "LPAE selftests"
+	depends on IOMMU_IO_PGTABLE_LPAE
+	help
+	  Enable self-tests for LPAE page table allocator. This performs
+	  a series of page-table consistency checks during boot.
+
+	  If unsure, say N here.
+
 endmenu
 
 config OF_IOMMU
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 9dbaa2e48424..669e322a83a4 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -733,3 +733,193 @@ struct io_pgtable_init_fns io_pgtable_arm_lpae_s2_init_fns = {
 	.alloc	= arm_lpae_alloc_pgtable_s2,
 	.free	= arm_lpae_free_pgtable,
 };
+
+#ifdef CONFIG_IOMMU_IO_PGTABLE_LPAE_SELFTEST
+
+static struct io_pgtable_cfg *cfg_cookie;
+
+static void dummy_tlb_flush_all(void *cookie)
+{
+	WARN_ON(cookie != cfg_cookie);
+}
+
+static void dummy_tlb_add_flush(unsigned long iova, size_t size, bool leaf,
+				void *cookie)
+{
+	WARN_ON(cookie != cfg_cookie);
+	WARN_ON(!(size & cfg_cookie->pgsize_bitmap));
+}
+
+static void dummy_tlb_sync(void *cookie)
+{
+	WARN_ON(cookie != cfg_cookie);
+}
+
+static void dummy_flush_pgtable(void *ptr, size_t size, void *cookie)
+{
+	WARN_ON(cookie != cfg_cookie);
+}
+
+static struct iommu_gather_ops dummy_tlb_ops __initdata = {
+	.tlb_flush_all	= dummy_tlb_flush_all,
+	.tlb_add_flush	= dummy_tlb_add_flush,
+	.tlb_sync	= dummy_tlb_sync,
+	.flush_pgtable	= dummy_flush_pgtable,
+};
+
+static void __init arm_lpae_dump_ops(struct io_pgtable_ops *ops)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+
+	pr_err("cfg: pgsize_bitmap 0x%lx, ias %u-bit\n",
+		cfg->pgsize_bitmap, cfg->ias);
+	pr_err("data: %d levels, %d pages_per_pgd, %lu pg_shift, %lu bits_per_level, pgd @ %p\n",
+		data->levels, data->pages_per_pgd, data->pg_shift,
+		data->bits_per_level, data->pgd);
+}
+
+#define __FAIL(ops, i)	({						\
+		WARN(1, "selftest: test failed for fmt idx %d\n", (i));	\
+		arm_lpae_dump_ops(ops);					\
+		-EFAULT;						\
+})
+
+static int __init arm_lpae_run_tests(struct io_pgtable_cfg *cfg)
+{
+	static const enum io_pgtable_fmt fmts[] = {
+		ARM_LPAE_S1,
+		ARM_LPAE_S2,
+	};
+
+	int i, j;
+	unsigned long iova;
+	size_t size;
+	struct io_pgtable_ops *ops;
+
+	for (i = 0; i < ARRAY_SIZE(fmts); ++i) {
+		cfg_cookie = cfg;
+		ops = alloc_io_pgtable_ops(fmts[i], cfg, cfg);
+		if (!ops) {
+			pr_err("selftest: failed to allocate io pgtable ops\n");
+			return -ENOMEM;
+		}
+
+		/*
+		 * Initial sanity checks.
+		 * Empty page tables shouldn't provide any translations.
+		 */
+		if (ops->iova_to_phys(ops, 42))
+			return __FAIL(ops, i);
+
+		if (ops->iova_to_phys(ops, SZ_1G + 42))
+			return __FAIL(ops, i);
+
+		if (ops->iova_to_phys(ops, SZ_2G + 42))
+			return __FAIL(ops, i);
+
+		/*
+		 * Distinct mappings of different granule sizes.
+		 */
+		iova = 0;
+		j = find_first_bit(&cfg->pgsize_bitmap, BITS_PER_LONG);
+		while (j != BITS_PER_LONG) {
+			size = 1UL << j;
+
+			if (ops->map(ops, iova, iova, size, IOMMU_READ |
+							    IOMMU_WRITE |
+							    IOMMU_NOEXEC |
+							    IOMMU_CACHE))
+				return __FAIL(ops, i);
+
+			/* Overlapping mappings */
+			if (!ops->map(ops, iova, iova + size, size,
+				      IOMMU_READ | IOMMU_NOEXEC))
+				return __FAIL(ops, i);
+
+			if (ops->iova_to_phys(ops, iova + 42) != (iova + 42))
+				return __FAIL(ops, i);
+
+			iova += SZ_1G;
+			j++;
+			j = find_next_bit(&cfg->pgsize_bitmap, BITS_PER_LONG, j);
+		}
+
+		/* Partial unmap */
+		size = 1UL << __ffs(cfg->pgsize_bitmap);
+		if (ops->unmap(ops, SZ_1G + size, size) != size)
+			return __FAIL(ops, i);
+
+		/* Remap of partial unmap */
+		if (ops->map(ops, SZ_1G + size, size, size, IOMMU_READ))
+			return __FAIL(ops, i);
+
+		if (ops->iova_to_phys(ops, SZ_1G + size + 42) != (size + 42))
+			return __FAIL(ops, i);
+
+		/* Full unmap */
+		iova = 0;
+		j = find_first_bit(&cfg->pgsize_bitmap, BITS_PER_LONG);
+		while (j != BITS_PER_LONG) {
+			size = 1UL << j;
+
+			if (ops->unmap(ops, iova, size) != size)
+				return __FAIL(ops, i);
+
+			if (ops->iova_to_phys(ops, iova + 42))
+				return __FAIL(ops, i);
+
+			/* Remap full block */
+			if (ops->map(ops, iova, iova, size, IOMMU_WRITE))
+				return __FAIL(ops, i);
+
+			if (ops->iova_to_phys(ops, iova + 42) != (iova + 42))
+				return __FAIL(ops, i);
+
+			iova += SZ_1G;
+			j++;
+			j = find_next_bit(&cfg->pgsize_bitmap, BITS_PER_LONG, j);
+		}
+
+		free_io_pgtable_ops(ops);
+	}
+
+	return 0;
+}
+
+static int __init arm_lpae_do_selftests(void)
+{
+	static const unsigned long pgsize[] = {
+		SZ_4K | SZ_2M | SZ_1G,
+		SZ_16K | SZ_32M,
+		SZ_64K | SZ_512M,
+	};
+
+	static const unsigned int ias[] = {
+		32, 36, 40, 42, 44, 48,
+	};
+
+	int i, j, pass = 0, fail = 0;
+	struct io_pgtable_cfg cfg = {
+		.tlb = &dummy_tlb_ops,
+		.oas = 48,
+	};
+
+	for (i = 0; i < ARRAY_SIZE(pgsize); ++i) {
+		for (j = 0; j < ARRAY_SIZE(ias); ++j) {
+			cfg.pgsize_bitmap = pgsize[i];
+			cfg.ias = ias[j];
+			pr_info("selftest: pgsize_bitmap 0x%08lx, IAS %u\n",
+				pgsize[i], ias[j]);
+			if (arm_lpae_run_tests(&cfg))
+				fail++;
+			else
+				pass++;
+		}
+	}
+
+	pr_info("selftest: completed with %d PASS %d FAIL\n", pass, fail);
+	return fail ? -EFAULT : 0;
+}
+subsys_initcall(arm_lpae_do_selftests);
+#endif
-- 
2.1.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 3/4] iommu: add self-consistency tests to ARM LPAE IO page table allocator
@ 2014-11-27 11:51     ` Will Deacon
  0 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-11-27 11:51 UTC (permalink / raw)
  To: linux-arm-kernel

This patch adds a series of basic self-consistency tests to the ARM LPAE
IO page table allocator that exercise corner cases in map/unmap, as well
as testing all valid configurations of pagesize, ias and stage.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 drivers/iommu/Kconfig          |   9 ++
 drivers/iommu/io-pgtable-arm.c | 190 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 199 insertions(+)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index e1742a0146f8..dde72d0990b0 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -28,6 +28,15 @@ config IOMMU_IO_PGTABLE_LPAE
 	  sizes at both stage-1 and stage-2, as well as address spaces
 	  up to 48-bits in size.
 
+config IOMMU_IO_PGTABLE_LPAE_SELFTEST
+	bool "LPAE selftests"
+	depends on IOMMU_IO_PGTABLE_LPAE
+	help
+	  Enable self-tests for LPAE page table allocator. This performs
+	  a series of page-table consistency checks during boot.
+
+	  If unsure, say N here.
+
 endmenu
 
 config OF_IOMMU
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 9dbaa2e48424..669e322a83a4 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -733,3 +733,193 @@ struct io_pgtable_init_fns io_pgtable_arm_lpae_s2_init_fns = {
 	.alloc	= arm_lpae_alloc_pgtable_s2,
 	.free	= arm_lpae_free_pgtable,
 };
+
+#ifdef CONFIG_IOMMU_IO_PGTABLE_LPAE_SELFTEST
+
+static struct io_pgtable_cfg *cfg_cookie;
+
+static void dummy_tlb_flush_all(void *cookie)
+{
+	WARN_ON(cookie != cfg_cookie);
+}
+
+static void dummy_tlb_add_flush(unsigned long iova, size_t size, bool leaf,
+				void *cookie)
+{
+	WARN_ON(cookie != cfg_cookie);
+	WARN_ON(!(size & cfg_cookie->pgsize_bitmap));
+}
+
+static void dummy_tlb_sync(void *cookie)
+{
+	WARN_ON(cookie != cfg_cookie);
+}
+
+static void dummy_flush_pgtable(void *ptr, size_t size, void *cookie)
+{
+	WARN_ON(cookie != cfg_cookie);
+}
+
+static struct iommu_gather_ops dummy_tlb_ops __initdata = {
+	.tlb_flush_all	= dummy_tlb_flush_all,
+	.tlb_add_flush	= dummy_tlb_add_flush,
+	.tlb_sync	= dummy_tlb_sync,
+	.flush_pgtable	= dummy_flush_pgtable,
+};
+
+static void __init arm_lpae_dump_ops(struct io_pgtable_ops *ops)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+
+	pr_err("cfg: pgsize_bitmap 0x%lx, ias %u-bit\n",
+		cfg->pgsize_bitmap, cfg->ias);
+	pr_err("data: %d levels, %d pages_per_pgd, %lu pg_shift, %lu bits_per_level, pgd @ %p\n",
+		data->levels, data->pages_per_pgd, data->pg_shift,
+		data->bits_per_level, data->pgd);
+}
+
+#define __FAIL(ops, i)	({						\
+		WARN(1, "selftest: test failed for fmt idx %d\n", (i));	\
+		arm_lpae_dump_ops(ops);					\
+		-EFAULT;						\
+})
+
+static int __init arm_lpae_run_tests(struct io_pgtable_cfg *cfg)
+{
+	static const enum io_pgtable_fmt fmts[] = {
+		ARM_LPAE_S1,
+		ARM_LPAE_S2,
+	};
+
+	int i, j;
+	unsigned long iova;
+	size_t size;
+	struct io_pgtable_ops *ops;
+
+	for (i = 0; i < ARRAY_SIZE(fmts); ++i) {
+		cfg_cookie = cfg;
+		ops = alloc_io_pgtable_ops(fmts[i], cfg, cfg);
+		if (!ops) {
+			pr_err("selftest: failed to allocate io pgtable ops\n");
+			return -ENOMEM;
+		}
+
+		/*
+		 * Initial sanity checks.
+		 * Empty page tables shouldn't provide any translations.
+		 */
+		if (ops->iova_to_phys(ops, 42))
+			return __FAIL(ops, i);
+
+		if (ops->iova_to_phys(ops, SZ_1G + 42))
+			return __FAIL(ops, i);
+
+		if (ops->iova_to_phys(ops, SZ_2G + 42))
+			return __FAIL(ops, i);
+
+		/*
+		 * Distinct mappings of different granule sizes.
+		 */
+		iova = 0;
+		j = find_first_bit(&cfg->pgsize_bitmap, BITS_PER_LONG);
+		while (j != BITS_PER_LONG) {
+			size = 1UL << j;
+
+			if (ops->map(ops, iova, iova, size, IOMMU_READ |
+							    IOMMU_WRITE |
+							    IOMMU_NOEXEC |
+							    IOMMU_CACHE))
+				return __FAIL(ops, i);
+
+			/* Overlapping mappings */
+			if (!ops->map(ops, iova, iova + size, size,
+				      IOMMU_READ | IOMMU_NOEXEC))
+				return __FAIL(ops, i);
+
+			if (ops->iova_to_phys(ops, iova + 42) != (iova + 42))
+				return __FAIL(ops, i);
+
+			iova += SZ_1G;
+			j++;
+			j = find_next_bit(&cfg->pgsize_bitmap, BITS_PER_LONG, j);
+		}
+
+		/* Partial unmap */
+		size = 1UL << __ffs(cfg->pgsize_bitmap);
+		if (ops->unmap(ops, SZ_1G + size, size) != size)
+			return __FAIL(ops, i);
+
+		/* Remap of partial unmap */
+		if (ops->map(ops, SZ_1G + size, size, size, IOMMU_READ))
+			return __FAIL(ops, i);
+
+		if (ops->iova_to_phys(ops, SZ_1G + size + 42) != (size + 42))
+			return __FAIL(ops, i);
+
+		/* Full unmap */
+		iova = 0;
+		j = find_first_bit(&cfg->pgsize_bitmap, BITS_PER_LONG);
+		while (j != BITS_PER_LONG) {
+			size = 1UL << j;
+
+			if (ops->unmap(ops, iova, size) != size)
+				return __FAIL(ops, i);
+
+			if (ops->iova_to_phys(ops, iova + 42))
+				return __FAIL(ops, i);
+
+			/* Remap full block */
+			if (ops->map(ops, iova, iova, size, IOMMU_WRITE))
+				return __FAIL(ops, i);
+
+			if (ops->iova_to_phys(ops, iova + 42) != (iova + 42))
+				return __FAIL(ops, i);
+
+			iova += SZ_1G;
+			j++;
+			j = find_next_bit(&cfg->pgsize_bitmap, BITS_PER_LONG, j);
+		}
+
+		free_io_pgtable_ops(ops);
+	}
+
+	return 0;
+}
+
+static int __init arm_lpae_do_selftests(void)
+{
+	static const unsigned long pgsize[] = {
+		SZ_4K | SZ_2M | SZ_1G,
+		SZ_16K | SZ_32M,
+		SZ_64K | SZ_512M,
+	};
+
+	static const unsigned int ias[] = {
+		32, 36, 40, 42, 44, 48,
+	};
+
+	int i, j, pass = 0, fail = 0;
+	struct io_pgtable_cfg cfg = {
+		.tlb = &dummy_tlb_ops,
+		.oas = 48,
+	};
+
+	for (i = 0; i < ARRAY_SIZE(pgsize); ++i) {
+		for (j = 0; j < ARRAY_SIZE(ias); ++j) {
+			cfg.pgsize_bitmap = pgsize[i];
+			cfg.ias = ias[j];
+			pr_info("selftest: pgsize_bitmap 0x%08lx, IAS %u\n",
+				pgsize[i], ias[j]);
+			if (arm_lpae_run_tests(&cfg))
+				fail++;
+			else
+				pass++;
+		}
+	}
+
+	pr_info("selftest: completed with %d PASS %d FAIL\n", pass, fail);
+	return fail ? -EFAULT : 0;
+}
+subsys_initcall(arm_lpae_do_selftests);
+#endif
-- 
2.1.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 4/4] iommu/arm-smmu: make use of generic LPAE allocator
  2014-11-27 11:51 ` Will Deacon
@ 2014-11-27 11:51     ` Will Deacon
  -1 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-11-27 11:51 UTC (permalink / raw)
  To: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Will Deacon, laurent.pinchart-ryLnwIuWjnjg/C1BVhZhaw,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin.Murphy-5wv7dgnIgG8

The ARM SMMU can walk LPAE page tables, so make use of the generic
allocator.

Signed-off-by: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
---
 arch/arm64/Kconfig       |   1 -
 drivers/iommu/Kconfig    |   6 +-
 drivers/iommu/arm-smmu.c | 872 ++++++++++++++---------------------------------
 3 files changed, 252 insertions(+), 627 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 9532f8d5857e..d2adb09c8f04 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -233,7 +233,6 @@ config ARM64_VA_BITS_42
 
 config ARM64_VA_BITS_48
 	bool "48-bit"
-	depends on !ARM_SMMU
 
 endchoice
 
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index dde72d0990b0..7e1bc3262663 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -317,13 +317,13 @@ config SPAPR_TCE_IOMMU
 
 config ARM_SMMU
 	bool "ARM Ltd. System MMU (SMMU) Support"
-	depends on ARM64 || (ARM_LPAE && OF)
+	depends on ARM64 || ARM
 	select IOMMU_API
+	select IOMMU_IO_PGTABLE_LPAE
 	select ARM_DMA_USE_IOMMU if ARM
 	help
 	  Support for implementations of the ARM System MMU architecture
-	  versions 1 and 2. The driver supports both v7l and v8l table
-	  formats with 4k and 64k page sizes.
+	  versions 1 and 2.
 
 	  Say Y here if your SoC includes an IOMMU device implementing
 	  the ARM SMMU architecture.
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 7a80f710ba2d..f9cfac2d8ae9 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -23,8 +23,6 @@
  *	- Stream-matching and stream-indexing
  *	- v7/v8 long-descriptor format
  *	- Non-secure access to the SMMU
- *	- 4k and 64k pages, with contiguous pte hints.
- *	- Up to 48-bit addressing (dependent on VA_BITS)
  *	- Context fault reporting
  */
 
@@ -36,7 +34,6 @@
 #include <linux/interrupt.h>
 #include <linux/io.h>
 #include <linux/iommu.h>
-#include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/of.h>
 #include <linux/pci.h>
@@ -46,7 +43,7 @@
 
 #include <linux/amba/bus.h>
 
-#include <asm/pgalloc.h>
+#include "io-pgtable.h"
 
 /* Maximum number of stream IDs assigned to a single device */
 #define MAX_MASTER_STREAMIDS		MAX_PHANDLE_ARGS
@@ -71,40 +68,6 @@
 		((smmu->options & ARM_SMMU_OPT_SECURE_CFG_ACCESS)	\
 			? 0x400 : 0))
 
-/* Page table bits */
-#define ARM_SMMU_PTE_XN			(((pteval_t)3) << 53)
-#define ARM_SMMU_PTE_CONT		(((pteval_t)1) << 52)
-#define ARM_SMMU_PTE_AF			(((pteval_t)1) << 10)
-#define ARM_SMMU_PTE_SH_NS		(((pteval_t)0) << 8)
-#define ARM_SMMU_PTE_SH_OS		(((pteval_t)2) << 8)
-#define ARM_SMMU_PTE_SH_IS		(((pteval_t)3) << 8)
-#define ARM_SMMU_PTE_PAGE		(((pteval_t)3) << 0)
-
-#if PAGE_SIZE == SZ_4K
-#define ARM_SMMU_PTE_CONT_ENTRIES	16
-#elif PAGE_SIZE == SZ_64K
-#define ARM_SMMU_PTE_CONT_ENTRIES	32
-#else
-#define ARM_SMMU_PTE_CONT_ENTRIES	1
-#endif
-
-#define ARM_SMMU_PTE_CONT_SIZE		(PAGE_SIZE * ARM_SMMU_PTE_CONT_ENTRIES)
-#define ARM_SMMU_PTE_CONT_MASK		(~(ARM_SMMU_PTE_CONT_SIZE - 1))
-
-/* Stage-1 PTE */
-#define ARM_SMMU_PTE_AP_UNPRIV		(((pteval_t)1) << 6)
-#define ARM_SMMU_PTE_AP_RDONLY		(((pteval_t)2) << 6)
-#define ARM_SMMU_PTE_ATTRINDX_SHIFT	2
-#define ARM_SMMU_PTE_nG			(((pteval_t)1) << 11)
-
-/* Stage-2 PTE */
-#define ARM_SMMU_PTE_HAP_FAULT		(((pteval_t)0) << 6)
-#define ARM_SMMU_PTE_HAP_READ		(((pteval_t)1) << 6)
-#define ARM_SMMU_PTE_HAP_WRITE		(((pteval_t)2) << 6)
-#define ARM_SMMU_PTE_MEMATTR_OIWB	(((pteval_t)0xf) << 2)
-#define ARM_SMMU_PTE_MEMATTR_NC		(((pteval_t)0x5) << 2)
-#define ARM_SMMU_PTE_MEMATTR_DEV	(((pteval_t)0x1) << 2)
-
 /* Configuration registers */
 #define ARM_SMMU_GR0_sCR0		0x0
 #define sCR0_CLIENTPD			(1 << 0)
@@ -132,17 +95,11 @@
 #define ARM_SMMU_GR0_sGFSYNR0		0x50
 #define ARM_SMMU_GR0_sGFSYNR1		0x54
 #define ARM_SMMU_GR0_sGFSYNR2		0x58
-#define ARM_SMMU_GR0_PIDR0		0xfe0
-#define ARM_SMMU_GR0_PIDR1		0xfe4
-#define ARM_SMMU_GR0_PIDR2		0xfe8
 
 #define ID0_S1TS			(1 << 30)
 #define ID0_S2TS			(1 << 29)
 #define ID0_NTS				(1 << 28)
 #define ID0_SMS				(1 << 27)
-#define ID0_PTFS_SHIFT			24
-#define ID0_PTFS_MASK			0x2
-#define ID0_PTFS_V8_ONLY		0x2
 #define ID0_CTTW			(1 << 14)
 #define ID0_NUMIRPT_SHIFT		16
 #define ID0_NUMIRPT_MASK		0xff
@@ -169,9 +126,6 @@
 #define ID2_PTFS_16K			(1 << 13)
 #define ID2_PTFS_64K			(1 << 14)
 
-#define PIDR2_ARCH_SHIFT		4
-#define PIDR2_ARCH_MASK			0xf
-
 /* Global TLB invalidation */
 #define ARM_SMMU_GR0_STLBIALL		0x60
 #define ARM_SMMU_GR0_TLBIVMID		0x64
@@ -231,13 +185,18 @@
 #define ARM_SMMU_CB_TTBCR2		0x10
 #define ARM_SMMU_CB_TTBR0_LO		0x20
 #define ARM_SMMU_CB_TTBR0_HI		0x24
+#define ARM_SMMU_CB_TTBR1_LO		0x28
+#define ARM_SMMU_CB_TTBR1_HI		0x2c
 #define ARM_SMMU_CB_TTBCR		0x30
 #define ARM_SMMU_CB_S1_MAIR0		0x38
+#define ARM_SMMU_CB_S1_MAIR1		0x3c
 #define ARM_SMMU_CB_FSR			0x58
 #define ARM_SMMU_CB_FAR_LO		0x60
 #define ARM_SMMU_CB_FAR_HI		0x64
 #define ARM_SMMU_CB_FSYNR0		0x68
+#define ARM_SMMU_CB_S1_TLBIVA		0x600
 #define ARM_SMMU_CB_S1_TLBIASID		0x610
+#define ARM_SMMU_CB_S1_TLBIVAL		0x620
 
 #define SCTLR_S1_ASIDPNE		(1 << 12)
 #define SCTLR_CFCFG			(1 << 7)
@@ -252,44 +211,9 @@
 #define RESUME_RETRY			(0 << 0)
 #define RESUME_TERMINATE		(1 << 0)
 
-#define TTBCR_EAE			(1 << 31)
-
-#define TTBCR_PASIZE_SHIFT		16
-#define TTBCR_PASIZE_MASK		0x7
-
-#define TTBCR_TG0_4K			(0 << 14)
-#define TTBCR_TG0_64K			(1 << 14)
-
-#define TTBCR_SH0_SHIFT			12
-#define TTBCR_SH0_MASK			0x3
-#define TTBCR_SH_NS			0
-#define TTBCR_SH_OS			2
-#define TTBCR_SH_IS			3
-
-#define TTBCR_ORGN0_SHIFT		10
-#define TTBCR_IRGN0_SHIFT		8
-#define TTBCR_RGN_MASK			0x3
-#define TTBCR_RGN_NC			0
-#define TTBCR_RGN_WBWA			1
-#define TTBCR_RGN_WT			2
-#define TTBCR_RGN_WB			3
-
-#define TTBCR_SL0_SHIFT			6
-#define TTBCR_SL0_MASK			0x3
-#define TTBCR_SL0_LVL_2			0
-#define TTBCR_SL0_LVL_1			1
-
-#define TTBCR_T1SZ_SHIFT		16
-#define TTBCR_T0SZ_SHIFT		0
-#define TTBCR_SZ_MASK			0xf
-
 #define TTBCR2_SEP_SHIFT		15
 #define TTBCR2_SEP_MASK			0x7
 
-#define TTBCR2_PASIZE_SHIFT		0
-#define TTBCR2_PASIZE_MASK		0x7
-
-/* Common definitions for PASize and SEP fields */
 #define TTBCR2_ADDR_32			0
 #define TTBCR2_ADDR_36			1
 #define TTBCR2_ADDR_40			2
@@ -297,16 +221,7 @@
 #define TTBCR2_ADDR_44			4
 #define TTBCR2_ADDR_48			5
 
-#define TTBRn_HI_ASID_SHIFT		16
-
-#define MAIR_ATTR_SHIFT(n)		((n) << 3)
-#define MAIR_ATTR_MASK			0xff
-#define MAIR_ATTR_DEVICE		0x04
-#define MAIR_ATTR_NC			0x44
-#define MAIR_ATTR_WBRWA			0xff
-#define MAIR_ATTR_IDX_NC		0
-#define MAIR_ATTR_IDX_CACHE		1
-#define MAIR_ATTR_IDX_DEV		2
+#define TTBRn_HI_ASID_SHIFT            16
 
 #define FSR_MULTI			(1 << 31)
 #define FSR_SS				(1 << 30)
@@ -380,10 +295,9 @@ struct arm_smmu_device {
 	u32				num_mapping_groups;
 	DECLARE_BITMAP(smr_map, ARM_SMMU_MAX_SMRS);
 
-	unsigned long			s1_input_size;
-	unsigned long			s1_output_size;
-	unsigned long			s2_input_size;
-	unsigned long			s2_output_size;
+	unsigned long			va_size;
+	unsigned long			ipa_size;
+	unsigned long			pa_size;
 
 	u32				num_global_irqs;
 	u32				num_context_irqs;
@@ -397,7 +311,6 @@ struct arm_smmu_cfg {
 	u8				cbndx;
 	u8				irptndx;
 	u32				cbar;
-	pgd_t				*pgd;
 };
 #define INVALID_IRPTNDX			0xff
 
@@ -412,11 +325,15 @@ enum arm_smmu_domain_stage {
 
 struct arm_smmu_domain {
 	struct arm_smmu_device		*smmu;
+	struct io_pgtable_ops		*pgtbl_ops;
+	spinlock_t			pgtbl_lock;
 	struct arm_smmu_cfg		cfg;
 	enum arm_smmu_domain_stage	stage;
-	spinlock_t			lock;
+	struct mutex			init_mutex; /* Protects smmu pointer */
 };
 
+static struct iommu_ops arm_smmu_ops;
+
 static DEFINE_SPINLOCK(arm_smmu_devices_lock);
 static LIST_HEAD(arm_smmu_devices);
 
@@ -597,7 +514,7 @@ static void __arm_smmu_free_bitmap(unsigned long *map, int idx)
 }
 
 /* Wait for any pending TLB invalidations to complete */
-static void arm_smmu_tlb_sync(struct arm_smmu_device *smmu)
+static void __arm_smmu_tlb_sync(struct arm_smmu_device *smmu)
 {
 	int count = 0;
 	void __iomem *gr0_base = ARM_SMMU_GR0(smmu);
@@ -615,12 +532,19 @@ static void arm_smmu_tlb_sync(struct arm_smmu_device *smmu)
 	}
 }
 
-static void arm_smmu_tlb_inv_context(struct arm_smmu_domain *smmu_domain)
+static void arm_smmu_tlb_sync(void *cookie)
+{
+	struct arm_smmu_domain *smmu_domain = cookie;
+	__arm_smmu_tlb_sync(smmu_domain->smmu);
+}
+
+static void arm_smmu_tlb_inv_context(void *cookie)
 {
+	struct arm_smmu_domain *smmu_domain = cookie;
 	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
 	struct arm_smmu_device *smmu = smmu_domain->smmu;
-	void __iomem *base = ARM_SMMU_GR0(smmu);
 	bool stage1 = cfg->cbar != CBAR_TYPE_S2_TRANS;
+	void __iomem *base;
 
 	if (stage1) {
 		base = ARM_SMMU_CB_BASE(smmu) + ARM_SMMU_CB(smmu, cfg->cbndx);
@@ -632,9 +556,70 @@ static void arm_smmu_tlb_inv_context(struct arm_smmu_domain *smmu_domain)
 			       base + ARM_SMMU_GR0_TLBIVMID);
 	}
 
-	arm_smmu_tlb_sync(smmu);
+	__arm_smmu_tlb_sync(smmu);
+}
+
+static void arm_smmu_tlb_inv_range_nosync(unsigned long iova, size_t size,
+					  bool leaf, void *cookie)
+{
+	struct arm_smmu_domain *smmu_domain = cookie;
+	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
+	bool stage1 = cfg->cbar != CBAR_TYPE_S2_TRANS;
+	void __iomem *reg;
+
+	if (stage1) {
+		reg = ARM_SMMU_CB_BASE(smmu) + ARM_SMMU_CB(smmu, cfg->cbndx);
+		reg += leaf ? ARM_SMMU_CB_S1_TLBIVAL : ARM_SMMU_CB_S1_TLBIVA;
+
+		if (!IS_ENABLED(CONFIG_64BIT) || smmu->version == ARM_SMMU_V1) {
+			iova &= ~12UL;
+			iova |= ARM_SMMU_CB_ASID(cfg);
+			writel_relaxed(iova, reg);
+#ifdef CONFIG_64BIT
+		} else {
+			iova >>= 12;
+			iova |= (u64)ARM_SMMU_CB_ASID(cfg) << 48;
+			writeq_relaxed(iova, reg);
+#endif
+		}
+	} else {
+		/* Invalidate by IPA is optional */
+		reg = ARM_SMMU_GR0(smmu) + ARM_SMMU_GR0_TLBIVMID;
+		writel_relaxed(ARM_SMMU_CB_VMID(cfg), reg);
+	}
 }
 
+static void arm_smmu_flush_pgtable(void *addr, size_t size, void *cookie)
+{
+	struct arm_smmu_domain *smmu_domain = cookie;
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
+	unsigned long offset = (unsigned long)addr & ~PAGE_MASK;
+
+
+	/* Ensure new page tables are visible to the hardware walker */
+	if (smmu->features & ARM_SMMU_FEAT_COHERENT_WALK) {
+		dsb(ishst);
+	} else {
+		/*
+		 * If the SMMU can't walk tables in the CPU caches, treat them
+		 * like non-coherent DMA since we need to flush the new entries
+		 * all the way out to memory. There's no possibility of
+		 * recursion here as the SMMU table walker will not be wired
+		 * through another SMMU.
+		 */
+		dma_map_page(smmu->dev, virt_to_page(addr), offset, size,
+			     DMA_TO_DEVICE);
+	}
+}
+
+static struct iommu_gather_ops arm_smmu_gather_ops = {
+	.tlb_flush_all	= arm_smmu_tlb_inv_context,
+	.tlb_add_flush	= arm_smmu_tlb_inv_range_nosync,
+	.tlb_sync	= arm_smmu_tlb_sync,
+	.flush_pgtable	= arm_smmu_flush_pgtable,
+};
+
 static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
 {
 	int flags, ret;
@@ -712,29 +697,8 @@ static irqreturn_t arm_smmu_global_fault(int irq, void *dev)
 	return IRQ_HANDLED;
 }
 
-static void arm_smmu_flush_pgtable(struct arm_smmu_device *smmu, void *addr,
-				   size_t size)
-{
-	unsigned long offset = (unsigned long)addr & ~PAGE_MASK;
-
-
-	/* Ensure new page tables are visible to the hardware walker */
-	if (smmu->features & ARM_SMMU_FEAT_COHERENT_WALK) {
-		dsb(ishst);
-	} else {
-		/*
-		 * If the SMMU can't walk tables in the CPU caches, treat them
-		 * like non-coherent DMA since we need to flush the new entries
-		 * all the way out to memory. There's no possibility of
-		 * recursion here as the SMMU table walker will not be wired
-		 * through another SMMU.
-		 */
-		dma_map_page(smmu->dev, virt_to_page(addr), offset, size,
-				DMA_TO_DEVICE);
-	}
-}
-
-static void arm_smmu_init_context_bank(struct arm_smmu_domain *smmu_domain)
+static void arm_smmu_init_context_bank(struct arm_smmu_domain *smmu_domain,
+				       struct io_pgtable_cfg *pgtbl_cfg)
 {
 	u32 reg;
 	bool stage1;
@@ -771,124 +735,68 @@ static void arm_smmu_init_context_bank(struct arm_smmu_domain *smmu_domain)
 #else
 		reg = CBA2R_RW64_32BIT;
 #endif
-		writel_relaxed(reg,
-			       gr1_base + ARM_SMMU_GR1_CBA2R(cfg->cbndx));
-
-		/* TTBCR2 */
-		switch (smmu->s1_input_size) {
-		case 32:
-			reg = (TTBCR2_ADDR_32 << TTBCR2_SEP_SHIFT);
-			break;
-		case 36:
-			reg = (TTBCR2_ADDR_36 << TTBCR2_SEP_SHIFT);
-			break;
-		case 39:
-		case 40:
-			reg = (TTBCR2_ADDR_40 << TTBCR2_SEP_SHIFT);
-			break;
-		case 42:
-			reg = (TTBCR2_ADDR_42 << TTBCR2_SEP_SHIFT);
-			break;
-		case 44:
-			reg = (TTBCR2_ADDR_44 << TTBCR2_SEP_SHIFT);
-			break;
-		case 48:
-			reg = (TTBCR2_ADDR_48 << TTBCR2_SEP_SHIFT);
-			break;
-		}
-
-		switch (smmu->s1_output_size) {
-		case 32:
-			reg |= (TTBCR2_ADDR_32 << TTBCR2_PASIZE_SHIFT);
-			break;
-		case 36:
-			reg |= (TTBCR2_ADDR_36 << TTBCR2_PASIZE_SHIFT);
-			break;
-		case 39:
-		case 40:
-			reg |= (TTBCR2_ADDR_40 << TTBCR2_PASIZE_SHIFT);
-			break;
-		case 42:
-			reg |= (TTBCR2_ADDR_42 << TTBCR2_PASIZE_SHIFT);
-			break;
-		case 44:
-			reg |= (TTBCR2_ADDR_44 << TTBCR2_PASIZE_SHIFT);
-			break;
-		case 48:
-			reg |= (TTBCR2_ADDR_48 << TTBCR2_PASIZE_SHIFT);
-			break;
-		}
-
-		if (stage1)
-			writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBCR2);
+		writel_relaxed(reg, gr1_base + ARM_SMMU_GR1_CBA2R(cfg->cbndx));
 	}
 
-	/* TTBR0 */
-	arm_smmu_flush_pgtable(smmu, cfg->pgd,
-			       PTRS_PER_PGD * sizeof(pgd_t));
-	reg = __pa(cfg->pgd);
-	writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBR0_LO);
-	reg = (phys_addr_t)__pa(cfg->pgd) >> 32;
-	if (stage1)
+	/* TTBRs */
+	if (stage1) {
+		reg = pgtbl_cfg->arm_lpae_s1_cfg.ttbr[0];
+		writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBR0_LO);
+		reg = pgtbl_cfg->arm_lpae_s1_cfg.ttbr[0] >> 32;
 		reg |= ARM_SMMU_CB_ASID(cfg) << TTBRn_HI_ASID_SHIFT;
-	writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBR0_HI);
-
-	/*
-	 * TTBCR
-	 * We use long descriptor, with inner-shareable WBWA tables in TTBR0.
-	 */
-	if (smmu->version > ARM_SMMU_V1) {
-		if (PAGE_SIZE == SZ_4K)
-			reg = TTBCR_TG0_4K;
-		else
-			reg = TTBCR_TG0_64K;
+		writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBR0_HI);
 
-		if (!stage1) {
-			reg |= (64 - smmu->s2_input_size) << TTBCR_T0SZ_SHIFT;
+		reg = pgtbl_cfg->arm_lpae_s1_cfg.ttbr[1];
+		writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBR1_LO);
+		reg = pgtbl_cfg->arm_lpae_s1_cfg.ttbr[1] >> 32;
+		reg |= ARM_SMMU_CB_ASID(cfg) << TTBRn_HI_ASID_SHIFT;
+		writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBR1_HI);
+	} else {
+		reg = pgtbl_cfg->arm_lpae_s2_cfg.vttbr;
+		writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBR0_LO);
+		reg = pgtbl_cfg->arm_lpae_s2_cfg.vttbr >> 32;
+		writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBR0_HI);
+	}
 
-			switch (smmu->s2_output_size) {
+	/* TTBCR */
+	if (stage1) {
+		reg = pgtbl_cfg->arm_lpae_s1_cfg.tcr;
+		writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBCR);
+		if (smmu->version > ARM_SMMU_V1) {
+			reg = pgtbl_cfg->arm_lpae_s1_cfg.tcr >> 32;
+			switch (smmu->va_size) {
 			case 32:
-				reg |= (TTBCR2_ADDR_32 << TTBCR_PASIZE_SHIFT);
+				reg |= (TTBCR2_ADDR_32 << TTBCR2_SEP_SHIFT);
 				break;
 			case 36:
-				reg |= (TTBCR2_ADDR_36 << TTBCR_PASIZE_SHIFT);
+				reg |= (TTBCR2_ADDR_36 << TTBCR2_SEP_SHIFT);
 				break;
 			case 40:
-				reg |= (TTBCR2_ADDR_40 << TTBCR_PASIZE_SHIFT);
+				reg |= (TTBCR2_ADDR_40 << TTBCR2_SEP_SHIFT);
 				break;
 			case 42:
-				reg |= (TTBCR2_ADDR_42 << TTBCR_PASIZE_SHIFT);
+				reg |= (TTBCR2_ADDR_42 << TTBCR2_SEP_SHIFT);
 				break;
 			case 44:
-				reg |= (TTBCR2_ADDR_44 << TTBCR_PASIZE_SHIFT);
+				reg |= (TTBCR2_ADDR_44 << TTBCR2_SEP_SHIFT);
 				break;
 			case 48:
-				reg |= (TTBCR2_ADDR_48 << TTBCR_PASIZE_SHIFT);
+				reg |= (TTBCR2_ADDR_48 << TTBCR2_SEP_SHIFT);
 				break;
 			}
-		} else {
-			reg |= (64 - smmu->s1_input_size) << TTBCR_T0SZ_SHIFT;
+			writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBCR2);
 		}
 	} else {
-		reg = 0;
+		reg = pgtbl_cfg->arm_lpae_s2_cfg.vtcr;
+		writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBCR);
 	}
 
-	reg |= TTBCR_EAE |
-	      (TTBCR_SH_IS << TTBCR_SH0_SHIFT) |
-	      (TTBCR_RGN_WBWA << TTBCR_ORGN0_SHIFT) |
-	      (TTBCR_RGN_WBWA << TTBCR_IRGN0_SHIFT);
-
-	if (!stage1)
-		reg |= (TTBCR_SL0_LVL_1 << TTBCR_SL0_SHIFT);
-
-	writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBCR);
-
-	/* MAIR0 (stage-1 only) */
+	/* MAIRs (stage-1 only) */
 	if (stage1) {
-		reg = (MAIR_ATTR_NC << MAIR_ATTR_SHIFT(MAIR_ATTR_IDX_NC)) |
-		      (MAIR_ATTR_WBRWA << MAIR_ATTR_SHIFT(MAIR_ATTR_IDX_CACHE)) |
-		      (MAIR_ATTR_DEVICE << MAIR_ATTR_SHIFT(MAIR_ATTR_IDX_DEV));
+		reg = pgtbl_cfg->arm_lpae_s1_cfg.mair[0];
 		writel_relaxed(reg, cb_base + ARM_SMMU_CB_S1_MAIR0);
+		reg = pgtbl_cfg->arm_lpae_s1_cfg.mair[1];
+		writel_relaxed(reg, cb_base + ARM_SMMU_CB_S1_MAIR1);
 	}
 
 	/* SCTLR */
@@ -905,11 +813,14 @@ static int arm_smmu_init_domain_context(struct iommu_domain *domain,
 					struct arm_smmu_device *smmu)
 {
 	int irq, start, ret = 0;
-	unsigned long flags;
+	unsigned long ias, oas;
+	struct io_pgtable_ops *pgtbl_ops;
+	struct io_pgtable_cfg pgtbl_cfg;
+	enum io_pgtable_fmt fmt;
 	struct arm_smmu_domain *smmu_domain = domain->priv;
 	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
 
-	spin_lock_irqsave(&smmu_domain->lock, flags);
+	mutex_lock(&smmu_domain->init_mutex);
 	if (smmu_domain->smmu)
 		goto out_unlock;
 
@@ -940,6 +851,9 @@ static int arm_smmu_init_domain_context(struct iommu_domain *domain,
 	case ARM_SMMU_DOMAIN_S1:
 		cfg->cbar = CBAR_TYPE_S1_TRANS_S2_BYPASS;
 		start = smmu->num_s2_context_banks;
+		ias = smmu->va_size;
+		oas = smmu->ipa_size;
+		fmt = ARM_LPAE_S1;
 		break;
 	case ARM_SMMU_DOMAIN_NESTED:
 		/*
@@ -949,6 +863,9 @@ static int arm_smmu_init_domain_context(struct iommu_domain *domain,
 	case ARM_SMMU_DOMAIN_S2:
 		cfg->cbar = CBAR_TYPE_S2_TRANS;
 		start = 0;
+		ias = smmu->ipa_size;
+		oas = smmu->pa_size;
+		fmt = ARM_LPAE_S2;
 		break;
 	default:
 		ret = -EINVAL;
@@ -968,10 +885,30 @@ static int arm_smmu_init_domain_context(struct iommu_domain *domain,
 		cfg->irptndx = cfg->cbndx;
 	}
 
-	ACCESS_ONCE(smmu_domain->smmu) = smmu;
-	arm_smmu_init_context_bank(smmu_domain);
-	spin_unlock_irqrestore(&smmu_domain->lock, flags);
+	pgtbl_cfg = (struct io_pgtable_cfg) {
+		.pgsize_bitmap	= arm_smmu_ops.pgsize_bitmap,
+		.ias		= ias,
+		.oas		= oas,
+		.tlb		= &arm_smmu_gather_ops,
+	};
 
+	smmu_domain->smmu = smmu;
+	pgtbl_ops = alloc_io_pgtable_ops(fmt, &pgtbl_cfg, smmu_domain);
+	if (!pgtbl_ops) {
+		ret = -ENOMEM;
+		goto out_clear_smmu;
+	}
+
+	/* Update our support page sizes to reflect the page table format */
+	arm_smmu_ops.pgsize_bitmap = pgtbl_cfg.pgsize_bitmap;
+
+	/* Initialise the context bank with our page table cfg */
+	arm_smmu_init_context_bank(smmu_domain, &pgtbl_cfg);
+
+	/*
+	 * Request context fault interrupt. Do this last to avoid the
+	 * handler seeing a half-initialised domain state.
+	 */
 	irq = smmu->irqs[smmu->num_global_irqs + cfg->irptndx];
 	ret = request_irq(irq, arm_smmu_context_fault, IRQF_SHARED,
 			  "arm-smmu-context-fault", domain);
@@ -981,10 +918,16 @@ static int arm_smmu_init_domain_context(struct iommu_domain *domain,
 		cfg->irptndx = INVALID_IRPTNDX;
 	}
 
+	mutex_unlock(&smmu_domain->init_mutex);
+
+	/* Publish page table ops for map/unmap */
+	smmu_domain->pgtbl_ops = pgtbl_ops;
 	return 0;
 
+out_clear_smmu:
+	smmu_domain->smmu = NULL;
 out_unlock:
-	spin_unlock_irqrestore(&smmu_domain->lock, flags);
+	mutex_unlock(&smmu_domain->init_mutex);
 	return ret;
 }
 
@@ -999,23 +942,27 @@ static void arm_smmu_destroy_domain_context(struct iommu_domain *domain)
 	if (!smmu)
 		return;
 
-	/* Disable the context bank and nuke the TLB before freeing it. */
+	/*
+	 * Disable the context bank and free the page tables before freeing
+	 * it.
+	 */
 	cb_base = ARM_SMMU_CB_BASE(smmu) + ARM_SMMU_CB(smmu, cfg->cbndx);
 	writel_relaxed(0, cb_base + ARM_SMMU_CB_SCTLR);
-	arm_smmu_tlb_inv_context(smmu_domain);
 
 	if (cfg->irptndx != INVALID_IRPTNDX) {
 		irq = smmu->irqs[smmu->num_global_irqs + cfg->irptndx];
 		free_irq(irq, domain);
 	}
 
+	if (smmu_domain->pgtbl_ops)
+		free_io_pgtable_ops(smmu_domain->pgtbl_ops);
+
 	__arm_smmu_free_bitmap(smmu->context_map, cfg->cbndx);
 }
 
 static int arm_smmu_domain_init(struct iommu_domain *domain)
 {
 	struct arm_smmu_domain *smmu_domain;
-	pgd_t *pgd;
 
 	/*
 	 * Allocate the domain and initialise some of its data structures.
@@ -1026,81 +973,10 @@ static int arm_smmu_domain_init(struct iommu_domain *domain)
 	if (!smmu_domain)
 		return -ENOMEM;
 
-	pgd = kcalloc(PTRS_PER_PGD, sizeof(pgd_t), GFP_KERNEL);
-	if (!pgd)
-		goto out_free_domain;
-	smmu_domain->cfg.pgd = pgd;
-
-	spin_lock_init(&smmu_domain->lock);
+	mutex_init(&smmu_domain->init_mutex);
+	spin_lock_init(&smmu_domain->pgtbl_lock);
 	domain->priv = smmu_domain;
 	return 0;
-
-out_free_domain:
-	kfree(smmu_domain);
-	return -ENOMEM;
-}
-
-static void arm_smmu_free_ptes(pmd_t *pmd)
-{
-	pgtable_t table = pmd_pgtable(*pmd);
-
-	__free_page(table);
-}
-
-static void arm_smmu_free_pmds(pud_t *pud)
-{
-	int i;
-	pmd_t *pmd, *pmd_base = pmd_offset(pud, 0);
-
-	pmd = pmd_base;
-	for (i = 0; i < PTRS_PER_PMD; ++i) {
-		if (pmd_none(*pmd))
-			continue;
-
-		arm_smmu_free_ptes(pmd);
-		pmd++;
-	}
-
-	pmd_free(NULL, pmd_base);
-}
-
-static void arm_smmu_free_puds(pgd_t *pgd)
-{
-	int i;
-	pud_t *pud, *pud_base = pud_offset(pgd, 0);
-
-	pud = pud_base;
-	for (i = 0; i < PTRS_PER_PUD; ++i) {
-		if (pud_none(*pud))
-			continue;
-
-		arm_smmu_free_pmds(pud);
-		pud++;
-	}
-
-	pud_free(NULL, pud_base);
-}
-
-static void arm_smmu_free_pgtables(struct arm_smmu_domain *smmu_domain)
-{
-	int i;
-	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
-	pgd_t *pgd, *pgd_base = cfg->pgd;
-
-	/*
-	 * Recursively free the page tables for this domain. We don't
-	 * care about speculative TLB filling because the tables should
-	 * not be active in any context bank at this point (SCTLR.M is 0).
-	 */
-	pgd = pgd_base;
-	for (i = 0; i < PTRS_PER_PGD; ++i) {
-		if (pgd_none(*pgd))
-			continue;
-		arm_smmu_free_puds(pgd);
-		pgd++;
-	}
-
-	kfree(pgd_base);
 }
 
 static void arm_smmu_domain_destroy(struct iommu_domain *domain)
@@ -1112,7 +988,6 @@ static void arm_smmu_domain_destroy(struct iommu_domain *domain)
 	 * already been detached.
 	 */
 	arm_smmu_destroy_domain_context(domain);
-	arm_smmu_free_pgtables(smmu_domain);
 	kfree(smmu_domain);
 }
 
@@ -1244,7 +1119,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 {
 	int ret;
 	struct arm_smmu_domain *smmu_domain = domain->priv;
-	struct arm_smmu_device *smmu, *dom_smmu;
+	struct arm_smmu_device *smmu;
 	struct arm_smmu_master_cfg *cfg;
 
 	smmu = find_smmu_for_device(dev);
@@ -1258,21 +1133,16 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 		return -EEXIST;
 	}
 
+	/* Ensure that the domain is finalised */
+	ret = arm_smmu_init_domain_context(domain, smmu);
+	if (IS_ERR_VALUE(ret))
+		return ret;
+
 	/*
 	 * Sanity check the domain. We don't support domains across
 	 * different SMMUs.
 	 */
-	dom_smmu = ACCESS_ONCE(smmu_domain->smmu);
-	if (!dom_smmu) {
-		/* Now that we have a master, we can finalise the domain */
-		ret = arm_smmu_init_domain_context(domain, smmu);
-		if (IS_ERR_VALUE(ret))
-			return ret;
-
-		dom_smmu = smmu_domain->smmu;
-	}
-
-	if (dom_smmu != smmu) {
+	if (smmu_domain->smmu != smmu) {
 		dev_err(dev,
 			"cannot attach to SMMU %s whilst already attached to domain on SMMU %s\n",
 			dev_name(smmu_domain->smmu->dev), dev_name(smmu->dev));
@@ -1303,293 +1173,55 @@ static void arm_smmu_detach_dev(struct iommu_domain *domain, struct device *dev)
 	arm_smmu_domain_remove_master(smmu_domain, cfg);
 }
 
-static bool arm_smmu_pte_is_contiguous_range(unsigned long addr,
-					     unsigned long end)
-{
-	return !(addr & ~ARM_SMMU_PTE_CONT_MASK) &&
-		(addr + ARM_SMMU_PTE_CONT_SIZE <= end);
-}
-
-static int arm_smmu_alloc_init_pte(struct arm_smmu_device *smmu, pmd_t *pmd,
-				   unsigned long addr, unsigned long end,
-				   unsigned long pfn, int prot, int stage)
-{
-	pte_t *pte, *start;
-	pteval_t pteval = ARM_SMMU_PTE_PAGE | ARM_SMMU_PTE_AF;
-
-	if (pmd_none(*pmd)) {
-		/* Allocate a new set of tables */
-		pgtable_t table = alloc_page(GFP_ATOMIC|__GFP_ZERO);
-
-		if (!table)
-			return -ENOMEM;
-
-		arm_smmu_flush_pgtable(smmu, page_address(table), PAGE_SIZE);
-		pmd_populate(NULL, pmd, table);
-		arm_smmu_flush_pgtable(smmu, pmd, sizeof(*pmd));
-	}
-
-	if (stage == 1) {
-		pteval |= ARM_SMMU_PTE_AP_UNPRIV | ARM_SMMU_PTE_nG;
-		if (!(prot & IOMMU_WRITE) && (prot & IOMMU_READ))
-			pteval |= ARM_SMMU_PTE_AP_RDONLY;
-
-		if (prot & IOMMU_CACHE)
-			pteval |= (MAIR_ATTR_IDX_CACHE <<
-				   ARM_SMMU_PTE_ATTRINDX_SHIFT);
-	} else {
-		pteval |= ARM_SMMU_PTE_HAP_FAULT;
-		if (prot & IOMMU_READ)
-			pteval |= ARM_SMMU_PTE_HAP_READ;
-		if (prot & IOMMU_WRITE)
-			pteval |= ARM_SMMU_PTE_HAP_WRITE;
-		if (prot & IOMMU_CACHE)
-			pteval |= ARM_SMMU_PTE_MEMATTR_OIWB;
-		else
-			pteval |= ARM_SMMU_PTE_MEMATTR_NC;
-	}
-
-	if (prot & IOMMU_NOEXEC)
-		pteval |= ARM_SMMU_PTE_XN;
-
-	/* If no access, create a faulting entry to avoid TLB fills */
-	if (!(prot & (IOMMU_READ | IOMMU_WRITE)))
-		pteval &= ~ARM_SMMU_PTE_PAGE;
-
-	pteval |= ARM_SMMU_PTE_SH_IS;
-	start = pmd_page_vaddr(*pmd) + pte_index(addr);
-	pte = start;
-
-	/*
-	 * Install the page table entries. This is fairly complicated
-	 * since we attempt to make use of the contiguous hint in the
-	 * ptes where possible. The contiguous hint indicates a series
-	 * of ARM_SMMU_PTE_CONT_ENTRIES ptes mapping a physically
-	 * contiguous region with the following constraints:
-	 *
-	 *   - The region start is aligned to ARM_SMMU_PTE_CONT_SIZE
-	 *   - Each pte in the region has the contiguous hint bit set
-	 *
-	 * This complicates unmapping (also handled by this code, when
-	 * neither IOMMU_READ or IOMMU_WRITE are set) because it is
-	 * possible, yet highly unlikely, that a client may unmap only
-	 * part of a contiguous range. This requires clearing of the
-	 * contiguous hint bits in the range before installing the new
-	 * faulting entries.
-	 *
-	 * Note that re-mapping an address range without first unmapping
-	 * it is not supported, so TLB invalidation is not required here
-	 * and is instead performed at unmap and domain-init time.
-	 */
-	do {
-		int i = 1;
-
-		pteval &= ~ARM_SMMU_PTE_CONT;
-
-		if (arm_smmu_pte_is_contiguous_range(addr, end)) {
-			i = ARM_SMMU_PTE_CONT_ENTRIES;
-			pteval |= ARM_SMMU_PTE_CONT;
-		} else if (pte_val(*pte) &
-			   (ARM_SMMU_PTE_CONT | ARM_SMMU_PTE_PAGE)) {
-			int j;
-			pte_t *cont_start;
-			unsigned long idx = pte_index(addr);
-
-			idx &= ~(ARM_SMMU_PTE_CONT_ENTRIES - 1);
-			cont_start = pmd_page_vaddr(*pmd) + idx;
-			for (j = 0; j < ARM_SMMU_PTE_CONT_ENTRIES; ++j)
-				pte_val(*(cont_start + j)) &=
-					~ARM_SMMU_PTE_CONT;
-
-			arm_smmu_flush_pgtable(smmu, cont_start,
-					       sizeof(*pte) *
-					       ARM_SMMU_PTE_CONT_ENTRIES);
-		}
-
-		do {
-			*pte = pfn_pte(pfn, __pgprot(pteval));
-		} while (pte++, pfn++, addr += PAGE_SIZE, --i);
-	} while (addr != end);
-
-	arm_smmu_flush_pgtable(smmu, start, sizeof(*pte) * (pte - start));
-	return 0;
-}
-
-static int arm_smmu_alloc_init_pmd(struct arm_smmu_device *smmu, pud_t *pud,
-				   unsigned long addr, unsigned long end,
-				   phys_addr_t phys, int prot, int stage)
-{
-	int ret;
-	pmd_t *pmd;
-	unsigned long next, pfn = __phys_to_pfn(phys);
-
-#ifndef __PAGETABLE_PMD_FOLDED
-	if (pud_none(*pud)) {
-		pmd = (pmd_t *)get_zeroed_page(GFP_ATOMIC);
-		if (!pmd)
-			return -ENOMEM;
-
-		arm_smmu_flush_pgtable(smmu, pmd, PAGE_SIZE);
-		pud_populate(NULL, pud, pmd);
-		arm_smmu_flush_pgtable(smmu, pud, sizeof(*pud));
-
-		pmd += pmd_index(addr);
-	} else
-#endif
-		pmd = pmd_offset(pud, addr);
-
-	do {
-		next = pmd_addr_end(addr, end);
-		ret = arm_smmu_alloc_init_pte(smmu, pmd, addr, next, pfn,
-					      prot, stage);
-		phys += next - addr;
-		pfn = __phys_to_pfn(phys);
-	} while (pmd++, addr = next, addr < end);
-
-	return ret;
-}
-
-static int arm_smmu_alloc_init_pud(struct arm_smmu_device *smmu, pgd_t *pgd,
-				   unsigned long addr, unsigned long end,
-				   phys_addr_t phys, int prot, int stage)
-{
-	int ret = 0;
-	pud_t *pud;
-	unsigned long next;
-
-#ifndef __PAGETABLE_PUD_FOLDED
-	if (pgd_none(*pgd)) {
-		pud = (pud_t *)get_zeroed_page(GFP_ATOMIC);
-		if (!pud)
-			return -ENOMEM;
-
-		arm_smmu_flush_pgtable(smmu, pud, PAGE_SIZE);
-		pgd_populate(NULL, pgd, pud);
-		arm_smmu_flush_pgtable(smmu, pgd, sizeof(*pgd));
-
-		pud += pud_index(addr);
-	} else
-#endif
-		pud = pud_offset(pgd, addr);
-
-	do {
-		next = pud_addr_end(addr, end);
-		ret = arm_smmu_alloc_init_pmd(smmu, pud, addr, next, phys,
-					      prot, stage);
-		phys += next - addr;
-	} while (pud++, addr = next, addr < end);
-
-	return ret;
-}
-
-static int arm_smmu_handle_mapping(struct arm_smmu_domain *smmu_domain,
-				   unsigned long iova, phys_addr_t paddr,
-				   size_t size, int prot)
-{
-	int ret, stage;
-	unsigned long end;
-	phys_addr_t input_mask, output_mask;
-	struct arm_smmu_device *smmu = smmu_domain->smmu;
-	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
-	pgd_t *pgd = cfg->pgd;
-	unsigned long flags;
-
-	if (cfg->cbar == CBAR_TYPE_S2_TRANS) {
-		stage = 2;
-		input_mask = (1ULL << smmu->s2_input_size) - 1;
-		output_mask = (1ULL << smmu->s2_output_size) - 1;
-	} else {
-		stage = 1;
-		input_mask = (1ULL << smmu->s1_input_size) - 1;
-		output_mask = (1ULL << smmu->s1_output_size) - 1;
-	}
-
-	if (!pgd)
-		return -EINVAL;
-
-	if (size & ~PAGE_MASK)
-		return -EINVAL;
-
-	if ((phys_addr_t)iova & ~input_mask)
-		return -ERANGE;
-
-	if (paddr & ~output_mask)
-		return -ERANGE;
-
-	spin_lock_irqsave(&smmu_domain->lock, flags);
-	pgd += pgd_index(iova);
-	end = iova + size;
-	do {
-		unsigned long next = pgd_addr_end(iova, end);
-
-		ret = arm_smmu_alloc_init_pud(smmu, pgd, iova, next, paddr,
-					      prot, stage);
-		if (ret)
-			goto out_unlock;
-
-		paddr += next - iova;
-		iova = next;
-	} while (pgd++, iova != end);
-
-out_unlock:
-	spin_unlock_irqrestore(&smmu_domain->lock, flags);
-
-	return ret;
-}
-
 static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
 			phys_addr_t paddr, size_t size, int prot)
 {
+	int ret;
+	unsigned long flags;
 	struct arm_smmu_domain *smmu_domain = domain->priv;
+	struct io_pgtable_ops *ops= smmu_domain->pgtbl_ops;
 
-	if (!smmu_domain)
+	if (!ops)
 		return -ENODEV;
 
-	return arm_smmu_handle_mapping(smmu_domain, iova, paddr, size, prot);
+	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
+	ret = ops->map(ops, iova, paddr, size, prot);
+	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
+	return ret;
 }
 
 static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
 			     size_t size)
 {
-	int ret;
+	size_t ret;
+	unsigned long flags;
 	struct arm_smmu_domain *smmu_domain = domain->priv;
+	struct io_pgtable_ops *ops= smmu_domain->pgtbl_ops;
+
+	if (!ops)
+		return 0;
 
-	ret = arm_smmu_handle_mapping(smmu_domain, iova, 0, size, 0);
-	arm_smmu_tlb_inv_context(smmu_domain);
-	return ret ? 0 : size;
+	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
+	ret = ops->unmap(ops, iova, size);
+	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
+	return ret;
 }
 
 static phys_addr_t arm_smmu_iova_to_phys(struct iommu_domain *domain,
 					 dma_addr_t iova)
 {
-	pgd_t *pgdp, pgd;
-	pud_t pud;
-	pmd_t pmd;
-	pte_t pte;
+	phys_addr_t ret;
+	unsigned long flags;
 	struct arm_smmu_domain *smmu_domain = domain->priv;
-	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+	struct io_pgtable_ops *ops= smmu_domain->pgtbl_ops;
 
-	pgdp = cfg->pgd;
-	if (!pgdp)
+	if (!ops)
 		return 0;
 
-	pgd = *(pgdp + pgd_index(iova));
-	if (pgd_none(pgd))
-		return 0;
-
-	pud = *pud_offset(&pgd, iova);
-	if (pud_none(pud))
-		return 0;
-
-	pmd = *pmd_offset(&pud, iova);
-	if (pmd_none(pmd))
-		return 0;
-
-	pte = *(pmd_page_vaddr(pmd) + pte_index(iova));
-	if (pte_none(pte))
-		return 0;
-
-	return __pfn_to_phys(pte_pfn(pte)) | (iova & ~PAGE_MASK);
+	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
+	ret = ops->iova_to_phys(ops, iova);
+	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
+	return ret;
 }
 
 static bool arm_smmu_capable(enum iommu_cap cap)
@@ -1698,24 +1330,34 @@ static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
 static int arm_smmu_domain_set_attr(struct iommu_domain *domain,
 				    enum iommu_attr attr, void *data)
 {
+	int ret = 0;
 	struct arm_smmu_domain *smmu_domain = domain->priv;
 
+	mutex_lock(&smmu_domain->init_mutex);
+
 	switch (attr) {
 	case DOMAIN_ATTR_NESTING:
-		if (smmu_domain->smmu)
-			return -EPERM;
+		if (smmu_domain->smmu) {
+			ret = -EPERM;
+			goto out_unlock;
+		}
+
 		if (*(int *)data)
 			smmu_domain->stage = ARM_SMMU_DOMAIN_NESTED;
 		else
 			smmu_domain->stage = ARM_SMMU_DOMAIN_S1;
 
-		return 0;
+		break;
 	default:
-		return -ENODEV;
+		ret = -ENODEV;
 	}
+
+out_unlock:
+	mutex_unlock(&smmu_domain->init_mutex);
+	return ret;
 }
 
-static const struct iommu_ops arm_smmu_ops = {
+static struct iommu_ops arm_smmu_ops = {
 	.capable		= arm_smmu_capable,
 	.domain_init		= arm_smmu_domain_init,
 	.domain_destroy		= arm_smmu_domain_destroy,
@@ -1728,9 +1370,7 @@ static const struct iommu_ops arm_smmu_ops = {
 	.remove_device		= arm_smmu_remove_device,
 	.domain_get_attr	= arm_smmu_domain_get_attr,
 	.domain_set_attr	= arm_smmu_domain_set_attr,
-	.pgsize_bitmap		= (SECTION_SIZE |
-				   ARM_SMMU_PTE_CONT_SIZE |
-				   PAGE_SIZE),
+	.pgsize_bitmap		= -1UL, /* Restricted during device attach */
 };
 
 static void arm_smmu_device_reset(struct arm_smmu_device *smmu)
@@ -1781,7 +1421,7 @@ static void arm_smmu_device_reset(struct arm_smmu_device *smmu)
 	reg &= ~(sCR0_BSU_MASK << sCR0_BSU_SHIFT);
 
 	/* Push the button */
-	arm_smmu_tlb_sync(smmu);
+	__arm_smmu_tlb_sync(smmu);
 	writel(reg, ARM_SMMU_GR0_NS(smmu) + ARM_SMMU_GR0_sCR0);
 }
 
@@ -1815,12 +1455,6 @@ static int arm_smmu_device_cfg_probe(struct arm_smmu_device *smmu)
 
 	/* ID0 */
 	id = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID0);
-#ifndef CONFIG_64BIT
-	if (((id >> ID0_PTFS_SHIFT) & ID0_PTFS_MASK) == ID0_PTFS_V8_ONLY) {
-		dev_err(smmu->dev, "\tno v7 descriptor support!\n");
-		return -ENODEV;
-	}
-#endif
 
 	/* Restrict available stages based on module parameter */
 	if (force_stage == 1)
@@ -1893,16 +1527,14 @@ static int arm_smmu_device_cfg_probe(struct arm_smmu_device *smmu)
 	smmu->pgshift = (id & ID1_PAGESIZE) ? 16 : 12;
 
 	/* Check for size mismatch of SMMU address space from mapped region */
-	size = 1 <<
-		(((id >> ID1_NUMPAGENDXB_SHIFT) & ID1_NUMPAGENDXB_MASK) + 1);
+	size = 1 << (((id >> ID1_NUMPAGENDXB_SHIFT) & ID1_NUMPAGENDXB_MASK) + 1);
 	size *= 2 << smmu->pgshift;
 	if (smmu->size != size)
 		dev_warn(smmu->dev,
 			"SMMU address space size (0x%lx) differs from mapped region size (0x%lx)!\n",
 			size, smmu->size);
 
-	smmu->num_s2_context_banks = (id >> ID1_NUMS2CB_SHIFT) &
-				      ID1_NUMS2CB_MASK;
+	smmu->num_s2_context_banks = (id >> ID1_NUMS2CB_SHIFT) & ID1_NUMS2CB_MASK;
 	smmu->num_context_banks = (id >> ID1_NUMCB_SHIFT) & ID1_NUMCB_MASK;
 	if (smmu->num_s2_context_banks > smmu->num_context_banks) {
 		dev_err(smmu->dev, "impossible number of S2 context banks!\n");
@@ -1914,46 +1546,40 @@ static int arm_smmu_device_cfg_probe(struct arm_smmu_device *smmu)
 	/* ID2 */
 	id = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID2);
 	size = arm_smmu_id_size_to_bits((id >> ID2_IAS_SHIFT) & ID2_IAS_MASK);
-	smmu->s1_output_size = min_t(unsigned long, PHYS_MASK_SHIFT, size);
+	smmu->ipa_size = size;
 
-	/* Stage-2 input size limited due to pgd allocation (PTRS_PER_PGD) */
-#ifdef CONFIG_64BIT
-	smmu->s2_input_size = min_t(unsigned long, VA_BITS, size);
-#else
-	smmu->s2_input_size = min(32UL, size);
-#endif
-
-	/* The stage-2 output mask is also applied for bypass */
+	/* The output mask is also applied for bypass */
 	size = arm_smmu_id_size_to_bits((id >> ID2_OAS_SHIFT) & ID2_OAS_MASK);
-	smmu->s2_output_size = min_t(unsigned long, PHYS_MASK_SHIFT, size);
+	smmu->pa_size = size;
 
 	if (smmu->version == ARM_SMMU_V1) {
-		smmu->s1_input_size = 32;
+		smmu->va_size = smmu->ipa_size;
+		size = SZ_4K | SZ_2M | SZ_1G;
 	} else {
-#ifdef CONFIG_64BIT
 		size = (id >> ID2_UBS_SHIFT) & ID2_UBS_MASK;
-		size = min(VA_BITS, arm_smmu_id_size_to_bits(size));
-#else
-		size = 32;
+		smmu->va_size = arm_smmu_id_size_to_bits(size);
+#ifndef CONFIG_64BIT
+		smmu->va_size = min(32UL, smmu->va_size);
 #endif
-		smmu->s1_input_size = size;
-
-		if ((PAGE_SIZE == SZ_4K && !(id & ID2_PTFS_4K)) ||
-		    (PAGE_SIZE == SZ_64K && !(id & ID2_PTFS_64K)) ||
-		    (PAGE_SIZE != SZ_4K && PAGE_SIZE != SZ_64K)) {
-			dev_err(smmu->dev, "CPU page size 0x%lx unsupported\n",
-				PAGE_SIZE);
-			return -ENODEV;
-		}
+		size = 0;
+		if (id & ID2_PTFS_4K)
+			size |= SZ_4K | SZ_2M | SZ_1G;
+		if (id & ID2_PTFS_16K)
+			size |= SZ_16K | SZ_32M;
+		if (id & ID2_PTFS_64K)
+			size |= SZ_64K | SZ_512M;
 	}
 
+	arm_smmu_ops.pgsize_bitmap &= size;
+	dev_notice(smmu->dev, "\tSupported page sizes: 0x%08lx\n", size);
+
 	if (smmu->features & ARM_SMMU_FEAT_TRANS_S1)
 		dev_notice(smmu->dev, "\tStage-1: %lu-bit VA -> %lu-bit IPA\n",
-			   smmu->s1_input_size, smmu->s1_output_size);
+			   smmu->va_size, smmu->ipa_size);
 
 	if (smmu->features & ARM_SMMU_FEAT_TRANS_S2)
 		dev_notice(smmu->dev, "\tStage-2: %lu-bit IPA -> %lu-bit PA\n",
-			   smmu->s2_input_size, smmu->s2_output_size);
+			   smmu->ipa_size, smmu->pa_size);
 
 	return 0;
 }
-- 
2.1.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 4/4] iommu/arm-smmu: make use of generic LPAE allocator
@ 2014-11-27 11:51     ` Will Deacon
  0 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-11-27 11:51 UTC (permalink / raw)
  To: linux-arm-kernel

The ARM SMMU can walk LPAE page tables, so make use of the generic
allocator.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/Kconfig       |   1 -
 drivers/iommu/Kconfig    |   6 +-
 drivers/iommu/arm-smmu.c | 872 ++++++++++++++---------------------------------
 3 files changed, 252 insertions(+), 627 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 9532f8d5857e..d2adb09c8f04 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -233,7 +233,6 @@ config ARM64_VA_BITS_42
 
 config ARM64_VA_BITS_48
 	bool "48-bit"
-	depends on !ARM_SMMU
 
 endchoice
 
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index dde72d0990b0..7e1bc3262663 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -317,13 +317,13 @@ config SPAPR_TCE_IOMMU
 
 config ARM_SMMU
 	bool "ARM Ltd. System MMU (SMMU) Support"
-	depends on ARM64 || (ARM_LPAE && OF)
+	depends on ARM64 || ARM
 	select IOMMU_API
+	select IOMMU_IO_PGTABLE_LPAE
 	select ARM_DMA_USE_IOMMU if ARM
 	help
 	  Support for implementations of the ARM System MMU architecture
-	  versions 1 and 2. The driver supports both v7l and v8l table
-	  formats with 4k and 64k page sizes.
+	  versions 1 and 2.
 
 	  Say Y here if your SoC includes an IOMMU device implementing
 	  the ARM SMMU architecture.
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 7a80f710ba2d..f9cfac2d8ae9 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -23,8 +23,6 @@
  *	- Stream-matching and stream-indexing
  *	- v7/v8 long-descriptor format
  *	- Non-secure access to the SMMU
- *	- 4k and 64k pages, with contiguous pte hints.
- *	- Up to 48-bit addressing (dependent on VA_BITS)
  *	- Context fault reporting
  */
 
@@ -36,7 +34,6 @@
 #include <linux/interrupt.h>
 #include <linux/io.h>
 #include <linux/iommu.h>
-#include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/of.h>
 #include <linux/pci.h>
@@ -46,7 +43,7 @@
 
 #include <linux/amba/bus.h>
 
-#include <asm/pgalloc.h>
+#include "io-pgtable.h"
 
 /* Maximum number of stream IDs assigned to a single device */
 #define MAX_MASTER_STREAMIDS		MAX_PHANDLE_ARGS
@@ -71,40 +68,6 @@
 		((smmu->options & ARM_SMMU_OPT_SECURE_CFG_ACCESS)	\
 			? 0x400 : 0))
 
-/* Page table bits */
-#define ARM_SMMU_PTE_XN			(((pteval_t)3) << 53)
-#define ARM_SMMU_PTE_CONT		(((pteval_t)1) << 52)
-#define ARM_SMMU_PTE_AF			(((pteval_t)1) << 10)
-#define ARM_SMMU_PTE_SH_NS		(((pteval_t)0) << 8)
-#define ARM_SMMU_PTE_SH_OS		(((pteval_t)2) << 8)
-#define ARM_SMMU_PTE_SH_IS		(((pteval_t)3) << 8)
-#define ARM_SMMU_PTE_PAGE		(((pteval_t)3) << 0)
-
-#if PAGE_SIZE == SZ_4K
-#define ARM_SMMU_PTE_CONT_ENTRIES	16
-#elif PAGE_SIZE == SZ_64K
-#define ARM_SMMU_PTE_CONT_ENTRIES	32
-#else
-#define ARM_SMMU_PTE_CONT_ENTRIES	1
-#endif
-
-#define ARM_SMMU_PTE_CONT_SIZE		(PAGE_SIZE * ARM_SMMU_PTE_CONT_ENTRIES)
-#define ARM_SMMU_PTE_CONT_MASK		(~(ARM_SMMU_PTE_CONT_SIZE - 1))
-
-/* Stage-1 PTE */
-#define ARM_SMMU_PTE_AP_UNPRIV		(((pteval_t)1) << 6)
-#define ARM_SMMU_PTE_AP_RDONLY		(((pteval_t)2) << 6)
-#define ARM_SMMU_PTE_ATTRINDX_SHIFT	2
-#define ARM_SMMU_PTE_nG			(((pteval_t)1) << 11)
-
-/* Stage-2 PTE */
-#define ARM_SMMU_PTE_HAP_FAULT		(((pteval_t)0) << 6)
-#define ARM_SMMU_PTE_HAP_READ		(((pteval_t)1) << 6)
-#define ARM_SMMU_PTE_HAP_WRITE		(((pteval_t)2) << 6)
-#define ARM_SMMU_PTE_MEMATTR_OIWB	(((pteval_t)0xf) << 2)
-#define ARM_SMMU_PTE_MEMATTR_NC		(((pteval_t)0x5) << 2)
-#define ARM_SMMU_PTE_MEMATTR_DEV	(((pteval_t)0x1) << 2)
-
 /* Configuration registers */
 #define ARM_SMMU_GR0_sCR0		0x0
 #define sCR0_CLIENTPD			(1 << 0)
@@ -132,17 +95,11 @@
 #define ARM_SMMU_GR0_sGFSYNR0		0x50
 #define ARM_SMMU_GR0_sGFSYNR1		0x54
 #define ARM_SMMU_GR0_sGFSYNR2		0x58
-#define ARM_SMMU_GR0_PIDR0		0xfe0
-#define ARM_SMMU_GR0_PIDR1		0xfe4
-#define ARM_SMMU_GR0_PIDR2		0xfe8
 
 #define ID0_S1TS			(1 << 30)
 #define ID0_S2TS			(1 << 29)
 #define ID0_NTS				(1 << 28)
 #define ID0_SMS				(1 << 27)
-#define ID0_PTFS_SHIFT			24
-#define ID0_PTFS_MASK			0x2
-#define ID0_PTFS_V8_ONLY		0x2
 #define ID0_CTTW			(1 << 14)
 #define ID0_NUMIRPT_SHIFT		16
 #define ID0_NUMIRPT_MASK		0xff
@@ -169,9 +126,6 @@
 #define ID2_PTFS_16K			(1 << 13)
 #define ID2_PTFS_64K			(1 << 14)
 
-#define PIDR2_ARCH_SHIFT		4
-#define PIDR2_ARCH_MASK			0xf
-
 /* Global TLB invalidation */
 #define ARM_SMMU_GR0_STLBIALL		0x60
 #define ARM_SMMU_GR0_TLBIVMID		0x64
@@ -231,13 +185,18 @@
 #define ARM_SMMU_CB_TTBCR2		0x10
 #define ARM_SMMU_CB_TTBR0_LO		0x20
 #define ARM_SMMU_CB_TTBR0_HI		0x24
+#define ARM_SMMU_CB_TTBR1_LO		0x28
+#define ARM_SMMU_CB_TTBR1_HI		0x2c
 #define ARM_SMMU_CB_TTBCR		0x30
 #define ARM_SMMU_CB_S1_MAIR0		0x38
+#define ARM_SMMU_CB_S1_MAIR1		0x3c
 #define ARM_SMMU_CB_FSR			0x58
 #define ARM_SMMU_CB_FAR_LO		0x60
 #define ARM_SMMU_CB_FAR_HI		0x64
 #define ARM_SMMU_CB_FSYNR0		0x68
+#define ARM_SMMU_CB_S1_TLBIVA		0x600
 #define ARM_SMMU_CB_S1_TLBIASID		0x610
+#define ARM_SMMU_CB_S1_TLBIVAL		0x620
 
 #define SCTLR_S1_ASIDPNE		(1 << 12)
 #define SCTLR_CFCFG			(1 << 7)
@@ -252,44 +211,9 @@
 #define RESUME_RETRY			(0 << 0)
 #define RESUME_TERMINATE		(1 << 0)
 
-#define TTBCR_EAE			(1 << 31)
-
-#define TTBCR_PASIZE_SHIFT		16
-#define TTBCR_PASIZE_MASK		0x7
-
-#define TTBCR_TG0_4K			(0 << 14)
-#define TTBCR_TG0_64K			(1 << 14)
-
-#define TTBCR_SH0_SHIFT			12
-#define TTBCR_SH0_MASK			0x3
-#define TTBCR_SH_NS			0
-#define TTBCR_SH_OS			2
-#define TTBCR_SH_IS			3
-
-#define TTBCR_ORGN0_SHIFT		10
-#define TTBCR_IRGN0_SHIFT		8
-#define TTBCR_RGN_MASK			0x3
-#define TTBCR_RGN_NC			0
-#define TTBCR_RGN_WBWA			1
-#define TTBCR_RGN_WT			2
-#define TTBCR_RGN_WB			3
-
-#define TTBCR_SL0_SHIFT			6
-#define TTBCR_SL0_MASK			0x3
-#define TTBCR_SL0_LVL_2			0
-#define TTBCR_SL0_LVL_1			1
-
-#define TTBCR_T1SZ_SHIFT		16
-#define TTBCR_T0SZ_SHIFT		0
-#define TTBCR_SZ_MASK			0xf
-
 #define TTBCR2_SEP_SHIFT		15
 #define TTBCR2_SEP_MASK			0x7
 
-#define TTBCR2_PASIZE_SHIFT		0
-#define TTBCR2_PASIZE_MASK		0x7
-
-/* Common definitions for PASize and SEP fields */
 #define TTBCR2_ADDR_32			0
 #define TTBCR2_ADDR_36			1
 #define TTBCR2_ADDR_40			2
@@ -297,16 +221,7 @@
 #define TTBCR2_ADDR_44			4
 #define TTBCR2_ADDR_48			5
 
-#define TTBRn_HI_ASID_SHIFT		16
-
-#define MAIR_ATTR_SHIFT(n)		((n) << 3)
-#define MAIR_ATTR_MASK			0xff
-#define MAIR_ATTR_DEVICE		0x04
-#define MAIR_ATTR_NC			0x44
-#define MAIR_ATTR_WBRWA			0xff
-#define MAIR_ATTR_IDX_NC		0
-#define MAIR_ATTR_IDX_CACHE		1
-#define MAIR_ATTR_IDX_DEV		2
+#define TTBRn_HI_ASID_SHIFT            16
 
 #define FSR_MULTI			(1 << 31)
 #define FSR_SS				(1 << 30)
@@ -380,10 +295,9 @@ struct arm_smmu_device {
 	u32				num_mapping_groups;
 	DECLARE_BITMAP(smr_map, ARM_SMMU_MAX_SMRS);
 
-	unsigned long			s1_input_size;
-	unsigned long			s1_output_size;
-	unsigned long			s2_input_size;
-	unsigned long			s2_output_size;
+	unsigned long			va_size;
+	unsigned long			ipa_size;
+	unsigned long			pa_size;
 
 	u32				num_global_irqs;
 	u32				num_context_irqs;
@@ -397,7 +311,6 @@ struct arm_smmu_cfg {
 	u8				cbndx;
 	u8				irptndx;
 	u32				cbar;
-	pgd_t				*pgd;
 };
 #define INVALID_IRPTNDX			0xff
 
@@ -412,11 +325,15 @@ enum arm_smmu_domain_stage {
 
 struct arm_smmu_domain {
 	struct arm_smmu_device		*smmu;
+	struct io_pgtable_ops		*pgtbl_ops;
+	spinlock_t			pgtbl_lock;
 	struct arm_smmu_cfg		cfg;
 	enum arm_smmu_domain_stage	stage;
-	spinlock_t			lock;
+	struct mutex			init_mutex; /* Protects smmu pointer */
 };
 
+static struct iommu_ops arm_smmu_ops;
+
 static DEFINE_SPINLOCK(arm_smmu_devices_lock);
 static LIST_HEAD(arm_smmu_devices);
 
@@ -597,7 +514,7 @@ static void __arm_smmu_free_bitmap(unsigned long *map, int idx)
 }
 
 /* Wait for any pending TLB invalidations to complete */
-static void arm_smmu_tlb_sync(struct arm_smmu_device *smmu)
+static void __arm_smmu_tlb_sync(struct arm_smmu_device *smmu)
 {
 	int count = 0;
 	void __iomem *gr0_base = ARM_SMMU_GR0(smmu);
@@ -615,12 +532,19 @@ static void arm_smmu_tlb_sync(struct arm_smmu_device *smmu)
 	}
 }
 
-static void arm_smmu_tlb_inv_context(struct arm_smmu_domain *smmu_domain)
+static void arm_smmu_tlb_sync(void *cookie)
+{
+	struct arm_smmu_domain *smmu_domain = cookie;
+	__arm_smmu_tlb_sync(smmu_domain->smmu);
+}
+
+static void arm_smmu_tlb_inv_context(void *cookie)
 {
+	struct arm_smmu_domain *smmu_domain = cookie;
 	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
 	struct arm_smmu_device *smmu = smmu_domain->smmu;
-	void __iomem *base = ARM_SMMU_GR0(smmu);
 	bool stage1 = cfg->cbar != CBAR_TYPE_S2_TRANS;
+	void __iomem *base;
 
 	if (stage1) {
 		base = ARM_SMMU_CB_BASE(smmu) + ARM_SMMU_CB(smmu, cfg->cbndx);
@@ -632,9 +556,70 @@ static void arm_smmu_tlb_inv_context(struct arm_smmu_domain *smmu_domain)
 			       base + ARM_SMMU_GR0_TLBIVMID);
 	}
 
-	arm_smmu_tlb_sync(smmu);
+	__arm_smmu_tlb_sync(smmu);
+}
+
+static void arm_smmu_tlb_inv_range_nosync(unsigned long iova, size_t size,
+					  bool leaf, void *cookie)
+{
+	struct arm_smmu_domain *smmu_domain = cookie;
+	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
+	bool stage1 = cfg->cbar != CBAR_TYPE_S2_TRANS;
+	void __iomem *reg;
+
+	if (stage1) {
+		reg = ARM_SMMU_CB_BASE(smmu) + ARM_SMMU_CB(smmu, cfg->cbndx);
+		reg += leaf ? ARM_SMMU_CB_S1_TLBIVAL : ARM_SMMU_CB_S1_TLBIVA;
+
+		if (!IS_ENABLED(CONFIG_64BIT) || smmu->version == ARM_SMMU_V1) {
+			iova &= ~12UL;
+			iova |= ARM_SMMU_CB_ASID(cfg);
+			writel_relaxed(iova, reg);
+#ifdef CONFIG_64BIT
+		} else {
+			iova >>= 12;
+			iova |= (u64)ARM_SMMU_CB_ASID(cfg) << 48;
+			writeq_relaxed(iova, reg);
+#endif
+		}
+	} else {
+		/* Invalidate by IPA is optional */
+		reg = ARM_SMMU_GR0(smmu) + ARM_SMMU_GR0_TLBIVMID;
+		writel_relaxed(ARM_SMMU_CB_VMID(cfg), reg);
+	}
 }
 
+static void arm_smmu_flush_pgtable(void *addr, size_t size, void *cookie)
+{
+	struct arm_smmu_domain *smmu_domain = cookie;
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
+	unsigned long offset = (unsigned long)addr & ~PAGE_MASK;
+
+
+	/* Ensure new page tables are visible to the hardware walker */
+	if (smmu->features & ARM_SMMU_FEAT_COHERENT_WALK) {
+		dsb(ishst);
+	} else {
+		/*
+		 * If the SMMU can't walk tables in the CPU caches, treat them
+		 * like non-coherent DMA since we need to flush the new entries
+		 * all the way out to memory. There's no possibility of
+		 * recursion here as the SMMU table walker will not be wired
+		 * through another SMMU.
+		 */
+		dma_map_page(smmu->dev, virt_to_page(addr), offset, size,
+			     DMA_TO_DEVICE);
+	}
+}
+
+static struct iommu_gather_ops arm_smmu_gather_ops = {
+	.tlb_flush_all	= arm_smmu_tlb_inv_context,
+	.tlb_add_flush	= arm_smmu_tlb_inv_range_nosync,
+	.tlb_sync	= arm_smmu_tlb_sync,
+	.flush_pgtable	= arm_smmu_flush_pgtable,
+};
+
 static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
 {
 	int flags, ret;
@@ -712,29 +697,8 @@ static irqreturn_t arm_smmu_global_fault(int irq, void *dev)
 	return IRQ_HANDLED;
 }
 
-static void arm_smmu_flush_pgtable(struct arm_smmu_device *smmu, void *addr,
-				   size_t size)
-{
-	unsigned long offset = (unsigned long)addr & ~PAGE_MASK;
-
-
-	/* Ensure new page tables are visible to the hardware walker */
-	if (smmu->features & ARM_SMMU_FEAT_COHERENT_WALK) {
-		dsb(ishst);
-	} else {
-		/*
-		 * If the SMMU can't walk tables in the CPU caches, treat them
-		 * like non-coherent DMA since we need to flush the new entries
-		 * all the way out to memory. There's no possibility of
-		 * recursion here as the SMMU table walker will not be wired
-		 * through another SMMU.
-		 */
-		dma_map_page(smmu->dev, virt_to_page(addr), offset, size,
-				DMA_TO_DEVICE);
-	}
-}
-
-static void arm_smmu_init_context_bank(struct arm_smmu_domain *smmu_domain)
+static void arm_smmu_init_context_bank(struct arm_smmu_domain *smmu_domain,
+				       struct io_pgtable_cfg *pgtbl_cfg)
 {
 	u32 reg;
 	bool stage1;
@@ -771,124 +735,68 @@ static void arm_smmu_init_context_bank(struct arm_smmu_domain *smmu_domain)
 #else
 		reg = CBA2R_RW64_32BIT;
 #endif
-		writel_relaxed(reg,
-			       gr1_base + ARM_SMMU_GR1_CBA2R(cfg->cbndx));
-
-		/* TTBCR2 */
-		switch (smmu->s1_input_size) {
-		case 32:
-			reg = (TTBCR2_ADDR_32 << TTBCR2_SEP_SHIFT);
-			break;
-		case 36:
-			reg = (TTBCR2_ADDR_36 << TTBCR2_SEP_SHIFT);
-			break;
-		case 39:
-		case 40:
-			reg = (TTBCR2_ADDR_40 << TTBCR2_SEP_SHIFT);
-			break;
-		case 42:
-			reg = (TTBCR2_ADDR_42 << TTBCR2_SEP_SHIFT);
-			break;
-		case 44:
-			reg = (TTBCR2_ADDR_44 << TTBCR2_SEP_SHIFT);
-			break;
-		case 48:
-			reg = (TTBCR2_ADDR_48 << TTBCR2_SEP_SHIFT);
-			break;
-		}
-
-		switch (smmu->s1_output_size) {
-		case 32:
-			reg |= (TTBCR2_ADDR_32 << TTBCR2_PASIZE_SHIFT);
-			break;
-		case 36:
-			reg |= (TTBCR2_ADDR_36 << TTBCR2_PASIZE_SHIFT);
-			break;
-		case 39:
-		case 40:
-			reg |= (TTBCR2_ADDR_40 << TTBCR2_PASIZE_SHIFT);
-			break;
-		case 42:
-			reg |= (TTBCR2_ADDR_42 << TTBCR2_PASIZE_SHIFT);
-			break;
-		case 44:
-			reg |= (TTBCR2_ADDR_44 << TTBCR2_PASIZE_SHIFT);
-			break;
-		case 48:
-			reg |= (TTBCR2_ADDR_48 << TTBCR2_PASIZE_SHIFT);
-			break;
-		}
-
-		if (stage1)
-			writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBCR2);
+		writel_relaxed(reg, gr1_base + ARM_SMMU_GR1_CBA2R(cfg->cbndx));
 	}
 
-	/* TTBR0 */
-	arm_smmu_flush_pgtable(smmu, cfg->pgd,
-			       PTRS_PER_PGD * sizeof(pgd_t));
-	reg = __pa(cfg->pgd);
-	writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBR0_LO);
-	reg = (phys_addr_t)__pa(cfg->pgd) >> 32;
-	if (stage1)
+	/* TTBRs */
+	if (stage1) {
+		reg = pgtbl_cfg->arm_lpae_s1_cfg.ttbr[0];
+		writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBR0_LO);
+		reg = pgtbl_cfg->arm_lpae_s1_cfg.ttbr[0] >> 32;
 		reg |= ARM_SMMU_CB_ASID(cfg) << TTBRn_HI_ASID_SHIFT;
-	writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBR0_HI);
-
-	/*
-	 * TTBCR
-	 * We use long descriptor, with inner-shareable WBWA tables in TTBR0.
-	 */
-	if (smmu->version > ARM_SMMU_V1) {
-		if (PAGE_SIZE == SZ_4K)
-			reg = TTBCR_TG0_4K;
-		else
-			reg = TTBCR_TG0_64K;
+		writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBR0_HI);
 
-		if (!stage1) {
-			reg |= (64 - smmu->s2_input_size) << TTBCR_T0SZ_SHIFT;
+		reg = pgtbl_cfg->arm_lpae_s1_cfg.ttbr[1];
+		writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBR1_LO);
+		reg = pgtbl_cfg->arm_lpae_s1_cfg.ttbr[1] >> 32;
+		reg |= ARM_SMMU_CB_ASID(cfg) << TTBRn_HI_ASID_SHIFT;
+		writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBR1_HI);
+	} else {
+		reg = pgtbl_cfg->arm_lpae_s2_cfg.vttbr;
+		writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBR0_LO);
+		reg = pgtbl_cfg->arm_lpae_s2_cfg.vttbr >> 32;
+		writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBR0_HI);
+	}
 
-			switch (smmu->s2_output_size) {
+	/* TTBCR */
+	if (stage1) {
+		reg = pgtbl_cfg->arm_lpae_s1_cfg.tcr;
+		writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBCR);
+		if (smmu->version > ARM_SMMU_V1) {
+			reg = pgtbl_cfg->arm_lpae_s1_cfg.tcr >> 32;
+			switch (smmu->va_size) {
 			case 32:
-				reg |= (TTBCR2_ADDR_32 << TTBCR_PASIZE_SHIFT);
+				reg |= (TTBCR2_ADDR_32 << TTBCR2_SEP_SHIFT);
 				break;
 			case 36:
-				reg |= (TTBCR2_ADDR_36 << TTBCR_PASIZE_SHIFT);
+				reg |= (TTBCR2_ADDR_36 << TTBCR2_SEP_SHIFT);
 				break;
 			case 40:
-				reg |= (TTBCR2_ADDR_40 << TTBCR_PASIZE_SHIFT);
+				reg |= (TTBCR2_ADDR_40 << TTBCR2_SEP_SHIFT);
 				break;
 			case 42:
-				reg |= (TTBCR2_ADDR_42 << TTBCR_PASIZE_SHIFT);
+				reg |= (TTBCR2_ADDR_42 << TTBCR2_SEP_SHIFT);
 				break;
 			case 44:
-				reg |= (TTBCR2_ADDR_44 << TTBCR_PASIZE_SHIFT);
+				reg |= (TTBCR2_ADDR_44 << TTBCR2_SEP_SHIFT);
 				break;
 			case 48:
-				reg |= (TTBCR2_ADDR_48 << TTBCR_PASIZE_SHIFT);
+				reg |= (TTBCR2_ADDR_48 << TTBCR2_SEP_SHIFT);
 				break;
 			}
-		} else {
-			reg |= (64 - smmu->s1_input_size) << TTBCR_T0SZ_SHIFT;
+			writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBCR2);
 		}
 	} else {
-		reg = 0;
+		reg = pgtbl_cfg->arm_lpae_s2_cfg.vtcr;
+		writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBCR);
 	}
 
-	reg |= TTBCR_EAE |
-	      (TTBCR_SH_IS << TTBCR_SH0_SHIFT) |
-	      (TTBCR_RGN_WBWA << TTBCR_ORGN0_SHIFT) |
-	      (TTBCR_RGN_WBWA << TTBCR_IRGN0_SHIFT);
-
-	if (!stage1)
-		reg |= (TTBCR_SL0_LVL_1 << TTBCR_SL0_SHIFT);
-
-	writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBCR);
-
-	/* MAIR0 (stage-1 only) */
+	/* MAIRs (stage-1 only) */
 	if (stage1) {
-		reg = (MAIR_ATTR_NC << MAIR_ATTR_SHIFT(MAIR_ATTR_IDX_NC)) |
-		      (MAIR_ATTR_WBRWA << MAIR_ATTR_SHIFT(MAIR_ATTR_IDX_CACHE)) |
-		      (MAIR_ATTR_DEVICE << MAIR_ATTR_SHIFT(MAIR_ATTR_IDX_DEV));
+		reg = pgtbl_cfg->arm_lpae_s1_cfg.mair[0];
 		writel_relaxed(reg, cb_base + ARM_SMMU_CB_S1_MAIR0);
+		reg = pgtbl_cfg->arm_lpae_s1_cfg.mair[1];
+		writel_relaxed(reg, cb_base + ARM_SMMU_CB_S1_MAIR1);
 	}
 
 	/* SCTLR */
@@ -905,11 +813,14 @@ static int arm_smmu_init_domain_context(struct iommu_domain *domain,
 					struct arm_smmu_device *smmu)
 {
 	int irq, start, ret = 0;
-	unsigned long flags;
+	unsigned long ias, oas;
+	struct io_pgtable_ops *pgtbl_ops;
+	struct io_pgtable_cfg pgtbl_cfg;
+	enum io_pgtable_fmt fmt;
 	struct arm_smmu_domain *smmu_domain = domain->priv;
 	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
 
-	spin_lock_irqsave(&smmu_domain->lock, flags);
+	mutex_lock(&smmu_domain->init_mutex);
 	if (smmu_domain->smmu)
 		goto out_unlock;
 
@@ -940,6 +851,9 @@ static int arm_smmu_init_domain_context(struct iommu_domain *domain,
 	case ARM_SMMU_DOMAIN_S1:
 		cfg->cbar = CBAR_TYPE_S1_TRANS_S2_BYPASS;
 		start = smmu->num_s2_context_banks;
+		ias = smmu->va_size;
+		oas = smmu->ipa_size;
+		fmt = ARM_LPAE_S1;
 		break;
 	case ARM_SMMU_DOMAIN_NESTED:
 		/*
@@ -949,6 +863,9 @@ static int arm_smmu_init_domain_context(struct iommu_domain *domain,
 	case ARM_SMMU_DOMAIN_S2:
 		cfg->cbar = CBAR_TYPE_S2_TRANS;
 		start = 0;
+		ias = smmu->ipa_size;
+		oas = smmu->pa_size;
+		fmt = ARM_LPAE_S2;
 		break;
 	default:
 		ret = -EINVAL;
@@ -968,10 +885,30 @@ static int arm_smmu_init_domain_context(struct iommu_domain *domain,
 		cfg->irptndx = cfg->cbndx;
 	}
 
-	ACCESS_ONCE(smmu_domain->smmu) = smmu;
-	arm_smmu_init_context_bank(smmu_domain);
-	spin_unlock_irqrestore(&smmu_domain->lock, flags);
+	pgtbl_cfg = (struct io_pgtable_cfg) {
+		.pgsize_bitmap	= arm_smmu_ops.pgsize_bitmap,
+		.ias		= ias,
+		.oas		= oas,
+		.tlb		= &arm_smmu_gather_ops,
+	};
 
+	smmu_domain->smmu = smmu;
+	pgtbl_ops = alloc_io_pgtable_ops(fmt, &pgtbl_cfg, smmu_domain);
+	if (!pgtbl_ops) {
+		ret = -ENOMEM;
+		goto out_clear_smmu;
+	}
+
+	/* Update our support page sizes to reflect the page table format */
+	arm_smmu_ops.pgsize_bitmap = pgtbl_cfg.pgsize_bitmap;
+
+	/* Initialise the context bank with our page table cfg */
+	arm_smmu_init_context_bank(smmu_domain, &pgtbl_cfg);
+
+	/*
+	 * Request context fault interrupt. Do this last to avoid the
+	 * handler seeing a half-initialised domain state.
+	 */
 	irq = smmu->irqs[smmu->num_global_irqs + cfg->irptndx];
 	ret = request_irq(irq, arm_smmu_context_fault, IRQF_SHARED,
 			  "arm-smmu-context-fault", domain);
@@ -981,10 +918,16 @@ static int arm_smmu_init_domain_context(struct iommu_domain *domain,
 		cfg->irptndx = INVALID_IRPTNDX;
 	}
 
+	mutex_unlock(&smmu_domain->init_mutex);
+
+	/* Publish page table ops for map/unmap */
+	smmu_domain->pgtbl_ops = pgtbl_ops;
 	return 0;
 
+out_clear_smmu:
+	smmu_domain->smmu = NULL;
 out_unlock:
-	spin_unlock_irqrestore(&smmu_domain->lock, flags);
+	mutex_unlock(&smmu_domain->init_mutex);
 	return ret;
 }
 
@@ -999,23 +942,27 @@ static void arm_smmu_destroy_domain_context(struct iommu_domain *domain)
 	if (!smmu)
 		return;
 
-	/* Disable the context bank and nuke the TLB before freeing it. */
+	/*
+	 * Disable the context bank and free the page tables before freeing
+	 * it.
+	 */
 	cb_base = ARM_SMMU_CB_BASE(smmu) + ARM_SMMU_CB(smmu, cfg->cbndx);
 	writel_relaxed(0, cb_base + ARM_SMMU_CB_SCTLR);
-	arm_smmu_tlb_inv_context(smmu_domain);
 
 	if (cfg->irptndx != INVALID_IRPTNDX) {
 		irq = smmu->irqs[smmu->num_global_irqs + cfg->irptndx];
 		free_irq(irq, domain);
 	}
 
+	if (smmu_domain->pgtbl_ops)
+		free_io_pgtable_ops(smmu_domain->pgtbl_ops);
+
 	__arm_smmu_free_bitmap(smmu->context_map, cfg->cbndx);
 }
 
 static int arm_smmu_domain_init(struct iommu_domain *domain)
 {
 	struct arm_smmu_domain *smmu_domain;
-	pgd_t *pgd;
 
 	/*
 	 * Allocate the domain and initialise some of its data structures.
@@ -1026,81 +973,10 @@ static int arm_smmu_domain_init(struct iommu_domain *domain)
 	if (!smmu_domain)
 		return -ENOMEM;
 
-	pgd = kcalloc(PTRS_PER_PGD, sizeof(pgd_t), GFP_KERNEL);
-	if (!pgd)
-		goto out_free_domain;
-	smmu_domain->cfg.pgd = pgd;
-
-	spin_lock_init(&smmu_domain->lock);
+	mutex_init(&smmu_domain->init_mutex);
+	spin_lock_init(&smmu_domain->pgtbl_lock);
 	domain->priv = smmu_domain;
 	return 0;
-
-out_free_domain:
-	kfree(smmu_domain);
-	return -ENOMEM;
-}
-
-static void arm_smmu_free_ptes(pmd_t *pmd)
-{
-	pgtable_t table = pmd_pgtable(*pmd);
-
-	__free_page(table);
-}
-
-static void arm_smmu_free_pmds(pud_t *pud)
-{
-	int i;
-	pmd_t *pmd, *pmd_base = pmd_offset(pud, 0);
-
-	pmd = pmd_base;
-	for (i = 0; i < PTRS_PER_PMD; ++i) {
-		if (pmd_none(*pmd))
-			continue;
-
-		arm_smmu_free_ptes(pmd);
-		pmd++;
-	}
-
-	pmd_free(NULL, pmd_base);
-}
-
-static void arm_smmu_free_puds(pgd_t *pgd)
-{
-	int i;
-	pud_t *pud, *pud_base = pud_offset(pgd, 0);
-
-	pud = pud_base;
-	for (i = 0; i < PTRS_PER_PUD; ++i) {
-		if (pud_none(*pud))
-			continue;
-
-		arm_smmu_free_pmds(pud);
-		pud++;
-	}
-
-	pud_free(NULL, pud_base);
-}
-
-static void arm_smmu_free_pgtables(struct arm_smmu_domain *smmu_domain)
-{
-	int i;
-	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
-	pgd_t *pgd, *pgd_base = cfg->pgd;
-
-	/*
-	 * Recursively free the page tables for this domain. We don't
-	 * care about speculative TLB filling because the tables should
-	 * not be active in any context bank at this point (SCTLR.M is 0).
-	 */
-	pgd = pgd_base;
-	for (i = 0; i < PTRS_PER_PGD; ++i) {
-		if (pgd_none(*pgd))
-			continue;
-		arm_smmu_free_puds(pgd);
-		pgd++;
-	}
-
-	kfree(pgd_base);
 }
 
 static void arm_smmu_domain_destroy(struct iommu_domain *domain)
@@ -1112,7 +988,6 @@ static void arm_smmu_domain_destroy(struct iommu_domain *domain)
 	 * already been detached.
 	 */
 	arm_smmu_destroy_domain_context(domain);
-	arm_smmu_free_pgtables(smmu_domain);
 	kfree(smmu_domain);
 }
 
@@ -1244,7 +1119,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 {
 	int ret;
 	struct arm_smmu_domain *smmu_domain = domain->priv;
-	struct arm_smmu_device *smmu, *dom_smmu;
+	struct arm_smmu_device *smmu;
 	struct arm_smmu_master_cfg *cfg;
 
 	smmu = find_smmu_for_device(dev);
@@ -1258,21 +1133,16 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 		return -EEXIST;
 	}
 
+	/* Ensure that the domain is finalised */
+	ret = arm_smmu_init_domain_context(domain, smmu);
+	if (IS_ERR_VALUE(ret))
+		return ret;
+
 	/*
 	 * Sanity check the domain. We don't support domains across
 	 * different SMMUs.
 	 */
-	dom_smmu = ACCESS_ONCE(smmu_domain->smmu);
-	if (!dom_smmu) {
-		/* Now that we have a master, we can finalise the domain */
-		ret = arm_smmu_init_domain_context(domain, smmu);
-		if (IS_ERR_VALUE(ret))
-			return ret;
-
-		dom_smmu = smmu_domain->smmu;
-	}
-
-	if (dom_smmu != smmu) {
+	if (smmu_domain->smmu != smmu) {
 		dev_err(dev,
 			"cannot attach to SMMU %s whilst already attached to domain on SMMU %s\n",
 			dev_name(smmu_domain->smmu->dev), dev_name(smmu->dev));
@@ -1303,293 +1173,55 @@ static void arm_smmu_detach_dev(struct iommu_domain *domain, struct device *dev)
 	arm_smmu_domain_remove_master(smmu_domain, cfg);
 }
 
-static bool arm_smmu_pte_is_contiguous_range(unsigned long addr,
-					     unsigned long end)
-{
-	return !(addr & ~ARM_SMMU_PTE_CONT_MASK) &&
-		(addr + ARM_SMMU_PTE_CONT_SIZE <= end);
-}
-
-static int arm_smmu_alloc_init_pte(struct arm_smmu_device *smmu, pmd_t *pmd,
-				   unsigned long addr, unsigned long end,
-				   unsigned long pfn, int prot, int stage)
-{
-	pte_t *pte, *start;
-	pteval_t pteval = ARM_SMMU_PTE_PAGE | ARM_SMMU_PTE_AF;
-
-	if (pmd_none(*pmd)) {
-		/* Allocate a new set of tables */
-		pgtable_t table = alloc_page(GFP_ATOMIC|__GFP_ZERO);
-
-		if (!table)
-			return -ENOMEM;
-
-		arm_smmu_flush_pgtable(smmu, page_address(table), PAGE_SIZE);
-		pmd_populate(NULL, pmd, table);
-		arm_smmu_flush_pgtable(smmu, pmd, sizeof(*pmd));
-	}
-
-	if (stage == 1) {
-		pteval |= ARM_SMMU_PTE_AP_UNPRIV | ARM_SMMU_PTE_nG;
-		if (!(prot & IOMMU_WRITE) && (prot & IOMMU_READ))
-			pteval |= ARM_SMMU_PTE_AP_RDONLY;
-
-		if (prot & IOMMU_CACHE)
-			pteval |= (MAIR_ATTR_IDX_CACHE <<
-				   ARM_SMMU_PTE_ATTRINDX_SHIFT);
-	} else {
-		pteval |= ARM_SMMU_PTE_HAP_FAULT;
-		if (prot & IOMMU_READ)
-			pteval |= ARM_SMMU_PTE_HAP_READ;
-		if (prot & IOMMU_WRITE)
-			pteval |= ARM_SMMU_PTE_HAP_WRITE;
-		if (prot & IOMMU_CACHE)
-			pteval |= ARM_SMMU_PTE_MEMATTR_OIWB;
-		else
-			pteval |= ARM_SMMU_PTE_MEMATTR_NC;
-	}
-
-	if (prot & IOMMU_NOEXEC)
-		pteval |= ARM_SMMU_PTE_XN;
-
-	/* If no access, create a faulting entry to avoid TLB fills */
-	if (!(prot & (IOMMU_READ | IOMMU_WRITE)))
-		pteval &= ~ARM_SMMU_PTE_PAGE;
-
-	pteval |= ARM_SMMU_PTE_SH_IS;
-	start = pmd_page_vaddr(*pmd) + pte_index(addr);
-	pte = start;
-
-	/*
-	 * Install the page table entries. This is fairly complicated
-	 * since we attempt to make use of the contiguous hint in the
-	 * ptes where possible. The contiguous hint indicates a series
-	 * of ARM_SMMU_PTE_CONT_ENTRIES ptes mapping a physically
-	 * contiguous region with the following constraints:
-	 *
-	 *   - The region start is aligned to ARM_SMMU_PTE_CONT_SIZE
-	 *   - Each pte in the region has the contiguous hint bit set
-	 *
-	 * This complicates unmapping (also handled by this code, when
-	 * neither IOMMU_READ or IOMMU_WRITE are set) because it is
-	 * possible, yet highly unlikely, that a client may unmap only
-	 * part of a contiguous range. This requires clearing of the
-	 * contiguous hint bits in the range before installing the new
-	 * faulting entries.
-	 *
-	 * Note that re-mapping an address range without first unmapping
-	 * it is not supported, so TLB invalidation is not required here
-	 * and is instead performed@unmap and domain-init time.
-	 */
-	do {
-		int i = 1;
-
-		pteval &= ~ARM_SMMU_PTE_CONT;
-
-		if (arm_smmu_pte_is_contiguous_range(addr, end)) {
-			i = ARM_SMMU_PTE_CONT_ENTRIES;
-			pteval |= ARM_SMMU_PTE_CONT;
-		} else if (pte_val(*pte) &
-			   (ARM_SMMU_PTE_CONT | ARM_SMMU_PTE_PAGE)) {
-			int j;
-			pte_t *cont_start;
-			unsigned long idx = pte_index(addr);
-
-			idx &= ~(ARM_SMMU_PTE_CONT_ENTRIES - 1);
-			cont_start = pmd_page_vaddr(*pmd) + idx;
-			for (j = 0; j < ARM_SMMU_PTE_CONT_ENTRIES; ++j)
-				pte_val(*(cont_start + j)) &=
-					~ARM_SMMU_PTE_CONT;
-
-			arm_smmu_flush_pgtable(smmu, cont_start,
-					       sizeof(*pte) *
-					       ARM_SMMU_PTE_CONT_ENTRIES);
-		}
-
-		do {
-			*pte = pfn_pte(pfn, __pgprot(pteval));
-		} while (pte++, pfn++, addr += PAGE_SIZE, --i);
-	} while (addr != end);
-
-	arm_smmu_flush_pgtable(smmu, start, sizeof(*pte) * (pte - start));
-	return 0;
-}
-
-static int arm_smmu_alloc_init_pmd(struct arm_smmu_device *smmu, pud_t *pud,
-				   unsigned long addr, unsigned long end,
-				   phys_addr_t phys, int prot, int stage)
-{
-	int ret;
-	pmd_t *pmd;
-	unsigned long next, pfn = __phys_to_pfn(phys);
-
-#ifndef __PAGETABLE_PMD_FOLDED
-	if (pud_none(*pud)) {
-		pmd = (pmd_t *)get_zeroed_page(GFP_ATOMIC);
-		if (!pmd)
-			return -ENOMEM;
-
-		arm_smmu_flush_pgtable(smmu, pmd, PAGE_SIZE);
-		pud_populate(NULL, pud, pmd);
-		arm_smmu_flush_pgtable(smmu, pud, sizeof(*pud));
-
-		pmd += pmd_index(addr);
-	} else
-#endif
-		pmd = pmd_offset(pud, addr);
-
-	do {
-		next = pmd_addr_end(addr, end);
-		ret = arm_smmu_alloc_init_pte(smmu, pmd, addr, next, pfn,
-					      prot, stage);
-		phys += next - addr;
-		pfn = __phys_to_pfn(phys);
-	} while (pmd++, addr = next, addr < end);
-
-	return ret;
-}
-
-static int arm_smmu_alloc_init_pud(struct arm_smmu_device *smmu, pgd_t *pgd,
-				   unsigned long addr, unsigned long end,
-				   phys_addr_t phys, int prot, int stage)
-{
-	int ret = 0;
-	pud_t *pud;
-	unsigned long next;
-
-#ifndef __PAGETABLE_PUD_FOLDED
-	if (pgd_none(*pgd)) {
-		pud = (pud_t *)get_zeroed_page(GFP_ATOMIC);
-		if (!pud)
-			return -ENOMEM;
-
-		arm_smmu_flush_pgtable(smmu, pud, PAGE_SIZE);
-		pgd_populate(NULL, pgd, pud);
-		arm_smmu_flush_pgtable(smmu, pgd, sizeof(*pgd));
-
-		pud += pud_index(addr);
-	} else
-#endif
-		pud = pud_offset(pgd, addr);
-
-	do {
-		next = pud_addr_end(addr, end);
-		ret = arm_smmu_alloc_init_pmd(smmu, pud, addr, next, phys,
-					      prot, stage);
-		phys += next - addr;
-	} while (pud++, addr = next, addr < end);
-
-	return ret;
-}
-
-static int arm_smmu_handle_mapping(struct arm_smmu_domain *smmu_domain,
-				   unsigned long iova, phys_addr_t paddr,
-				   size_t size, int prot)
-{
-	int ret, stage;
-	unsigned long end;
-	phys_addr_t input_mask, output_mask;
-	struct arm_smmu_device *smmu = smmu_domain->smmu;
-	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
-	pgd_t *pgd = cfg->pgd;
-	unsigned long flags;
-
-	if (cfg->cbar == CBAR_TYPE_S2_TRANS) {
-		stage = 2;
-		input_mask = (1ULL << smmu->s2_input_size) - 1;
-		output_mask = (1ULL << smmu->s2_output_size) - 1;
-	} else {
-		stage = 1;
-		input_mask = (1ULL << smmu->s1_input_size) - 1;
-		output_mask = (1ULL << smmu->s1_output_size) - 1;
-	}
-
-	if (!pgd)
-		return -EINVAL;
-
-	if (size & ~PAGE_MASK)
-		return -EINVAL;
-
-	if ((phys_addr_t)iova & ~input_mask)
-		return -ERANGE;
-
-	if (paddr & ~output_mask)
-		return -ERANGE;
-
-	spin_lock_irqsave(&smmu_domain->lock, flags);
-	pgd += pgd_index(iova);
-	end = iova + size;
-	do {
-		unsigned long next = pgd_addr_end(iova, end);
-
-		ret = arm_smmu_alloc_init_pud(smmu, pgd, iova, next, paddr,
-					      prot, stage);
-		if (ret)
-			goto out_unlock;
-
-		paddr += next - iova;
-		iova = next;
-	} while (pgd++, iova != end);
-
-out_unlock:
-	spin_unlock_irqrestore(&smmu_domain->lock, flags);
-
-	return ret;
-}
-
 static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
 			phys_addr_t paddr, size_t size, int prot)
 {
+	int ret;
+	unsigned long flags;
 	struct arm_smmu_domain *smmu_domain = domain->priv;
+	struct io_pgtable_ops *ops= smmu_domain->pgtbl_ops;
 
-	if (!smmu_domain)
+	if (!ops)
 		return -ENODEV;
 
-	return arm_smmu_handle_mapping(smmu_domain, iova, paddr, size, prot);
+	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
+	ret = ops->map(ops, iova, paddr, size, prot);
+	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
+	return ret;
 }
 
 static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
 			     size_t size)
 {
-	int ret;
+	size_t ret;
+	unsigned long flags;
 	struct arm_smmu_domain *smmu_domain = domain->priv;
+	struct io_pgtable_ops *ops= smmu_domain->pgtbl_ops;
+
+	if (!ops)
+		return 0;
 
-	ret = arm_smmu_handle_mapping(smmu_domain, iova, 0, size, 0);
-	arm_smmu_tlb_inv_context(smmu_domain);
-	return ret ? 0 : size;
+	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
+	ret = ops->unmap(ops, iova, size);
+	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
+	return ret;
 }
 
 static phys_addr_t arm_smmu_iova_to_phys(struct iommu_domain *domain,
 					 dma_addr_t iova)
 {
-	pgd_t *pgdp, pgd;
-	pud_t pud;
-	pmd_t pmd;
-	pte_t pte;
+	phys_addr_t ret;
+	unsigned long flags;
 	struct arm_smmu_domain *smmu_domain = domain->priv;
-	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+	struct io_pgtable_ops *ops= smmu_domain->pgtbl_ops;
 
-	pgdp = cfg->pgd;
-	if (!pgdp)
+	if (!ops)
 		return 0;
 
-	pgd = *(pgdp + pgd_index(iova));
-	if (pgd_none(pgd))
-		return 0;
-
-	pud = *pud_offset(&pgd, iova);
-	if (pud_none(pud))
-		return 0;
-
-	pmd = *pmd_offset(&pud, iova);
-	if (pmd_none(pmd))
-		return 0;
-
-	pte = *(pmd_page_vaddr(pmd) + pte_index(iova));
-	if (pte_none(pte))
-		return 0;
-
-	return __pfn_to_phys(pte_pfn(pte)) | (iova & ~PAGE_MASK);
+	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
+	ret = ops->iova_to_phys(ops, iova);
+	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
+	return ret;
 }
 
 static bool arm_smmu_capable(enum iommu_cap cap)
@@ -1698,24 +1330,34 @@ static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
 static int arm_smmu_domain_set_attr(struct iommu_domain *domain,
 				    enum iommu_attr attr, void *data)
 {
+	int ret = 0;
 	struct arm_smmu_domain *smmu_domain = domain->priv;
 
+	mutex_lock(&smmu_domain->init_mutex);
+
 	switch (attr) {
 	case DOMAIN_ATTR_NESTING:
-		if (smmu_domain->smmu)
-			return -EPERM;
+		if (smmu_domain->smmu) {
+			ret = -EPERM;
+			goto out_unlock;
+		}
+
 		if (*(int *)data)
 			smmu_domain->stage = ARM_SMMU_DOMAIN_NESTED;
 		else
 			smmu_domain->stage = ARM_SMMU_DOMAIN_S1;
 
-		return 0;
+		break;
 	default:
-		return -ENODEV;
+		ret = -ENODEV;
 	}
+
+out_unlock:
+	mutex_unlock(&smmu_domain->init_mutex);
+	return ret;
 }
 
-static const struct iommu_ops arm_smmu_ops = {
+static struct iommu_ops arm_smmu_ops = {
 	.capable		= arm_smmu_capable,
 	.domain_init		= arm_smmu_domain_init,
 	.domain_destroy		= arm_smmu_domain_destroy,
@@ -1728,9 +1370,7 @@ static const struct iommu_ops arm_smmu_ops = {
 	.remove_device		= arm_smmu_remove_device,
 	.domain_get_attr	= arm_smmu_domain_get_attr,
 	.domain_set_attr	= arm_smmu_domain_set_attr,
-	.pgsize_bitmap		= (SECTION_SIZE |
-				   ARM_SMMU_PTE_CONT_SIZE |
-				   PAGE_SIZE),
+	.pgsize_bitmap		= -1UL, /* Restricted during device attach */
 };
 
 static void arm_smmu_device_reset(struct arm_smmu_device *smmu)
@@ -1781,7 +1421,7 @@ static void arm_smmu_device_reset(struct arm_smmu_device *smmu)
 	reg &= ~(sCR0_BSU_MASK << sCR0_BSU_SHIFT);
 
 	/* Push the button */
-	arm_smmu_tlb_sync(smmu);
+	__arm_smmu_tlb_sync(smmu);
 	writel(reg, ARM_SMMU_GR0_NS(smmu) + ARM_SMMU_GR0_sCR0);
 }
 
@@ -1815,12 +1455,6 @@ static int arm_smmu_device_cfg_probe(struct arm_smmu_device *smmu)
 
 	/* ID0 */
 	id = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID0);
-#ifndef CONFIG_64BIT
-	if (((id >> ID0_PTFS_SHIFT) & ID0_PTFS_MASK) == ID0_PTFS_V8_ONLY) {
-		dev_err(smmu->dev, "\tno v7 descriptor support!\n");
-		return -ENODEV;
-	}
-#endif
 
 	/* Restrict available stages based on module parameter */
 	if (force_stage == 1)
@@ -1893,16 +1527,14 @@ static int arm_smmu_device_cfg_probe(struct arm_smmu_device *smmu)
 	smmu->pgshift = (id & ID1_PAGESIZE) ? 16 : 12;
 
 	/* Check for size mismatch of SMMU address space from mapped region */
-	size = 1 <<
-		(((id >> ID1_NUMPAGENDXB_SHIFT) & ID1_NUMPAGENDXB_MASK) + 1);
+	size = 1 << (((id >> ID1_NUMPAGENDXB_SHIFT) & ID1_NUMPAGENDXB_MASK) + 1);
 	size *= 2 << smmu->pgshift;
 	if (smmu->size != size)
 		dev_warn(smmu->dev,
 			"SMMU address space size (0x%lx) differs from mapped region size (0x%lx)!\n",
 			size, smmu->size);
 
-	smmu->num_s2_context_banks = (id >> ID1_NUMS2CB_SHIFT) &
-				      ID1_NUMS2CB_MASK;
+	smmu->num_s2_context_banks = (id >> ID1_NUMS2CB_SHIFT) & ID1_NUMS2CB_MASK;
 	smmu->num_context_banks = (id >> ID1_NUMCB_SHIFT) & ID1_NUMCB_MASK;
 	if (smmu->num_s2_context_banks > smmu->num_context_banks) {
 		dev_err(smmu->dev, "impossible number of S2 context banks!\n");
@@ -1914,46 +1546,40 @@ static int arm_smmu_device_cfg_probe(struct arm_smmu_device *smmu)
 	/* ID2 */
 	id = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID2);
 	size = arm_smmu_id_size_to_bits((id >> ID2_IAS_SHIFT) & ID2_IAS_MASK);
-	smmu->s1_output_size = min_t(unsigned long, PHYS_MASK_SHIFT, size);
+	smmu->ipa_size = size;
 
-	/* Stage-2 input size limited due to pgd allocation (PTRS_PER_PGD) */
-#ifdef CONFIG_64BIT
-	smmu->s2_input_size = min_t(unsigned long, VA_BITS, size);
-#else
-	smmu->s2_input_size = min(32UL, size);
-#endif
-
-	/* The stage-2 output mask is also applied for bypass */
+	/* The output mask is also applied for bypass */
 	size = arm_smmu_id_size_to_bits((id >> ID2_OAS_SHIFT) & ID2_OAS_MASK);
-	smmu->s2_output_size = min_t(unsigned long, PHYS_MASK_SHIFT, size);
+	smmu->pa_size = size;
 
 	if (smmu->version == ARM_SMMU_V1) {
-		smmu->s1_input_size = 32;
+		smmu->va_size = smmu->ipa_size;
+		size = SZ_4K | SZ_2M | SZ_1G;
 	} else {
-#ifdef CONFIG_64BIT
 		size = (id >> ID2_UBS_SHIFT) & ID2_UBS_MASK;
-		size = min(VA_BITS, arm_smmu_id_size_to_bits(size));
-#else
-		size = 32;
+		smmu->va_size = arm_smmu_id_size_to_bits(size);
+#ifndef CONFIG_64BIT
+		smmu->va_size = min(32UL, smmu->va_size);
 #endif
-		smmu->s1_input_size = size;
-
-		if ((PAGE_SIZE == SZ_4K && !(id & ID2_PTFS_4K)) ||
-		    (PAGE_SIZE == SZ_64K && !(id & ID2_PTFS_64K)) ||
-		    (PAGE_SIZE != SZ_4K && PAGE_SIZE != SZ_64K)) {
-			dev_err(smmu->dev, "CPU page size 0x%lx unsupported\n",
-				PAGE_SIZE);
-			return -ENODEV;
-		}
+		size = 0;
+		if (id & ID2_PTFS_4K)
+			size |= SZ_4K | SZ_2M | SZ_1G;
+		if (id & ID2_PTFS_16K)
+			size |= SZ_16K | SZ_32M;
+		if (id & ID2_PTFS_64K)
+			size |= SZ_64K | SZ_512M;
 	}
 
+	arm_smmu_ops.pgsize_bitmap &= size;
+	dev_notice(smmu->dev, "\tSupported page sizes: 0x%08lx\n", size);
+
 	if (smmu->features & ARM_SMMU_FEAT_TRANS_S1)
 		dev_notice(smmu->dev, "\tStage-1: %lu-bit VA -> %lu-bit IPA\n",
-			   smmu->s1_input_size, smmu->s1_output_size);
+			   smmu->va_size, smmu->ipa_size);
 
 	if (smmu->features & ARM_SMMU_FEAT_TRANS_S2)
 		dev_notice(smmu->dev, "\tStage-2: %lu-bit IPA -> %lu-bit PA\n",
-			   smmu->s2_input_size, smmu->s2_output_size);
+			   smmu->ipa_size, smmu->pa_size);
 
 	return 0;
 }
-- 
2.1.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/4] iommu: introduce generic page table allocation framework
  2014-11-27 11:51     ` Will Deacon
@ 2014-11-30 22:00         ` Laurent Pinchart
  -1 siblings, 0 replies; 76+ messages in thread
From: Laurent Pinchart @ 2014-11-30 22:00 UTC (permalink / raw)
  To: Will Deacon
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin.Murphy-5wv7dgnIgG8,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Will,

Thank you for the patch.

On Thursday 27 November 2014 11:51:15 Will Deacon wrote:
> This patch introduces a generic framework for allocating page tables for
> an IOMMU. There are a number of reasons we want to do this:
> 
>   - It avoids duplication of complex table management code in IOMMU
>     drivers that use the same page table format
> 
>   - It removes any coupling with the CPU table format (and even the
>     architecture!)
> 
>   - It defines an API for IOMMU TLB maintenance
> 
> Signed-off-by: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
> ---
>  drivers/iommu/Kconfig      |  8 ++++++
>  drivers/iommu/Makefile     |  1 +
>  drivers/iommu/io-pgtable.c | 71 +++++++++++++++++++++++++++++++++++++++++++
>  drivers/iommu/io-pgtable.h | 65 ++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 145 insertions(+)
>  create mode 100644 drivers/iommu/io-pgtable.c
>  create mode 100644 drivers/iommu/io-pgtable.h
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index dd5112265cc9..0f10554e7114 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -13,6 +13,14 @@ menuconfig IOMMU_SUPPORT
> 
>  if IOMMU_SUPPORT
> 
> +menu "Generic IOMMU Pagetable Support"
> +
> +# Selected by the actual pagetable implementations
> +config IOMMU_IO_PGTABLE
> +	bool
> +
> +endmenu
> +
>  config OF_IOMMU
>         def_bool y
>         depends on OF
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 16edef74b8ee..aff244c78181 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -1,6 +1,7 @@
>  obj-$(CONFIG_IOMMU_API) += iommu.o
>  obj-$(CONFIG_IOMMU_API) += iommu-traces.o
>  obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
> +obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
>  obj-$(CONFIG_OF_IOMMU)	+= of_iommu.o
>  obj-$(CONFIG_MSM_IOMMU) += msm_iommu.o msm_iommu_dev.o
>  obj-$(CONFIG_AMD_IOMMU) += amd_iommu.o amd_iommu_init.o
> diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
> new file mode 100644
> index 000000000000..82e39a0db94b
> --- /dev/null
> +++ b/drivers/iommu/io-pgtable.c
> @@ -0,0 +1,71 @@
> +/*
> + * Generic page table allocator for IOMMUs.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307,
> USA. + *
> + * Copyright (C) 2014 ARM Limited
> + *
> + * Author: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
> + */
> +
> +#include <linux/bug.h>
> +#include <linux/kernel.h>
> +#include <linux/types.h>
> +
> +#include "io-pgtable.h"
> +
> +static struct io_pgtable_init_fns

Any reason not to make the table const ?

> *io_pgtable_init_table[IO_PGTABLE_NUM_FMTS] =
> +{
> +};
> +
> +struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
> +					    struct io_pgtable_cfg *cfg,
> +					    void *cookie)
> +{
> +	struct io_pgtable *iop;
> +	struct io_pgtable_init_fns *fns;
> +
> +	if (fmt >= IO_PGTABLE_NUM_FMTS)
> +		return NULL;
> +
> +	fns = io_pgtable_init_table[fmt];
> +	if (!fns)
> +		return NULL;
> +
> +	iop = fns->alloc(cfg, cookie);
> +	if (!iop)
> +		return NULL;
> +
> +	iop->fmt	= fmt;
> +	iop->cookie	= cookie;
> +	iop->cfg	= *cfg;
> +
> +	return &iop->ops;
> +}
> +
> +/*
> + * It is the IOMMU driver's responsibility to ensure that the page table
> + * is no longer accessible to the walker by this point.
> + */
> +void free_io_pgtable_ops(struct io_pgtable_ops *ops)
> +{
> +	struct io_pgtable *iop;
> +
> +	if (!ops)
> +		return;
> +
> +	iop = container_of(ops, struct io_pgtable, ops);
> +	iop->cfg.tlb->tlb_flush_all(iop->cookie);
> +	io_pgtable_init_table[iop->fmt]->free(iop);
> +}
> diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
> new file mode 100644
> index 000000000000..5ae75d9cae50
> --- /dev/null
> +++ b/drivers/iommu/io-pgtable.h
> @@ -0,0 +1,65 @@
> +#ifndef __IO_PGTABLE_H
> +#define __IO_PGTABLE_H
> +
> +struct io_pgtable_ops {
> +	int (*map)(struct io_pgtable_ops *ops, unsigned long iova,

How about passing a struct io_pgtable * instead of the ops pointer ? This 
would require returning a struct io_pgtable from the alloc function, which I 
suppose you didn't want to do to ensure the caller will not touch the struct 
io_pgtable fields directly. Do we really need to go that far, or can we simply 
document struct io_pgtable as being private to the pg alloc framework core and 
allocators ? Someone who really wants to get hold of the io_pgtable instance 
could use container_of on the ops anyway, like the allocators do.

> +		   phys_addr_t paddr, size_t size, int prot);
> +	int (*unmap)(struct io_pgtable_ops *ops, unsigned long iova,
> +		     size_t size);
> +	phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
> +				    unsigned long iova);
> +};
> +
> +struct iommu_gather_ops {
> +	/* Synchronously invalidate the entire TLB context */
> +	void (*tlb_flush_all)(void *cookie);
> +
> +	/* Queue up a TLB invalidation for a virtual address range */
> +	void (*tlb_add_flush)(unsigned long iova, size_t size, bool leaf,
> +			      void *cookie);

Is there a limit to the number of entries that can be queued, or any other 
kind of restriction ? Implementing a completely generic TLB flush queue can 
become complex for IOMMU drivers.

I would also document in which context(s) this callback will be called, as 
IOMMU drivers might be tempted to allocate memory in order to implement a TLB 
flush queue.

> +	/* Ensure any queued TLB invalidation has taken effect */
> +	void (*tlb_sync)(void *cookie);
> +
> +	/* Ensure page tables updates are visible to the IOMMU */
> +	void (*flush_pgtable)(void *ptr, size_t size, void *cookie);
> +};

I suppose kerneldoc will come in the next version ;-)

> +struct io_pgtable_cfg {
> +	int			quirks; /* IO_PGTABLE_QUIRK_* */
> +	unsigned long		pgsize_bitmap;
> +	unsigned int		ias;
> +	unsigned int		oas;
> +	struct iommu_gather_ops	*tlb;
> +
> +	/* Low-level data specific to the table format */
> +	union {
> +	};
> +};
> +
> +enum io_pgtable_fmt {
> +	IO_PGTABLE_NUM_FMTS,
> +};
> +
> +struct io_pgtable {
> +	enum io_pgtable_fmt	fmt;
> +	void			*cookie;
> +	struct io_pgtable_cfg	cfg;
> +	struct io_pgtable_ops	ops;

This could be turned into a const pointer if we pass struct io_pgtable around 
instead of the ops.

> +};
> +
> +struct io_pgtable_init_fns {
> +	struct io_pgtable *(*alloc)(struct io_pgtable_cfg *cfg, void *cookie);
> +	void (*free)(struct io_pgtable *iop);
> +};

I would reorder structures into two groups, one clearly marked as private that 
shouldn't be touched by IOMMU drivers, and then the io_pgtable_fmt enum and 
the io_pgtable_cfg struct grouped with the two functions below.

> +struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
> +					    struct io_pgtable_cfg *cfg,
> +					    void *cookie);
> +
> +/*
> + * Free an io_pgtable_ops structure. The caller *must* ensure that the
> + * page table is no longer live, but the TLB can be dirty.
> + */
> +void free_io_pgtable_ops(struct io_pgtable_ops *ops);
> +
> +#endif /* __IO_PGTABLE_H */

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 1/4] iommu: introduce generic page table allocation framework
@ 2014-11-30 22:00         ` Laurent Pinchart
  0 siblings, 0 replies; 76+ messages in thread
From: Laurent Pinchart @ 2014-11-30 22:00 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

Thank you for the patch.

On Thursday 27 November 2014 11:51:15 Will Deacon wrote:
> This patch introduces a generic framework for allocating page tables for
> an IOMMU. There are a number of reasons we want to do this:
> 
>   - It avoids duplication of complex table management code in IOMMU
>     drivers that use the same page table format
> 
>   - It removes any coupling with the CPU table format (and even the
>     architecture!)
> 
>   - It defines an API for IOMMU TLB maintenance
> 
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  drivers/iommu/Kconfig      |  8 ++++++
>  drivers/iommu/Makefile     |  1 +
>  drivers/iommu/io-pgtable.c | 71 +++++++++++++++++++++++++++++++++++++++++++
>  drivers/iommu/io-pgtable.h | 65 ++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 145 insertions(+)
>  create mode 100644 drivers/iommu/io-pgtable.c
>  create mode 100644 drivers/iommu/io-pgtable.h
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index dd5112265cc9..0f10554e7114 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -13,6 +13,14 @@ menuconfig IOMMU_SUPPORT
> 
>  if IOMMU_SUPPORT
> 
> +menu "Generic IOMMU Pagetable Support"
> +
> +# Selected by the actual pagetable implementations
> +config IOMMU_IO_PGTABLE
> +	bool
> +
> +endmenu
> +
>  config OF_IOMMU
>         def_bool y
>         depends on OF
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 16edef74b8ee..aff244c78181 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -1,6 +1,7 @@
>  obj-$(CONFIG_IOMMU_API) += iommu.o
>  obj-$(CONFIG_IOMMU_API) += iommu-traces.o
>  obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
> +obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
>  obj-$(CONFIG_OF_IOMMU)	+= of_iommu.o
>  obj-$(CONFIG_MSM_IOMMU) += msm_iommu.o msm_iommu_dev.o
>  obj-$(CONFIG_AMD_IOMMU) += amd_iommu.o amd_iommu_init.o
> diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
> new file mode 100644
> index 000000000000..82e39a0db94b
> --- /dev/null
> +++ b/drivers/iommu/io-pgtable.c
> @@ -0,0 +1,71 @@
> +/*
> + * Generic page table allocator for IOMMUs.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307,
> USA. + *
> + * Copyright (C) 2014 ARM Limited
> + *
> + * Author: Will Deacon <will.deacon@arm.com>
> + */
> +
> +#include <linux/bug.h>
> +#include <linux/kernel.h>
> +#include <linux/types.h>
> +
> +#include "io-pgtable.h"
> +
> +static struct io_pgtable_init_fns

Any reason not to make the table const ?

> *io_pgtable_init_table[IO_PGTABLE_NUM_FMTS] =
> +{
> +};
> +
> +struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
> +					    struct io_pgtable_cfg *cfg,
> +					    void *cookie)
> +{
> +	struct io_pgtable *iop;
> +	struct io_pgtable_init_fns *fns;
> +
> +	if (fmt >= IO_PGTABLE_NUM_FMTS)
> +		return NULL;
> +
> +	fns = io_pgtable_init_table[fmt];
> +	if (!fns)
> +		return NULL;
> +
> +	iop = fns->alloc(cfg, cookie);
> +	if (!iop)
> +		return NULL;
> +
> +	iop->fmt	= fmt;
> +	iop->cookie	= cookie;
> +	iop->cfg	= *cfg;
> +
> +	return &iop->ops;
> +}
> +
> +/*
> + * It is the IOMMU driver's responsibility to ensure that the page table
> + * is no longer accessible to the walker by this point.
> + */
> +void free_io_pgtable_ops(struct io_pgtable_ops *ops)
> +{
> +	struct io_pgtable *iop;
> +
> +	if (!ops)
> +		return;
> +
> +	iop = container_of(ops, struct io_pgtable, ops);
> +	iop->cfg.tlb->tlb_flush_all(iop->cookie);
> +	io_pgtable_init_table[iop->fmt]->free(iop);
> +}
> diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
> new file mode 100644
> index 000000000000..5ae75d9cae50
> --- /dev/null
> +++ b/drivers/iommu/io-pgtable.h
> @@ -0,0 +1,65 @@
> +#ifndef __IO_PGTABLE_H
> +#define __IO_PGTABLE_H
> +
> +struct io_pgtable_ops {
> +	int (*map)(struct io_pgtable_ops *ops, unsigned long iova,

How about passing a struct io_pgtable * instead of the ops pointer ? This 
would require returning a struct io_pgtable from the alloc function, which I 
suppose you didn't want to do to ensure the caller will not touch the struct 
io_pgtable fields directly. Do we really need to go that far, or can we simply 
document struct io_pgtable as being private to the pg alloc framework core and 
allocators ? Someone who really wants to get hold of the io_pgtable instance 
could use container_of on the ops anyway, like the allocators do.

> +		   phys_addr_t paddr, size_t size, int prot);
> +	int (*unmap)(struct io_pgtable_ops *ops, unsigned long iova,
> +		     size_t size);
> +	phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
> +				    unsigned long iova);
> +};
> +
> +struct iommu_gather_ops {
> +	/* Synchronously invalidate the entire TLB context */
> +	void (*tlb_flush_all)(void *cookie);
> +
> +	/* Queue up a TLB invalidation for a virtual address range */
> +	void (*tlb_add_flush)(unsigned long iova, size_t size, bool leaf,
> +			      void *cookie);

Is there a limit to the number of entries that can be queued, or any other 
kind of restriction ? Implementing a completely generic TLB flush queue can 
become complex for IOMMU drivers.

I would also document in which context(s) this callback will be called, as 
IOMMU drivers might be tempted to allocate memory in order to implement a TLB 
flush queue.

> +	/* Ensure any queued TLB invalidation has taken effect */
> +	void (*tlb_sync)(void *cookie);
> +
> +	/* Ensure page tables updates are visible to the IOMMU */
> +	void (*flush_pgtable)(void *ptr, size_t size, void *cookie);
> +};

I suppose kerneldoc will come in the next version ;-)

> +struct io_pgtable_cfg {
> +	int			quirks; /* IO_PGTABLE_QUIRK_* */
> +	unsigned long		pgsize_bitmap;
> +	unsigned int		ias;
> +	unsigned int		oas;
> +	struct iommu_gather_ops	*tlb;
> +
> +	/* Low-level data specific to the table format */
> +	union {
> +	};
> +};
> +
> +enum io_pgtable_fmt {
> +	IO_PGTABLE_NUM_FMTS,
> +};
> +
> +struct io_pgtable {
> +	enum io_pgtable_fmt	fmt;
> +	void			*cookie;
> +	struct io_pgtable_cfg	cfg;
> +	struct io_pgtable_ops	ops;

This could be turned into a const pointer if we pass struct io_pgtable around 
instead of the ops.

> +};
> +
> +struct io_pgtable_init_fns {
> +	struct io_pgtable *(*alloc)(struct io_pgtable_cfg *cfg, void *cookie);
> +	void (*free)(struct io_pgtable *iop);
> +};

I would reorder structures into two groups, one clearly marked as private that 
shouldn't be touched by IOMMU drivers, and then the io_pgtable_fmt enum and 
the io_pgtable_cfg struct grouped with the two functions below.

> +struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
> +					    struct io_pgtable_cfg *cfg,
> +					    void *cookie);
> +
> +/*
> + * Free an io_pgtable_ops structure. The caller *must* ensure that the
> + * page table is no longer live, but the TLB can be dirty.
> + */
> +void free_io_pgtable_ops(struct io_pgtable_ops *ops);
> +
> +#endif /* __IO_PGTABLE_H */

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/4] Generic IOMMU page table framework
  2014-11-27 11:51 ` Will Deacon
@ 2014-11-30 22:03     ` Laurent Pinchart
  -1 siblings, 0 replies; 76+ messages in thread
From: Laurent Pinchart @ 2014-11-30 22:03 UTC (permalink / raw)
  To: Will Deacon
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin.Murphy-5wv7dgnIgG8,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Will,

Thank you for the patches.

On Thursday 27 November 2014 11:51:14 Will Deacon wrote:
> Hi all,
> 
> This series introduces a generic IOMMU page table allocation framework,
> implements support for ARM long-descriptors and then ports the arm-smmu
> driver over to the new code.
> 
> There are a few reasons for doing this:
> 
>   - Page table code is hard, and I don't enjoy shopping
> 
>   - A number of IOMMUs actually use the same table format, but currently
>     duplicate the code
> 
>   - It provides a CPU (and architecture) independent allocator, which
>     may be useful for some systems where the CPU is using a different
>     table format for its own mappings
> 
> As illustrated in the final patch, an IOMMU driver interacts with the
> allocator by passing in a configuration structure describing the
> input and output address ranges, the supported pages sizes and a set of
> ops for performing various TLB invalidation and PTE flushing routines.
> 
> The LPAE code implements support for 4k/2M/1G, 16k/32M and 64k/512M
> mappings, but I decided not to implement the contiguous bit in the
> interest of trying to keep the code semi-readable. This could always be
> added later, if needed.

Do you have any idea how much the contiguous bit can improve performances in 
real use cases ?

> I also included some self-tests for the LPAE implementation. Ideally
> we'd merge these, but I'm also happy to drop them if there are
> objections.
> 
> Tested with the self-tests, but also VFIO + MMU-500 at stage-1 and
> stage-2. Patches taken against my iommu/devel branch (queued by Joerg
> for 3.19).
> 
> All feedback welcome.
> 
> Will
> 
> --->8
> 
> Will Deacon (4):
>   iommu: introduce generic page table allocation framework
>   iommu: add ARM LPAE page table allocator
>   iommu: add self-consistency tests to ARM LPAE IO page table allocator
>   iommu/arm-smmu: make use of generic LPAE allocator
> 
>  MAINTAINERS                    |   1 +
>  arch/arm64/Kconfig             |   1 -
>  drivers/iommu/Kconfig          |  32 +-
>  drivers/iommu/Makefile         |   2 +
>  drivers/iommu/arm-smmu.c       | 872 +++++++++++---------------------------
>  drivers/iommu/io-pgtable-arm.c | 925 ++++++++++++++++++++++++++++++++++++++
>  drivers/iommu/io-pgtable.c     |  78 ++++
>  drivers/iommu/io-pgtable.h     |  77 ++++
>  8 files changed, 1361 insertions(+), 627 deletions(-)
>  create mode 100644 drivers/iommu/io-pgtable-arm.c
>  create mode 100644 drivers/iommu/io-pgtable.c
>  create mode 100644 drivers/iommu/io-pgtable.h

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 0/4] Generic IOMMU page table framework
@ 2014-11-30 22:03     ` Laurent Pinchart
  0 siblings, 0 replies; 76+ messages in thread
From: Laurent Pinchart @ 2014-11-30 22:03 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

Thank you for the patches.

On Thursday 27 November 2014 11:51:14 Will Deacon wrote:
> Hi all,
> 
> This series introduces a generic IOMMU page table allocation framework,
> implements support for ARM long-descriptors and then ports the arm-smmu
> driver over to the new code.
> 
> There are a few reasons for doing this:
> 
>   - Page table code is hard, and I don't enjoy shopping
> 
>   - A number of IOMMUs actually use the same table format, but currently
>     duplicate the code
> 
>   - It provides a CPU (and architecture) independent allocator, which
>     may be useful for some systems where the CPU is using a different
>     table format for its own mappings
> 
> As illustrated in the final patch, an IOMMU driver interacts with the
> allocator by passing in a configuration structure describing the
> input and output address ranges, the supported pages sizes and a set of
> ops for performing various TLB invalidation and PTE flushing routines.
> 
> The LPAE code implements support for 4k/2M/1G, 16k/32M and 64k/512M
> mappings, but I decided not to implement the contiguous bit in the
> interest of trying to keep the code semi-readable. This could always be
> added later, if needed.

Do you have any idea how much the contiguous bit can improve performances in 
real use cases ?

> I also included some self-tests for the LPAE implementation. Ideally
> we'd merge these, but I'm also happy to drop them if there are
> objections.
> 
> Tested with the self-tests, but also VFIO + MMU-500 at stage-1 and
> stage-2. Patches taken against my iommu/devel branch (queued by Joerg
> for 3.19).
> 
> All feedback welcome.
> 
> Will
> 
> --->8
> 
> Will Deacon (4):
>   iommu: introduce generic page table allocation framework
>   iommu: add ARM LPAE page table allocator
>   iommu: add self-consistency tests to ARM LPAE IO page table allocator
>   iommu/arm-smmu: make use of generic LPAE allocator
> 
>  MAINTAINERS                    |   1 +
>  arch/arm64/Kconfig             |   1 -
>  drivers/iommu/Kconfig          |  32 +-
>  drivers/iommu/Makefile         |   2 +
>  drivers/iommu/arm-smmu.c       | 872 +++++++++++---------------------------
>  drivers/iommu/io-pgtable-arm.c | 925 ++++++++++++++++++++++++++++++++++++++
>  drivers/iommu/io-pgtable.c     |  78 ++++
>  drivers/iommu/io-pgtable.h     |  77 ++++
>  8 files changed, 1361 insertions(+), 627 deletions(-)
>  create mode 100644 drivers/iommu/io-pgtable-arm.c
>  create mode 100644 drivers/iommu/io-pgtable.c
>  create mode 100644 drivers/iommu/io-pgtable.h

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/4] iommu: add ARM LPAE page table allocator
  2014-11-27 11:51     ` Will Deacon
@ 2014-11-30 23:29         ` Laurent Pinchart
  -1 siblings, 0 replies; 76+ messages in thread
From: Laurent Pinchart @ 2014-11-30 23:29 UTC (permalink / raw)
  To: Will Deacon
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin.Murphy-5wv7dgnIgG8,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Will,

Thank you for the patch.

On Thursday 27 November 2014 11:51:16 Will Deacon wrote:
> A number of IOMMUs found in ARM SoCs can walk architecture-compatible
> page tables.
> 
> This patch adds a generic allocator for Stage-1 and Stage-2 v7/v8
> long-descriptor page tables. 4k, 16k and 64k pages are supported, with
> up to 4-levels of walk to cover a 48-bit address space.
> 
> Signed-off-by: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
> ---
>  MAINTAINERS                    |   1 +
>  drivers/iommu/Kconfig          |   9 +
>  drivers/iommu/Makefile         |   1 +
>  drivers/iommu/io-pgtable-arm.c | 735 ++++++++++++++++++++++++++++++++++++++
>  drivers/iommu/io-pgtable.c     |   7 +
>  drivers/iommu/io-pgtable.h     |  12 +
>  6 files changed, 765 insertions(+)
>  create mode 100644 drivers/iommu/io-pgtable-arm.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 0ff630de8a6d..d3ca31b7c960 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1562,6 +1562,7 @@ M:	Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
>  L:	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org (moderated for non-subscribers)
>  S:	Maintained
>  F:	drivers/iommu/arm-smmu.c
> +F:	drivers/iommu/io-pgtable-arm.c
> 
>  ARM64 PORT (AARCH64 ARCHITECTURE)
>  M:	Catalin Marinas <catalin.marinas-5wv7dgnIgG8@public.gmane.org>
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 0f10554e7114..e1742a0146f8 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -19,6 +19,15 @@ menu "Generic IOMMU Pagetable Support"
>  config IOMMU_IO_PGTABLE
>  	bool
> 
> +config IOMMU_IO_PGTABLE_LPAE
> +	bool "ARMv7/v8 Long Descriptor Format"
> +	select IOMMU_IO_PGTABLE
> +	help
> +	  Enable support for the ARM long descriptor pagetable format.
> +	  This allocator supports 4K/2M/1G, 16K/32M and 64K/512M page
> +	  sizes at both stage-1 and stage-2, as well as address spaces
> +	  up to 48-bits in size.
> +
>  endmenu
> 
>  config OF_IOMMU
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index aff244c78181..269cdd82b672 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -2,6 +2,7 @@ obj-$(CONFIG_IOMMU_API) += iommu.o
>  obj-$(CONFIG_IOMMU_API) += iommu-traces.o
>  obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
>  obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
> +obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
>  obj-$(CONFIG_OF_IOMMU)	+= of_iommu.o
>  obj-$(CONFIG_MSM_IOMMU) += msm_iommu.o msm_iommu_dev.o
>  obj-$(CONFIG_AMD_IOMMU) += amd_iommu.o amd_iommu_init.o
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> new file mode 100644
> index 000000000000..9dbaa2e48424
> --- /dev/null
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -0,0 +1,735 @@
> +/*
> + * CPU-agnostic ARM page table allocator.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see <http://www.gnu.org/licenses/>.
> + *
> + * Copyright (C) 2014 ARM Limited
> + *
> + * Author: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
> + */
> +
> +#define pr_fmt(fmt)	"arm-lpae io-pgtable: " fmt
> +
> +#include <linux/iommu.h>
> +#include <linux/kernel.h>
> +#include <linux/sizes.h>
> +#include <linux/slab.h>
> +#include <linux/types.h>
> +
> +#include "io-pgtable.h"
> +
> +#define ARM_LPAE_MAX_ADDR_BITS		48
> +#define ARM_LPAE_S2_MAX_CONCAT_PAGES	16
> +#define ARM_LPAE_MAX_LEVELS		4
> +
> +/* Struct accessors */
> +#define io_pgtable_to_data(x)						\
> +	container_of((x), struct arm_lpae_io_pgtable, iop)
> +
> +#define io_pgtable_ops_to_pgtable(x)					\
> +	container_of((x), struct io_pgtable, ops)
> +
> +#define io_pgtable_ops_to_data(x)					\
> +	io_pgtable_to_data(io_pgtable_ops_to_pgtable(x))
> +
> +/*
> + * For consistency with the architecture, we always consider
> + * ARM_LPAE_MAX_LEVELS levels, with the walk starting at level n >=0
> + */
> +#define ARM_LPAE_START_LVL(d)	(ARM_LPAE_MAX_LEVELS - (d)->levels)
> +
> +/*
> + * Calculate the right shift amount to get to the portion describing level
> l + * in a virtual address mapped by the pagetable in d.
> + */
> +#define ARM_LPAE_LVL_SHIFT(l,d)						\
> +	((((d)->levels - ((l) - ARM_LPAE_START_LVL(d) + 1))		\
> +	  * (d)->bits_per_level) + (d)->pg_shift)
> +
> +/*
> + * Calculate the index at level l used to map virtual address a using the
> + * pagetable in d.
> + */
> +#define ARM_LPAE_PGD_IDX(l,d)						\
> +	((l) == ARM_LPAE_START_LVL(d) ? ilog2((d)->pages_per_pgd) : 0)
> +
> +#define ARM_LPAE_LVL_IDX(a,l,d)						\
> +	(((a) >> ARM_LPAE_LVL_SHIFT(l,d)) &				\
> +	 ((1 << ((d)->bits_per_level + ARM_LPAE_PGD_IDX(l,d))) - 1))
> +
> +/* Calculate the block/page mapping size at level l for pagetable in d. */
> +#define ARM_LPAE_BLOCK_SIZE(l,d)					\
> +	(1 << (ilog2(sizeof(arm_lpae_iopte)) +				\
> +		((ARM_LPAE_MAX_LEVELS - (l)) * (d)->bits_per_level)))
> +
> +/* Page table bits */
> +#define ARM_LPAE_PTE_TYPE_SHIFT		0
> +#define ARM_LPAE_PTE_TYPE_MASK		0x3
> +
> +#define ARM_LPAE_PTE_TYPE_BLOCK		1
> +#define ARM_LPAE_PTE_TYPE_TABLE		3
> +#define ARM_LPAE_PTE_TYPE_PAGE		3
> +
> +#define ARM_LPAE_PTE_XN			(((arm_lpae_iopte)3) << 53)
> +#define ARM_LPAE_PTE_AF			(((arm_lpae_iopte)1) << 10)
> +#define ARM_LPAE_PTE_SH_NS		(((arm_lpae_iopte)0) << 8)
> +#define ARM_LPAE_PTE_SH_OS		(((arm_lpae_iopte)2) << 8)
> +#define ARM_LPAE_PTE_SH_IS		(((arm_lpae_iopte)3) << 8)
> +#define ARM_LPAE_PTE_VALID		(((arm_lpae_iopte)1) << 0)
> +
> +#define ARM_LPAE_PTE_ATTR_LO_MASK	(((arm_lpae_iopte)0x3ff) << 2)
> +/* Ignore the contiguous bit for block splitting */
> +#define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
> +#define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
> +					 ARM_LPAE_PTE_ATTR_HI_MASK)
> +
> +/* Stage-1 PTE */
> +#define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
> +#define ARM_LPAE_PTE_AP_RDONLY		(((arm_lpae_iopte)2) << 6)
> +#define ARM_LPAE_PTE_ATTRINDX_SHIFT	2
> +#define ARM_LPAE_PTE_nG			(((arm_lpae_iopte)1) << 11)
> +
> +/* Stage-2 PTE */
> +#define ARM_LPAE_PTE_HAP_FAULT		(((arm_lpae_iopte)0) << 6)
> +#define ARM_LPAE_PTE_HAP_READ		(((arm_lpae_iopte)1) << 6)
> +#define ARM_LPAE_PTE_HAP_WRITE		(((arm_lpae_iopte)2) << 6)
> +#define ARM_LPAE_PTE_MEMATTR_OIWB	(((arm_lpae_iopte)0xf) << 2)
> +#define ARM_LPAE_PTE_MEMATTR_NC		(((arm_lpae_iopte)0x5) << 2)
> +#define ARM_LPAE_PTE_MEMATTR_DEV	(((arm_lpae_iopte)0x1) << 2)
> +
> +/* Register bits */
> +#define ARM_LPAE_TCR_EAE		(1 << 31)
> +
> +#define ARM_LPAE_TCR_TG0_4K		(0 << 14)
> +#define ARM_LPAE_TCR_TG0_64K		(1 << 14)
> +#define ARM_LPAE_TCR_TG0_16K		(2 << 14)
> +
> +#define ARM_LPAE_TCR_SH0_SHIFT		12
> +#define ARM_LPAE_TCR_SH0_MASK		0x3
> +#define ARM_LPAE_TCR_SH_NS		0
> +#define ARM_LPAE_TCR_SH_OS		2
> +#define ARM_LPAE_TCR_SH_IS		3
> +
> +#define ARM_LPAE_TCR_ORGN0_SHIFT	10
> +#define ARM_LPAE_TCR_IRGN0_SHIFT	8
> +#define ARM_LPAE_TCR_RGN_MASK		0x3
> +#define ARM_LPAE_TCR_RGN_NC		0
> +#define ARM_LPAE_TCR_RGN_WBWA		1
> +#define ARM_LPAE_TCR_RGN_WT		2
> +#define ARM_LPAE_TCR_RGN_WB		3
> +
> +#define ARM_LPAE_TCR_SL0_SHIFT		6
> +#define ARM_LPAE_TCR_SL0_MASK		0x3
> +
> +#define ARM_LPAE_TCR_T0SZ_SHIFT		0
> +#define ARM_LPAE_TCR_SZ_MASK		0xf
> +
> +#define ARM_LPAE_TCR_PS_SHIFT		16
> +#define ARM_LPAE_TCR_PS_MASK		0x7
> +
> +#define ARM_LPAE_TCR_IPS_SHIFT		32
> +#define ARM_LPAE_TCR_IPS_MASK		0x7
> +
> +#define ARM_LPAE_TCR_PS_32_BIT		0x0ULL
> +#define ARM_LPAE_TCR_PS_36_BIT		0x1ULL
> +#define ARM_LPAE_TCR_PS_40_BIT		0x2ULL
> +#define ARM_LPAE_TCR_PS_42_BIT		0x3ULL
> +#define ARM_LPAE_TCR_PS_44_BIT		0x4ULL
> +#define ARM_LPAE_TCR_PS_48_BIT		0x5ULL
> +
> +#define ARM_LPAE_MAIR_ATTR_SHIFT(n)	((n) << 3)
> +#define ARM_LPAE_MAIR_ATTR_MASK		0xff
> +#define ARM_LPAE_MAIR_ATTR_DEVICE	0x04
> +#define ARM_LPAE_MAIR_ATTR_NC		0x44
> +#define ARM_LPAE_MAIR_ATTR_WBRWA	0xff
> +#define ARM_LPAE_MAIR_ATTR_IDX_NC	0
> +#define ARM_LPAE_MAIR_ATTR_IDX_CACHE	1
> +#define ARM_LPAE_MAIR_ATTR_IDX_DEV	2
> +
> +/* IOPTE accessors */
> +#define iopte_deref(pte,d)					\
> +	(__va((pte) & ((1ULL << ARM_LPAE_MAX_ADDR_BITS) - 1)	\
> +	& ~((1ULL << (d)->pg_shift) - 1)))
> +
> +#define iopte_type(pte,l)					\
> +	(((pte) >> ARM_LPAE_PTE_TYPE_SHIFT) & ARM_LPAE_PTE_TYPE_MASK)
> +
> +#define iopte_prot(pte)	((pte) & ARM_LPAE_PTE_ATTR_MASK)
> +
> +#define iopte_leaf(pte,l)					\
> +	(l == (ARM_LPAE_MAX_LEVELS - 1) ?			\
> +		(iopte_type(pte,l) == ARM_LPAE_PTE_TYPE_PAGE) :	\
> +		(iopte_type(pte,l) == ARM_LPAE_PTE_TYPE_BLOCK))
> +
> +#define iopte_to_pfn(pte,d)					\
> +	(((pte) & ((1ULL << ARM_LPAE_MAX_ADDR_BITS) - 1)) >> (d)->pg_shift)
> +
> +#define pfn_to_iopte(pfn,d)					\
> +	(((pfn) << (d)->pg_shift) & ((1ULL << ARM_LPAE_MAX_ADDR_BITS) - 1))
> +
> +struct arm_lpae_io_pgtable {
> +	struct io_pgtable	iop;
> +
> +	int			levels;
> +	int			pages_per_pgd;
> +	unsigned long		pg_shift;
> +	unsigned long		bits_per_level;
> +
> +	void			*pgd;
> +};
> +
> +typedef u64 arm_lpae_iopte;
> +
> +static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
> +			     unsigned long iova, phys_addr_t paddr,
> +			     arm_lpae_iopte prot, int lvl,
> +			     arm_lpae_iopte *ptep)
> +{
> +	arm_lpae_iopte pte = prot;
> +
> +	/* We require an unmap first */
> +	if (iopte_leaf(*ptep, lvl))
> +		return -EEXIST;
> +
> +	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
> +		pte |= ARM_LPAE_PTE_TYPE_PAGE;
> +	else
> +		pte |= ARM_LPAE_PTE_TYPE_BLOCK;
> +
> +	pte |= ARM_LPAE_PTE_AF | ARM_LPAE_PTE_SH_IS;
> +	pte |= pfn_to_iopte(paddr >> data->pg_shift, data);
> +
> +	*ptep = pte;
> +	data->iop.cfg.tlb->flush_pgtable(ptep, sizeof(*ptep), data->iop.cookie);
> +	return 0;
> +}
> +
> +static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long
> iova, +			  phys_addr_t paddr, size_t size, arm_lpae_iopte prot,
> +			  int lvl, arm_lpae_iopte *ptep)
> +{
> +	arm_lpae_iopte *cptep, pte;
> +	void *cookie = data->iop.cookie;
> +	size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
> +
> +	/* Find our entry at the current level */
> +	ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
> +
> +	/* If we can install a leaf entry at this level, then do so */
> +	if (size == block_size && (size & data->iop.cfg.pgsize_bitmap))
> +		return arm_lpae_init_pte(data, iova, paddr, prot, lvl, ptep);
> +
> +	/* We can't allocate tables at the final level */
> +	if (WARN_ON(lvl >= ARM_LPAE_MAX_LEVELS - 1))
> +		return -EINVAL;
> +
> +	/* Grab a pointer to the next level */
> +	pte = *ptep;
> +	if (!pte) {
> +		cptep = alloc_pages_exact(1UL << data->pg_shift,
> +					 GFP_ATOMIC | __GFP_ZERO);
> +		if (!cptep)
> +			return -ENOMEM;
> +
> +		data->iop.cfg.tlb->flush_pgtable(cptep, 1UL << data->pg_shift,
> +						 cookie);
> +		pte = __pa(cptep) | ARM_LPAE_PTE_TYPE_TABLE;
> +		*ptep = pte;
> +		data->iop.cfg.tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
> +	} else {
> +		cptep = iopte_deref(pte, data);
> +	}
> +
> +	/* Rinse, repeat */
> +	return __arm_lpae_map(data, iova, paddr, size, prot, lvl + 1, cptep);
> +}
> +
> +static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable
> *data, +					   int prot)
> +{
> +	arm_lpae_iopte pte;
> +
> +	if (data->iop.fmt == ARM_LPAE_S1) {
> +		pte = ARM_LPAE_PTE_AP_UNPRIV | ARM_LPAE_PTE_nG;
> +
> +		if (!(prot & IOMMU_WRITE) && (prot & IOMMU_READ))
> +			pte |= ARM_LPAE_PTE_AP_RDONLY;
> +
> +		if (prot & IOMMU_CACHE)
> +			pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
> +				<< ARM_LPAE_PTE_ATTRINDX_SHIFT);

In my case I'll need to manage the NS bit (here and when allocating tables in 
__arm_lpae_map). The exact requirements are not exactly clear at the moment 
I'm afraid, the datasheet doesn't clearly document secure behaviour, but tests 
showed that setting the NS was necessary.

Given that arm_lpae_init_pte() will unconditionally set the AF and SH_IS bits 
you could set them here too, but that shouldn't make a big difference.

> +	} else {
> +		pte = ARM_LPAE_PTE_HAP_FAULT;
> +		if (prot & IOMMU_READ)
> +			pte |= ARM_LPAE_PTE_HAP_READ;
> +		if (prot & IOMMU_WRITE)
> +			pte |= ARM_LPAE_PTE_HAP_WRITE;
> +		if (prot & IOMMU_CACHE)
> +			pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
> +		else
> +			pte |= ARM_LPAE_PTE_MEMATTR_NC;
> +	}
> +
> +	if (prot & IOMMU_NOEXEC)
> +		pte |= ARM_LPAE_PTE_XN;
> +
> +	return pte;
> +}
> +
> +static int arm_lpae_map(struct io_pgtable_ops *ops, unsigned long iova,
> +			phys_addr_t paddr, size_t size, int iommu_prot)
> +{
> +	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
> +	arm_lpae_iopte *ptep = data->pgd;
> +	int lvl = ARM_LPAE_START_LVL(data);
> +	arm_lpae_iopte prot;
> +
> +	/* If no access, then nothing to do */
> +	if (!(iommu_prot & (IOMMU_READ | IOMMU_WRITE)))
> +		return 0;

Shouldn't this create a faulting entry instead ?

> +	prot = arm_lpae_prot_to_pte(data, iommu_prot);
> +	return __arm_lpae_map(data, iova, paddr, size, prot, lvl, ptep);
> +}
> +
> +static void __arm_lpae_free_pgtable(struct arm_lpae_io_pgtable *data, int
> lvl, +				    arm_lpae_iopte *ptep)
> +{
> +	arm_lpae_iopte *start, *end;
> +	unsigned long table_size;
> +
> +	/* Only leaf entries at the last level */
> +	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
> +		return;
> +
> +	table_size = 1UL << data->pg_shift;
> +	if (lvl == ARM_LPAE_START_LVL(data))
> +		table_size *= data->pages_per_pgd;
> +
> +	start = ptep;
> +	end = (void *)ptep + table_size;
> +
> +	while (ptep != end) {
> +		arm_lpae_iopte pte = *ptep++;
> +
> +		if (!pte || iopte_leaf(pte, lvl))
> +			continue;
> +
> +		__arm_lpae_free_pgtable(data, lvl + 1, iopte_deref(pte, data));
> +	}
> +
> +	free_pages_exact(start, table_size);
> +}
> +
> +static void arm_lpae_free_pgtable(struct io_pgtable *iop)
> +{
> +	struct arm_lpae_io_pgtable *data = io_pgtable_to_data(iop);
> +
> +	__arm_lpae_free_pgtable(data, ARM_LPAE_START_LVL(data), data->pgd);
> +	kfree(data);
> +}
> +
> +static int arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
> +				    unsigned long iova, size_t size,
> +				    arm_lpae_iopte prot, int lvl,
> +				    arm_lpae_iopte *ptep, size_t blk_size)
> +{
> +	unsigned long blk_start, blk_end;
> +	phys_addr_t blk_paddr;
> +	arm_lpae_iopte table = 0;
> +	void *cookie = data->iop.cookie;
> +	struct iommu_gather_ops *tlb = data->iop.cfg.tlb;
> +
> +	blk_start = iova & ~(blk_size - 1);
> +	blk_end = blk_start + blk_size;
> +	blk_paddr = iopte_to_pfn(*ptep, data) << data->pg_shift;
> +
> +	for (; blk_start < blk_end; blk_start += size, blk_paddr += size) {
> +		arm_lpae_iopte *tablep;
> +
> +		/* Unmap! */
> +		if (blk_start == iova)
> +			continue;
> +
> +		/* __arm_lpae_map expects a pointer to the start of the table */
> +		tablep = &table - ARM_LPAE_LVL_IDX(blk_start, lvl, data);
> +		if (__arm_lpae_map(data, blk_start, blk_paddr, size, prot, lvl,
> +				   tablep) < 0) {
> +			if (table) {
> +				/* Free the table we allocated */
> +				tablep = iopte_deref(table, data);
> +				__arm_lpae_free_pgtable(data, lvl + 1, tablep);
> +			}
> +			return 0; /* Bytes unmapped */
> +		}
> +	}
> +
> +	*ptep = table;
> +	tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
> +	iova &= ~(blk_size - 1);
> +	tlb->tlb_add_flush(iova, blk_size, true, cookie);
> +	return size;
> +}
> +
> +static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> +			    unsigned long iova, size_t size, int lvl,
> +			    arm_lpae_iopte *ptep)
> +{
> +	arm_lpae_iopte pte;
> +	struct iommu_gather_ops *tlb = data->iop.cfg.tlb;
> +	void *cookie = data->iop.cookie;
> +	size_t blk_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
> +
> +	ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
> +	pte = *ptep;
> +
> +	/* Something went horribly wrong and we ran out of page table */
> +	if (WARN_ON(!pte || (lvl == ARM_LPAE_MAX_LEVELS)))
> +		return 0;
> +
> +	/* If the size matches this level, we're in the right place */
> +	if (size == blk_size) {
> +		*ptep = 0;
> +		tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
> +
> +		if (!iopte_leaf(pte, lvl)) {
> +			/* Also flush any partial walks */
> +			tlb->tlb_add_flush(iova, size, false, cookie);
> +			tlb->tlb_sync(data->iop.cookie);
> +			ptep = iopte_deref(pte, data);
> +			__arm_lpae_free_pgtable(data, lvl + 1, ptep);
> +		} else {
> +			tlb->tlb_add_flush(iova, size, true, cookie);
> +		}
> +
> +		return size;
> +	} else if (iopte_leaf(pte, lvl)) {
> +		/*
> +		 * Insert a table at the next level to map the old region,
> +		 * minus the part we want to unmap
> +		 */
> +		return arm_lpae_split_blk_unmap(data, iova, size,
> +						iopte_prot(pte), lvl, ptep,
> +						blk_size);
> +	}
> +
> +	/* Keep on walkin' */
> +	ptep = iopte_deref(pte, data);
> +	return __arm_lpae_unmap(data, iova, size, lvl + 1, ptep);
> +}
> +
> +static int arm_lpae_unmap(struct io_pgtable_ops *ops, unsigned long iova,
> +			  size_t size)
> +{
> +	size_t unmapped;
> +	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
> +	struct io_pgtable *iop = &data->iop;
> +	arm_lpae_iopte *ptep = data->pgd;
> +	int lvl = ARM_LPAE_START_LVL(data);
> +
> +	unmapped = __arm_lpae_unmap(data, iova, size, lvl, ptep);
> +	if (unmapped)
> +		iop->cfg.tlb->tlb_sync(iop->cookie);
> +
> +	return unmapped;
> +}
> +
> +static phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
> +					 unsigned long iova)
> +{
> +	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
> +	arm_lpae_iopte pte, *ptep = data->pgd;
> +	int lvl = ARM_LPAE_START_LVL(data);
> +
> +	do {
> +		/* Valid IOPTE pointer? */
> +		if (!ptep)
> +			return 0;
> +
> +		/* Grab the IOPTE we're interested in */
> +		pte = *(ptep + ARM_LPAE_LVL_IDX(iova, lvl, data));
> +
> +		/* Valid entry? */
> +		if (!pte)
> +			return 0;
> +
> +		/* Leaf entry? */
> +		if (iopte_leaf(pte,lvl))
> +			goto found_translation;
> +
> +		/* Take it to the next level */
> +		ptep = iopte_deref(pte, data);
> +	} while (++lvl < ARM_LPAE_MAX_LEVELS);
> +
> +	/* Ran out of page tables to walk */
> +	return 0;
> +
> +found_translation:
> +	iova &= ((1 << data->pg_shift) - 1);
> +	return ((phys_addr_t)iopte_to_pfn(pte,data) << data->pg_shift) | iova;
> +}
> +
> +static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
> +{
> +	unsigned long granule;
> +
> +	/*
> +	 * We need to restrict the supported page sizes to match the
> +	 * translation regime for a particular granule. Aim to match
> +	 * the CPU page size if possible, otherwise prefer smaller sizes.
> +	 * While we're at it, restrict the block sizes to match the
> +	 * chosen granule.
> +	 */
> +	if (cfg->pgsize_bitmap & PAGE_SIZE)
> +		granule = PAGE_SIZE;
> +	else if (cfg->pgsize_bitmap & ~PAGE_MASK)
> +		granule = 1UL << __fls(cfg->pgsize_bitmap & ~PAGE_MASK);
> +	else if (cfg->pgsize_bitmap & PAGE_MASK)
> +		granule = 1UL << __ffs(cfg->pgsize_bitmap & PAGE_MASK);
> +	else
> +		granule = 0;
> +
> +	switch (granule) {
> +	case SZ_4K:
> +		cfg->pgsize_bitmap &= (SZ_4K | SZ_2M | SZ_1G);
> +		break;
> +	case SZ_16K:
> +		cfg->pgsize_bitmap &= (SZ_16K | SZ_32M);
> +		break;
> +	case SZ_64K:
> +		cfg->pgsize_bitmap &= (SZ_64K | SZ_512M);
> +		break;
> +	default:
> +		cfg->pgsize_bitmap = 0;
> +	}
> +}
> +
> +static struct arm_lpae_io_pgtable *
> +arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
> +{
> +	unsigned long va_bits;
> +	struct arm_lpae_io_pgtable *data;
> +
> +	arm_lpae_restrict_pgsizes(cfg);
> +
> +	if (!(cfg->pgsize_bitmap & (SZ_4K | SZ_16K | SZ_64K)))
> +		return NULL;
> +
> +	if (cfg->ias > ARM_LPAE_MAX_ADDR_BITS)
> +		return NULL;
> +
> +	if (cfg->oas > ARM_LPAE_MAX_ADDR_BITS)
> +		return NULL;
> +
> +	data = kmalloc(sizeof(*data), GFP_KERNEL);
> +	if (!data)
> +		return NULL;
> +
> +	data->pages_per_pgd = 1;
> +	data->pg_shift = __ffs(cfg->pgsize_bitmap);
> +	data->bits_per_level = data->pg_shift - ilog2(sizeof(arm_lpae_iopte));
> +
> +	va_bits = cfg->ias - data->pg_shift;
> +	data->levels = DIV_ROUND_UP(va_bits, data->bits_per_level);
> +
> +	data->iop.ops = (struct io_pgtable_ops) {
> +		.map		= arm_lpae_map,
> +		.unmap		= arm_lpae_unmap,
> +		.iova_to_phys	= arm_lpae_iova_to_phys,
> +	};
> +
> +	return data;
> +}
> +
> +static struct io_pgtable *arm_lpae_alloc_pgtable_s1(struct io_pgtable_cfg
> *cfg, +						    void *cookie)
> +{
> +	u64 reg;
> +	struct arm_lpae_io_pgtable *data = arm_lpae_alloc_pgtable(cfg);
> +
> +	if (!data)
> +		return NULL;
> +
> +	/* TCR */
> +	reg = ARM_LPAE_TCR_EAE |
> +	     (ARM_LPAE_TCR_SH_IS << ARM_LPAE_TCR_SH0_SHIFT) |
> +	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_IRGN0_SHIFT) |
> +	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_ORGN0_SHIFT);
> +
> +	switch (1 << data->pg_shift) {
> +	case SZ_4K:
> +		reg |= ARM_LPAE_TCR_TG0_4K;
> +		break;
> +	case SZ_16K:
> +		reg |= ARM_LPAE_TCR_TG0_16K;
> +		break;
> +	case SZ_64K:
> +		reg |= ARM_LPAE_TCR_TG0_64K;
> +		break;
> +	}
> +
> +	switch (cfg->oas) {
> +	case 32:
> +		reg |= (ARM_LPAE_TCR_PS_32_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> +		break;
> +	case 36:
> +		reg |= (ARM_LPAE_TCR_PS_36_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> +		break;
> +	case 40:
> +		reg |= (ARM_LPAE_TCR_PS_40_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> +		break;
> +	case 42:
> +		reg |= (ARM_LPAE_TCR_PS_42_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> +		break;
> +	case 44:
> +		reg |= (ARM_LPAE_TCR_PS_44_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> +		break;
> +	case 48:
> +		reg |= (ARM_LPAE_TCR_PS_48_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> +		break;
> +	default:
> +		goto out_free_data;
> +	}
> +
> +	reg |= (64ULL - cfg->ias) << ARM_LPAE_TCR_T0SZ_SHIFT;
> +	cfg->arm_lpae_s1_cfg.tcr = reg;
> +
> +	/* MAIRs */
> +	reg = (ARM_LPAE_MAIR_ATTR_NC
> +	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_NC)) |
> +	      (ARM_LPAE_MAIR_ATTR_WBRWA
> +	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_CACHE)) |
> +	      (ARM_LPAE_MAIR_ATTR_DEVICE
> +	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_DEV));
> +
> +	cfg->arm_lpae_s1_cfg.mair[0] = reg;
> +	cfg->arm_lpae_s1_cfg.mair[1] = 0;
> +
> +	/* Looking good; allocate a pgd */
> +	data->pgd = alloc_pages_exact(1UL << data->pg_shift,
> +				      GFP_KERNEL | __GFP_ZERO);

data->pg_shift is computed as __ffs(cfg->pgsize_bitmap). 1UL << data->pg_shift 
will thus be equal to the smallest page size supported by the IOMMU. This will 
thus allocate 4kB, 16kB or 64kB depending on the IOMMU configuration. However, 
if I'm not mistaken the top-level directory needs to store one entry per 
largest supported page size. That's 4, 128 or 8 entries depending on the 
configuration. You're thus over-allocating.

> +	if (!data->pgd)
> +		goto out_free_data;
> +
> +	cfg->tlb->flush_pgtable(data->pgd, (1UL << data->pg_shift), cookie);
> +
> +	/* TTBRs */
> +	cfg->arm_lpae_s1_cfg.ttbr[0] = virt_to_phys(data->pgd);
> +	cfg->arm_lpae_s1_cfg.ttbr[1] = 0;
> +	return &data->iop;
> +
> +out_free_data:
> +	kfree(data);
> +	return NULL;
> +}
> +
> +static struct io_pgtable *arm_lpae_alloc_pgtable_s2(struct io_pgtable_cfg
> *cfg,
> +						    void *cookie)
> +{
> +	u64 reg, sl;
> +	size_t pgd_size;
> +	struct arm_lpae_io_pgtable *data = arm_lpae_alloc_pgtable(cfg);
> +
> +	if (!data)
> +		return NULL;
> +
> +	/*
> +	 * Concatenate PGDs at level 1 if possible in order to reduce
> +	 * the depth of the stage-2 walk.
> +	 */
> +	if (data->levels == ARM_LPAE_MAX_LEVELS) {
> +		unsigned long pgd_bits, pgd_pages;
> +		unsigned long va_bits = cfg->ias - data->pg_shift;
> +
> +		pgd_bits = data->bits_per_level * (data->levels - 1);
> +		pgd_pages = 1 << (va_bits - pgd_bits);
> +		if (pgd_pages <= ARM_LPAE_S2_MAX_CONCAT_PAGES) {
> +			data->pages_per_pgd = pgd_pages;
> +			data->levels--;
> +		}
> +	}
> +
> +	/* VTCR */
> +	reg = ARM_LPAE_TCR_EAE |
> +	     (ARM_LPAE_TCR_SH_IS << ARM_LPAE_TCR_SH0_SHIFT) |
> +	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_IRGN0_SHIFT) |
> +	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_ORGN0_SHIFT);
> +
> +	sl = ARM_LPAE_START_LVL(data);
> +
> +	switch (1 << data->pg_shift) {
> +	case SZ_4K:
> +		reg |= ARM_LPAE_TCR_TG0_4K;
> +		sl++; /* SL0 format is different for 4K granule size */
> +		break;
> +	case SZ_16K:
> +		reg |= ARM_LPAE_TCR_TG0_16K;
> +		break;
> +	case SZ_64K:
> +		reg |= ARM_LPAE_TCR_TG0_64K;
> +		break;
> +	}
> +
> +	switch (cfg->oas) {
> +	case 32:
> +		reg |= (ARM_LPAE_TCR_PS_32_BIT << ARM_LPAE_TCR_PS_SHIFT);
> +		break;
> +	case 36:
> +		reg |= (ARM_LPAE_TCR_PS_36_BIT << ARM_LPAE_TCR_PS_SHIFT);
> +		break;
> +	case 40:
> +		reg |= (ARM_LPAE_TCR_PS_40_BIT << ARM_LPAE_TCR_PS_SHIFT);
> +		break;
> +	case 42:
> +		reg |= (ARM_LPAE_TCR_PS_42_BIT << ARM_LPAE_TCR_PS_SHIFT);
> +		break;
> +	case 44:
> +		reg |= (ARM_LPAE_TCR_PS_44_BIT << ARM_LPAE_TCR_PS_SHIFT);
> +		break;
> +	case 48:
> +		reg |= (ARM_LPAE_TCR_PS_48_BIT << ARM_LPAE_TCR_PS_SHIFT);
> +		break;
> +	default:
> +		goto out_free_data;
> +	}
> +
> +	reg |= (64ULL - cfg->ias) << ARM_LPAE_TCR_T0SZ_SHIFT;
> +	reg |= (~sl & ARM_LPAE_TCR_SL0_MASK) << ARM_LPAE_TCR_SL0_SHIFT;
> +	cfg->arm_lpae_s2_cfg.vtcr = reg;
> +
> +	/* Allocate pgd pages */
> +	pgd_size = data->pages_per_pgd * (1UL << data->pg_shift);
> +	data->pgd = alloc_pages_exact(pgd_size, GFP_KERNEL | __GFP_ZERO);
> +	if (!data->pgd)
> +		goto out_free_data;
> +
> +	cfg->tlb->flush_pgtable(data->pgd, pgd_size, cookie);
> +
> +	/* VTTBR */
> +	cfg->arm_lpae_s2_cfg.vttbr = virt_to_phys(data->pgd);
> +	return &data->iop;
> +
> +out_free_data:
> +	kfree(data);
> +	return NULL;
> +}
> +
> +struct io_pgtable_init_fns io_pgtable_arm_lpae_s1_init_fns = {
> +	.alloc	= arm_lpae_alloc_pgtable_s1,
> +	.free	= arm_lpae_free_pgtable,
> +};
> +
> +struct io_pgtable_init_fns io_pgtable_arm_lpae_s2_init_fns = {
> +	.alloc	= arm_lpae_alloc_pgtable_s2,
> +	.free	= arm_lpae_free_pgtable,
> +};
> diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
> index 82e39a0db94b..d0a2016efcb4 100644
> --- a/drivers/iommu/io-pgtable.c
> +++ b/drivers/iommu/io-pgtable.c
> @@ -25,8 +25,15 @@
> 
>  #include "io-pgtable.h"
> 
> +extern struct io_pgtable_init_fns io_pgtable_arm_lpae_s1_init_fns;
> +extern struct io_pgtable_init_fns io_pgtable_arm_lpae_s2_init_fns;
> +
>  static struct io_pgtable_init_fns
> *io_pgtable_init_table[IO_PGTABLE_NUM_FMTS] = {
> +#ifdef CONFIG_IOMMU_IO_PGTABLE_LPAE
> +	[ARM_LPAE_S1] = &io_pgtable_arm_lpae_s1_init_fns,
> +	[ARM_LPAE_S2] = &io_pgtable_arm_lpae_s2_init_fns,
> +#endif
>  };
> 
>  struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
> diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
> index 5ae75d9cae50..c1cff3d045db 100644
> --- a/drivers/iommu/io-pgtable.h
> +++ b/drivers/iommu/io-pgtable.h
> @@ -33,10 +33,22 @@ struct io_pgtable_cfg {
> 
>  	/* Low-level data specific to the table format */
>  	union {
> +		struct {
> +			u64	ttbr[2];
> +			u64	tcr;
> +			u64	mair[2];
> +		} arm_lpae_s1_cfg;
> +
> +		struct {
> +			u64	vttbr;
> +			u64	vtcr;
> +		} arm_lpae_s2_cfg;
>  	};
>  };
> 
>  enum io_pgtable_fmt {
> +	ARM_LPAE_S1,
> +	ARM_LPAE_S2,
>  	IO_PGTABLE_NUM_FMTS,
>  };

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 2/4] iommu: add ARM LPAE page table allocator
@ 2014-11-30 23:29         ` Laurent Pinchart
  0 siblings, 0 replies; 76+ messages in thread
From: Laurent Pinchart @ 2014-11-30 23:29 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

Thank you for the patch.

On Thursday 27 November 2014 11:51:16 Will Deacon wrote:
> A number of IOMMUs found in ARM SoCs can walk architecture-compatible
> page tables.
> 
> This patch adds a generic allocator for Stage-1 and Stage-2 v7/v8
> long-descriptor page tables. 4k, 16k and 64k pages are supported, with
> up to 4-levels of walk to cover a 48-bit address space.
> 
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  MAINTAINERS                    |   1 +
>  drivers/iommu/Kconfig          |   9 +
>  drivers/iommu/Makefile         |   1 +
>  drivers/iommu/io-pgtable-arm.c | 735 ++++++++++++++++++++++++++++++++++++++
>  drivers/iommu/io-pgtable.c     |   7 +
>  drivers/iommu/io-pgtable.h     |  12 +
>  6 files changed, 765 insertions(+)
>  create mode 100644 drivers/iommu/io-pgtable-arm.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 0ff630de8a6d..d3ca31b7c960 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1562,6 +1562,7 @@ M:	Will Deacon <will.deacon@arm.com>
>  L:	linux-arm-kernel at lists.infradead.org (moderated for non-subscribers)
>  S:	Maintained
>  F:	drivers/iommu/arm-smmu.c
> +F:	drivers/iommu/io-pgtable-arm.c
> 
>  ARM64 PORT (AARCH64 ARCHITECTURE)
>  M:	Catalin Marinas <catalin.marinas@arm.com>
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 0f10554e7114..e1742a0146f8 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -19,6 +19,15 @@ menu "Generic IOMMU Pagetable Support"
>  config IOMMU_IO_PGTABLE
>  	bool
> 
> +config IOMMU_IO_PGTABLE_LPAE
> +	bool "ARMv7/v8 Long Descriptor Format"
> +	select IOMMU_IO_PGTABLE
> +	help
> +	  Enable support for the ARM long descriptor pagetable format.
> +	  This allocator supports 4K/2M/1G, 16K/32M and 64K/512M page
> +	  sizes at both stage-1 and stage-2, as well as address spaces
> +	  up to 48-bits in size.
> +
>  endmenu
> 
>  config OF_IOMMU
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index aff244c78181..269cdd82b672 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -2,6 +2,7 @@ obj-$(CONFIG_IOMMU_API) += iommu.o
>  obj-$(CONFIG_IOMMU_API) += iommu-traces.o
>  obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
>  obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
> +obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
>  obj-$(CONFIG_OF_IOMMU)	+= of_iommu.o
>  obj-$(CONFIG_MSM_IOMMU) += msm_iommu.o msm_iommu_dev.o
>  obj-$(CONFIG_AMD_IOMMU) += amd_iommu.o amd_iommu_init.o
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> new file mode 100644
> index 000000000000..9dbaa2e48424
> --- /dev/null
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -0,0 +1,735 @@
> +/*
> + * CPU-agnostic ARM page table allocator.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see <http://www.gnu.org/licenses/>.
> + *
> + * Copyright (C) 2014 ARM Limited
> + *
> + * Author: Will Deacon <will.deacon@arm.com>
> + */
> +
> +#define pr_fmt(fmt)	"arm-lpae io-pgtable: " fmt
> +
> +#include <linux/iommu.h>
> +#include <linux/kernel.h>
> +#include <linux/sizes.h>
> +#include <linux/slab.h>
> +#include <linux/types.h>
> +
> +#include "io-pgtable.h"
> +
> +#define ARM_LPAE_MAX_ADDR_BITS		48
> +#define ARM_LPAE_S2_MAX_CONCAT_PAGES	16
> +#define ARM_LPAE_MAX_LEVELS		4
> +
> +/* Struct accessors */
> +#define io_pgtable_to_data(x)						\
> +	container_of((x), struct arm_lpae_io_pgtable, iop)
> +
> +#define io_pgtable_ops_to_pgtable(x)					\
> +	container_of((x), struct io_pgtable, ops)
> +
> +#define io_pgtable_ops_to_data(x)					\
> +	io_pgtable_to_data(io_pgtable_ops_to_pgtable(x))
> +
> +/*
> + * For consistency with the architecture, we always consider
> + * ARM_LPAE_MAX_LEVELS levels, with the walk starting at level n >=0
> + */
> +#define ARM_LPAE_START_LVL(d)	(ARM_LPAE_MAX_LEVELS - (d)->levels)
> +
> +/*
> + * Calculate the right shift amount to get to the portion describing level
> l + * in a virtual address mapped by the pagetable in d.
> + */
> +#define ARM_LPAE_LVL_SHIFT(l,d)						\
> +	((((d)->levels - ((l) - ARM_LPAE_START_LVL(d) + 1))		\
> +	  * (d)->bits_per_level) + (d)->pg_shift)
> +
> +/*
> + * Calculate the index at level l used to map virtual address a using the
> + * pagetable in d.
> + */
> +#define ARM_LPAE_PGD_IDX(l,d)						\
> +	((l) == ARM_LPAE_START_LVL(d) ? ilog2((d)->pages_per_pgd) : 0)
> +
> +#define ARM_LPAE_LVL_IDX(a,l,d)						\
> +	(((a) >> ARM_LPAE_LVL_SHIFT(l,d)) &				\
> +	 ((1 << ((d)->bits_per_level + ARM_LPAE_PGD_IDX(l,d))) - 1))
> +
> +/* Calculate the block/page mapping size at level l for pagetable in d. */
> +#define ARM_LPAE_BLOCK_SIZE(l,d)					\
> +	(1 << (ilog2(sizeof(arm_lpae_iopte)) +				\
> +		((ARM_LPAE_MAX_LEVELS - (l)) * (d)->bits_per_level)))
> +
> +/* Page table bits */
> +#define ARM_LPAE_PTE_TYPE_SHIFT		0
> +#define ARM_LPAE_PTE_TYPE_MASK		0x3
> +
> +#define ARM_LPAE_PTE_TYPE_BLOCK		1
> +#define ARM_LPAE_PTE_TYPE_TABLE		3
> +#define ARM_LPAE_PTE_TYPE_PAGE		3
> +
> +#define ARM_LPAE_PTE_XN			(((arm_lpae_iopte)3) << 53)
> +#define ARM_LPAE_PTE_AF			(((arm_lpae_iopte)1) << 10)
> +#define ARM_LPAE_PTE_SH_NS		(((arm_lpae_iopte)0) << 8)
> +#define ARM_LPAE_PTE_SH_OS		(((arm_lpae_iopte)2) << 8)
> +#define ARM_LPAE_PTE_SH_IS		(((arm_lpae_iopte)3) << 8)
> +#define ARM_LPAE_PTE_VALID		(((arm_lpae_iopte)1) << 0)
> +
> +#define ARM_LPAE_PTE_ATTR_LO_MASK	(((arm_lpae_iopte)0x3ff) << 2)
> +/* Ignore the contiguous bit for block splitting */
> +#define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
> +#define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
> +					 ARM_LPAE_PTE_ATTR_HI_MASK)
> +
> +/* Stage-1 PTE */
> +#define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
> +#define ARM_LPAE_PTE_AP_RDONLY		(((arm_lpae_iopte)2) << 6)
> +#define ARM_LPAE_PTE_ATTRINDX_SHIFT	2
> +#define ARM_LPAE_PTE_nG			(((arm_lpae_iopte)1) << 11)
> +
> +/* Stage-2 PTE */
> +#define ARM_LPAE_PTE_HAP_FAULT		(((arm_lpae_iopte)0) << 6)
> +#define ARM_LPAE_PTE_HAP_READ		(((arm_lpae_iopte)1) << 6)
> +#define ARM_LPAE_PTE_HAP_WRITE		(((arm_lpae_iopte)2) << 6)
> +#define ARM_LPAE_PTE_MEMATTR_OIWB	(((arm_lpae_iopte)0xf) << 2)
> +#define ARM_LPAE_PTE_MEMATTR_NC		(((arm_lpae_iopte)0x5) << 2)
> +#define ARM_LPAE_PTE_MEMATTR_DEV	(((arm_lpae_iopte)0x1) << 2)
> +
> +/* Register bits */
> +#define ARM_LPAE_TCR_EAE		(1 << 31)
> +
> +#define ARM_LPAE_TCR_TG0_4K		(0 << 14)
> +#define ARM_LPAE_TCR_TG0_64K		(1 << 14)
> +#define ARM_LPAE_TCR_TG0_16K		(2 << 14)
> +
> +#define ARM_LPAE_TCR_SH0_SHIFT		12
> +#define ARM_LPAE_TCR_SH0_MASK		0x3
> +#define ARM_LPAE_TCR_SH_NS		0
> +#define ARM_LPAE_TCR_SH_OS		2
> +#define ARM_LPAE_TCR_SH_IS		3
> +
> +#define ARM_LPAE_TCR_ORGN0_SHIFT	10
> +#define ARM_LPAE_TCR_IRGN0_SHIFT	8
> +#define ARM_LPAE_TCR_RGN_MASK		0x3
> +#define ARM_LPAE_TCR_RGN_NC		0
> +#define ARM_LPAE_TCR_RGN_WBWA		1
> +#define ARM_LPAE_TCR_RGN_WT		2
> +#define ARM_LPAE_TCR_RGN_WB		3
> +
> +#define ARM_LPAE_TCR_SL0_SHIFT		6
> +#define ARM_LPAE_TCR_SL0_MASK		0x3
> +
> +#define ARM_LPAE_TCR_T0SZ_SHIFT		0
> +#define ARM_LPAE_TCR_SZ_MASK		0xf
> +
> +#define ARM_LPAE_TCR_PS_SHIFT		16
> +#define ARM_LPAE_TCR_PS_MASK		0x7
> +
> +#define ARM_LPAE_TCR_IPS_SHIFT		32
> +#define ARM_LPAE_TCR_IPS_MASK		0x7
> +
> +#define ARM_LPAE_TCR_PS_32_BIT		0x0ULL
> +#define ARM_LPAE_TCR_PS_36_BIT		0x1ULL
> +#define ARM_LPAE_TCR_PS_40_BIT		0x2ULL
> +#define ARM_LPAE_TCR_PS_42_BIT		0x3ULL
> +#define ARM_LPAE_TCR_PS_44_BIT		0x4ULL
> +#define ARM_LPAE_TCR_PS_48_BIT		0x5ULL
> +
> +#define ARM_LPAE_MAIR_ATTR_SHIFT(n)	((n) << 3)
> +#define ARM_LPAE_MAIR_ATTR_MASK		0xff
> +#define ARM_LPAE_MAIR_ATTR_DEVICE	0x04
> +#define ARM_LPAE_MAIR_ATTR_NC		0x44
> +#define ARM_LPAE_MAIR_ATTR_WBRWA	0xff
> +#define ARM_LPAE_MAIR_ATTR_IDX_NC	0
> +#define ARM_LPAE_MAIR_ATTR_IDX_CACHE	1
> +#define ARM_LPAE_MAIR_ATTR_IDX_DEV	2
> +
> +/* IOPTE accessors */
> +#define iopte_deref(pte,d)					\
> +	(__va((pte) & ((1ULL << ARM_LPAE_MAX_ADDR_BITS) - 1)	\
> +	& ~((1ULL << (d)->pg_shift) - 1)))
> +
> +#define iopte_type(pte,l)					\
> +	(((pte) >> ARM_LPAE_PTE_TYPE_SHIFT) & ARM_LPAE_PTE_TYPE_MASK)
> +
> +#define iopte_prot(pte)	((pte) & ARM_LPAE_PTE_ATTR_MASK)
> +
> +#define iopte_leaf(pte,l)					\
> +	(l == (ARM_LPAE_MAX_LEVELS - 1) ?			\
> +		(iopte_type(pte,l) == ARM_LPAE_PTE_TYPE_PAGE) :	\
> +		(iopte_type(pte,l) == ARM_LPAE_PTE_TYPE_BLOCK))
> +
> +#define iopte_to_pfn(pte,d)					\
> +	(((pte) & ((1ULL << ARM_LPAE_MAX_ADDR_BITS) - 1)) >> (d)->pg_shift)
> +
> +#define pfn_to_iopte(pfn,d)					\
> +	(((pfn) << (d)->pg_shift) & ((1ULL << ARM_LPAE_MAX_ADDR_BITS) - 1))
> +
> +struct arm_lpae_io_pgtable {
> +	struct io_pgtable	iop;
> +
> +	int			levels;
> +	int			pages_per_pgd;
> +	unsigned long		pg_shift;
> +	unsigned long		bits_per_level;
> +
> +	void			*pgd;
> +};
> +
> +typedef u64 arm_lpae_iopte;
> +
> +static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
> +			     unsigned long iova, phys_addr_t paddr,
> +			     arm_lpae_iopte prot, int lvl,
> +			     arm_lpae_iopte *ptep)
> +{
> +	arm_lpae_iopte pte = prot;
> +
> +	/* We require an unmap first */
> +	if (iopte_leaf(*ptep, lvl))
> +		return -EEXIST;
> +
> +	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
> +		pte |= ARM_LPAE_PTE_TYPE_PAGE;
> +	else
> +		pte |= ARM_LPAE_PTE_TYPE_BLOCK;
> +
> +	pte |= ARM_LPAE_PTE_AF | ARM_LPAE_PTE_SH_IS;
> +	pte |= pfn_to_iopte(paddr >> data->pg_shift, data);
> +
> +	*ptep = pte;
> +	data->iop.cfg.tlb->flush_pgtable(ptep, sizeof(*ptep), data->iop.cookie);
> +	return 0;
> +}
> +
> +static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long
> iova, +			  phys_addr_t paddr, size_t size, arm_lpae_iopte prot,
> +			  int lvl, arm_lpae_iopte *ptep)
> +{
> +	arm_lpae_iopte *cptep, pte;
> +	void *cookie = data->iop.cookie;
> +	size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
> +
> +	/* Find our entry at the current level */
> +	ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
> +
> +	/* If we can install a leaf entry at this level, then do so */
> +	if (size == block_size && (size & data->iop.cfg.pgsize_bitmap))
> +		return arm_lpae_init_pte(data, iova, paddr, prot, lvl, ptep);
> +
> +	/* We can't allocate tables at the final level */
> +	if (WARN_ON(lvl >= ARM_LPAE_MAX_LEVELS - 1))
> +		return -EINVAL;
> +
> +	/* Grab a pointer to the next level */
> +	pte = *ptep;
> +	if (!pte) {
> +		cptep = alloc_pages_exact(1UL << data->pg_shift,
> +					 GFP_ATOMIC | __GFP_ZERO);
> +		if (!cptep)
> +			return -ENOMEM;
> +
> +		data->iop.cfg.tlb->flush_pgtable(cptep, 1UL << data->pg_shift,
> +						 cookie);
> +		pte = __pa(cptep) | ARM_LPAE_PTE_TYPE_TABLE;
> +		*ptep = pte;
> +		data->iop.cfg.tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
> +	} else {
> +		cptep = iopte_deref(pte, data);
> +	}
> +
> +	/* Rinse, repeat */
> +	return __arm_lpae_map(data, iova, paddr, size, prot, lvl + 1, cptep);
> +}
> +
> +static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable
> *data, +					   int prot)
> +{
> +	arm_lpae_iopte pte;
> +
> +	if (data->iop.fmt == ARM_LPAE_S1) {
> +		pte = ARM_LPAE_PTE_AP_UNPRIV | ARM_LPAE_PTE_nG;
> +
> +		if (!(prot & IOMMU_WRITE) && (prot & IOMMU_READ))
> +			pte |= ARM_LPAE_PTE_AP_RDONLY;
> +
> +		if (prot & IOMMU_CACHE)
> +			pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
> +				<< ARM_LPAE_PTE_ATTRINDX_SHIFT);

In my case I'll need to manage the NS bit (here and when allocating tables in 
__arm_lpae_map). The exact requirements are not exactly clear@the moment 
I'm afraid, the datasheet doesn't clearly document secure behaviour, but tests 
showed that setting the NS was necessary.

Given that arm_lpae_init_pte() will unconditionally set the AF and SH_IS bits 
you could set them here too, but that shouldn't make a big difference.

> +	} else {
> +		pte = ARM_LPAE_PTE_HAP_FAULT;
> +		if (prot & IOMMU_READ)
> +			pte |= ARM_LPAE_PTE_HAP_READ;
> +		if (prot & IOMMU_WRITE)
> +			pte |= ARM_LPAE_PTE_HAP_WRITE;
> +		if (prot & IOMMU_CACHE)
> +			pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
> +		else
> +			pte |= ARM_LPAE_PTE_MEMATTR_NC;
> +	}
> +
> +	if (prot & IOMMU_NOEXEC)
> +		pte |= ARM_LPAE_PTE_XN;
> +
> +	return pte;
> +}
> +
> +static int arm_lpae_map(struct io_pgtable_ops *ops, unsigned long iova,
> +			phys_addr_t paddr, size_t size, int iommu_prot)
> +{
> +	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
> +	arm_lpae_iopte *ptep = data->pgd;
> +	int lvl = ARM_LPAE_START_LVL(data);
> +	arm_lpae_iopte prot;
> +
> +	/* If no access, then nothing to do */
> +	if (!(iommu_prot & (IOMMU_READ | IOMMU_WRITE)))
> +		return 0;

Shouldn't this create a faulting entry instead ?

> +	prot = arm_lpae_prot_to_pte(data, iommu_prot);
> +	return __arm_lpae_map(data, iova, paddr, size, prot, lvl, ptep);
> +}
> +
> +static void __arm_lpae_free_pgtable(struct arm_lpae_io_pgtable *data, int
> lvl, +				    arm_lpae_iopte *ptep)
> +{
> +	arm_lpae_iopte *start, *end;
> +	unsigned long table_size;
> +
> +	/* Only leaf entries at the last level */
> +	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
> +		return;
> +
> +	table_size = 1UL << data->pg_shift;
> +	if (lvl == ARM_LPAE_START_LVL(data))
> +		table_size *= data->pages_per_pgd;
> +
> +	start = ptep;
> +	end = (void *)ptep + table_size;
> +
> +	while (ptep != end) {
> +		arm_lpae_iopte pte = *ptep++;
> +
> +		if (!pte || iopte_leaf(pte, lvl))
> +			continue;
> +
> +		__arm_lpae_free_pgtable(data, lvl + 1, iopte_deref(pte, data));
> +	}
> +
> +	free_pages_exact(start, table_size);
> +}
> +
> +static void arm_lpae_free_pgtable(struct io_pgtable *iop)
> +{
> +	struct arm_lpae_io_pgtable *data = io_pgtable_to_data(iop);
> +
> +	__arm_lpae_free_pgtable(data, ARM_LPAE_START_LVL(data), data->pgd);
> +	kfree(data);
> +}
> +
> +static int arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
> +				    unsigned long iova, size_t size,
> +				    arm_lpae_iopte prot, int lvl,
> +				    arm_lpae_iopte *ptep, size_t blk_size)
> +{
> +	unsigned long blk_start, blk_end;
> +	phys_addr_t blk_paddr;
> +	arm_lpae_iopte table = 0;
> +	void *cookie = data->iop.cookie;
> +	struct iommu_gather_ops *tlb = data->iop.cfg.tlb;
> +
> +	blk_start = iova & ~(blk_size - 1);
> +	blk_end = blk_start + blk_size;
> +	blk_paddr = iopte_to_pfn(*ptep, data) << data->pg_shift;
> +
> +	for (; blk_start < blk_end; blk_start += size, blk_paddr += size) {
> +		arm_lpae_iopte *tablep;
> +
> +		/* Unmap! */
> +		if (blk_start == iova)
> +			continue;
> +
> +		/* __arm_lpae_map expects a pointer to the start of the table */
> +		tablep = &table - ARM_LPAE_LVL_IDX(blk_start, lvl, data);
> +		if (__arm_lpae_map(data, blk_start, blk_paddr, size, prot, lvl,
> +				   tablep) < 0) {
> +			if (table) {
> +				/* Free the table we allocated */
> +				tablep = iopte_deref(table, data);
> +				__arm_lpae_free_pgtable(data, lvl + 1, tablep);
> +			}
> +			return 0; /* Bytes unmapped */
> +		}
> +	}
> +
> +	*ptep = table;
> +	tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
> +	iova &= ~(blk_size - 1);
> +	tlb->tlb_add_flush(iova, blk_size, true, cookie);
> +	return size;
> +}
> +
> +static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> +			    unsigned long iova, size_t size, int lvl,
> +			    arm_lpae_iopte *ptep)
> +{
> +	arm_lpae_iopte pte;
> +	struct iommu_gather_ops *tlb = data->iop.cfg.tlb;
> +	void *cookie = data->iop.cookie;
> +	size_t blk_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
> +
> +	ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
> +	pte = *ptep;
> +
> +	/* Something went horribly wrong and we ran out of page table */
> +	if (WARN_ON(!pte || (lvl == ARM_LPAE_MAX_LEVELS)))
> +		return 0;
> +
> +	/* If the size matches this level, we're in the right place */
> +	if (size == blk_size) {
> +		*ptep = 0;
> +		tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
> +
> +		if (!iopte_leaf(pte, lvl)) {
> +			/* Also flush any partial walks */
> +			tlb->tlb_add_flush(iova, size, false, cookie);
> +			tlb->tlb_sync(data->iop.cookie);
> +			ptep = iopte_deref(pte, data);
> +			__arm_lpae_free_pgtable(data, lvl + 1, ptep);
> +		} else {
> +			tlb->tlb_add_flush(iova, size, true, cookie);
> +		}
> +
> +		return size;
> +	} else if (iopte_leaf(pte, lvl)) {
> +		/*
> +		 * Insert a table at the next level to map the old region,
> +		 * minus the part we want to unmap
> +		 */
> +		return arm_lpae_split_blk_unmap(data, iova, size,
> +						iopte_prot(pte), lvl, ptep,
> +						blk_size);
> +	}
> +
> +	/* Keep on walkin' */
> +	ptep = iopte_deref(pte, data);
> +	return __arm_lpae_unmap(data, iova, size, lvl + 1, ptep);
> +}
> +
> +static int arm_lpae_unmap(struct io_pgtable_ops *ops, unsigned long iova,
> +			  size_t size)
> +{
> +	size_t unmapped;
> +	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
> +	struct io_pgtable *iop = &data->iop;
> +	arm_lpae_iopte *ptep = data->pgd;
> +	int lvl = ARM_LPAE_START_LVL(data);
> +
> +	unmapped = __arm_lpae_unmap(data, iova, size, lvl, ptep);
> +	if (unmapped)
> +		iop->cfg.tlb->tlb_sync(iop->cookie);
> +
> +	return unmapped;
> +}
> +
> +static phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
> +					 unsigned long iova)
> +{
> +	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
> +	arm_lpae_iopte pte, *ptep = data->pgd;
> +	int lvl = ARM_LPAE_START_LVL(data);
> +
> +	do {
> +		/* Valid IOPTE pointer? */
> +		if (!ptep)
> +			return 0;
> +
> +		/* Grab the IOPTE we're interested in */
> +		pte = *(ptep + ARM_LPAE_LVL_IDX(iova, lvl, data));
> +
> +		/* Valid entry? */
> +		if (!pte)
> +			return 0;
> +
> +		/* Leaf entry? */
> +		if (iopte_leaf(pte,lvl))
> +			goto found_translation;
> +
> +		/* Take it to the next level */
> +		ptep = iopte_deref(pte, data);
> +	} while (++lvl < ARM_LPAE_MAX_LEVELS);
> +
> +	/* Ran out of page tables to walk */
> +	return 0;
> +
> +found_translation:
> +	iova &= ((1 << data->pg_shift) - 1);
> +	return ((phys_addr_t)iopte_to_pfn(pte,data) << data->pg_shift) | iova;
> +}
> +
> +static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
> +{
> +	unsigned long granule;
> +
> +	/*
> +	 * We need to restrict the supported page sizes to match the
> +	 * translation regime for a particular granule. Aim to match
> +	 * the CPU page size if possible, otherwise prefer smaller sizes.
> +	 * While we're at it, restrict the block sizes to match the
> +	 * chosen granule.
> +	 */
> +	if (cfg->pgsize_bitmap & PAGE_SIZE)
> +		granule = PAGE_SIZE;
> +	else if (cfg->pgsize_bitmap & ~PAGE_MASK)
> +		granule = 1UL << __fls(cfg->pgsize_bitmap & ~PAGE_MASK);
> +	else if (cfg->pgsize_bitmap & PAGE_MASK)
> +		granule = 1UL << __ffs(cfg->pgsize_bitmap & PAGE_MASK);
> +	else
> +		granule = 0;
> +
> +	switch (granule) {
> +	case SZ_4K:
> +		cfg->pgsize_bitmap &= (SZ_4K | SZ_2M | SZ_1G);
> +		break;
> +	case SZ_16K:
> +		cfg->pgsize_bitmap &= (SZ_16K | SZ_32M);
> +		break;
> +	case SZ_64K:
> +		cfg->pgsize_bitmap &= (SZ_64K | SZ_512M);
> +		break;
> +	default:
> +		cfg->pgsize_bitmap = 0;
> +	}
> +}
> +
> +static struct arm_lpae_io_pgtable *
> +arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
> +{
> +	unsigned long va_bits;
> +	struct arm_lpae_io_pgtable *data;
> +
> +	arm_lpae_restrict_pgsizes(cfg);
> +
> +	if (!(cfg->pgsize_bitmap & (SZ_4K | SZ_16K | SZ_64K)))
> +		return NULL;
> +
> +	if (cfg->ias > ARM_LPAE_MAX_ADDR_BITS)
> +		return NULL;
> +
> +	if (cfg->oas > ARM_LPAE_MAX_ADDR_BITS)
> +		return NULL;
> +
> +	data = kmalloc(sizeof(*data), GFP_KERNEL);
> +	if (!data)
> +		return NULL;
> +
> +	data->pages_per_pgd = 1;
> +	data->pg_shift = __ffs(cfg->pgsize_bitmap);
> +	data->bits_per_level = data->pg_shift - ilog2(sizeof(arm_lpae_iopte));
> +
> +	va_bits = cfg->ias - data->pg_shift;
> +	data->levels = DIV_ROUND_UP(va_bits, data->bits_per_level);
> +
> +	data->iop.ops = (struct io_pgtable_ops) {
> +		.map		= arm_lpae_map,
> +		.unmap		= arm_lpae_unmap,
> +		.iova_to_phys	= arm_lpae_iova_to_phys,
> +	};
> +
> +	return data;
> +}
> +
> +static struct io_pgtable *arm_lpae_alloc_pgtable_s1(struct io_pgtable_cfg
> *cfg, +						    void *cookie)
> +{
> +	u64 reg;
> +	struct arm_lpae_io_pgtable *data = arm_lpae_alloc_pgtable(cfg);
> +
> +	if (!data)
> +		return NULL;
> +
> +	/* TCR */
> +	reg = ARM_LPAE_TCR_EAE |
> +	     (ARM_LPAE_TCR_SH_IS << ARM_LPAE_TCR_SH0_SHIFT) |
> +	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_IRGN0_SHIFT) |
> +	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_ORGN0_SHIFT);
> +
> +	switch (1 << data->pg_shift) {
> +	case SZ_4K:
> +		reg |= ARM_LPAE_TCR_TG0_4K;
> +		break;
> +	case SZ_16K:
> +		reg |= ARM_LPAE_TCR_TG0_16K;
> +		break;
> +	case SZ_64K:
> +		reg |= ARM_LPAE_TCR_TG0_64K;
> +		break;
> +	}
> +
> +	switch (cfg->oas) {
> +	case 32:
> +		reg |= (ARM_LPAE_TCR_PS_32_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> +		break;
> +	case 36:
> +		reg |= (ARM_LPAE_TCR_PS_36_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> +		break;
> +	case 40:
> +		reg |= (ARM_LPAE_TCR_PS_40_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> +		break;
> +	case 42:
> +		reg |= (ARM_LPAE_TCR_PS_42_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> +		break;
> +	case 44:
> +		reg |= (ARM_LPAE_TCR_PS_44_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> +		break;
> +	case 48:
> +		reg |= (ARM_LPAE_TCR_PS_48_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> +		break;
> +	default:
> +		goto out_free_data;
> +	}
> +
> +	reg |= (64ULL - cfg->ias) << ARM_LPAE_TCR_T0SZ_SHIFT;
> +	cfg->arm_lpae_s1_cfg.tcr = reg;
> +
> +	/* MAIRs */
> +	reg = (ARM_LPAE_MAIR_ATTR_NC
> +	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_NC)) |
> +	      (ARM_LPAE_MAIR_ATTR_WBRWA
> +	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_CACHE)) |
> +	      (ARM_LPAE_MAIR_ATTR_DEVICE
> +	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_DEV));
> +
> +	cfg->arm_lpae_s1_cfg.mair[0] = reg;
> +	cfg->arm_lpae_s1_cfg.mair[1] = 0;
> +
> +	/* Looking good; allocate a pgd */
> +	data->pgd = alloc_pages_exact(1UL << data->pg_shift,
> +				      GFP_KERNEL | __GFP_ZERO);

data->pg_shift is computed as __ffs(cfg->pgsize_bitmap). 1UL << data->pg_shift 
will thus be equal to the smallest page size supported by the IOMMU. This will 
thus allocate 4kB, 16kB or 64kB depending on the IOMMU configuration. However, 
if I'm not mistaken the top-level directory needs to store one entry per 
largest supported page size. That's 4, 128 or 8 entries depending on the 
configuration. You're thus over-allocating.

> +	if (!data->pgd)
> +		goto out_free_data;
> +
> +	cfg->tlb->flush_pgtable(data->pgd, (1UL << data->pg_shift), cookie);
> +
> +	/* TTBRs */
> +	cfg->arm_lpae_s1_cfg.ttbr[0] = virt_to_phys(data->pgd);
> +	cfg->arm_lpae_s1_cfg.ttbr[1] = 0;
> +	return &data->iop;
> +
> +out_free_data:
> +	kfree(data);
> +	return NULL;
> +}
> +
> +static struct io_pgtable *arm_lpae_alloc_pgtable_s2(struct io_pgtable_cfg
> *cfg,
> +						    void *cookie)
> +{
> +	u64 reg, sl;
> +	size_t pgd_size;
> +	struct arm_lpae_io_pgtable *data = arm_lpae_alloc_pgtable(cfg);
> +
> +	if (!data)
> +		return NULL;
> +
> +	/*
> +	 * Concatenate PGDs at level 1 if possible in order to reduce
> +	 * the depth of the stage-2 walk.
> +	 */
> +	if (data->levels == ARM_LPAE_MAX_LEVELS) {
> +		unsigned long pgd_bits, pgd_pages;
> +		unsigned long va_bits = cfg->ias - data->pg_shift;
> +
> +		pgd_bits = data->bits_per_level * (data->levels - 1);
> +		pgd_pages = 1 << (va_bits - pgd_bits);
> +		if (pgd_pages <= ARM_LPAE_S2_MAX_CONCAT_PAGES) {
> +			data->pages_per_pgd = pgd_pages;
> +			data->levels--;
> +		}
> +	}
> +
> +	/* VTCR */
> +	reg = ARM_LPAE_TCR_EAE |
> +	     (ARM_LPAE_TCR_SH_IS << ARM_LPAE_TCR_SH0_SHIFT) |
> +	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_IRGN0_SHIFT) |
> +	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_ORGN0_SHIFT);
> +
> +	sl = ARM_LPAE_START_LVL(data);
> +
> +	switch (1 << data->pg_shift) {
> +	case SZ_4K:
> +		reg |= ARM_LPAE_TCR_TG0_4K;
> +		sl++; /* SL0 format is different for 4K granule size */
> +		break;
> +	case SZ_16K:
> +		reg |= ARM_LPAE_TCR_TG0_16K;
> +		break;
> +	case SZ_64K:
> +		reg |= ARM_LPAE_TCR_TG0_64K;
> +		break;
> +	}
> +
> +	switch (cfg->oas) {
> +	case 32:
> +		reg |= (ARM_LPAE_TCR_PS_32_BIT << ARM_LPAE_TCR_PS_SHIFT);
> +		break;
> +	case 36:
> +		reg |= (ARM_LPAE_TCR_PS_36_BIT << ARM_LPAE_TCR_PS_SHIFT);
> +		break;
> +	case 40:
> +		reg |= (ARM_LPAE_TCR_PS_40_BIT << ARM_LPAE_TCR_PS_SHIFT);
> +		break;
> +	case 42:
> +		reg |= (ARM_LPAE_TCR_PS_42_BIT << ARM_LPAE_TCR_PS_SHIFT);
> +		break;
> +	case 44:
> +		reg |= (ARM_LPAE_TCR_PS_44_BIT << ARM_LPAE_TCR_PS_SHIFT);
> +		break;
> +	case 48:
> +		reg |= (ARM_LPAE_TCR_PS_48_BIT << ARM_LPAE_TCR_PS_SHIFT);
> +		break;
> +	default:
> +		goto out_free_data;
> +	}
> +
> +	reg |= (64ULL - cfg->ias) << ARM_LPAE_TCR_T0SZ_SHIFT;
> +	reg |= (~sl & ARM_LPAE_TCR_SL0_MASK) << ARM_LPAE_TCR_SL0_SHIFT;
> +	cfg->arm_lpae_s2_cfg.vtcr = reg;
> +
> +	/* Allocate pgd pages */
> +	pgd_size = data->pages_per_pgd * (1UL << data->pg_shift);
> +	data->pgd = alloc_pages_exact(pgd_size, GFP_KERNEL | __GFP_ZERO);
> +	if (!data->pgd)
> +		goto out_free_data;
> +
> +	cfg->tlb->flush_pgtable(data->pgd, pgd_size, cookie);
> +
> +	/* VTTBR */
> +	cfg->arm_lpae_s2_cfg.vttbr = virt_to_phys(data->pgd);
> +	return &data->iop;
> +
> +out_free_data:
> +	kfree(data);
> +	return NULL;
> +}
> +
> +struct io_pgtable_init_fns io_pgtable_arm_lpae_s1_init_fns = {
> +	.alloc	= arm_lpae_alloc_pgtable_s1,
> +	.free	= arm_lpae_free_pgtable,
> +};
> +
> +struct io_pgtable_init_fns io_pgtable_arm_lpae_s2_init_fns = {
> +	.alloc	= arm_lpae_alloc_pgtable_s2,
> +	.free	= arm_lpae_free_pgtable,
> +};
> diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
> index 82e39a0db94b..d0a2016efcb4 100644
> --- a/drivers/iommu/io-pgtable.c
> +++ b/drivers/iommu/io-pgtable.c
> @@ -25,8 +25,15 @@
> 
>  #include "io-pgtable.h"
> 
> +extern struct io_pgtable_init_fns io_pgtable_arm_lpae_s1_init_fns;
> +extern struct io_pgtable_init_fns io_pgtable_arm_lpae_s2_init_fns;
> +
>  static struct io_pgtable_init_fns
> *io_pgtable_init_table[IO_PGTABLE_NUM_FMTS] = {
> +#ifdef CONFIG_IOMMU_IO_PGTABLE_LPAE
> +	[ARM_LPAE_S1] = &io_pgtable_arm_lpae_s1_init_fns,
> +	[ARM_LPAE_S2] = &io_pgtable_arm_lpae_s2_init_fns,
> +#endif
>  };
> 
>  struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
> diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
> index 5ae75d9cae50..c1cff3d045db 100644
> --- a/drivers/iommu/io-pgtable.h
> +++ b/drivers/iommu/io-pgtable.h
> @@ -33,10 +33,22 @@ struct io_pgtable_cfg {
> 
>  	/* Low-level data specific to the table format */
>  	union {
> +		struct {
> +			u64	ttbr[2];
> +			u64	tcr;
> +			u64	mair[2];
> +		} arm_lpae_s1_cfg;
> +
> +		struct {
> +			u64	vttbr;
> +			u64	vtcr;
> +		} arm_lpae_s2_cfg;
>  	};
>  };
> 
>  enum io_pgtable_fmt {
> +	ARM_LPAE_S1,
> +	ARM_LPAE_S2,
>  	IO_PGTABLE_NUM_FMTS,
>  };

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/4] Generic IOMMU page table framework
  2014-11-30 22:03     ` Laurent Pinchart
@ 2014-12-01 12:05       ` Will Deacon
  -1 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-01 12:05 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin Murphy,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Sun, Nov 30, 2014 at 10:03:08PM +0000, Laurent Pinchart wrote:
> Hi Will,

Hi Laurent,

> On Thursday 27 November 2014 11:51:14 Will Deacon wrote:
> > Hi all,
> > 
> > This series introduces a generic IOMMU page table allocation framework,
> > implements support for ARM long-descriptors and then ports the arm-smmu
> > driver over to the new code.
> > 
> > There are a few reasons for doing this:
> > 
> >   - Page table code is hard, and I don't enjoy shopping
> > 
> >   - A number of IOMMUs actually use the same table format, but currently
> >     duplicate the code
> > 
> >   - It provides a CPU (and architecture) independent allocator, which
> >     may be useful for some systems where the CPU is using a different
> >     table format for its own mappings
> > 
> > As illustrated in the final patch, an IOMMU driver interacts with the
> > allocator by passing in a configuration structure describing the
> > input and output address ranges, the supported pages sizes and a set of
> > ops for performing various TLB invalidation and PTE flushing routines.
> > 
> > The LPAE code implements support for 4k/2M/1G, 16k/32M and 64k/512M
> > mappings, but I decided not to implement the contiguous bit in the
> > interest of trying to keep the code semi-readable. This could always be
> > added later, if needed.
> 
> Do you have any idea how much the contiguous bit can improve performances in 
> real use cases ?

It depends on the TLB, really. Given that the contiguous sized map directly
onto block sizes using different granules, I didn't see that the complexity
was worth it.

For example:

   4k granule : 16 contiguous entries => {64k, 32M, 16G}
  16k granule : 128 contiguous lvl3 entries => 2M
                32 contiguous lvl2 entries => 1G
  64k granule : 32 contiguous entries => {2M, 16G}

If we use block mappings, then we get:

   4k granule : 2M @ lvl2, 1G @ lvl1
  16k granule : 32M @ lvl2
  64k granule : 512M @ lvl2

so really, we only miss the ability to create 16G mappings. I doubt
that hardware even implements that size in the TLB (the contiguous bit
is only a hint).

On top of that, the contiguous bit leads to additional expense on unmap,
since you have extra TLB invalidation splitting the thing into
non-contiguous pages before you can do anything.

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 0/4] Generic IOMMU page table framework
@ 2014-12-01 12:05       ` Will Deacon
  0 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-01 12:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Nov 30, 2014 at 10:03:08PM +0000, Laurent Pinchart wrote:
> Hi Will,

Hi Laurent,

> On Thursday 27 November 2014 11:51:14 Will Deacon wrote:
> > Hi all,
> > 
> > This series introduces a generic IOMMU page table allocation framework,
> > implements support for ARM long-descriptors and then ports the arm-smmu
> > driver over to the new code.
> > 
> > There are a few reasons for doing this:
> > 
> >   - Page table code is hard, and I don't enjoy shopping
> > 
> >   - A number of IOMMUs actually use the same table format, but currently
> >     duplicate the code
> > 
> >   - It provides a CPU (and architecture) independent allocator, which
> >     may be useful for some systems where the CPU is using a different
> >     table format for its own mappings
> > 
> > As illustrated in the final patch, an IOMMU driver interacts with the
> > allocator by passing in a configuration structure describing the
> > input and output address ranges, the supported pages sizes and a set of
> > ops for performing various TLB invalidation and PTE flushing routines.
> > 
> > The LPAE code implements support for 4k/2M/1G, 16k/32M and 64k/512M
> > mappings, but I decided not to implement the contiguous bit in the
> > interest of trying to keep the code semi-readable. This could always be
> > added later, if needed.
> 
> Do you have any idea how much the contiguous bit can improve performances in 
> real use cases ?

It depends on the TLB, really. Given that the contiguous sized map directly
onto block sizes using different granules, I didn't see that the complexity
was worth it.

For example:

   4k granule : 16 contiguous entries => {64k, 32M, 16G}
  16k granule : 128 contiguous lvl3 entries => 2M
                32 contiguous lvl2 entries => 1G
  64k granule : 32 contiguous entries => {2M, 16G}

If we use block mappings, then we get:

   4k granule : 2M @ lvl2, 1G @ lvl1
  16k granule : 32M @ lvl2
  64k granule : 512M @ lvl2

so really, we only miss the ability to create 16G mappings. I doubt
that hardware even implements that size in the TLB (the contiguous bit
is only a hint).

On top of that, the contiguous bit leads to additional expense on unmap,
since you have extra TLB invalidation splitting the thing into
non-contiguous pages before you can do anything.

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/4] iommu: introduce generic page table allocation framework
  2014-11-30 22:00         ` Laurent Pinchart
@ 2014-12-01 12:13           ` Will Deacon
  -1 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-01 12:13 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin Murphy,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Sun, Nov 30, 2014 at 10:00:21PM +0000, Laurent Pinchart wrote:
> Hi Will,

Hi Laurent,

> Thank you for the patch.

Cheers for the review!

> On Thursday 27 November 2014 11:51:15 Will Deacon wrote:
> > diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
> > new file mode 100644
> > index 000000000000..82e39a0db94b
> > --- /dev/null
> > +++ b/drivers/iommu/io-pgtable.c
> > @@ -0,0 +1,71 @@
> > +/*
> > + * Generic page table allocator for IOMMUs.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write to the Free Software
> > + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307,
> > USA. + *
> > + * Copyright (C) 2014 ARM Limited
> > + *
> > + * Author: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
> > + */
> > +
> > +#include <linux/bug.h>
> > +#include <linux/kernel.h>
> > +#include <linux/types.h>
> > +
> > +#include "io-pgtable.h"
> > +
> > +static struct io_pgtable_init_fns
> 
> Any reason not to make the table const ?

No reason, I'll give it a go.

> > diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
> > new file mode 100644
> > index 000000000000..5ae75d9cae50
> > --- /dev/null
> > +++ b/drivers/iommu/io-pgtable.h
> > @@ -0,0 +1,65 @@
> > +#ifndef __IO_PGTABLE_H
> > +#define __IO_PGTABLE_H
> > +
> > +struct io_pgtable_ops {
> > +	int (*map)(struct io_pgtable_ops *ops, unsigned long iova,
> 
> How about passing a struct io_pgtable * instead of the ops pointer ? This 
> would require returning a struct io_pgtable from the alloc function, which I 
> suppose you didn't want to do to ensure the caller will not touch the struct 
> io_pgtable fields directly. Do we really need to go that far, or can we simply 
> document struct io_pgtable as being private to the pg alloc framework core and 
> allocators ? Someone who really wants to get hold of the io_pgtable instance 
> could use container_of on the ops anyway, like the allocators do.

Hmm, currently the struct io_pgtable is private to the page table allocator,
so I don't like the IOMMU driver having an explicit handle to that. I also
like having the value returned from alloc_io_pgtable_ops being used as the
handle to pass around -- it keeps things simple for the caller because
there's one structure that you get back and that's the thing you use as a
reference.

What do we gain by returning the struct io_pgtable pointer instead?

> > +		   phys_addr_t paddr, size_t size, int prot);
> > +	int (*unmap)(struct io_pgtable_ops *ops, unsigned long iova,
> > +		     size_t size);
> > +	phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
> > +				    unsigned long iova);
> > +};
> > +
> > +struct iommu_gather_ops {
> > +	/* Synchronously invalidate the entire TLB context */
> > +	void (*tlb_flush_all)(void *cookie);
> > +
> > +	/* Queue up a TLB invalidation for a virtual address range */
> > +	void (*tlb_add_flush)(unsigned long iova, size_t size, bool leaf,
> > +			      void *cookie);
> 
> Is there a limit to the number of entries that can be queued, or any other 
> kind of restriction ? Implementing a completely generic TLB flush queue can 
> become complex for IOMMU drivers.

I think it's only as complicated as you decide to make it. For example, 
you could just issue the TLBI directly in the add_flush callback (like I
do for the arm-smmu driver), but then don't bother polling the hardware
for completion until the sync callback.

> I would also document in which context(s) this callback will be called, as 
> IOMMU drivers might be tempted to allocate memory in order to implement a TLB 
> flush queue.

Good idea.

> > +	/* Ensure any queued TLB invalidation has taken effect */
> > +	void (*tlb_sync)(void *cookie);
> > +
> > +	/* Ensure page tables updates are visible to the IOMMU */
> > +	void (*flush_pgtable)(void *ptr, size_t size, void *cookie);
> > +};
> 
> I suppose kerneldoc will come in the next version ;-)

Bah, ok then, if you insist!

> > +struct io_pgtable_cfg {
> > +	int			quirks; /* IO_PGTABLE_QUIRK_* */
> > +	unsigned long		pgsize_bitmap;
> > +	unsigned int		ias;
> > +	unsigned int		oas;
> > +	struct iommu_gather_ops	*tlb;
> > +
> > +	/* Low-level data specific to the table format */
> > +	union {
> > +	};
> > +};
> > +
> > +enum io_pgtable_fmt {
> > +	IO_PGTABLE_NUM_FMTS,
> > +};
> > +
> > +struct io_pgtable {
> > +	enum io_pgtable_fmt	fmt;
> > +	void			*cookie;
> > +	struct io_pgtable_cfg	cfg;
> > +	struct io_pgtable_ops	ops;
> 
> This could be turned into a const pointer if we pass struct io_pgtable around 
> instead of the ops.
> 
> > +};
> > +
> > +struct io_pgtable_init_fns {
> > +	struct io_pgtable *(*alloc)(struct io_pgtable_cfg *cfg, void *cookie);
> > +	void (*free)(struct io_pgtable *iop);
> > +};
> 
> I would reorder structures into two groups, one clearly marked as private that 
> shouldn't be touched by IOMMU drivers, and then the io_pgtable_fmt enum and 
> the io_pgtable_cfg struct grouped with the two functions below.

Sure.

Thanks again for the review.

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 1/4] iommu: introduce generic page table allocation framework
@ 2014-12-01 12:13           ` Will Deacon
  0 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-01 12:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Nov 30, 2014 at 10:00:21PM +0000, Laurent Pinchart wrote:
> Hi Will,

Hi Laurent,

> Thank you for the patch.

Cheers for the review!

> On Thursday 27 November 2014 11:51:15 Will Deacon wrote:
> > diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
> > new file mode 100644
> > index 000000000000..82e39a0db94b
> > --- /dev/null
> > +++ b/drivers/iommu/io-pgtable.c
> > @@ -0,0 +1,71 @@
> > +/*
> > + * Generic page table allocator for IOMMUs.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write to the Free Software
> > + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307,
> > USA. + *
> > + * Copyright (C) 2014 ARM Limited
> > + *
> > + * Author: Will Deacon <will.deacon@arm.com>
> > + */
> > +
> > +#include <linux/bug.h>
> > +#include <linux/kernel.h>
> > +#include <linux/types.h>
> > +
> > +#include "io-pgtable.h"
> > +
> > +static struct io_pgtable_init_fns
> 
> Any reason not to make the table const ?

No reason, I'll give it a go.

> > diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
> > new file mode 100644
> > index 000000000000..5ae75d9cae50
> > --- /dev/null
> > +++ b/drivers/iommu/io-pgtable.h
> > @@ -0,0 +1,65 @@
> > +#ifndef __IO_PGTABLE_H
> > +#define __IO_PGTABLE_H
> > +
> > +struct io_pgtable_ops {
> > +	int (*map)(struct io_pgtable_ops *ops, unsigned long iova,
> 
> How about passing a struct io_pgtable * instead of the ops pointer ? This 
> would require returning a struct io_pgtable from the alloc function, which I 
> suppose you didn't want to do to ensure the caller will not touch the struct 
> io_pgtable fields directly. Do we really need to go that far, or can we simply 
> document struct io_pgtable as being private to the pg alloc framework core and 
> allocators ? Someone who really wants to get hold of the io_pgtable instance 
> could use container_of on the ops anyway, like the allocators do.

Hmm, currently the struct io_pgtable is private to the page table allocator,
so I don't like the IOMMU driver having an explicit handle to that. I also
like having the value returned from alloc_io_pgtable_ops being used as the
handle to pass around -- it keeps things simple for the caller because
there's one structure that you get back and that's the thing you use as a
reference.

What do we gain by returning the struct io_pgtable pointer instead?

> > +		   phys_addr_t paddr, size_t size, int prot);
> > +	int (*unmap)(struct io_pgtable_ops *ops, unsigned long iova,
> > +		     size_t size);
> > +	phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
> > +				    unsigned long iova);
> > +};
> > +
> > +struct iommu_gather_ops {
> > +	/* Synchronously invalidate the entire TLB context */
> > +	void (*tlb_flush_all)(void *cookie);
> > +
> > +	/* Queue up a TLB invalidation for a virtual address range */
> > +	void (*tlb_add_flush)(unsigned long iova, size_t size, bool leaf,
> > +			      void *cookie);
> 
> Is there a limit to the number of entries that can be queued, or any other 
> kind of restriction ? Implementing a completely generic TLB flush queue can 
> become complex for IOMMU drivers.

I think it's only as complicated as you decide to make it. For example, 
you could just issue the TLBI directly in the add_flush callback (like I
do for the arm-smmu driver), but then don't bother polling the hardware
for completion until the sync callback.

> I would also document in which context(s) this callback will be called, as 
> IOMMU drivers might be tempted to allocate memory in order to implement a TLB 
> flush queue.

Good idea.

> > +	/* Ensure any queued TLB invalidation has taken effect */
> > +	void (*tlb_sync)(void *cookie);
> > +
> > +	/* Ensure page tables updates are visible to the IOMMU */
> > +	void (*flush_pgtable)(void *ptr, size_t size, void *cookie);
> > +};
> 
> I suppose kerneldoc will come in the next version ;-)

Bah, ok then, if you insist!

> > +struct io_pgtable_cfg {
> > +	int			quirks; /* IO_PGTABLE_QUIRK_* */
> > +	unsigned long		pgsize_bitmap;
> > +	unsigned int		ias;
> > +	unsigned int		oas;
> > +	struct iommu_gather_ops	*tlb;
> > +
> > +	/* Low-level data specific to the table format */
> > +	union {
> > +	};
> > +};
> > +
> > +enum io_pgtable_fmt {
> > +	IO_PGTABLE_NUM_FMTS,
> > +};
> > +
> > +struct io_pgtable {
> > +	enum io_pgtable_fmt	fmt;
> > +	void			*cookie;
> > +	struct io_pgtable_cfg	cfg;
> > +	struct io_pgtable_ops	ops;
> 
> This could be turned into a const pointer if we pass struct io_pgtable around 
> instead of the ops.
> 
> > +};
> > +
> > +struct io_pgtable_init_fns {
> > +	struct io_pgtable *(*alloc)(struct io_pgtable_cfg *cfg, void *cookie);
> > +	void (*free)(struct io_pgtable *iop);
> > +};
> 
> I would reorder structures into two groups, one clearly marked as private that 
> shouldn't be touched by IOMMU drivers, and then the io_pgtable_fmt enum and 
> the io_pgtable_cfg struct grouped with the two functions below.

Sure.

Thanks again for the review.

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/4] iommu: introduce generic page table allocation framework
  2014-12-01 12:13           ` Will Deacon
@ 2014-12-01 13:33               ` Laurent Pinchart
  -1 siblings, 0 replies; 76+ messages in thread
From: Laurent Pinchart @ 2014-12-01 13:33 UTC (permalink / raw)
  To: Will Deacon
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin Murphy,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Will,

On Monday 01 December 2014 12:13:38 Will Deacon wrote:
> On Sun, Nov 30, 2014 at 10:00:21PM +0000, Laurent Pinchart wrote:
> > On Thursday 27 November 2014 11:51:15 Will Deacon wrote:
> > > diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
> > > new file mode 100644
> > > index 000000000000..82e39a0db94b
> > > --- /dev/null
> > > +++ b/drivers/iommu/io-pgtable.c
> > > @@ -0,0 +1,71 @@
> > > +/*
> > > + * Generic page table allocator for IOMMUs.
> > > + *
> > > + * This program is free software; you can redistribute it and/or modify
> > > + * it under the terms of the GNU General Public License version 2 as
> > > + * published by the Free Software Foundation.
> > > + *
> > > + * This program is distributed in the hope that it will be useful,
> > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > > + * GNU General Public License for more details.
> > > + *
> > > + * You should have received a copy of the GNU General Public License
> > > + * along with this program; if not, write to the Free Software
> > > + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
> > > 02111-1307, USA.

By the way you can remove this paragraph, we don't want to update all source 
files the day the FSF decides to move to a new address.

> > > + *
> > > + * Copyright (C) 2014 ARM Limited
> > > + *
> > > + * Author: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
> > > + */
> > > +
> > > +#include <linux/bug.h>
> > > +#include <linux/kernel.h>
> > > +#include <linux/types.h>
> > > +
> > > +#include "io-pgtable.h"
> > > +
> > > +static struct io_pgtable_init_fns
> > 
> > Any reason not to make the table const ?
> 
> No reason, I'll give it a go.
> 
> > > diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
> > > new file mode 100644
> > > index 000000000000..5ae75d9cae50
> > > --- /dev/null
> > > +++ b/drivers/iommu/io-pgtable.h
> > > @@ -0,0 +1,65 @@
> > > +#ifndef __IO_PGTABLE_H
> > > +#define __IO_PGTABLE_H
> > > +
> > > +struct io_pgtable_ops {
> > > +	int (*map)(struct io_pgtable_ops *ops, unsigned long iova,
> > 
> > How about passing a struct io_pgtable * instead of the ops pointer ? This
> > would require returning a struct io_pgtable from the alloc function, which
> > I suppose you didn't want to do to ensure the caller will not touch the
> > struct io_pgtable fields directly. Do we really need to go that far, or
> > can we simply document struct io_pgtable as being private to the pg alloc
> > framework core and allocators ? Someone who really wants to get hold of
> > the io_pgtable instance could use container_of on the ops anyway, like
> > the allocators do.
> 
> Hmm, currently the struct io_pgtable is private to the page table allocator,
> so I don't like the IOMMU driver having an explicit handle to that.

I agree with this, but given that struct io_pgtable is defined in a header 
used by the IOMMU driver, and given that it directly embeds struct 
io_pgtable_ops, there's no big difference between the two structures.

> I also like having the value returned from alloc_io_pgtable_ops being used
> as the handle to pass around -- it keeps things simple for the caller
> because there's one structure that you get back and that's the thing you use
> as a reference.

I agree with that as well, my proposal was to return a struct io_pgtable from 
alloc_io_pgtable_ops.

> What do we gain by returning the struct io_pgtable pointer instead?

The ops structure could be made a const pointer. That's a pretty small 
optimization, granted.

> > > +		   phys_addr_t paddr, size_t size, int prot);
> > > +	int (*unmap)(struct io_pgtable_ops *ops, unsigned long iova,
> > > +		     size_t size);
> > > +	phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
> > > +				    unsigned long iova);
> > > +};
> > > +
> > > +struct iommu_gather_ops {
> > > +	/* Synchronously invalidate the entire TLB context */
> > > +	void (*tlb_flush_all)(void *cookie);
> > > +
> > > +	/* Queue up a TLB invalidation for a virtual address range */
> > > +	void (*tlb_add_flush)(unsigned long iova, size_t size, bool leaf,
> > > +			      void *cookie);
> > 
> > Is there a limit to the number of entries that can be queued, or any other
> > kind of restriction ? Implementing a completely generic TLB flush queue
> > can become complex for IOMMU drivers.
> 
> I think it's only as complicated as you decide to make it. For example,
> you could just issue the TLBI directly in the add_flush callback (like I
> do for the arm-smmu driver), but then don't bother polling the hardware
> for completion until the sync callback.
> 
> > I would also document in which context(s) this callback will be called, as
> > IOMMU drivers might be tempted to allocate memory in order to implement a
> > TLB flush queue.
> 
> Good idea.
> 
> > > +	/* Ensure any queued TLB invalidation has taken effect */
> > > +	void (*tlb_sync)(void *cookie);
> > > +
> > > +	/* Ensure page tables updates are visible to the IOMMU */
> > > +	void (*flush_pgtable)(void *ptr, size_t size, void *cookie);
> > > +};
> > 
> > I suppose kerneldoc will come in the next version ;-)
> 
> Bah, ok then, if you insist!

I'm afraid I do :-)

> > > +struct io_pgtable_cfg {
> > > +	int			quirks; /* IO_PGTABLE_QUIRK_* */
> > > +	unsigned long		pgsize_bitmap;
> > > +	unsigned int		ias;
> > > +	unsigned int		oas;
> > > +	struct iommu_gather_ops	*tlb;
> > > +
> > > +	/* Low-level data specific to the table format */
> > > +	union {
> > > +	};
> > > +};
> > > +
> > > +enum io_pgtable_fmt {
> > > +	IO_PGTABLE_NUM_FMTS,
> > > +};
> > > +
> > > +struct io_pgtable {
> > > +	enum io_pgtable_fmt	fmt;
> > > +	void			*cookie;
> > > +	struct io_pgtable_cfg	cfg;
> > > +	struct io_pgtable_ops	ops;
> > 
> > This could be turned into a const pointer if we pass struct io_pgtable
> > around instead of the ops.
> > 
> > > +};
> > > +
> > > +struct io_pgtable_init_fns {
> > > +	struct io_pgtable *(*alloc)(struct io_pgtable_cfg *cfg, void 
*cookie);
> > > +	void (*free)(struct io_pgtable *iop);
> > > +};
> > 
> > I would reorder structures into two groups, one clearly marked as private
> > that shouldn't be touched by IOMMU drivers, and then the io_pgtable_fmt
> > enum and the io_pgtable_cfg struct grouped with the two functions below.
> 
> Sure.
> 
> Thanks again for the review.

You're welcome.

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 1/4] iommu: introduce generic page table allocation framework
@ 2014-12-01 13:33               ` Laurent Pinchart
  0 siblings, 0 replies; 76+ messages in thread
From: Laurent Pinchart @ 2014-12-01 13:33 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

On Monday 01 December 2014 12:13:38 Will Deacon wrote:
> On Sun, Nov 30, 2014 at 10:00:21PM +0000, Laurent Pinchart wrote:
> > On Thursday 27 November 2014 11:51:15 Will Deacon wrote:
> > > diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
> > > new file mode 100644
> > > index 000000000000..82e39a0db94b
> > > --- /dev/null
> > > +++ b/drivers/iommu/io-pgtable.c
> > > @@ -0,0 +1,71 @@
> > > +/*
> > > + * Generic page table allocator for IOMMUs.
> > > + *
> > > + * This program is free software; you can redistribute it and/or modify
> > > + * it under the terms of the GNU General Public License version 2 as
> > > + * published by the Free Software Foundation.
> > > + *
> > > + * This program is distributed in the hope that it will be useful,
> > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > > + * GNU General Public License for more details.
> > > + *
> > > + * You should have received a copy of the GNU General Public License
> > > + * along with this program; if not, write to the Free Software
> > > + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
> > > 02111-1307, USA.

By the way you can remove this paragraph, we don't want to update all source 
files the day the FSF decides to move to a new address.

> > > + *
> > > + * Copyright (C) 2014 ARM Limited
> > > + *
> > > + * Author: Will Deacon <will.deacon@arm.com>
> > > + */
> > > +
> > > +#include <linux/bug.h>
> > > +#include <linux/kernel.h>
> > > +#include <linux/types.h>
> > > +
> > > +#include "io-pgtable.h"
> > > +
> > > +static struct io_pgtable_init_fns
> > 
> > Any reason not to make the table const ?
> 
> No reason, I'll give it a go.
> 
> > > diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
> > > new file mode 100644
> > > index 000000000000..5ae75d9cae50
> > > --- /dev/null
> > > +++ b/drivers/iommu/io-pgtable.h
> > > @@ -0,0 +1,65 @@
> > > +#ifndef __IO_PGTABLE_H
> > > +#define __IO_PGTABLE_H
> > > +
> > > +struct io_pgtable_ops {
> > > +	int (*map)(struct io_pgtable_ops *ops, unsigned long iova,
> > 
> > How about passing a struct io_pgtable * instead of the ops pointer ? This
> > would require returning a struct io_pgtable from the alloc function, which
> > I suppose you didn't want to do to ensure the caller will not touch the
> > struct io_pgtable fields directly. Do we really need to go that far, or
> > can we simply document struct io_pgtable as being private to the pg alloc
> > framework core and allocators ? Someone who really wants to get hold of
> > the io_pgtable instance could use container_of on the ops anyway, like
> > the allocators do.
> 
> Hmm, currently the struct io_pgtable is private to the page table allocator,
> so I don't like the IOMMU driver having an explicit handle to that.

I agree with this, but given that struct io_pgtable is defined in a header 
used by the IOMMU driver, and given that it directly embeds struct 
io_pgtable_ops, there's no big difference between the two structures.

> I also like having the value returned from alloc_io_pgtable_ops being used
> as the handle to pass around -- it keeps things simple for the caller
> because there's one structure that you get back and that's the thing you use
> as a reference.

I agree with that as well, my proposal was to return a struct io_pgtable from 
alloc_io_pgtable_ops.

> What do we gain by returning the struct io_pgtable pointer instead?

The ops structure could be made a const pointer. That's a pretty small 
optimization, granted.

> > > +		   phys_addr_t paddr, size_t size, int prot);
> > > +	int (*unmap)(struct io_pgtable_ops *ops, unsigned long iova,
> > > +		     size_t size);
> > > +	phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
> > > +				    unsigned long iova);
> > > +};
> > > +
> > > +struct iommu_gather_ops {
> > > +	/* Synchronously invalidate the entire TLB context */
> > > +	void (*tlb_flush_all)(void *cookie);
> > > +
> > > +	/* Queue up a TLB invalidation for a virtual address range */
> > > +	void (*tlb_add_flush)(unsigned long iova, size_t size, bool leaf,
> > > +			      void *cookie);
> > 
> > Is there a limit to the number of entries that can be queued, or any other
> > kind of restriction ? Implementing a completely generic TLB flush queue
> > can become complex for IOMMU drivers.
> 
> I think it's only as complicated as you decide to make it. For example,
> you could just issue the TLBI directly in the add_flush callback (like I
> do for the arm-smmu driver), but then don't bother polling the hardware
> for completion until the sync callback.
> 
> > I would also document in which context(s) this callback will be called, as
> > IOMMU drivers might be tempted to allocate memory in order to implement a
> > TLB flush queue.
> 
> Good idea.
> 
> > > +	/* Ensure any queued TLB invalidation has taken effect */
> > > +	void (*tlb_sync)(void *cookie);
> > > +
> > > +	/* Ensure page tables updates are visible to the IOMMU */
> > > +	void (*flush_pgtable)(void *ptr, size_t size, void *cookie);
> > > +};
> > 
> > I suppose kerneldoc will come in the next version ;-)
> 
> Bah, ok then, if you insist!

I'm afraid I do :-)

> > > +struct io_pgtable_cfg {
> > > +	int			quirks; /* IO_PGTABLE_QUIRK_* */
> > > +	unsigned long		pgsize_bitmap;
> > > +	unsigned int		ias;
> > > +	unsigned int		oas;
> > > +	struct iommu_gather_ops	*tlb;
> > > +
> > > +	/* Low-level data specific to the table format */
> > > +	union {
> > > +	};
> > > +};
> > > +
> > > +enum io_pgtable_fmt {
> > > +	IO_PGTABLE_NUM_FMTS,
> > > +};
> > > +
> > > +struct io_pgtable {
> > > +	enum io_pgtable_fmt	fmt;
> > > +	void			*cookie;
> > > +	struct io_pgtable_cfg	cfg;
> > > +	struct io_pgtable_ops	ops;
> > 
> > This could be turned into a const pointer if we pass struct io_pgtable
> > around instead of the ops.
> > 
> > > +};
> > > +
> > > +struct io_pgtable_init_fns {
> > > +	struct io_pgtable *(*alloc)(struct io_pgtable_cfg *cfg, void 
*cookie);
> > > +	void (*free)(struct io_pgtable *iop);
> > > +};
> > 
> > I would reorder structures into two groups, one clearly marked as private
> > that shouldn't be touched by IOMMU drivers, and then the io_pgtable_fmt
> > enum and the io_pgtable_cfg struct grouped with the two functions below.
> 
> Sure.
> 
> Thanks again for the review.

You're welcome.

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/4] iommu: introduce generic page table allocation framework
  2014-12-01 13:33               ` Laurent Pinchart
@ 2014-12-01 13:53                 ` Will Deacon
  -1 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-01 13:53 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin Murphy,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Mon, Dec 01, 2014 at 01:33:09PM +0000, Laurent Pinchart wrote:
> On Monday 01 December 2014 12:13:38 Will Deacon wrote:
> > On Sun, Nov 30, 2014 at 10:00:21PM +0000, Laurent Pinchart wrote:
> > > On Thursday 27 November 2014 11:51:15 Will Deacon wrote:
> > > > diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
> > > > new file mode 100644
> > > > index 000000000000..82e39a0db94b
> > > > --- /dev/null
> > > > +++ b/drivers/iommu/io-pgtable.c
> > > > @@ -0,0 +1,71 @@
> > > > +/*
> > > > + * Generic page table allocator for IOMMUs.
> > > > + *
> > > > + * This program is free software; you can redistribute it and/or modify
> > > > + * it under the terms of the GNU General Public License version 2 as
> > > > + * published by the Free Software Foundation.
> > > > + *
> > > > + * This program is distributed in the hope that it will be useful,
> > > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > > > + * GNU General Public License for more details.
> > > > + *
> > > > + * You should have received a copy of the GNU General Public License
> > > > + * along with this program; if not, write to the Free Software
> > > > + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
> > > > 02111-1307, USA.
> 
> By the way you can remove this paragraph, we don't want to update all source 
> files the day the FSF decides to move to a new address.

Yeah, I missed that one (I fixed the lpae file already).

> > > > diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
> > > > new file mode 100644
> > > > index 000000000000..5ae75d9cae50
> > > > --- /dev/null
> > > > +++ b/drivers/iommu/io-pgtable.h
> > > > @@ -0,0 +1,65 @@
> > > > +#ifndef __IO_PGTABLE_H
> > > > +#define __IO_PGTABLE_H
> > > > +
> > > > +struct io_pgtable_ops {
> > > > +	int (*map)(struct io_pgtable_ops *ops, unsigned long iova,
> > > 
> > > How about passing a struct io_pgtable * instead of the ops pointer ? This
> > > would require returning a struct io_pgtable from the alloc function, which
> > > I suppose you didn't want to do to ensure the caller will not touch the
> > > struct io_pgtable fields directly. Do we really need to go that far, or
> > > can we simply document struct io_pgtable as being private to the pg alloc
> > > framework core and allocators ? Someone who really wants to get hold of
> > > the io_pgtable instance could use container_of on the ops anyway, like
> > > the allocators do.
> > 
> > Hmm, currently the struct io_pgtable is private to the page table allocator,
> > so I don't like the IOMMU driver having an explicit handle to that.
> 
> I agree with this, but given that struct io_pgtable is defined in a header 
> used by the IOMMU driver, and given that it directly embeds struct 
> io_pgtable_ops, there's no big difference between the two structures.

Right, but you have to do an explicit container_of and, with the kerneldoc
added, it should be clear that it's not a good idea to mess with things
like the cookie or the cfg after you've allocated the page tables.

> > I also like having the value returned from alloc_io_pgtable_ops being used
> > as the handle to pass around -- it keeps things simple for the caller
> > because there's one structure that you get back and that's the thing you use
> > as a reference.
> 
> I agree with that as well, my proposal was to return a struct io_pgtable from 
> alloc_io_pgtable_ops.
> 
> > What do we gain by returning the struct io_pgtable pointer instead?
> 
> The ops structure could be made a const pointer. That's a pretty small 
> optimization, granted.

I still think I'd rather keep things like they are. Let's see how it looks
in v2, when I've reordered the structures and documented them.

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 1/4] iommu: introduce generic page table allocation framework
@ 2014-12-01 13:53                 ` Will Deacon
  0 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-01 13:53 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Dec 01, 2014 at 01:33:09PM +0000, Laurent Pinchart wrote:
> On Monday 01 December 2014 12:13:38 Will Deacon wrote:
> > On Sun, Nov 30, 2014 at 10:00:21PM +0000, Laurent Pinchart wrote:
> > > On Thursday 27 November 2014 11:51:15 Will Deacon wrote:
> > > > diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
> > > > new file mode 100644
> > > > index 000000000000..82e39a0db94b
> > > > --- /dev/null
> > > > +++ b/drivers/iommu/io-pgtable.c
> > > > @@ -0,0 +1,71 @@
> > > > +/*
> > > > + * Generic page table allocator for IOMMUs.
> > > > + *
> > > > + * This program is free software; you can redistribute it and/or modify
> > > > + * it under the terms of the GNU General Public License version 2 as
> > > > + * published by the Free Software Foundation.
> > > > + *
> > > > + * This program is distributed in the hope that it will be useful,
> > > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > > > + * GNU General Public License for more details.
> > > > + *
> > > > + * You should have received a copy of the GNU General Public License
> > > > + * along with this program; if not, write to the Free Software
> > > > + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
> > > > 02111-1307, USA.
> 
> By the way you can remove this paragraph, we don't want to update all source 
> files the day the FSF decides to move to a new address.

Yeah, I missed that one (I fixed the lpae file already).

> > > > diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
> > > > new file mode 100644
> > > > index 000000000000..5ae75d9cae50
> > > > --- /dev/null
> > > > +++ b/drivers/iommu/io-pgtable.h
> > > > @@ -0,0 +1,65 @@
> > > > +#ifndef __IO_PGTABLE_H
> > > > +#define __IO_PGTABLE_H
> > > > +
> > > > +struct io_pgtable_ops {
> > > > +	int (*map)(struct io_pgtable_ops *ops, unsigned long iova,
> > > 
> > > How about passing a struct io_pgtable * instead of the ops pointer ? This
> > > would require returning a struct io_pgtable from the alloc function, which
> > > I suppose you didn't want to do to ensure the caller will not touch the
> > > struct io_pgtable fields directly. Do we really need to go that far, or
> > > can we simply document struct io_pgtable as being private to the pg alloc
> > > framework core and allocators ? Someone who really wants to get hold of
> > > the io_pgtable instance could use container_of on the ops anyway, like
> > > the allocators do.
> > 
> > Hmm, currently the struct io_pgtable is private to the page table allocator,
> > so I don't like the IOMMU driver having an explicit handle to that.
> 
> I agree with this, but given that struct io_pgtable is defined in a header 
> used by the IOMMU driver, and given that it directly embeds struct 
> io_pgtable_ops, there's no big difference between the two structures.

Right, but you have to do an explicit container_of and, with the kerneldoc
added, it should be clear that it's not a good idea to mess with things
like the cookie or the cfg after you've allocated the page tables.

> > I also like having the value returned from alloc_io_pgtable_ops being used
> > as the handle to pass around -- it keeps things simple for the caller
> > because there's one structure that you get back and that's the thing you use
> > as a reference.
> 
> I agree with that as well, my proposal was to return a struct io_pgtable from 
> alloc_io_pgtable_ops.
> 
> > What do we gain by returning the struct io_pgtable pointer instead?
> 
> The ops structure could be made a const pointer. That's a pretty small 
> optimization, granted.

I still think I'd rather keep things like they are. Let's see how it looks
in v2, when I've reordered the structures and documented them.

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/4] iommu: add ARM LPAE page table allocator
  2014-11-30 23:29         ` Laurent Pinchart
@ 2014-12-01 17:23           ` Will Deacon
  -1 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-01 17:23 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin Murphy,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Sun, Nov 30, 2014 at 11:29:46PM +0000, Laurent Pinchart wrote:
> Hi Will,

Hello again,

> On Thursday 27 November 2014 11:51:16 Will Deacon wrote:
> > +static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable
> > *data, +                                         int prot)
> > +{
> > +     arm_lpae_iopte pte;
> > +
> > +     if (data->iop.fmt == ARM_LPAE_S1) {
> > +             pte = ARM_LPAE_PTE_AP_UNPRIV | ARM_LPAE_PTE_nG;
> > +
> > +             if (!(prot & IOMMU_WRITE) && (prot & IOMMU_READ))
> > +                     pte |= ARM_LPAE_PTE_AP_RDONLY;
> > +
> > +             if (prot & IOMMU_CACHE)
> > +                     pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
> > +                             << ARM_LPAE_PTE_ATTRINDX_SHIFT);
> 
> In my case I'll need to manage the NS bit (here and when allocating tables in
> __arm_lpae_map). The exact requirements are not exactly clear at the moment
> I'm afraid, the datasheet doesn't clearly document secure behaviour, but tests
> showed that setting the NS was necessary.

Hurrah! You can add a quick to the currently unused quirks field that I have
in io_pgtable_cfg :)

> Given that arm_lpae_init_pte() will unconditionally set the AF and SH_IS bits
> you could set them here too, but that shouldn't make a big difference.

I prefer to keep only the protection bits handled here (i.e. those bits that
map directly to the IOMMU_* prot flags).

> > +     } else {
> > +             pte = ARM_LPAE_PTE_HAP_FAULT;
> > +             if (prot & IOMMU_READ)
> > +                     pte |= ARM_LPAE_PTE_HAP_READ;
> > +             if (prot & IOMMU_WRITE)
> > +                     pte |= ARM_LPAE_PTE_HAP_WRITE;
> > +             if (prot & IOMMU_CACHE)
> > +                     pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
> > +             else
> > +                     pte |= ARM_LPAE_PTE_MEMATTR_NC;
> > +     }
> > +
> > +     if (prot & IOMMU_NOEXEC)
> > +             pte |= ARM_LPAE_PTE_XN;
> > +
> > +     return pte;
> > +}
> > +
> > +static int arm_lpae_map(struct io_pgtable_ops *ops, unsigned long iova,
> > +                     phys_addr_t paddr, size_t size, int iommu_prot)
> > +{
> > +     struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
> > +     arm_lpae_iopte *ptep = data->pgd;
> > +     int lvl = ARM_LPAE_START_LVL(data);
> > +     arm_lpae_iopte prot;
> > +
> > +     /* If no access, then nothing to do */
> > +     if (!(iommu_prot & (IOMMU_READ | IOMMU_WRITE)))
> > +             return 0;
> 
> Shouldn't this create a faulting entry instead ?

That's effectively what it does. Calling iommu_map on something that's
already mapped is a programming error, so therefore we know that the entry
is already faulting by virtue of it being unmapped.

> > +static struct io_pgtable *arm_lpae_alloc_pgtable_s1(struct io_pgtable_cfg
> > *cfg, +                                                   void *cookie)
> > +{
> > +     u64 reg;
> > +     struct arm_lpae_io_pgtable *data = arm_lpae_alloc_pgtable(cfg);
> > +
> > +     if (!data)
> > +             return NULL;
> > +
> > +     /* TCR */
> > +     reg = ARM_LPAE_TCR_EAE |
> > +          (ARM_LPAE_TCR_SH_IS << ARM_LPAE_TCR_SH0_SHIFT) |
> > +          (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_IRGN0_SHIFT) |
> > +          (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_ORGN0_SHIFT);
> > +
> > +     switch (1 << data->pg_shift) {
> > +     case SZ_4K:
> > +             reg |= ARM_LPAE_TCR_TG0_4K;
> > +             break;
> > +     case SZ_16K:
> > +             reg |= ARM_LPAE_TCR_TG0_16K;
> > +             break;
> > +     case SZ_64K:
> > +             reg |= ARM_LPAE_TCR_TG0_64K;
> > +             break;
> > +     }
> > +
> > +     switch (cfg->oas) {
> > +     case 32:
> > +             reg |= (ARM_LPAE_TCR_PS_32_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> > +             break;
> > +     case 36:
> > +             reg |= (ARM_LPAE_TCR_PS_36_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> > +             break;
> > +     case 40:
> > +             reg |= (ARM_LPAE_TCR_PS_40_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> > +             break;
> > +     case 42:
> > +             reg |= (ARM_LPAE_TCR_PS_42_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> > +             break;
> > +     case 44:
> > +             reg |= (ARM_LPAE_TCR_PS_44_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> > +             break;
> > +     case 48:
> > +             reg |= (ARM_LPAE_TCR_PS_48_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> > +             break;
> > +     default:
> > +             goto out_free_data;
> > +     }
> > +
> > +     reg |= (64ULL - cfg->ias) << ARM_LPAE_TCR_T0SZ_SHIFT;
> > +     cfg->arm_lpae_s1_cfg.tcr = reg;
> > +
> > +     /* MAIRs */
> > +     reg = (ARM_LPAE_MAIR_ATTR_NC
> > +            << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_NC)) |
> > +           (ARM_LPAE_MAIR_ATTR_WBRWA
> > +            << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_CACHE)) |
> > +           (ARM_LPAE_MAIR_ATTR_DEVICE
> > +            << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_DEV));
> > +
> > +     cfg->arm_lpae_s1_cfg.mair[0] = reg;
> > +     cfg->arm_lpae_s1_cfg.mair[1] = 0;
> > +
> > +     /* Looking good; allocate a pgd */
> > +     data->pgd = alloc_pages_exact(1UL << data->pg_shift,
> > +                                   GFP_KERNEL | __GFP_ZERO);
> 
> data->pg_shift is computed as __ffs(cfg->pgsize_bitmap). 1UL << data->pg_shift
> will thus be equal to the smallest page size supported by the IOMMU. This will
> thus allocate 4kB, 16kB or 64kB depending on the IOMMU configuration. However,
> if I'm not mistaken the top-level directory needs to store one entry per
> largest supported page size. That's 4, 128 or 8 entries depending on the
> configuration. You're thus over-allocating.

Yeah, I'll take a closer look at this. The size of the pgd really depends
on the TxSZ configuration, which in turn depends on the ias and the page
size. There are also alignment constraints to bear in mind, but I'll see
what I can do (as it stands, over-allocating will always work).

Thanks,

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 2/4] iommu: add ARM LPAE page table allocator
@ 2014-12-01 17:23           ` Will Deacon
  0 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-01 17:23 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Nov 30, 2014 at 11:29:46PM +0000, Laurent Pinchart wrote:
> Hi Will,

Hello again,

> On Thursday 27 November 2014 11:51:16 Will Deacon wrote:
> > +static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable
> > *data, +                                         int prot)
> > +{
> > +     arm_lpae_iopte pte;
> > +
> > +     if (data->iop.fmt == ARM_LPAE_S1) {
> > +             pte = ARM_LPAE_PTE_AP_UNPRIV | ARM_LPAE_PTE_nG;
> > +
> > +             if (!(prot & IOMMU_WRITE) && (prot & IOMMU_READ))
> > +                     pte |= ARM_LPAE_PTE_AP_RDONLY;
> > +
> > +             if (prot & IOMMU_CACHE)
> > +                     pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
> > +                             << ARM_LPAE_PTE_ATTRINDX_SHIFT);
> 
> In my case I'll need to manage the NS bit (here and when allocating tables in
> __arm_lpae_map). The exact requirements are not exactly clear at the moment
> I'm afraid, the datasheet doesn't clearly document secure behaviour, but tests
> showed that setting the NS was necessary.

Hurrah! You can add a quick to the currently unused quirks field that I have
in io_pgtable_cfg :)

> Given that arm_lpae_init_pte() will unconditionally set the AF and SH_IS bits
> you could set them here too, but that shouldn't make a big difference.

I prefer to keep only the protection bits handled here (i.e. those bits that
map directly to the IOMMU_* prot flags).

> > +     } else {
> > +             pte = ARM_LPAE_PTE_HAP_FAULT;
> > +             if (prot & IOMMU_READ)
> > +                     pte |= ARM_LPAE_PTE_HAP_READ;
> > +             if (prot & IOMMU_WRITE)
> > +                     pte |= ARM_LPAE_PTE_HAP_WRITE;
> > +             if (prot & IOMMU_CACHE)
> > +                     pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
> > +             else
> > +                     pte |= ARM_LPAE_PTE_MEMATTR_NC;
> > +     }
> > +
> > +     if (prot & IOMMU_NOEXEC)
> > +             pte |= ARM_LPAE_PTE_XN;
> > +
> > +     return pte;
> > +}
> > +
> > +static int arm_lpae_map(struct io_pgtable_ops *ops, unsigned long iova,
> > +                     phys_addr_t paddr, size_t size, int iommu_prot)
> > +{
> > +     struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
> > +     arm_lpae_iopte *ptep = data->pgd;
> > +     int lvl = ARM_LPAE_START_LVL(data);
> > +     arm_lpae_iopte prot;
> > +
> > +     /* If no access, then nothing to do */
> > +     if (!(iommu_prot & (IOMMU_READ | IOMMU_WRITE)))
> > +             return 0;
> 
> Shouldn't this create a faulting entry instead ?

That's effectively what it does. Calling iommu_map on something that's
already mapped is a programming error, so therefore we know that the entry
is already faulting by virtue of it being unmapped.

> > +static struct io_pgtable *arm_lpae_alloc_pgtable_s1(struct io_pgtable_cfg
> > *cfg, +                                                   void *cookie)
> > +{
> > +     u64 reg;
> > +     struct arm_lpae_io_pgtable *data = arm_lpae_alloc_pgtable(cfg);
> > +
> > +     if (!data)
> > +             return NULL;
> > +
> > +     /* TCR */
> > +     reg = ARM_LPAE_TCR_EAE |
> > +          (ARM_LPAE_TCR_SH_IS << ARM_LPAE_TCR_SH0_SHIFT) |
> > +          (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_IRGN0_SHIFT) |
> > +          (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_ORGN0_SHIFT);
> > +
> > +     switch (1 << data->pg_shift) {
> > +     case SZ_4K:
> > +             reg |= ARM_LPAE_TCR_TG0_4K;
> > +             break;
> > +     case SZ_16K:
> > +             reg |= ARM_LPAE_TCR_TG0_16K;
> > +             break;
> > +     case SZ_64K:
> > +             reg |= ARM_LPAE_TCR_TG0_64K;
> > +             break;
> > +     }
> > +
> > +     switch (cfg->oas) {
> > +     case 32:
> > +             reg |= (ARM_LPAE_TCR_PS_32_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> > +             break;
> > +     case 36:
> > +             reg |= (ARM_LPAE_TCR_PS_36_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> > +             break;
> > +     case 40:
> > +             reg |= (ARM_LPAE_TCR_PS_40_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> > +             break;
> > +     case 42:
> > +             reg |= (ARM_LPAE_TCR_PS_42_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> > +             break;
> > +     case 44:
> > +             reg |= (ARM_LPAE_TCR_PS_44_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> > +             break;
> > +     case 48:
> > +             reg |= (ARM_LPAE_TCR_PS_48_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> > +             break;
> > +     default:
> > +             goto out_free_data;
> > +     }
> > +
> > +     reg |= (64ULL - cfg->ias) << ARM_LPAE_TCR_T0SZ_SHIFT;
> > +     cfg->arm_lpae_s1_cfg.tcr = reg;
> > +
> > +     /* MAIRs */
> > +     reg = (ARM_LPAE_MAIR_ATTR_NC
> > +            << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_NC)) |
> > +           (ARM_LPAE_MAIR_ATTR_WBRWA
> > +            << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_CACHE)) |
> > +           (ARM_LPAE_MAIR_ATTR_DEVICE
> > +            << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_DEV));
> > +
> > +     cfg->arm_lpae_s1_cfg.mair[0] = reg;
> > +     cfg->arm_lpae_s1_cfg.mair[1] = 0;
> > +
> > +     /* Looking good; allocate a pgd */
> > +     data->pgd = alloc_pages_exact(1UL << data->pg_shift,
> > +                                   GFP_KERNEL | __GFP_ZERO);
> 
> data->pg_shift is computed as __ffs(cfg->pgsize_bitmap). 1UL << data->pg_shift
> will thus be equal to the smallest page size supported by the IOMMU. This will
> thus allocate 4kB, 16kB or 64kB depending on the IOMMU configuration. However,
> if I'm not mistaken the top-level directory needs to store one entry per
> largest supported page size. That's 4, 128 or 8 entries depending on the
> configuration. You're thus over-allocating.

Yeah, I'll take a closer look at this. The size of the pgd really depends
on the TxSZ configuration, which in turn depends on the ias and the page
size. There are also alignment constraints to bear in mind, but I'll see
what I can do (as it stands, over-allocating will always work).

Thanks,

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/4] iommu: add ARM LPAE page table allocator
  2014-12-01 17:23           ` Will Deacon
@ 2014-12-01 20:21               ` Laurent Pinchart
  -1 siblings, 0 replies; 76+ messages in thread
From: Laurent Pinchart @ 2014-12-01 20:21 UTC (permalink / raw)
  To: Will Deacon
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin Murphy,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Will,

On Monday 01 December 2014 17:23:15 Will Deacon wrote:
> On Sun, Nov 30, 2014 at 11:29:46PM +0000, Laurent Pinchart wrote:
> > Hi Will,
> 
> Hello again,
> 
> > On Thursday 27 November 2014 11:51:16 Will Deacon wrote:
> > > +static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable
> > > *data, +                                         int prot)
> > > +{
> > > +     arm_lpae_iopte pte;
> > > +
> > > +     if (data->iop.fmt == ARM_LPAE_S1) {
> > > +             pte = ARM_LPAE_PTE_AP_UNPRIV | ARM_LPAE_PTE_nG;
> > > +
> > > +             if (!(prot & IOMMU_WRITE) && (prot & IOMMU_READ))
> > > +                     pte |= ARM_LPAE_PTE_AP_RDONLY;
> > > +
> > > +             if (prot & IOMMU_CACHE)
> > > +                     pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
> > > +                             << ARM_LPAE_PTE_ATTRINDX_SHIFT);
> > 
> > In my case I'll need to manage the NS bit (here and when allocating tables
> > in __arm_lpae_map). The exact requirements are not exactly clear at the
> > moment I'm afraid, the datasheet doesn't clearly document secure
> > behaviour, but tests showed that setting the NS was necessary.
> 
> Hurrah! You can add a quick to the currently unused quirks field that I have
> in io_pgtable_cfg :)

:-)

> > Given that arm_lpae_init_pte() will unconditionally set the AF and SH_IS
> > bits you could set them here too, but that shouldn't make a big
> > difference.
>
> I prefer to keep only the protection bits handled here (i.e. those bits that
> map directly to the IOMMU_* prot flags).

I thought so. That's fine with me.

> > > +     } else {
> > > +             pte = ARM_LPAE_PTE_HAP_FAULT;
> > > +             if (prot & IOMMU_READ)
> > > +                     pte |= ARM_LPAE_PTE_HAP_READ;
> > > +             if (prot & IOMMU_WRITE)
> > > +                     pte |= ARM_LPAE_PTE_HAP_WRITE;
> > > +             if (prot & IOMMU_CACHE)
> > > +                     pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
> > > +             else
> > > +                     pte |= ARM_LPAE_PTE_MEMATTR_NC;
> > > +     }
> > > +
> > > +     if (prot & IOMMU_NOEXEC)
> > > +             pte |= ARM_LPAE_PTE_XN;
> > > +
> > > +     return pte;
> > > +}
> > > +
> > > +static int arm_lpae_map(struct io_pgtable_ops *ops, unsigned long iova,
> > > +                     phys_addr_t paddr, size_t size, int iommu_prot)
> > > +{
> > > +     struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
> > > +     arm_lpae_iopte *ptep = data->pgd;
> > > +     int lvl = ARM_LPAE_START_LVL(data);
> > > +     arm_lpae_iopte prot;
> > > +
> > > +     /* If no access, then nothing to do */
> > > +     if (!(iommu_prot & (IOMMU_READ | IOMMU_WRITE)))
> > > +             return 0;
> > 
> > Shouldn't this create a faulting entry instead ?
> 
> That's effectively what it does. Calling iommu_map on something that's
> already mapped is a programming error, so therefore we know that the entry
> is already faulting by virtue of it being unmapped.

Indeed.

> > > +static struct io_pgtable *arm_lpae_alloc_pgtable_s1(struct
> > > io_pgtable_cfg
> > > *cfg, +                                                   void *cookie)
> > > +{
> > > +     u64 reg;
> > > +     struct arm_lpae_io_pgtable *data = arm_lpae_alloc_pgtable(cfg);
> > > +
> > > +     if (!data)
> > > +             return NULL;
> > > +
> > > +     /* TCR */
> > > +     reg = ARM_LPAE_TCR_EAE |
> > > +          (ARM_LPAE_TCR_SH_IS << ARM_LPAE_TCR_SH0_SHIFT) |
> > > +          (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_IRGN0_SHIFT) |
> > > +          (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_ORGN0_SHIFT);
> > > +
> > > +     switch (1 << data->pg_shift) {
> > > +     case SZ_4K:
> > > +             reg |= ARM_LPAE_TCR_TG0_4K;
> > > +             break;
> > > +     case SZ_16K:
> > > +             reg |= ARM_LPAE_TCR_TG0_16K;
> > > +             break;
> > > +     case SZ_64K:
> > > +             reg |= ARM_LPAE_TCR_TG0_64K;
> > > +             break;
> > > +     }
> > > +
> > > +     switch (cfg->oas) {
> > > +     case 32:
> > > +             reg |= (ARM_LPAE_TCR_PS_32_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> > > +             break;
> > > +     case 36:
> > > +             reg |= (ARM_LPAE_TCR_PS_36_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> > > +             break;
> > > +     case 40:
> > > +             reg |= (ARM_LPAE_TCR_PS_40_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> > > +             break;
> > > +     case 42:
> > > +             reg |= (ARM_LPAE_TCR_PS_42_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> > > +             break;
> > > +     case 44:
> > > +             reg |= (ARM_LPAE_TCR_PS_44_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> > > +             break;
> > > +     case 48:
> > > +             reg |= (ARM_LPAE_TCR_PS_48_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> > > +             break;
> > > +     default:
> > > +             goto out_free_data;
> > > +     }
> > > +
> > > +     reg |= (64ULL - cfg->ias) << ARM_LPAE_TCR_T0SZ_SHIFT;
> > > +     cfg->arm_lpae_s1_cfg.tcr = reg;
> > > +
> > > +     /* MAIRs */
> > > +     reg = (ARM_LPAE_MAIR_ATTR_NC
> > > +            << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_NC)) |
> > > +           (ARM_LPAE_MAIR_ATTR_WBRWA
> > > +            << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_CACHE))
> > > |
> > > +           (ARM_LPAE_MAIR_ATTR_DEVICE
> > > +            << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_DEV));
> > > +
> > > +     cfg->arm_lpae_s1_cfg.mair[0] = reg;
> > > +     cfg->arm_lpae_s1_cfg.mair[1] = 0;
> > > +
> > > +     /* Looking good; allocate a pgd */
> > > +     data->pgd = alloc_pages_exact(1UL << data->pg_shift,
> > > +                                   GFP_KERNEL | __GFP_ZERO);
> > 
> > data->pg_shift is computed as __ffs(cfg->pgsize_bitmap). 1UL <<
> > data->pg_shift will thus be equal to the smallest page size supported by
> > the IOMMU. This will thus allocate 4kB, 16kB or 64kB depending on the
> > IOMMU configuration. However, if I'm not mistaken the top-level directory
> > needs to store one entry per largest supported page size. That's 4, 128
> > or 8 entries depending on the configuration. You're thus over-allocating.
> 
> Yeah, I'll take a closer look at this. The size of the pgd really depends
> on the TxSZ configuration, which in turn depends on the ias and the page
> size. There are also alignment constraints to bear in mind, but I'll see
> what I can do (as it stands, over-allocating will always work).

Beside wasting memory, the code also doesn't reflect the requirements. It 
works by chance, meaning it could break later. That's why I'd like to see this 
being fixed. Can't the size be computed with something like

	size = (1 << (ias - data->levels * data->pg_shift))
	     * sizeof(arm_lpae_iopte);

(please add a proper detailed comment to explain the computation, as the 
meaning is not straightforward)

There might be some corner cases I'm not aware of.

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 2/4] iommu: add ARM LPAE page table allocator
@ 2014-12-01 20:21               ` Laurent Pinchart
  0 siblings, 0 replies; 76+ messages in thread
From: Laurent Pinchart @ 2014-12-01 20:21 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

On Monday 01 December 2014 17:23:15 Will Deacon wrote:
> On Sun, Nov 30, 2014 at 11:29:46PM +0000, Laurent Pinchart wrote:
> > Hi Will,
> 
> Hello again,
> 
> > On Thursday 27 November 2014 11:51:16 Will Deacon wrote:
> > > +static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable
> > > *data, +                                         int prot)
> > > +{
> > > +     arm_lpae_iopte pte;
> > > +
> > > +     if (data->iop.fmt == ARM_LPAE_S1) {
> > > +             pte = ARM_LPAE_PTE_AP_UNPRIV | ARM_LPAE_PTE_nG;
> > > +
> > > +             if (!(prot & IOMMU_WRITE) && (prot & IOMMU_READ))
> > > +                     pte |= ARM_LPAE_PTE_AP_RDONLY;
> > > +
> > > +             if (prot & IOMMU_CACHE)
> > > +                     pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
> > > +                             << ARM_LPAE_PTE_ATTRINDX_SHIFT);
> > 
> > In my case I'll need to manage the NS bit (here and when allocating tables
> > in __arm_lpae_map). The exact requirements are not exactly clear at the
> > moment I'm afraid, the datasheet doesn't clearly document secure
> > behaviour, but tests showed that setting the NS was necessary.
> 
> Hurrah! You can add a quick to the currently unused quirks field that I have
> in io_pgtable_cfg :)

:-)

> > Given that arm_lpae_init_pte() will unconditionally set the AF and SH_IS
> > bits you could set them here too, but that shouldn't make a big
> > difference.
>
> I prefer to keep only the protection bits handled here (i.e. those bits that
> map directly to the IOMMU_* prot flags).

I thought so. That's fine with me.

> > > +     } else {
> > > +             pte = ARM_LPAE_PTE_HAP_FAULT;
> > > +             if (prot & IOMMU_READ)
> > > +                     pte |= ARM_LPAE_PTE_HAP_READ;
> > > +             if (prot & IOMMU_WRITE)
> > > +                     pte |= ARM_LPAE_PTE_HAP_WRITE;
> > > +             if (prot & IOMMU_CACHE)
> > > +                     pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
> > > +             else
> > > +                     pte |= ARM_LPAE_PTE_MEMATTR_NC;
> > > +     }
> > > +
> > > +     if (prot & IOMMU_NOEXEC)
> > > +             pte |= ARM_LPAE_PTE_XN;
> > > +
> > > +     return pte;
> > > +}
> > > +
> > > +static int arm_lpae_map(struct io_pgtable_ops *ops, unsigned long iova,
> > > +                     phys_addr_t paddr, size_t size, int iommu_prot)
> > > +{
> > > +     struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
> > > +     arm_lpae_iopte *ptep = data->pgd;
> > > +     int lvl = ARM_LPAE_START_LVL(data);
> > > +     arm_lpae_iopte prot;
> > > +
> > > +     /* If no access, then nothing to do */
> > > +     if (!(iommu_prot & (IOMMU_READ | IOMMU_WRITE)))
> > > +             return 0;
> > 
> > Shouldn't this create a faulting entry instead ?
> 
> That's effectively what it does. Calling iommu_map on something that's
> already mapped is a programming error, so therefore we know that the entry
> is already faulting by virtue of it being unmapped.

Indeed.

> > > +static struct io_pgtable *arm_lpae_alloc_pgtable_s1(struct
> > > io_pgtable_cfg
> > > *cfg, +                                                   void *cookie)
> > > +{
> > > +     u64 reg;
> > > +     struct arm_lpae_io_pgtable *data = arm_lpae_alloc_pgtable(cfg);
> > > +
> > > +     if (!data)
> > > +             return NULL;
> > > +
> > > +     /* TCR */
> > > +     reg = ARM_LPAE_TCR_EAE |
> > > +          (ARM_LPAE_TCR_SH_IS << ARM_LPAE_TCR_SH0_SHIFT) |
> > > +          (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_IRGN0_SHIFT) |
> > > +          (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_ORGN0_SHIFT);
> > > +
> > > +     switch (1 << data->pg_shift) {
> > > +     case SZ_4K:
> > > +             reg |= ARM_LPAE_TCR_TG0_4K;
> > > +             break;
> > > +     case SZ_16K:
> > > +             reg |= ARM_LPAE_TCR_TG0_16K;
> > > +             break;
> > > +     case SZ_64K:
> > > +             reg |= ARM_LPAE_TCR_TG0_64K;
> > > +             break;
> > > +     }
> > > +
> > > +     switch (cfg->oas) {
> > > +     case 32:
> > > +             reg |= (ARM_LPAE_TCR_PS_32_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> > > +             break;
> > > +     case 36:
> > > +             reg |= (ARM_LPAE_TCR_PS_36_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> > > +             break;
> > > +     case 40:
> > > +             reg |= (ARM_LPAE_TCR_PS_40_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> > > +             break;
> > > +     case 42:
> > > +             reg |= (ARM_LPAE_TCR_PS_42_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> > > +             break;
> > > +     case 44:
> > > +             reg |= (ARM_LPAE_TCR_PS_44_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> > > +             break;
> > > +     case 48:
> > > +             reg |= (ARM_LPAE_TCR_PS_48_BIT << ARM_LPAE_TCR_IPS_SHIFT);
> > > +             break;
> > > +     default:
> > > +             goto out_free_data;
> > > +     }
> > > +
> > > +     reg |= (64ULL - cfg->ias) << ARM_LPAE_TCR_T0SZ_SHIFT;
> > > +     cfg->arm_lpae_s1_cfg.tcr = reg;
> > > +
> > > +     /* MAIRs */
> > > +     reg = (ARM_LPAE_MAIR_ATTR_NC
> > > +            << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_NC)) |
> > > +           (ARM_LPAE_MAIR_ATTR_WBRWA
> > > +            << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_CACHE))
> > > |
> > > +           (ARM_LPAE_MAIR_ATTR_DEVICE
> > > +            << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_DEV));
> > > +
> > > +     cfg->arm_lpae_s1_cfg.mair[0] = reg;
> > > +     cfg->arm_lpae_s1_cfg.mair[1] = 0;
> > > +
> > > +     /* Looking good; allocate a pgd */
> > > +     data->pgd = alloc_pages_exact(1UL << data->pg_shift,
> > > +                                   GFP_KERNEL | __GFP_ZERO);
> > 
> > data->pg_shift is computed as __ffs(cfg->pgsize_bitmap). 1UL <<
> > data->pg_shift will thus be equal to the smallest page size supported by
> > the IOMMU. This will thus allocate 4kB, 16kB or 64kB depending on the
> > IOMMU configuration. However, if I'm not mistaken the top-level directory
> > needs to store one entry per largest supported page size. That's 4, 128
> > or 8 entries depending on the configuration. You're thus over-allocating.
> 
> Yeah, I'll take a closer look at this. The size of the pgd really depends
> on the TxSZ configuration, which in turn depends on the ias and the page
> size. There are also alignment constraints to bear in mind, but I'll see
> what I can do (as it stands, over-allocating will always work).

Beside wasting memory, the code also doesn't reflect the requirements. It 
works by chance, meaning it could break later. That's why I'd like to see this 
being fixed. Can't the size be computed with something like

	size = (1 << (ias - data->levels * data->pg_shift))
	     * sizeof(arm_lpae_iopte);

(please add a proper detailed comment to explain the computation, as the 
meaning is not straightforward)

There might be some corner cases I'm not aware of.

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/4] iommu: add ARM LPAE page table allocator
  2014-12-01 20:21               ` Laurent Pinchart
@ 2014-12-02  9:41                 ` Will Deacon
  -1 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-02  9:41 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin Murphy,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Mon, Dec 01, 2014 at 08:21:58PM +0000, Laurent Pinchart wrote:
> On Monday 01 December 2014 17:23:15 Will Deacon wrote:
> > On Sun, Nov 30, 2014 at 11:29:46PM +0000, Laurent Pinchart wrote:
> > > On Thursday 27 November 2014 11:51:16 Will Deacon wrote:
> > > > +     /* Looking good; allocate a pgd */
> > > > +     data->pgd = alloc_pages_exact(1UL << data->pg_shift,
> > > > +                                   GFP_KERNEL | __GFP_ZERO);
> > > 
> > > data->pg_shift is computed as __ffs(cfg->pgsize_bitmap). 1UL <<
> > > data->pg_shift will thus be equal to the smallest page size supported by
> > > the IOMMU. This will thus allocate 4kB, 16kB or 64kB depending on the
> > > IOMMU configuration. However, if I'm not mistaken the top-level directory
> > > needs to store one entry per largest supported page size. That's 4, 128
> > > or 8 entries depending on the configuration. You're thus over-allocating.
> > 
> > Yeah, I'll take a closer look at this. The size of the pgd really depends
> > on the TxSZ configuration, which in turn depends on the ias and the page
> > size. There are also alignment constraints to bear in mind, but I'll see
> > what I can do (as it stands, over-allocating will always work).
> 
> Beside wasting memory, the code also doesn't reflect the requirements. It 
> works by chance, meaning it could break later.

It won't break, as the maximum size *is* bounded by a page for stage-1
and we already handle stage-2 concatenation correctly.

> That's why I'd like to see this 
> being fixed. Can't the size be computed with something like
> 
> 	size = (1 << (ias - data->levels * data->pg_shift))
> 	     * sizeof(arm_lpae_iopte);
> 
> (please add a proper detailed comment to explain the computation, as the 
> meaning is not straightforward)

That's actually the easy part. The harder part is getting the correct
alignment, which means managing by own kmem_cache on a per-ops basis. That
feels like overkill to me and we also need to make sure that we don't screw
up the case of concatenated pgds at stage-2. On top of that, since each
cache would be per-ops, I'm not even sure we'd save anything (the slab
allocators all operate on pages afaict).

If I use alloc_page_exact, we'll still have some wasteage, but it would 
be less for the case where the CPU page size is smaller than the SMMU page
size. Do you think that's worth the extra complexity? We allocate full pages
at all levels after the pgd, so the wasteage is relatively small.

An alternative would be preinitialising some caches for `likely' pgd sizes,
but that's also horrible, especially if the kernel decides that it doesn't
need a bunch of the configurations at runtime.

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 2/4] iommu: add ARM LPAE page table allocator
@ 2014-12-02  9:41                 ` Will Deacon
  0 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-02  9:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Dec 01, 2014 at 08:21:58PM +0000, Laurent Pinchart wrote:
> On Monday 01 December 2014 17:23:15 Will Deacon wrote:
> > On Sun, Nov 30, 2014 at 11:29:46PM +0000, Laurent Pinchart wrote:
> > > On Thursday 27 November 2014 11:51:16 Will Deacon wrote:
> > > > +     /* Looking good; allocate a pgd */
> > > > +     data->pgd = alloc_pages_exact(1UL << data->pg_shift,
> > > > +                                   GFP_KERNEL | __GFP_ZERO);
> > > 
> > > data->pg_shift is computed as __ffs(cfg->pgsize_bitmap). 1UL <<
> > > data->pg_shift will thus be equal to the smallest page size supported by
> > > the IOMMU. This will thus allocate 4kB, 16kB or 64kB depending on the
> > > IOMMU configuration. However, if I'm not mistaken the top-level directory
> > > needs to store one entry per largest supported page size. That's 4, 128
> > > or 8 entries depending on the configuration. You're thus over-allocating.
> > 
> > Yeah, I'll take a closer look at this. The size of the pgd really depends
> > on the TxSZ configuration, which in turn depends on the ias and the page
> > size. There are also alignment constraints to bear in mind, but I'll see
> > what I can do (as it stands, over-allocating will always work).
> 
> Beside wasting memory, the code also doesn't reflect the requirements. It 
> works by chance, meaning it could break later.

It won't break, as the maximum size *is* bounded by a page for stage-1
and we already handle stage-2 concatenation correctly.

> That's why I'd like to see this 
> being fixed. Can't the size be computed with something like
> 
> 	size = (1 << (ias - data->levels * data->pg_shift))
> 	     * sizeof(arm_lpae_iopte);
> 
> (please add a proper detailed comment to explain the computation, as the 
> meaning is not straightforward)

That's actually the easy part. The harder part is getting the correct
alignment, which means managing by own kmem_cache on a per-ops basis. That
feels like overkill to me and we also need to make sure that we don't screw
up the case of concatenated pgds at stage-2. On top of that, since each
cache would be per-ops, I'm not even sure we'd save anything (the slab
allocators all operate on pages afaict).

If I use alloc_page_exact, we'll still have some wasteage, but it would 
be less for the case where the CPU page size is smaller than the SMMU page
size. Do you think that's worth the extra complexity? We allocate full pages
at all levels after the pgd, so the wasteage is relatively small.

An alternative would be preinitialising some caches for `likely' pgd sizes,
but that's also horrible, especially if the kernel decides that it doesn't
need a bunch of the configurations at runtime.

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/4] iommu: add ARM LPAE page table allocator
  2014-12-02  9:41                 ` Will Deacon
@ 2014-12-02 11:47                     ` Laurent Pinchart
  -1 siblings, 0 replies; 76+ messages in thread
From: Laurent Pinchart @ 2014-12-02 11:47 UTC (permalink / raw)
  To: Will Deacon
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin Murphy,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Will,

On Tuesday 02 December 2014 09:41:56 Will Deacon wrote:
> On Mon, Dec 01, 2014 at 08:21:58PM +0000, Laurent Pinchart wrote:
> > On Monday 01 December 2014 17:23:15 Will Deacon wrote:
> > > On Sun, Nov 30, 2014 at 11:29:46PM +0000, Laurent Pinchart wrote:
> > > > On Thursday 27 November 2014 11:51:16 Will Deacon wrote:
> > > > > +     /* Looking good; allocate a pgd */
> > > > > +     data->pgd = alloc_pages_exact(1UL << data->pg_shift,
> > > > > +                                   GFP_KERNEL | __GFP_ZERO);
> > > > 
> > > > data->pg_shift is computed as __ffs(cfg->pgsize_bitmap). 1UL <<
> > > > data->pg_shift will thus be equal to the smallest page size supported
> > > > by the IOMMU. This will thus allocate 4kB, 16kB or 64kB depending on
> > > > the IOMMU configuration. However, if I'm not mistaken the top-level
> > > > directory needs to store one entry per largest supported page size.
> > > > That's 4, 128 or 8 entries depending on the configuration. You're thus
> > > > over-allocating.
> > > 
> > > Yeah, I'll take a closer look at this. The size of the pgd really
> > > depends on the TxSZ configuration, which in turn depends on the ias and
> > > the page size. There are also alignment constraints to bear in mind, but
> > > I'll see what I can do (as it stands, over-allocating will always work).
> > 
> > Beside wasting memory, the code also doesn't reflect the requirements. It
> > works by chance, meaning it could break later.
> 
> It won't break, as the maximum size *is* bounded by a page for stage-1
> and we already handle stage-2 concatenation correctly.

What I mean is that there's no correlation between the required size and the 
allocated size in the current code. It happens to work, but if the driver gets 
extended later to support more IOMMU configurations subtle bugs may crop up.

> > That's why I'd like to see this
> > being fixed. Can't the size be computed with something like
> > 
> > 	size = (1 << (ias - data->levels * data->pg_shift))
> > 	
> > 	     * sizeof(arm_lpae_iopte);
> > 
> > (please add a proper detailed comment to explain the computation, as the
> > meaning is not straightforward)
> 
> That's actually the easy part. The harder part is getting the correct
> alignment, which means managing by own kmem_cache on a per-ops basis. That
> feels like overkill to me and we also need to make sure that we don't screw
> up the case of concatenated pgds at stage-2. On top of that, since each
> cache would be per-ops, I'm not even sure we'd save anything (the slab
> allocators all operate on pages afaict).
> 
> If I use alloc_page_exact, we'll still have some wasteage, but it would
> be less for the case where the CPU page size is smaller than the SMMU page
> size. Do you think that's worth the extra complexity? We allocate full pages
> at all levels after the pgd, so the wasteage is relatively small.
> 
> An alternative would be preinitialising some caches for `likely' pgd sizes,
> but that's also horrible, especially if the kernel decides that it doesn't
> need a bunch of the configurations at runtime.

How about just computing the right size, align it to a page size, and using 
alloc_page_exact ? The waste is small, so it doesn't justify anything more 
complex than that.

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 2/4] iommu: add ARM LPAE page table allocator
@ 2014-12-02 11:47                     ` Laurent Pinchart
  0 siblings, 0 replies; 76+ messages in thread
From: Laurent Pinchart @ 2014-12-02 11:47 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

On Tuesday 02 December 2014 09:41:56 Will Deacon wrote:
> On Mon, Dec 01, 2014 at 08:21:58PM +0000, Laurent Pinchart wrote:
> > On Monday 01 December 2014 17:23:15 Will Deacon wrote:
> > > On Sun, Nov 30, 2014 at 11:29:46PM +0000, Laurent Pinchart wrote:
> > > > On Thursday 27 November 2014 11:51:16 Will Deacon wrote:
> > > > > +     /* Looking good; allocate a pgd */
> > > > > +     data->pgd = alloc_pages_exact(1UL << data->pg_shift,
> > > > > +                                   GFP_KERNEL | __GFP_ZERO);
> > > > 
> > > > data->pg_shift is computed as __ffs(cfg->pgsize_bitmap). 1UL <<
> > > > data->pg_shift will thus be equal to the smallest page size supported
> > > > by the IOMMU. This will thus allocate 4kB, 16kB or 64kB depending on
> > > > the IOMMU configuration. However, if I'm not mistaken the top-level
> > > > directory needs to store one entry per largest supported page size.
> > > > That's 4, 128 or 8 entries depending on the configuration. You're thus
> > > > over-allocating.
> > > 
> > > Yeah, I'll take a closer look at this. The size of the pgd really
> > > depends on the TxSZ configuration, which in turn depends on the ias and
> > > the page size. There are also alignment constraints to bear in mind, but
> > > I'll see what I can do (as it stands, over-allocating will always work).
> > 
> > Beside wasting memory, the code also doesn't reflect the requirements. It
> > works by chance, meaning it could break later.
> 
> It won't break, as the maximum size *is* bounded by a page for stage-1
> and we already handle stage-2 concatenation correctly.

What I mean is that there's no correlation between the required size and the 
allocated size in the current code. It happens to work, but if the driver gets 
extended later to support more IOMMU configurations subtle bugs may crop up.

> > That's why I'd like to see this
> > being fixed. Can't the size be computed with something like
> > 
> > 	size = (1 << (ias - data->levels * data->pg_shift))
> > 	
> > 	     * sizeof(arm_lpae_iopte);
> > 
> > (please add a proper detailed comment to explain the computation, as the
> > meaning is not straightforward)
> 
> That's actually the easy part. The harder part is getting the correct
> alignment, which means managing by own kmem_cache on a per-ops basis. That
> feels like overkill to me and we also need to make sure that we don't screw
> up the case of concatenated pgds at stage-2. On top of that, since each
> cache would be per-ops, I'm not even sure we'd save anything (the slab
> allocators all operate on pages afaict).
> 
> If I use alloc_page_exact, we'll still have some wasteage, but it would
> be less for the case where the CPU page size is smaller than the SMMU page
> size. Do you think that's worth the extra complexity? We allocate full pages
> at all levels after the pgd, so the wasteage is relatively small.
> 
> An alternative would be preinitialising some caches for `likely' pgd sizes,
> but that's also horrible, especially if the kernel decides that it doesn't
> need a bunch of the configurations at runtime.

How about just computing the right size, align it to a page size, and using 
alloc_page_exact ? The waste is small, so it doesn't justify anything more 
complex than that.

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/4] Generic IOMMU page table framework
  2014-12-01 12:05       ` Will Deacon
@ 2014-12-02 13:47           ` Laurent Pinchart
  -1 siblings, 0 replies; 76+ messages in thread
From: Laurent Pinchart @ 2014-12-02 13:47 UTC (permalink / raw)
  To: Will Deacon
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin Murphy,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Will,

On Monday 01 December 2014 12:05:34 Will Deacon wrote:
> On Sun, Nov 30, 2014 at 10:03:08PM +0000, Laurent Pinchart wrote:
> > On Thursday 27 November 2014 11:51:14 Will Deacon wrote:
> > > Hi all,
> > > 
> > > This series introduces a generic IOMMU page table allocation framework,
> > > implements support for ARM long-descriptors and then ports the arm-smmu
> > > driver over to the new code.
> > > 
> > > There are a few reasons for doing this:
> > >   - Page table code is hard, and I don't enjoy shopping
> > >
> > >   - A number of IOMMUs actually use the same table format, but currently
> > >     duplicate the code
> > >
> > >   - It provides a CPU (and architecture) independent allocator, which
> > >     may be useful for some systems where the CPU is using a different
> > >     table format for its own mappings
> > >
> > > As illustrated in the final patch, an IOMMU driver interacts with the
> > > allocator by passing in a configuration structure describing the
> > > input and output address ranges, the supported pages sizes and a set of
> > > ops for performing various TLB invalidation and PTE flushing routines.
> > > 
> > > The LPAE code implements support for 4k/2M/1G, 16k/32M and 64k/512M
> > > mappings, but I decided not to implement the contiguous bit in the
> > > interest of trying to keep the code semi-readable. This could always be
> > > added later, if needed.
> > 
> > Do you have any idea how much the contiguous bit can improve performances
> > in real use cases ?
> 
> It depends on the TLB, really. Given that the contiguous sized map directly
> onto block sizes using different granules, I didn't see that the complexity
> was worth it.
> 
> For example:
> 
>    4k granule : 16 contiguous entries => {64k, 32M, 16G}
>   16k granule : 128 contiguous lvl3 entries => 2M
>                 32 contiguous lvl2 entries => 1G
>   64k granule : 32 contiguous entries => {2M, 16G}
> 
> If we use block mappings, then we get:
> 
>    4k granule : 2M @ lvl2, 1G @ lvl1
>   16k granule : 32M @ lvl2
>   64k granule : 512M @ lvl2
> 
> so really, we only miss the ability to create 16G mappings.

In the general case maybe, but as far as I know my IOMMU only supports 4kB 
granule. Without support for the contiguous bit I loose the ability to create 
64kB mappings, which I believe (but haven't test yet) will have a noticeable 
impact.

> I doubt that hardware even implements that size in the TLB (the contiguous
> bit is only a hint).
>
> On top of that, the contiguous bit leads to additional expense on unmap,
> since you have extra TLB invalidation splitting the thing into non-
> contiguous pages before you can do anything.

That will only be required when doing partial unmaps, which shouldn't be that 
frequent. When unmapping a 64kB block there's no need to split the mapping 
beforehand.

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 0/4] Generic IOMMU page table framework
@ 2014-12-02 13:47           ` Laurent Pinchart
  0 siblings, 0 replies; 76+ messages in thread
From: Laurent Pinchart @ 2014-12-02 13:47 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

On Monday 01 December 2014 12:05:34 Will Deacon wrote:
> On Sun, Nov 30, 2014 at 10:03:08PM +0000, Laurent Pinchart wrote:
> > On Thursday 27 November 2014 11:51:14 Will Deacon wrote:
> > > Hi all,
> > > 
> > > This series introduces a generic IOMMU page table allocation framework,
> > > implements support for ARM long-descriptors and then ports the arm-smmu
> > > driver over to the new code.
> > > 
> > > There are a few reasons for doing this:
> > >   - Page table code is hard, and I don't enjoy shopping
> > >
> > >   - A number of IOMMUs actually use the same table format, but currently
> > >     duplicate the code
> > >
> > >   - It provides a CPU (and architecture) independent allocator, which
> > >     may be useful for some systems where the CPU is using a different
> > >     table format for its own mappings
> > >
> > > As illustrated in the final patch, an IOMMU driver interacts with the
> > > allocator by passing in a configuration structure describing the
> > > input and output address ranges, the supported pages sizes and a set of
> > > ops for performing various TLB invalidation and PTE flushing routines.
> > > 
> > > The LPAE code implements support for 4k/2M/1G, 16k/32M and 64k/512M
> > > mappings, but I decided not to implement the contiguous bit in the
> > > interest of trying to keep the code semi-readable. This could always be
> > > added later, if needed.
> > 
> > Do you have any idea how much the contiguous bit can improve performances
> > in real use cases ?
> 
> It depends on the TLB, really. Given that the contiguous sized map directly
> onto block sizes using different granules, I didn't see that the complexity
> was worth it.
> 
> For example:
> 
>    4k granule : 16 contiguous entries => {64k, 32M, 16G}
>   16k granule : 128 contiguous lvl3 entries => 2M
>                 32 contiguous lvl2 entries => 1G
>   64k granule : 32 contiguous entries => {2M, 16G}
> 
> If we use block mappings, then we get:
> 
>    4k granule : 2M @ lvl2, 1G @ lvl1
>   16k granule : 32M @ lvl2
>   64k granule : 512M @ lvl2
> 
> so really, we only miss the ability to create 16G mappings.

In the general case maybe, but as far as I know my IOMMU only supports 4kB 
granule. Without support for the contiguous bit I loose the ability to create 
64kB mappings, which I believe (but haven't test yet) will have a noticeable 
impact.

> I doubt that hardware even implements that size in the TLB (the contiguous
> bit is only a hint).
>
> On top of that, the contiguous bit leads to additional expense on unmap,
> since you have extra TLB invalidation splitting the thing into non-
> contiguous pages before you can do anything.

That will only be required when doing partial unmaps, which shouldn't be that 
frequent. When unmapping a 64kB block there's no need to split the mapping 
beforehand.

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/4] Generic IOMMU page table framework
  2014-12-02 13:47           ` Laurent Pinchart
@ 2014-12-02 13:53             ` Will Deacon
  -1 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-02 13:53 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin Murphy,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Tue, Dec 02, 2014 at 01:47:41PM +0000, Laurent Pinchart wrote:
> On Monday 01 December 2014 12:05:34 Will Deacon wrote:
> > On Sun, Nov 30, 2014 at 10:03:08PM +0000, Laurent Pinchart wrote:
> > > On Thursday 27 November 2014 11:51:14 Will Deacon wrote:
> > > > The LPAE code implements support for 4k/2M/1G, 16k/32M and 64k/512M
> > > > mappings, but I decided not to implement the contiguous bit in the
> > > > interest of trying to keep the code semi-readable. This could always be
> > > > added later, if needed.
> > > 
> > > Do you have any idea how much the contiguous bit can improve performances
> > > in real use cases ?
> > 
> > It depends on the TLB, really. Given that the contiguous sized map directly
> > onto block sizes using different granules, I didn't see that the complexity
> > was worth it.
> > 
> > For example:
> > 
> >    4k granule : 16 contiguous entries => {64k, 32M, 16G}
> >   16k granule : 128 contiguous lvl3 entries => 2M
> >                 32 contiguous lvl2 entries => 1G
> >   64k granule : 32 contiguous entries => {2M, 16G}
> > 
> > If we use block mappings, then we get:
> > 
> >    4k granule : 2M @ lvl2, 1G @ lvl1
> >   16k granule : 32M @ lvl2
> >   64k granule : 512M @ lvl2
> > 
> > so really, we only miss the ability to create 16G mappings.
> 
> In the general case maybe, but as far as I know my IOMMU only supports 4kB 
> granule. Without support for the contiguous bit I loose the ability to create 
> 64kB mappings, which I believe (but haven't test yet) will have a noticeable 
> impact.

It would be good if you could confirm that. I'd have thought that you'd end
up using 2MB mappings most of the time for DMA buffers.

> > I doubt that hardware even implements that size in the TLB (the contiguous
> > bit is only a hint).
> >
> > On top of that, the contiguous bit leads to additional expense on unmap,
> > since you have extra TLB invalidation splitting the thing into non-
> > contiguous pages before you can do anything.
> 
> That will only be required when doing partial unmaps, which shouldn't be that 
> frequent. When unmapping a 64kB block there's no need to split the mapping 
> beforehand.

Sure. I'm not against having support for the contiguous bit, I just don't
plan to implement it myself :)

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 0/4] Generic IOMMU page table framework
@ 2014-12-02 13:53             ` Will Deacon
  0 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-02 13:53 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Dec 02, 2014 at 01:47:41PM +0000, Laurent Pinchart wrote:
> On Monday 01 December 2014 12:05:34 Will Deacon wrote:
> > On Sun, Nov 30, 2014 at 10:03:08PM +0000, Laurent Pinchart wrote:
> > > On Thursday 27 November 2014 11:51:14 Will Deacon wrote:
> > > > The LPAE code implements support for 4k/2M/1G, 16k/32M and 64k/512M
> > > > mappings, but I decided not to implement the contiguous bit in the
> > > > interest of trying to keep the code semi-readable. This could always be
> > > > added later, if needed.
> > > 
> > > Do you have any idea how much the contiguous bit can improve performances
> > > in real use cases ?
> > 
> > It depends on the TLB, really. Given that the contiguous sized map directly
> > onto block sizes using different granules, I didn't see that the complexity
> > was worth it.
> > 
> > For example:
> > 
> >    4k granule : 16 contiguous entries => {64k, 32M, 16G}
> >   16k granule : 128 contiguous lvl3 entries => 2M
> >                 32 contiguous lvl2 entries => 1G
> >   64k granule : 32 contiguous entries => {2M, 16G}
> > 
> > If we use block mappings, then we get:
> > 
> >    4k granule : 2M @ lvl2, 1G @ lvl1
> >   16k granule : 32M @ lvl2
> >   64k granule : 512M @ lvl2
> > 
> > so really, we only miss the ability to create 16G mappings.
> 
> In the general case maybe, but as far as I know my IOMMU only supports 4kB 
> granule. Without support for the contiguous bit I loose the ability to create 
> 64kB mappings, which I believe (but haven't test yet) will have a noticeable 
> impact.

It would be good if you could confirm that. I'd have thought that you'd end
up using 2MB mappings most of the time for DMA buffers.

> > I doubt that hardware even implements that size in the TLB (the contiguous
> > bit is only a hint).
> >
> > On top of that, the contiguous bit leads to additional expense on unmap,
> > since you have extra TLB invalidation splitting the thing into non-
> > contiguous pages before you can do anything.
> 
> That will only be required when doing partial unmaps, which shouldn't be that 
> frequent. When unmapping a 64kB block there's no need to split the mapping 
> beforehand.

Sure. I'm not against having support for the contiguous bit, I just don't
plan to implement it myself :)

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/4] Generic IOMMU page table framework
  2014-12-02 13:53             ` Will Deacon
@ 2014-12-02 22:29                 ` Laurent Pinchart
  -1 siblings, 0 replies; 76+ messages in thread
From: Laurent Pinchart @ 2014-12-02 22:29 UTC (permalink / raw)
  To: Will Deacon
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin Murphy,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Will,

On Tuesday 02 December 2014 13:53:56 Will Deacon wrote:
> On Tue, Dec 02, 2014 at 01:47:41PM +0000, Laurent Pinchart wrote:
> > On Monday 01 December 2014 12:05:34 Will Deacon wrote:
> >> On Sun, Nov 30, 2014 at 10:03:08PM +0000, Laurent Pinchart wrote:
> >>> On Thursday 27 November 2014 11:51:14 Will Deacon wrote:
> >>>> The LPAE code implements support for 4k/2M/1G, 16k/32M and 64k/512M
> >>>> mappings, but I decided not to implement the contiguous bit in the
> >>>> interest of trying to keep the code semi-readable. This could always
> >>>> be added later, if needed.
> >>> 
> >>> Do you have any idea how much the contiguous bit can improve
> >>> performances in real use cases ?
> >> 
> >> It depends on the TLB, really. Given that the contiguous sized map
> >> directly onto block sizes using different granules, I didn't see that
> >> the complexity was worth it.
> >> 
> >> For example:
> >>    4k granule : 16 contiguous entries => {64k, 32M, 16G}
> >>   16k granule : 128 contiguous lvl3 entries => 2M
> >>                 32 contiguous lvl2 entries => 1G
> >>   64k granule : 32 contiguous entries => {2M, 16G}
> >> 
> >> If we use block mappings, then we get:
> >>    4k granule : 2M @ lvl2, 1G @ lvl1
> >>   16k granule : 32M @ lvl2
> >>   64k granule : 512M @ lvl2
> >> 
> >> so really, we only miss the ability to create 16G mappings.
> >
> > In the general case maybe, but as far as I know my IOMMU only supports 4kB
> > granule. Without support for the contiguous bit I loose the ability to
> > create 64kB mappings, which I believe (but haven't test yet) will have a
> > noticeable impact.
> 
> It would be good if you could confirm that. I'd have thought that you'd end
> up using 2MB mappings most of the time for DMA buffers.

I'll try to gather statistics as soon as I can get TLB flushing working 
reliably. Without it turning the IOMMU on kills the system pretty fast :-)

> >> I doubt that hardware even implements that size in the TLB (the
> >> contiguous bit is only a hint).
> >> 
> >> On top of that, the contiguous bit leads to additional expense on unmap,
> >> since you have extra TLB invalidation splitting the thing into non-
> >> contiguous pages before you can do anything.
> > 
> > That will only be required when doing partial unmaps, which shouldn't be
> > that frequent. When unmapping a 64kB block there's no need to split the
> > mapping beforehand.
> 
> Sure. I'm not against having support for the contiguous bit, I just don't
> plan to implement it myself :)

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 0/4] Generic IOMMU page table framework
@ 2014-12-02 22:29                 ` Laurent Pinchart
  0 siblings, 0 replies; 76+ messages in thread
From: Laurent Pinchart @ 2014-12-02 22:29 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

On Tuesday 02 December 2014 13:53:56 Will Deacon wrote:
> On Tue, Dec 02, 2014 at 01:47:41PM +0000, Laurent Pinchart wrote:
> > On Monday 01 December 2014 12:05:34 Will Deacon wrote:
> >> On Sun, Nov 30, 2014 at 10:03:08PM +0000, Laurent Pinchart wrote:
> >>> On Thursday 27 November 2014 11:51:14 Will Deacon wrote:
> >>>> The LPAE code implements support for 4k/2M/1G, 16k/32M and 64k/512M
> >>>> mappings, but I decided not to implement the contiguous bit in the
> >>>> interest of trying to keep the code semi-readable. This could always
> >>>> be added later, if needed.
> >>> 
> >>> Do you have any idea how much the contiguous bit can improve
> >>> performances in real use cases ?
> >> 
> >> It depends on the TLB, really. Given that the contiguous sized map
> >> directly onto block sizes using different granules, I didn't see that
> >> the complexity was worth it.
> >> 
> >> For example:
> >>    4k granule : 16 contiguous entries => {64k, 32M, 16G}
> >>   16k granule : 128 contiguous lvl3 entries => 2M
> >>                 32 contiguous lvl2 entries => 1G
> >>   64k granule : 32 contiguous entries => {2M, 16G}
> >> 
> >> If we use block mappings, then we get:
> >>    4k granule : 2M @ lvl2, 1G @ lvl1
> >>   16k granule : 32M @ lvl2
> >>   64k granule : 512M @ lvl2
> >> 
> >> so really, we only miss the ability to create 16G mappings.
> >
> > In the general case maybe, but as far as I know my IOMMU only supports 4kB
> > granule. Without support for the contiguous bit I loose the ability to
> > create 64kB mappings, which I believe (but haven't test yet) will have a
> > noticeable impact.
> 
> It would be good if you could confirm that. I'd have thought that you'd end
> up using 2MB mappings most of the time for DMA buffers.

I'll try to gather statistics as soon as I can get TLB flushing working 
reliably. Without it turning the IOMMU on kills the system pretty fast :-)

> >> I doubt that hardware even implements that size in the TLB (the
> >> contiguous bit is only a hint).
> >> 
> >> On top of that, the contiguous bit leads to additional expense on unmap,
> >> since you have extra TLB invalidation splitting the thing into non-
> >> contiguous pages before you can do anything.
> > 
> > That will only be required when doing partial unmaps, which shouldn't be
> > that frequent. When unmapping a 64kB block there's no need to split the
> > mapping beforehand.
> 
> Sure. I'm not against having support for the contiguous bit, I just don't
> plan to implement it myself :)

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/4] iommu: add ARM LPAE page table allocator
  2014-11-27 11:51     ` Will Deacon
@ 2014-12-02 22:41         ` Mitchel Humpherys
  -1 siblings, 0 replies; 76+ messages in thread
From: Mitchel Humpherys @ 2014-12-02 22:41 UTC (permalink / raw)
  To: Will Deacon
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	laurent.pinchart-ryLnwIuWjnjg/C1BVhZhaw,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin.Murphy-5wv7dgnIgG8,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Thu, Nov 27 2014 at 03:51:16 AM, Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org> wrote:
> A number of IOMMUs found in ARM SoCs can walk architecture-compatible
> page tables.
>
> This patch adds a generic allocator for Stage-1 and Stage-2 v7/v8
> long-descriptor page tables. 4k, 16k and 64k pages are supported, with
> up to 4-levels of walk to cover a 48-bit address space.
>
> Signed-off-by: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
> ---

[...]

> +static struct io_pgtable *arm_lpae_alloc_pgtable_s1(struct io_pgtable_cfg *cfg,
> +						    void *cookie)
> +{
> +	u64 reg;
> +	struct arm_lpae_io_pgtable *data = arm_lpae_alloc_pgtable(cfg);
> +
> +	if (!data)
> +		return NULL;
> +
> +	/* TCR */
> +	reg = ARM_LPAE_TCR_EAE |
> +	     (ARM_LPAE_TCR_SH_IS << ARM_LPAE_TCR_SH0_SHIFT) |
> +	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_IRGN0_SHIFT) |
> +	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_ORGN0_SHIFT);

TCR has different definitions depending on whether we're using v7l or
v8l.  For example, bit 31 is TG1[1] (not EAE) when CBA2R.VA64=1.  Are we
expecting to have an io-pgtable-arm64.c or something?  Seems like that
would be mostly redundant with this file...  (We have this problem in
the current arm-smmu driver today).


-Mitch

-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 2/4] iommu: add ARM LPAE page table allocator
@ 2014-12-02 22:41         ` Mitchel Humpherys
  0 siblings, 0 replies; 76+ messages in thread
From: Mitchel Humpherys @ 2014-12-02 22:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Nov 27 2014 at 03:51:16 AM, Will Deacon <will.deacon@arm.com> wrote:
> A number of IOMMUs found in ARM SoCs can walk architecture-compatible
> page tables.
>
> This patch adds a generic allocator for Stage-1 and Stage-2 v7/v8
> long-descriptor page tables. 4k, 16k and 64k pages are supported, with
> up to 4-levels of walk to cover a 48-bit address space.
>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---

[...]

> +static struct io_pgtable *arm_lpae_alloc_pgtable_s1(struct io_pgtable_cfg *cfg,
> +						    void *cookie)
> +{
> +	u64 reg;
> +	struct arm_lpae_io_pgtable *data = arm_lpae_alloc_pgtable(cfg);
> +
> +	if (!data)
> +		return NULL;
> +
> +	/* TCR */
> +	reg = ARM_LPAE_TCR_EAE |
> +	     (ARM_LPAE_TCR_SH_IS << ARM_LPAE_TCR_SH0_SHIFT) |
> +	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_IRGN0_SHIFT) |
> +	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_ORGN0_SHIFT);

TCR has different definitions depending on whether we're using v7l or
v8l.  For example, bit 31 is TG1[1] (not EAE) when CBA2R.VA64=1.  Are we
expecting to have an io-pgtable-arm64.c or something?  Seems like that
would be mostly redundant with this file...  (We have this problem in
the current arm-smmu driver today).


-Mitch

-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/4] iommu: add ARM LPAE page table allocator
  2014-12-02 22:41         ` Mitchel Humpherys
@ 2014-12-03 11:11             ` Will Deacon
  -1 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-03 11:11 UTC (permalink / raw)
  To: Mitchel Humpherys
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	laurent.pinchart-ryLnwIuWjnjg/C1BVhZhaw,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin Murphy,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Tue, Dec 02, 2014 at 10:41:52PM +0000, Mitchel Humpherys wrote:
> On Thu, Nov 27 2014 at 03:51:16 AM, Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org> wrote:
> > A number of IOMMUs found in ARM SoCs can walk architecture-compatible
> > page tables.
> >
> > This patch adds a generic allocator for Stage-1 and Stage-2 v7/v8
> > long-descriptor page tables. 4k, 16k and 64k pages are supported, with
> > up to 4-levels of walk to cover a 48-bit address space.
> >
> > Signed-off-by: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
> > ---
> 
> [...]
> 
> > +static struct io_pgtable *arm_lpae_alloc_pgtable_s1(struct io_pgtable_cfg *cfg,
> > +						    void *cookie)
> > +{
> > +	u64 reg;
> > +	struct arm_lpae_io_pgtable *data = arm_lpae_alloc_pgtable(cfg);
> > +
> > +	if (!data)
> > +		return NULL;
> > +
> > +	/* TCR */
> > +	reg = ARM_LPAE_TCR_EAE |
> > +	     (ARM_LPAE_TCR_SH_IS << ARM_LPAE_TCR_SH0_SHIFT) |
> > +	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_IRGN0_SHIFT) |
> > +	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_ORGN0_SHIFT);
> 
> TCR has different definitions depending on whether we're using v7l or
> v8l.  For example, bit 31 is TG1[1] (not EAE) when CBA2R.VA64=1.  Are we
> expecting to have an io-pgtable-arm64.c or something?  Seems like that
> would be mostly redundant with this file...  (We have this problem in
> the current arm-smmu driver today).

I guess we can add an ARM_LPAE_V7_S1 format or something for this, although
we don't actually use TTBR1, so the current code shouldn't cause any issues.

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 2/4] iommu: add ARM LPAE page table allocator
@ 2014-12-03 11:11             ` Will Deacon
  0 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-03 11:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Dec 02, 2014 at 10:41:52PM +0000, Mitchel Humpherys wrote:
> On Thu, Nov 27 2014 at 03:51:16 AM, Will Deacon <will.deacon@arm.com> wrote:
> > A number of IOMMUs found in ARM SoCs can walk architecture-compatible
> > page tables.
> >
> > This patch adds a generic allocator for Stage-1 and Stage-2 v7/v8
> > long-descriptor page tables. 4k, 16k and 64k pages are supported, with
> > up to 4-levels of walk to cover a 48-bit address space.
> >
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > ---
> 
> [...]
> 
> > +static struct io_pgtable *arm_lpae_alloc_pgtable_s1(struct io_pgtable_cfg *cfg,
> > +						    void *cookie)
> > +{
> > +	u64 reg;
> > +	struct arm_lpae_io_pgtable *data = arm_lpae_alloc_pgtable(cfg);
> > +
> > +	if (!data)
> > +		return NULL;
> > +
> > +	/* TCR */
> > +	reg = ARM_LPAE_TCR_EAE |
> > +	     (ARM_LPAE_TCR_SH_IS << ARM_LPAE_TCR_SH0_SHIFT) |
> > +	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_IRGN0_SHIFT) |
> > +	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_ORGN0_SHIFT);
> 
> TCR has different definitions depending on whether we're using v7l or
> v8l.  For example, bit 31 is TG1[1] (not EAE) when CBA2R.VA64=1.  Are we
> expecting to have an io-pgtable-arm64.c or something?  Seems like that
> would be mostly redundant with this file...  (We have this problem in
> the current arm-smmu driver today).

I guess we can add an ARM_LPAE_V7_S1 format or something for this, although
we don't actually use TTBR1, so the current code shouldn't cause any issues.

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [PATCH 2/4] iommu: add ARM LPAE page table allocator
  2014-11-27 11:51     ` Will Deacon
@ 2014-12-05 10:55         ` Varun Sethi
  -1 siblings, 0 replies; 76+ messages in thread
From: Varun Sethi @ 2014-12-05 10:55 UTC (permalink / raw)
  To: Will Deacon, linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: laurent.pinchart-ryLnwIuWjnjg/C1BVhZhaw,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin.Murphy-5wv7dgnIgG8

Hi Will,
Please find my comments inline. Search for "varun"

-----Original Message-----
From: Will Deacon [mailto:will.deacon-5wv7dgnIgG8@public.gmane.org] 
Sent: Thursday, November 27, 2014 5:21 PM
To: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org; iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
Cc: prem.mallappa-dY08KVG/lbpWk0Htik3J/w@public.gmane.org; Robin.Murphy-5wv7dgnIgG8@public.gmane.org; lauraa-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org; mitchelh-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org; laurent.pinchart-ryLnwIuWjnjg/C1BVhZhaw@public.gmane.org; joro-zLv9SwRftAIdnm+yROfE0A@public.gmane.org; Sethi Varun-B16395; m.szyprowski-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org; Will Deacon
Subject: [PATCH 2/4] iommu: add ARM LPAE page table allocator

A number of IOMMUs found in ARM SoCs can walk architecture-compatible page tables.

This patch adds a generic allocator for Stage-1 and Stage-2 v7/v8 long-descriptor page tables. 4k, 16k and 64k pages are supported, with up to 4-levels of walk to cover a 48-bit address space.

Signed-off-by: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
---
 MAINTAINERS                    |   1 +
 drivers/iommu/Kconfig          |   9 +
 drivers/iommu/Makefile         |   1 +
 drivers/iommu/io-pgtable-arm.c | 735 +++++++++++++++++++++++++++++++++++++++++
 drivers/iommu/io-pgtable.c     |   7 +
 drivers/iommu/io-pgtable.h     |  12 +
 6 files changed, 765 insertions(+)
 create mode 100644 drivers/iommu/io-pgtable-arm.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 0ff630de8a6d..d3ca31b7c960 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1562,6 +1562,7 @@ M:	Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
 L:	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org (moderated for non-subscribers)
 S:	Maintained
 F:	drivers/iommu/arm-smmu.c
+F:	drivers/iommu/io-pgtable-arm.c
 
 ARM64 PORT (AARCH64 ARCHITECTURE)
 M:	Catalin Marinas <catalin.marinas-5wv7dgnIgG8@public.gmane.org>
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index 0f10554e7114..e1742a0146f8 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -19,6 +19,15 @@ menu "Generic IOMMU Pagetable Support"
 config IOMMU_IO_PGTABLE
 	bool
 
+config IOMMU_IO_PGTABLE_LPAE
+	bool "ARMv7/v8 Long Descriptor Format"
+	select IOMMU_IO_PGTABLE
+	help
+	  Enable support for the ARM long descriptor pagetable format.
+	  This allocator supports 4K/2M/1G, 16K/32M and 64K/512M page
+	  sizes at both stage-1 and stage-2, as well as address spaces
+	  up to 48-bits in size.
+
 endmenu
 
 config OF_IOMMU
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile index aff244c78181..269cdd82b672 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -2,6 +2,7 @@ obj-$(CONFIG_IOMMU_API) += iommu.o
 obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
+obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
 obj-$(CONFIG_OF_IOMMU)	+= of_iommu.o
 obj-$(CONFIG_MSM_IOMMU) += msm_iommu.o msm_iommu_dev.o
 obj-$(CONFIG_AMD_IOMMU) += amd_iommu.o amd_iommu_init.o diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c new file mode 100644 index 000000000000..9dbaa2e48424
--- /dev/null
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -0,0 +1,735 @@
+/*
+ * CPU-agnostic ARM page table allocator.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ *
+ * Copyright (C) 2014 ARM Limited
+ *
+ * Author: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>  */
+
+#define pr_fmt(fmt)	"arm-lpae io-pgtable: " fmt
+
+#include <linux/iommu.h>
+#include <linux/kernel.h>
+#include <linux/sizes.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+
+#include "io-pgtable.h"
+
+#define ARM_LPAE_MAX_ADDR_BITS		48
+#define ARM_LPAE_S2_MAX_CONCAT_PAGES	16
+#define ARM_LPAE_MAX_LEVELS		4
+
+/* Struct accessors */
+#define io_pgtable_to_data(x)						\
+	container_of((x), struct arm_lpae_io_pgtable, iop)
+
+#define io_pgtable_ops_to_pgtable(x)					\
+	container_of((x), struct io_pgtable, ops)
+
+#define io_pgtable_ops_to_data(x)					\
+	io_pgtable_to_data(io_pgtable_ops_to_pgtable(x))
+
+/*
+ * For consistency with the architecture, we always consider
+ * ARM_LPAE_MAX_LEVELS levels, with the walk starting at level n >=0  
+*/
+#define ARM_LPAE_START_LVL(d)	(ARM_LPAE_MAX_LEVELS - (d)->levels)
+
+/*
+ * Calculate the right shift amount to get to the portion describing 
+level l
+ * in a virtual address mapped by the pagetable in d.
+ */
+#define ARM_LPAE_LVL_SHIFT(l,d)						\
+	((((d)->levels - ((l) - ARM_LPAE_START_LVL(d) + 1))		\
+	  * (d)->bits_per_level) + (d)->pg_shift)
+
+/*
+ * Calculate the index at level l used to map virtual address a using 
+the
+ * pagetable in d.
+ */
+#define ARM_LPAE_PGD_IDX(l,d)						\
+	((l) == ARM_LPAE_START_LVL(d) ? ilog2((d)->pages_per_pgd) : 0)
+
+#define ARM_LPAE_LVL_IDX(a,l,d)						\
+	(((a) >> ARM_LPAE_LVL_SHIFT(l,d)) &				\
+	 ((1 << ((d)->bits_per_level + ARM_LPAE_PGD_IDX(l,d))) - 1))
+
+/* Calculate the block/page mapping size at level l for pagetable in d. */
+#define ARM_LPAE_BLOCK_SIZE(l,d)					\
+	(1 << (ilog2(sizeof(arm_lpae_iopte)) +				\
+		((ARM_LPAE_MAX_LEVELS - (l)) * (d)->bits_per_level)))
+
+/* Page table bits */
+#define ARM_LPAE_PTE_TYPE_SHIFT		0
+#define ARM_LPAE_PTE_TYPE_MASK		0x3
+
+#define ARM_LPAE_PTE_TYPE_BLOCK		1
+#define ARM_LPAE_PTE_TYPE_TABLE		3
+#define ARM_LPAE_PTE_TYPE_PAGE		3
+
+#define ARM_LPAE_PTE_XN			(((arm_lpae_iopte)3) << 53)
+#define ARM_LPAE_PTE_AF			(((arm_lpae_iopte)1) << 10)
+#define ARM_LPAE_PTE_SH_NS		(((arm_lpae_iopte)0) << 8)
+#define ARM_LPAE_PTE_SH_OS		(((arm_lpae_iopte)2) << 8)
+#define ARM_LPAE_PTE_SH_IS		(((arm_lpae_iopte)3) << 8)
+#define ARM_LPAE_PTE_VALID		(((arm_lpae_iopte)1) << 0)
+
+#define ARM_LPAE_PTE_ATTR_LO_MASK	(((arm_lpae_iopte)0x3ff) << 2)
+/* Ignore the contiguous bit for block splitting */
+#define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
+#define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
+					 ARM_LPAE_PTE_ATTR_HI_MASK)
+
+/* Stage-1 PTE */
+#define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
+#define ARM_LPAE_PTE_AP_RDONLY		(((arm_lpae_iopte)2) << 6)
+#define ARM_LPAE_PTE_ATTRINDX_SHIFT	2
+#define ARM_LPAE_PTE_nG			(((arm_lpae_iopte)1) << 11)
+
+/* Stage-2 PTE */
+#define ARM_LPAE_PTE_HAP_FAULT		(((arm_lpae_iopte)0) << 6)
+#define ARM_LPAE_PTE_HAP_READ		(((arm_lpae_iopte)1) << 6)
+#define ARM_LPAE_PTE_HAP_WRITE		(((arm_lpae_iopte)2) << 6)
+#define ARM_LPAE_PTE_MEMATTR_OIWB	(((arm_lpae_iopte)0xf) << 2)
+#define ARM_LPAE_PTE_MEMATTR_NC		(((arm_lpae_iopte)0x5) << 2)
+#define ARM_LPAE_PTE_MEMATTR_DEV	(((arm_lpae_iopte)0x1) << 2)
+
+/* Register bits */
+#define ARM_LPAE_TCR_EAE		(1 << 31)
+
+#define ARM_LPAE_TCR_TG0_4K		(0 << 14)
+#define ARM_LPAE_TCR_TG0_64K		(1 << 14)
+#define ARM_LPAE_TCR_TG0_16K		(2 << 14)
+
+#define ARM_LPAE_TCR_SH0_SHIFT		12
+#define ARM_LPAE_TCR_SH0_MASK		0x3
+#define ARM_LPAE_TCR_SH_NS		0
+#define ARM_LPAE_TCR_SH_OS		2
+#define ARM_LPAE_TCR_SH_IS		3
+
+#define ARM_LPAE_TCR_ORGN0_SHIFT	10
+#define ARM_LPAE_TCR_IRGN0_SHIFT	8
+#define ARM_LPAE_TCR_RGN_MASK		0x3
+#define ARM_LPAE_TCR_RGN_NC		0
+#define ARM_LPAE_TCR_RGN_WBWA		1
+#define ARM_LPAE_TCR_RGN_WT		2
+#define ARM_LPAE_TCR_RGN_WB		3
+
+#define ARM_LPAE_TCR_SL0_SHIFT		6
+#define ARM_LPAE_TCR_SL0_MASK		0x3
+
+#define ARM_LPAE_TCR_T0SZ_SHIFT		0
+#define ARM_LPAE_TCR_SZ_MASK		0xf
+
+#define ARM_LPAE_TCR_PS_SHIFT		16
+#define ARM_LPAE_TCR_PS_MASK		0x7
+
+#define ARM_LPAE_TCR_IPS_SHIFT		32
+#define ARM_LPAE_TCR_IPS_MASK		0x7
+
+#define ARM_LPAE_TCR_PS_32_BIT		0x0ULL
+#define ARM_LPAE_TCR_PS_36_BIT		0x1ULL
+#define ARM_LPAE_TCR_PS_40_BIT		0x2ULL
+#define ARM_LPAE_TCR_PS_42_BIT		0x3ULL
+#define ARM_LPAE_TCR_PS_44_BIT		0x4ULL
+#define ARM_LPAE_TCR_PS_48_BIT		0x5ULL
+
+#define ARM_LPAE_MAIR_ATTR_SHIFT(n)	((n) << 3)
+#define ARM_LPAE_MAIR_ATTR_MASK		0xff
+#define ARM_LPAE_MAIR_ATTR_DEVICE	0x04
+#define ARM_LPAE_MAIR_ATTR_NC		0x44
+#define ARM_LPAE_MAIR_ATTR_WBRWA	0xff
+#define ARM_LPAE_MAIR_ATTR_IDX_NC	0
+#define ARM_LPAE_MAIR_ATTR_IDX_CACHE	1
+#define ARM_LPAE_MAIR_ATTR_IDX_DEV	2
+
+/* IOPTE accessors */
+#define iopte_deref(pte,d)					\
+	(__va((pte) & ((1ULL << ARM_LPAE_MAX_ADDR_BITS) - 1)	\
+	& ~((1ULL << (d)->pg_shift) - 1)))
+
+#define iopte_type(pte,l)					\
+	(((pte) >> ARM_LPAE_PTE_TYPE_SHIFT) & ARM_LPAE_PTE_TYPE_MASK)
+
+#define iopte_prot(pte)	((pte) & ARM_LPAE_PTE_ATTR_MASK)
+
+#define iopte_leaf(pte,l)					\
+	(l == (ARM_LPAE_MAX_LEVELS - 1) ?			\
+		(iopte_type(pte,l) == ARM_LPAE_PTE_TYPE_PAGE) :	\
+		(iopte_type(pte,l) == ARM_LPAE_PTE_TYPE_BLOCK))
+
+#define iopte_to_pfn(pte,d)					\
+	(((pte) & ((1ULL << ARM_LPAE_MAX_ADDR_BITS) - 1)) >> (d)->pg_shift)
+
+#define pfn_to_iopte(pfn,d)					\
+	(((pfn) << (d)->pg_shift) & ((1ULL << ARM_LPAE_MAX_ADDR_BITS) - 1))
+
+struct arm_lpae_io_pgtable {
+	struct io_pgtable	iop;
+
+	int			levels;
+	int			pages_per_pgd;
+	unsigned long		pg_shift;
+	unsigned long		bits_per_level;
+
+	void			*pgd;
+};
+
+typedef u64 arm_lpae_iopte;
+
+static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
+			     unsigned long iova, phys_addr_t paddr,
+			     arm_lpae_iopte prot, int lvl,
+			     arm_lpae_iopte *ptep)
+{
+	arm_lpae_iopte pte = prot;
+
+	/* We require an unmap first */
+	if (iopte_leaf(*ptep, lvl))
+		return -EEXIST;
[varun] Instead of returning an error, how about displaying a warning and replacing the entry? 

+
+	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
+		pte |= ARM_LPAE_PTE_TYPE_PAGE;
+	else
+		pte |= ARM_LPAE_PTE_TYPE_BLOCK;
+
+	pte |= ARM_LPAE_PTE_AF | ARM_LPAE_PTE_SH_IS;
+	pte |= pfn_to_iopte(paddr >> data->pg_shift, data);
+
+	*ptep = pte;
+	data->iop.cfg.tlb->flush_pgtable(ptep, sizeof(*ptep), data->iop.cookie);
+	return 0;
+}
+
+static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
+			  phys_addr_t paddr, size_t size, arm_lpae_iopte prot,
+			  int lvl, arm_lpae_iopte *ptep)
+{
+	arm_lpae_iopte *cptep, pte;
+	void *cookie = data->iop.cookie;
+	size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
+
+	/* Find our entry at the current level */
+	ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+
+	/* If we can install a leaf entry at this level, then do so */
+	if (size == block_size && (size & data->iop.cfg.pgsize_bitmap))
+		return arm_lpae_init_pte(data, iova, paddr, prot, lvl, ptep);
+
+	/* We can't allocate tables at the final level */
+	if (WARN_ON(lvl >= ARM_LPAE_MAX_LEVELS - 1))
+		return -EINVAL;

[varun] A warning message would be helpful.

+
+	/* Grab a pointer to the next level */
+	pte = *ptep;
+	if (!pte) {
+		cptep = alloc_pages_exact(1UL << data->pg_shift,
+					 GFP_ATOMIC | __GFP_ZERO);
+		if (!cptep)
+			return -ENOMEM;
+
+		data->iop.cfg.tlb->flush_pgtable(cptep, 1UL << data->pg_shift,
+						 cookie);
+		pte = __pa(cptep) | ARM_LPAE_PTE_TYPE_TABLE;
+		*ptep = pte;
+		data->iop.cfg.tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
+	} else {
+		cptep = iopte_deref(pte, data);
+	}
+
+	/* Rinse, repeat */
+	return __arm_lpae_map(data, iova, paddr, size, prot, lvl + 1, cptep); 
+}
+
+static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
+					   int prot)
+{
+	arm_lpae_iopte pte;
+
+	if (data->iop.fmt == ARM_LPAE_S1) {
+		pte = ARM_LPAE_PTE_AP_UNPRIV | ARM_LPAE_PTE_nG;
+
+		if (!(prot & IOMMU_WRITE) && (prot & IOMMU_READ))
+			pte |= ARM_LPAE_PTE_AP_RDONLY;
+
+		if (prot & IOMMU_CACHE)
+			pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
+				<< ARM_LPAE_PTE_ATTRINDX_SHIFT);
+	} else {
+		pte = ARM_LPAE_PTE_HAP_FAULT;
+		if (prot & IOMMU_READ)
+			pte |= ARM_LPAE_PTE_HAP_READ;
+		if (prot & IOMMU_WRITE)
+			pte |= ARM_LPAE_PTE_HAP_WRITE;
+		if (prot & IOMMU_CACHE)
+			pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
+		else
+			pte |= ARM_LPAE_PTE_MEMATTR_NC;
+	}
+
+	if (prot & IOMMU_NOEXEC)
+		pte |= ARM_LPAE_PTE_XN;
+
+	return pte;
+}
[[varun]] Do you plan to add a flag to indicate device memory? We had a discussion about this on the patch submitted by me. May be you can include that as a part of this patch.
+
+static int arm_lpae_map(struct io_pgtable_ops *ops, unsigned long iova,
+			phys_addr_t paddr, size_t size, int iommu_prot) {
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	arm_lpae_iopte *ptep = data->pgd;
+	int lvl = ARM_LPAE_START_LVL(data);
+	arm_lpae_iopte prot;
+
+	/* If no access, then nothing to do */
+	if (!(iommu_prot & (IOMMU_READ | IOMMU_WRITE)))
+		return 0;
+
+	prot = arm_lpae_prot_to_pte(data, iommu_prot);
+	return __arm_lpae_map(data, iova, paddr, size, prot, lvl, ptep); }
+
+static void __arm_lpae_free_pgtable(struct arm_lpae_io_pgtable *data, int lvl,
+				    arm_lpae_iopte *ptep)
+{
+	arm_lpae_iopte *start, *end;
+	unsigned long table_size;
+
+	/* Only leaf entries at the last level */
+	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
+		return;
+
+	table_size = 1UL << data->pg_shift;
+	if (lvl == ARM_LPAE_START_LVL(data))
+		table_size *= data->pages_per_pgd;
+
+	start = ptep;
+	end = (void *)ptep + table_size;
+
+	while (ptep != end) {
+		arm_lpae_iopte pte = *ptep++;
+
+		if (!pte || iopte_leaf(pte, lvl))
+			continue;
+
+		__arm_lpae_free_pgtable(data, lvl + 1, iopte_deref(pte, data));
+	}
+
+	free_pages_exact(start, table_size);
+}
+
+static void arm_lpae_free_pgtable(struct io_pgtable *iop) {
+	struct arm_lpae_io_pgtable *data = io_pgtable_to_data(iop);
+
+	__arm_lpae_free_pgtable(data, ARM_LPAE_START_LVL(data), data->pgd);
+	kfree(data);
+}
+
+static int arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
+				    unsigned long iova, size_t size,
+				    arm_lpae_iopte prot, int lvl,
+				    arm_lpae_iopte *ptep, size_t blk_size) {
+	unsigned long blk_start, blk_end;
+	phys_addr_t blk_paddr;
+	arm_lpae_iopte table = 0;
+	void *cookie = data->iop.cookie;
+	struct iommu_gather_ops *tlb = data->iop.cfg.tlb;
+
+	blk_start = iova & ~(blk_size - 1);
+	blk_end = blk_start + blk_size;
+	blk_paddr = iopte_to_pfn(*ptep, data) << data->pg_shift;
+
+	for (; blk_start < blk_end; blk_start += size, blk_paddr += size) {
+		arm_lpae_iopte *tablep;
+
+		/* Unmap! */
+		if (blk_start == iova)
+			continue;
+
+		/* __arm_lpae_map expects a pointer to the start of the table */
+		tablep = &table - ARM_LPAE_LVL_IDX(blk_start, lvl, data);


[[varun]] Not clear what's happening here. May be I am missing something, but where is the table allocated?

+		if (__arm_lpae_map(data, blk_start, blk_paddr, size, prot, lvl,
+				   tablep) < 0) {


[[varun]] Again not clear how are we unmapping the range. Index at the current level should point to a page table (with contiguous block mappings). Unmap would applied to the mappings at the next level. Unmap can happen anywhere in the contiguous range. It seems that you are just creating a subset of the block mapping.

+			if (table) {
+				/* Free the table we allocated */
+				tablep = iopte_deref(table, data);
+				__arm_lpae_free_pgtable(data, lvl + 1, tablep);
+			}
+			return 0; /* Bytes unmapped */
+		}
+	}
+
+	*ptep = table;
+	tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
+	iova &= ~(blk_size - 1);
+	tlb->tlb_add_flush(iova, blk_size, true, cookie);
+	return size;
+}
+
+static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
+			    unsigned long iova, size_t size, int lvl,
+			    arm_lpae_iopte *ptep)
+{
+	arm_lpae_iopte pte;
+	struct iommu_gather_ops *tlb = data->iop.cfg.tlb;
+	void *cookie = data->iop.cookie;
+	size_t blk_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
+
+	ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+	pte = *ptep;
+
+	/* Something went horribly wrong and we ran out of page table */
+	if (WARN_ON(!pte || (lvl == ARM_LPAE_MAX_LEVELS)))
+		return 0;
+
+	/* If the size matches this level, we're in the right place */
+	if (size == blk_size) {
+		*ptep = 0;
+		tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
+
+		if (!iopte_leaf(pte, lvl)) {
+			/* Also flush any partial walks */
+			tlb->tlb_add_flush(iova, size, false, cookie);
+			tlb->tlb_sync(data->iop.cookie);
+			ptep = iopte_deref(pte, data);
+			__arm_lpae_free_pgtable(data, lvl + 1, ptep);
+		} else {
+			tlb->tlb_add_flush(iova, size, true, cookie);
+		}
+
+		return size;
+	} else if (iopte_leaf(pte, lvl)) {
+		/*
+		 * Insert a table at the next level to map the old region,
+		 * minus the part we want to unmap
+		 */
[[varun]]  Minus could be somwhere in between the contiguous chunk? We should first break the entire block mapping in to a next level page mapping and then unmap a chunk.

+		return arm_lpae_split_blk_unmap(data, iova, size,
+						iopte_prot(pte), lvl, ptep,
+						blk_size);
+	}
+
+	/* Keep on walkin' */
+	ptep = iopte_deref(pte, data);
+	return __arm_lpae_unmap(data, iova, size, lvl + 1, ptep); }
+
+static int arm_lpae_unmap(struct io_pgtable_ops *ops, unsigned long iova,
+			  size_t size)
+{
+	size_t unmapped;
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	struct io_pgtable *iop = &data->iop;
+	arm_lpae_iopte *ptep = data->pgd;
+	int lvl = ARM_LPAE_START_LVL(data);
+
+	unmapped = __arm_lpae_unmap(data, iova, size, lvl, ptep);
+	if (unmapped)
+		iop->cfg.tlb->tlb_sync(iop->cookie);
+
+	return unmapped;
+}
+
+static phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
+					 unsigned long iova)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	arm_lpae_iopte pte, *ptep = data->pgd;
+	int lvl = ARM_LPAE_START_LVL(data);
+
+	do {
+		/* Valid IOPTE pointer? */
+		if (!ptep)
+			return 0;
+
+		/* Grab the IOPTE we're interested in */
+		pte = *(ptep + ARM_LPAE_LVL_IDX(iova, lvl, data));
+
+		/* Valid entry? */
+		if (!pte)
+			return 0;
+
+		/* Leaf entry? */
+		if (iopte_leaf(pte,lvl))
+			goto found_translation;
+
+		/* Take it to the next level */
+		ptep = iopte_deref(pte, data);
+	} while (++lvl < ARM_LPAE_MAX_LEVELS);
+
+	/* Ran out of page tables to walk */
+	return 0;
+
+found_translation:
+	iova &= ((1 << data->pg_shift) - 1);
+	return ((phys_addr_t)iopte_to_pfn(pte,data) << data->pg_shift) | iova; 
+}
+
+static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg) {
+	unsigned long granule;
+
+	/*
+	 * We need to restrict the supported page sizes to match the
+	 * translation regime for a particular granule. Aim to match
+	 * the CPU page size if possible, otherwise prefer smaller sizes.
+	 * While we're at it, restrict the block sizes to match the
+	 * chosen granule.
+	 */
+	if (cfg->pgsize_bitmap & PAGE_SIZE)
+		granule = PAGE_SIZE;
+	else if (cfg->pgsize_bitmap & ~PAGE_MASK)
+		granule = 1UL << __fls(cfg->pgsize_bitmap & ~PAGE_MASK);
+	else if (cfg->pgsize_bitmap & PAGE_MASK)
+		granule = 1UL << __ffs(cfg->pgsize_bitmap & PAGE_MASK);
+	else
+		granule = 0;
+
+	switch (granule) {
+	case SZ_4K:
+		cfg->pgsize_bitmap &= (SZ_4K | SZ_2M | SZ_1G);
+		break;
+	case SZ_16K:
+		cfg->pgsize_bitmap &= (SZ_16K | SZ_32M);
+		break;
+	case SZ_64K:
+		cfg->pgsize_bitmap &= (SZ_64K | SZ_512M);
+		break;
+	default:
+		cfg->pgsize_bitmap = 0;
+	}
+}
+
+static struct arm_lpae_io_pgtable *
+arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg) {
+	unsigned long va_bits;
+	struct arm_lpae_io_pgtable *data;
+
+	arm_lpae_restrict_pgsizes(cfg);
+
+	if (!(cfg->pgsize_bitmap & (SZ_4K | SZ_16K | SZ_64K)))
+		return NULL;
+
+	if (cfg->ias > ARM_LPAE_MAX_ADDR_BITS)
+		return NULL;
+
+	if (cfg->oas > ARM_LPAE_MAX_ADDR_BITS)
+		return NULL;
+
+	data = kmalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return NULL;
+
+	data->pages_per_pgd = 1;
+	data->pg_shift = __ffs(cfg->pgsize_bitmap);
+	data->bits_per_level = data->pg_shift - ilog2(sizeof(arm_lpae_iopte));
+
+	va_bits = cfg->ias - data->pg_shift;
+	data->levels = DIV_ROUND_UP(va_bits, data->bits_per_level);


[[varun]]  Not related to the patch, but this would be applicable to the CPU tables as well i.e, we can't support 48bit VA with 64 KB page tables, right? The AR64 memory maps shows possibility of using 6 bits for the first level page table.

+
+	data->iop.ops = (struct io_pgtable_ops) {
+		.map		= arm_lpae_map,
+		.unmap		= arm_lpae_unmap,
+		.iova_to_phys	= arm_lpae_iova_to_phys,
+	};
+
+	return data;
+}
+
+static struct io_pgtable *arm_lpae_alloc_pgtable_s1(struct io_pgtable_cfg *cfg,
+						    void *cookie)
+{
+	u64 reg;
+	struct arm_lpae_io_pgtable *data = arm_lpae_alloc_pgtable(cfg);
+
+	if (!data)
+		return NULL;
+
+	/* TCR */
+	reg = ARM_LPAE_TCR_EAE |
+	     (ARM_LPAE_TCR_SH_IS << ARM_LPAE_TCR_SH0_SHIFT) |
+	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_IRGN0_SHIFT) |
+	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_ORGN0_SHIFT);
+
+	switch (1 << data->pg_shift) {
+	case SZ_4K:
+		reg |= ARM_LPAE_TCR_TG0_4K;
+		break;
+	case SZ_16K:
+		reg |= ARM_LPAE_TCR_TG0_16K;
+		break;
+	case SZ_64K:
+		reg |= ARM_LPAE_TCR_TG0_64K;
+		break;
+	}
+
+	switch (cfg->oas) {
+	case 32:
+		reg |= (ARM_LPAE_TCR_PS_32_BIT << ARM_LPAE_TCR_IPS_SHIFT);
+		break;
+	case 36:
+		reg |= (ARM_LPAE_TCR_PS_36_BIT << ARM_LPAE_TCR_IPS_SHIFT);
+		break;
+	case 40:
+		reg |= (ARM_LPAE_TCR_PS_40_BIT << ARM_LPAE_TCR_IPS_SHIFT);
+		break;
+	case 42:
+		reg |= (ARM_LPAE_TCR_PS_42_BIT << ARM_LPAE_TCR_IPS_SHIFT);
+		break;
+	case 44:
+		reg |= (ARM_LPAE_TCR_PS_44_BIT << ARM_LPAE_TCR_IPS_SHIFT);
+		break;
+	case 48:
+		reg |= (ARM_LPAE_TCR_PS_48_BIT << ARM_LPAE_TCR_IPS_SHIFT);
+		break;
+	default:
+		goto out_free_data;
+	}
+
+	reg |= (64ULL - cfg->ias) << ARM_LPAE_TCR_T0SZ_SHIFT;
+	cfg->arm_lpae_s1_cfg.tcr = reg;
+
+	/* MAIRs */
+	reg = (ARM_LPAE_MAIR_ATTR_NC
+	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_NC)) |
+	      (ARM_LPAE_MAIR_ATTR_WBRWA
+	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_CACHE)) |
+	      (ARM_LPAE_MAIR_ATTR_DEVICE
+	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_DEV));
+
+	cfg->arm_lpae_s1_cfg.mair[0] = reg;
+	cfg->arm_lpae_s1_cfg.mair[1] = 0;
+
+	/* Looking good; allocate a pgd */
+	data->pgd = alloc_pages_exact(1UL << data->pg_shift,
+				      GFP_KERNEL | __GFP_ZERO);
+	if (!data->pgd)
+		goto out_free_data;
+
+	cfg->tlb->flush_pgtable(data->pgd, (1UL << data->pg_shift), cookie);
+
+	/* TTBRs */
+	cfg->arm_lpae_s1_cfg.ttbr[0] = virt_to_phys(data->pgd);
+	cfg->arm_lpae_s1_cfg.ttbr[1] = 0;
+	return &data->iop;
+
+out_free_data:
+	kfree(data);
+	return NULL;
+}
+
+static struct io_pgtable *arm_lpae_alloc_pgtable_s2(struct io_pgtable_cfg *cfg,
+						    void *cookie)
+{
+	u64 reg, sl;
+	size_t pgd_size;
+	struct arm_lpae_io_pgtable *data = arm_lpae_alloc_pgtable(cfg);
+
+	if (!data)
+		return NULL;
+
+	/*
+	 * Concatenate PGDs at level 1 if possible in order to reduce
+	 * the depth of the stage-2 walk.
+	 */
+	if (data->levels == ARM_LPAE_MAX_LEVELS) {
+		unsigned long pgd_bits, pgd_pages;
+		unsigned long va_bits = cfg->ias - data->pg_shift;
+
+		pgd_bits = data->bits_per_level * (data->levels - 1);
+		pgd_pages = 1 << (va_bits - pgd_bits);
+		if (pgd_pages <= ARM_LPAE_S2_MAX_CONCAT_PAGES) {
+			data->pages_per_pgd = pgd_pages;
+			data->levels--;
+		}
+	}
+
[[varun]] Can you point me to some documentation regarding stage 2 page concatenation. Not sure why this is required? 

+	/* VTCR */
+	reg = ARM_LPAE_TCR_EAE |
+	     (ARM_LPAE_TCR_SH_IS << ARM_LPAE_TCR_SH0_SHIFT) |
+	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_IRGN0_SHIFT) |
+	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_ORGN0_SHIFT);
+
+	sl = ARM_LPAE_START_LVL(data);
+
+	switch (1 << data->pg_shift) {
+	case SZ_4K:
+		reg |= ARM_LPAE_TCR_TG0_4K;
+		sl++; /* SL0 format is different for 4K granule size */
+		break;
+	case SZ_16K:
+		reg |= ARM_LPAE_TCR_TG0_16K;
+		break;
+	case SZ_64K:
+		reg |= ARM_LPAE_TCR_TG0_64K;
+		break;
+	}
+
+	switch (cfg->oas) {
+	case 32:
+		reg |= (ARM_LPAE_TCR_PS_32_BIT << ARM_LPAE_TCR_PS_SHIFT);
+		break;
+	case 36:
+		reg |= (ARM_LPAE_TCR_PS_36_BIT << ARM_LPAE_TCR_PS_SHIFT);
+		break;
+	case 40:
+		reg |= (ARM_LPAE_TCR_PS_40_BIT << ARM_LPAE_TCR_PS_SHIFT);
+		break;
+	case 42:
+		reg |= (ARM_LPAE_TCR_PS_42_BIT << ARM_LPAE_TCR_PS_SHIFT);
+		break;
+	case 44:
+		reg |= (ARM_LPAE_TCR_PS_44_BIT << ARM_LPAE_TCR_PS_SHIFT);
+		break;
+	case 48:
+		reg |= (ARM_LPAE_TCR_PS_48_BIT << ARM_LPAE_TCR_PS_SHIFT);
+		break;
+	default:
+		goto out_free_data;
+	}
+
+	reg |= (64ULL - cfg->ias) << ARM_LPAE_TCR_T0SZ_SHIFT;
+	reg |= (~sl & ARM_LPAE_TCR_SL0_MASK) << ARM_LPAE_TCR_SL0_SHIFT;
+	cfg->arm_lpae_s2_cfg.vtcr = reg;
+
+	/* Allocate pgd pages */
+	pgd_size = data->pages_per_pgd * (1UL << data->pg_shift);
+	data->pgd = alloc_pages_exact(pgd_size, GFP_KERNEL | __GFP_ZERO);
+	if (!data->pgd)
+		goto out_free_data;
+
+	cfg->tlb->flush_pgtable(data->pgd, pgd_size, cookie);
+
+	/* VTTBR */
+	cfg->arm_lpae_s2_cfg.vttbr = virt_to_phys(data->pgd);
+	return &data->iop;
+
+out_free_data:
+	kfree(data);
+	return NULL;
+}
+
+struct io_pgtable_init_fns io_pgtable_arm_lpae_s1_init_fns = {
+	.alloc	= arm_lpae_alloc_pgtable_s1,
+	.free	= arm_lpae_free_pgtable,
+};
+
+struct io_pgtable_init_fns io_pgtable_arm_lpae_s2_init_fns = {
+	.alloc	= arm_lpae_alloc_pgtable_s2,
+	.free	= arm_lpae_free_pgtable,
+};
diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c index 82e39a0db94b..d0a2016efcb4 100644
--- a/drivers/iommu/io-pgtable.c
+++ b/drivers/iommu/io-pgtable.c
@@ -25,8 +25,15 @@
 
 #include "io-pgtable.h"
 
+extern struct io_pgtable_init_fns io_pgtable_arm_lpae_s1_init_fns; 
+extern struct io_pgtable_init_fns io_pgtable_arm_lpae_s2_init_fns;
+
 static struct io_pgtable_init_fns *io_pgtable_init_table[IO_PGTABLE_NUM_FMTS] =  {
+#ifdef CONFIG_IOMMU_IO_PGTABLE_LPAE
+	[ARM_LPAE_S1] = &io_pgtable_arm_lpae_s1_init_fns,
+	[ARM_LPAE_S2] = &io_pgtable_arm_lpae_s2_init_fns, #endif
 };
 
 struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt, diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h index 5ae75d9cae50..c1cff3d045db 100644
--- a/drivers/iommu/io-pgtable.h
+++ b/drivers/iommu/io-pgtable.h
@@ -33,10 +33,22 @@ struct io_pgtable_cfg {
 
 	/* Low-level data specific to the table format */
 	union {
+		struct {
+			u64	ttbr[2];
+			u64	tcr;
+			u64	mair[2];
+		} arm_lpae_s1_cfg;
+
+		struct {
+			u64	vttbr;
+			u64	vtcr;
+		} arm_lpae_s2_cfg;
 	};
 };
 
 enum io_pgtable_fmt {
+	ARM_LPAE_S1,
+	ARM_LPAE_S2,
 	IO_PGTABLE_NUM_FMTS,
 };
 
Thanks,
Varun

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 2/4] iommu: add ARM LPAE page table allocator
@ 2014-12-05 10:55         ` Varun Sethi
  0 siblings, 0 replies; 76+ messages in thread
From: Varun Sethi @ 2014-12-05 10:55 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,
Please find my comments inline. Search for "varun"

-----Original Message-----
From: Will Deacon [mailto:will.deacon at arm.com] 
Sent: Thursday, November 27, 2014 5:21 PM
To: linux-arm-kernel at lists.infradead.org; iommu at lists.linux-foundation.org
Cc: prem.mallappa at broadcom.com; Robin.Murphy at arm.com; lauraa at codeaurora.org; mitchelh at codeaurora.org; laurent.pinchart at ideasonboard.com; joro at 8bytes.org; Sethi Varun-B16395; m.szyprowski at samsung.com; Will Deacon
Subject: [PATCH 2/4] iommu: add ARM LPAE page table allocator

A number of IOMMUs found in ARM SoCs can walk architecture-compatible page tables.

This patch adds a generic allocator for Stage-1 and Stage-2 v7/v8 long-descriptor page tables. 4k, 16k and 64k pages are supported, with up to 4-levels of walk to cover a 48-bit address space.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 MAINTAINERS                    |   1 +
 drivers/iommu/Kconfig          |   9 +
 drivers/iommu/Makefile         |   1 +
 drivers/iommu/io-pgtable-arm.c | 735 +++++++++++++++++++++++++++++++++++++++++
 drivers/iommu/io-pgtable.c     |   7 +
 drivers/iommu/io-pgtable.h     |  12 +
 6 files changed, 765 insertions(+)
 create mode 100644 drivers/iommu/io-pgtable-arm.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 0ff630de8a6d..d3ca31b7c960 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1562,6 +1562,7 @@ M:	Will Deacon <will.deacon@arm.com>
 L:	linux-arm-kernel at lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
 F:	drivers/iommu/arm-smmu.c
+F:	drivers/iommu/io-pgtable-arm.c
 
 ARM64 PORT (AARCH64 ARCHITECTURE)
 M:	Catalin Marinas <catalin.marinas@arm.com>
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index 0f10554e7114..e1742a0146f8 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -19,6 +19,15 @@ menu "Generic IOMMU Pagetable Support"
 config IOMMU_IO_PGTABLE
 	bool
 
+config IOMMU_IO_PGTABLE_LPAE
+	bool "ARMv7/v8 Long Descriptor Format"
+	select IOMMU_IO_PGTABLE
+	help
+	  Enable support for the ARM long descriptor pagetable format.
+	  This allocator supports 4K/2M/1G, 16K/32M and 64K/512M page
+	  sizes at both stage-1 and stage-2, as well as address spaces
+	  up to 48-bits in size.
+
 endmenu
 
 config OF_IOMMU
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile index aff244c78181..269cdd82b672 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -2,6 +2,7 @@ obj-$(CONFIG_IOMMU_API) += iommu.o
 obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
+obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
 obj-$(CONFIG_OF_IOMMU)	+= of_iommu.o
 obj-$(CONFIG_MSM_IOMMU) += msm_iommu.o msm_iommu_dev.o
 obj-$(CONFIG_AMD_IOMMU) += amd_iommu.o amd_iommu_init.o diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c new file mode 100644 index 000000000000..9dbaa2e48424
--- /dev/null
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -0,0 +1,735 @@
+/*
+ * CPU-agnostic ARM page table allocator.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ *
+ * Copyright (C) 2014 ARM Limited
+ *
+ * Author: Will Deacon <will.deacon@arm.com>  */
+
+#define pr_fmt(fmt)	"arm-lpae io-pgtable: " fmt
+
+#include <linux/iommu.h>
+#include <linux/kernel.h>
+#include <linux/sizes.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+
+#include "io-pgtable.h"
+
+#define ARM_LPAE_MAX_ADDR_BITS		48
+#define ARM_LPAE_S2_MAX_CONCAT_PAGES	16
+#define ARM_LPAE_MAX_LEVELS		4
+
+/* Struct accessors */
+#define io_pgtable_to_data(x)						\
+	container_of((x), struct arm_lpae_io_pgtable, iop)
+
+#define io_pgtable_ops_to_pgtable(x)					\
+	container_of((x), struct io_pgtable, ops)
+
+#define io_pgtable_ops_to_data(x)					\
+	io_pgtable_to_data(io_pgtable_ops_to_pgtable(x))
+
+/*
+ * For consistency with the architecture, we always consider
+ * ARM_LPAE_MAX_LEVELS levels, with the walk starting at level n >=0  
+*/
+#define ARM_LPAE_START_LVL(d)	(ARM_LPAE_MAX_LEVELS - (d)->levels)
+
+/*
+ * Calculate the right shift amount to get to the portion describing 
+level l
+ * in a virtual address mapped by the pagetable in d.
+ */
+#define ARM_LPAE_LVL_SHIFT(l,d)						\
+	((((d)->levels - ((l) - ARM_LPAE_START_LVL(d) + 1))		\
+	  * (d)->bits_per_level) + (d)->pg_shift)
+
+/*
+ * Calculate the index at level l used to map virtual address a using 
+the
+ * pagetable in d.
+ */
+#define ARM_LPAE_PGD_IDX(l,d)						\
+	((l) == ARM_LPAE_START_LVL(d) ? ilog2((d)->pages_per_pgd) : 0)
+
+#define ARM_LPAE_LVL_IDX(a,l,d)						\
+	(((a) >> ARM_LPAE_LVL_SHIFT(l,d)) &				\
+	 ((1 << ((d)->bits_per_level + ARM_LPAE_PGD_IDX(l,d))) - 1))
+
+/* Calculate the block/page mapping size at level l for pagetable in d. */
+#define ARM_LPAE_BLOCK_SIZE(l,d)					\
+	(1 << (ilog2(sizeof(arm_lpae_iopte)) +				\
+		((ARM_LPAE_MAX_LEVELS - (l)) * (d)->bits_per_level)))
+
+/* Page table bits */
+#define ARM_LPAE_PTE_TYPE_SHIFT		0
+#define ARM_LPAE_PTE_TYPE_MASK		0x3
+
+#define ARM_LPAE_PTE_TYPE_BLOCK		1
+#define ARM_LPAE_PTE_TYPE_TABLE		3
+#define ARM_LPAE_PTE_TYPE_PAGE		3
+
+#define ARM_LPAE_PTE_XN			(((arm_lpae_iopte)3) << 53)
+#define ARM_LPAE_PTE_AF			(((arm_lpae_iopte)1) << 10)
+#define ARM_LPAE_PTE_SH_NS		(((arm_lpae_iopte)0) << 8)
+#define ARM_LPAE_PTE_SH_OS		(((arm_lpae_iopte)2) << 8)
+#define ARM_LPAE_PTE_SH_IS		(((arm_lpae_iopte)3) << 8)
+#define ARM_LPAE_PTE_VALID		(((arm_lpae_iopte)1) << 0)
+
+#define ARM_LPAE_PTE_ATTR_LO_MASK	(((arm_lpae_iopte)0x3ff) << 2)
+/* Ignore the contiguous bit for block splitting */
+#define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
+#define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
+					 ARM_LPAE_PTE_ATTR_HI_MASK)
+
+/* Stage-1 PTE */
+#define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
+#define ARM_LPAE_PTE_AP_RDONLY		(((arm_lpae_iopte)2) << 6)
+#define ARM_LPAE_PTE_ATTRINDX_SHIFT	2
+#define ARM_LPAE_PTE_nG			(((arm_lpae_iopte)1) << 11)
+
+/* Stage-2 PTE */
+#define ARM_LPAE_PTE_HAP_FAULT		(((arm_lpae_iopte)0) << 6)
+#define ARM_LPAE_PTE_HAP_READ		(((arm_lpae_iopte)1) << 6)
+#define ARM_LPAE_PTE_HAP_WRITE		(((arm_lpae_iopte)2) << 6)
+#define ARM_LPAE_PTE_MEMATTR_OIWB	(((arm_lpae_iopte)0xf) << 2)
+#define ARM_LPAE_PTE_MEMATTR_NC		(((arm_lpae_iopte)0x5) << 2)
+#define ARM_LPAE_PTE_MEMATTR_DEV	(((arm_lpae_iopte)0x1) << 2)
+
+/* Register bits */
+#define ARM_LPAE_TCR_EAE		(1 << 31)
+
+#define ARM_LPAE_TCR_TG0_4K		(0 << 14)
+#define ARM_LPAE_TCR_TG0_64K		(1 << 14)
+#define ARM_LPAE_TCR_TG0_16K		(2 << 14)
+
+#define ARM_LPAE_TCR_SH0_SHIFT		12
+#define ARM_LPAE_TCR_SH0_MASK		0x3
+#define ARM_LPAE_TCR_SH_NS		0
+#define ARM_LPAE_TCR_SH_OS		2
+#define ARM_LPAE_TCR_SH_IS		3
+
+#define ARM_LPAE_TCR_ORGN0_SHIFT	10
+#define ARM_LPAE_TCR_IRGN0_SHIFT	8
+#define ARM_LPAE_TCR_RGN_MASK		0x3
+#define ARM_LPAE_TCR_RGN_NC		0
+#define ARM_LPAE_TCR_RGN_WBWA		1
+#define ARM_LPAE_TCR_RGN_WT		2
+#define ARM_LPAE_TCR_RGN_WB		3
+
+#define ARM_LPAE_TCR_SL0_SHIFT		6
+#define ARM_LPAE_TCR_SL0_MASK		0x3
+
+#define ARM_LPAE_TCR_T0SZ_SHIFT		0
+#define ARM_LPAE_TCR_SZ_MASK		0xf
+
+#define ARM_LPAE_TCR_PS_SHIFT		16
+#define ARM_LPAE_TCR_PS_MASK		0x7
+
+#define ARM_LPAE_TCR_IPS_SHIFT		32
+#define ARM_LPAE_TCR_IPS_MASK		0x7
+
+#define ARM_LPAE_TCR_PS_32_BIT		0x0ULL
+#define ARM_LPAE_TCR_PS_36_BIT		0x1ULL
+#define ARM_LPAE_TCR_PS_40_BIT		0x2ULL
+#define ARM_LPAE_TCR_PS_42_BIT		0x3ULL
+#define ARM_LPAE_TCR_PS_44_BIT		0x4ULL
+#define ARM_LPAE_TCR_PS_48_BIT		0x5ULL
+
+#define ARM_LPAE_MAIR_ATTR_SHIFT(n)	((n) << 3)
+#define ARM_LPAE_MAIR_ATTR_MASK		0xff
+#define ARM_LPAE_MAIR_ATTR_DEVICE	0x04
+#define ARM_LPAE_MAIR_ATTR_NC		0x44
+#define ARM_LPAE_MAIR_ATTR_WBRWA	0xff
+#define ARM_LPAE_MAIR_ATTR_IDX_NC	0
+#define ARM_LPAE_MAIR_ATTR_IDX_CACHE	1
+#define ARM_LPAE_MAIR_ATTR_IDX_DEV	2
+
+/* IOPTE accessors */
+#define iopte_deref(pte,d)					\
+	(__va((pte) & ((1ULL << ARM_LPAE_MAX_ADDR_BITS) - 1)	\
+	& ~((1ULL << (d)->pg_shift) - 1)))
+
+#define iopte_type(pte,l)					\
+	(((pte) >> ARM_LPAE_PTE_TYPE_SHIFT) & ARM_LPAE_PTE_TYPE_MASK)
+
+#define iopte_prot(pte)	((pte) & ARM_LPAE_PTE_ATTR_MASK)
+
+#define iopte_leaf(pte,l)					\
+	(l == (ARM_LPAE_MAX_LEVELS - 1) ?			\
+		(iopte_type(pte,l) == ARM_LPAE_PTE_TYPE_PAGE) :	\
+		(iopte_type(pte,l) == ARM_LPAE_PTE_TYPE_BLOCK))
+
+#define iopte_to_pfn(pte,d)					\
+	(((pte) & ((1ULL << ARM_LPAE_MAX_ADDR_BITS) - 1)) >> (d)->pg_shift)
+
+#define pfn_to_iopte(pfn,d)					\
+	(((pfn) << (d)->pg_shift) & ((1ULL << ARM_LPAE_MAX_ADDR_BITS) - 1))
+
+struct arm_lpae_io_pgtable {
+	struct io_pgtable	iop;
+
+	int			levels;
+	int			pages_per_pgd;
+	unsigned long		pg_shift;
+	unsigned long		bits_per_level;
+
+	void			*pgd;
+};
+
+typedef u64 arm_lpae_iopte;
+
+static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
+			     unsigned long iova, phys_addr_t paddr,
+			     arm_lpae_iopte prot, int lvl,
+			     arm_lpae_iopte *ptep)
+{
+	arm_lpae_iopte pte = prot;
+
+	/* We require an unmap first */
+	if (iopte_leaf(*ptep, lvl))
+		return -EEXIST;
[varun] Instead of returning an error, how about displaying a warning and replacing the entry? 

+
+	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
+		pte |= ARM_LPAE_PTE_TYPE_PAGE;
+	else
+		pte |= ARM_LPAE_PTE_TYPE_BLOCK;
+
+	pte |= ARM_LPAE_PTE_AF | ARM_LPAE_PTE_SH_IS;
+	pte |= pfn_to_iopte(paddr >> data->pg_shift, data);
+
+	*ptep = pte;
+	data->iop.cfg.tlb->flush_pgtable(ptep, sizeof(*ptep), data->iop.cookie);
+	return 0;
+}
+
+static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
+			  phys_addr_t paddr, size_t size, arm_lpae_iopte prot,
+			  int lvl, arm_lpae_iopte *ptep)
+{
+	arm_lpae_iopte *cptep, pte;
+	void *cookie = data->iop.cookie;
+	size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
+
+	/* Find our entry at the current level */
+	ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+
+	/* If we can install a leaf entry at this level, then do so */
+	if (size == block_size && (size & data->iop.cfg.pgsize_bitmap))
+		return arm_lpae_init_pte(data, iova, paddr, prot, lvl, ptep);
+
+	/* We can't allocate tables at the final level */
+	if (WARN_ON(lvl >= ARM_LPAE_MAX_LEVELS - 1))
+		return -EINVAL;

[varun] A warning message would be helpful.

+
+	/* Grab a pointer to the next level */
+	pte = *ptep;
+	if (!pte) {
+		cptep = alloc_pages_exact(1UL << data->pg_shift,
+					 GFP_ATOMIC | __GFP_ZERO);
+		if (!cptep)
+			return -ENOMEM;
+
+		data->iop.cfg.tlb->flush_pgtable(cptep, 1UL << data->pg_shift,
+						 cookie);
+		pte = __pa(cptep) | ARM_LPAE_PTE_TYPE_TABLE;
+		*ptep = pte;
+		data->iop.cfg.tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
+	} else {
+		cptep = iopte_deref(pte, data);
+	}
+
+	/* Rinse, repeat */
+	return __arm_lpae_map(data, iova, paddr, size, prot, lvl + 1, cptep); 
+}
+
+static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
+					   int prot)
+{
+	arm_lpae_iopte pte;
+
+	if (data->iop.fmt == ARM_LPAE_S1) {
+		pte = ARM_LPAE_PTE_AP_UNPRIV | ARM_LPAE_PTE_nG;
+
+		if (!(prot & IOMMU_WRITE) && (prot & IOMMU_READ))
+			pte |= ARM_LPAE_PTE_AP_RDONLY;
+
+		if (prot & IOMMU_CACHE)
+			pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
+				<< ARM_LPAE_PTE_ATTRINDX_SHIFT);
+	} else {
+		pte = ARM_LPAE_PTE_HAP_FAULT;
+		if (prot & IOMMU_READ)
+			pte |= ARM_LPAE_PTE_HAP_READ;
+		if (prot & IOMMU_WRITE)
+			pte |= ARM_LPAE_PTE_HAP_WRITE;
+		if (prot & IOMMU_CACHE)
+			pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
+		else
+			pte |= ARM_LPAE_PTE_MEMATTR_NC;
+	}
+
+	if (prot & IOMMU_NOEXEC)
+		pte |= ARM_LPAE_PTE_XN;
+
+	return pte;
+}
[[varun]] Do you plan to add a flag to indicate device memory? We had a discussion about this on the patch submitted by me. May be you can include that as a part of this patch.
+
+static int arm_lpae_map(struct io_pgtable_ops *ops, unsigned long iova,
+			phys_addr_t paddr, size_t size, int iommu_prot) {
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	arm_lpae_iopte *ptep = data->pgd;
+	int lvl = ARM_LPAE_START_LVL(data);
+	arm_lpae_iopte prot;
+
+	/* If no access, then nothing to do */
+	if (!(iommu_prot & (IOMMU_READ | IOMMU_WRITE)))
+		return 0;
+
+	prot = arm_lpae_prot_to_pte(data, iommu_prot);
+	return __arm_lpae_map(data, iova, paddr, size, prot, lvl, ptep); }
+
+static void __arm_lpae_free_pgtable(struct arm_lpae_io_pgtable *data, int lvl,
+				    arm_lpae_iopte *ptep)
+{
+	arm_lpae_iopte *start, *end;
+	unsigned long table_size;
+
+	/* Only leaf entries at the last level */
+	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
+		return;
+
+	table_size = 1UL << data->pg_shift;
+	if (lvl == ARM_LPAE_START_LVL(data))
+		table_size *= data->pages_per_pgd;
+
+	start = ptep;
+	end = (void *)ptep + table_size;
+
+	while (ptep != end) {
+		arm_lpae_iopte pte = *ptep++;
+
+		if (!pte || iopte_leaf(pte, lvl))
+			continue;
+
+		__arm_lpae_free_pgtable(data, lvl + 1, iopte_deref(pte, data));
+	}
+
+	free_pages_exact(start, table_size);
+}
+
+static void arm_lpae_free_pgtable(struct io_pgtable *iop) {
+	struct arm_lpae_io_pgtable *data = io_pgtable_to_data(iop);
+
+	__arm_lpae_free_pgtable(data, ARM_LPAE_START_LVL(data), data->pgd);
+	kfree(data);
+}
+
+static int arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
+				    unsigned long iova, size_t size,
+				    arm_lpae_iopte prot, int lvl,
+				    arm_lpae_iopte *ptep, size_t blk_size) {
+	unsigned long blk_start, blk_end;
+	phys_addr_t blk_paddr;
+	arm_lpae_iopte table = 0;
+	void *cookie = data->iop.cookie;
+	struct iommu_gather_ops *tlb = data->iop.cfg.tlb;
+
+	blk_start = iova & ~(blk_size - 1);
+	blk_end = blk_start + blk_size;
+	blk_paddr = iopte_to_pfn(*ptep, data) << data->pg_shift;
+
+	for (; blk_start < blk_end; blk_start += size, blk_paddr += size) {
+		arm_lpae_iopte *tablep;
+
+		/* Unmap! */
+		if (blk_start == iova)
+			continue;
+
+		/* __arm_lpae_map expects a pointer to the start of the table */
+		tablep = &table - ARM_LPAE_LVL_IDX(blk_start, lvl, data);


[[varun]] Not clear what's happening here. May be I am missing something, but where is the table allocated?

+		if (__arm_lpae_map(data, blk_start, blk_paddr, size, prot, lvl,
+				   tablep) < 0) {


[[varun]] Again not clear how are we unmapping the range. Index at the current level should point to a page table (with contiguous block mappings). Unmap would applied to the mappings@the next level. Unmap can happen anywhere in the contiguous range. It seems that you are just creating a subset of the block mapping.

+			if (table) {
+				/* Free the table we allocated */
+				tablep = iopte_deref(table, data);
+				__arm_lpae_free_pgtable(data, lvl + 1, tablep);
+			}
+			return 0; /* Bytes unmapped */
+		}
+	}
+
+	*ptep = table;
+	tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
+	iova &= ~(blk_size - 1);
+	tlb->tlb_add_flush(iova, blk_size, true, cookie);
+	return size;
+}
+
+static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
+			    unsigned long iova, size_t size, int lvl,
+			    arm_lpae_iopte *ptep)
+{
+	arm_lpae_iopte pte;
+	struct iommu_gather_ops *tlb = data->iop.cfg.tlb;
+	void *cookie = data->iop.cookie;
+	size_t blk_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
+
+	ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+	pte = *ptep;
+
+	/* Something went horribly wrong and we ran out of page table */
+	if (WARN_ON(!pte || (lvl == ARM_LPAE_MAX_LEVELS)))
+		return 0;
+
+	/* If the size matches this level, we're in the right place */
+	if (size == blk_size) {
+		*ptep = 0;
+		tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
+
+		if (!iopte_leaf(pte, lvl)) {
+			/* Also flush any partial walks */
+			tlb->tlb_add_flush(iova, size, false, cookie);
+			tlb->tlb_sync(data->iop.cookie);
+			ptep = iopte_deref(pte, data);
+			__arm_lpae_free_pgtable(data, lvl + 1, ptep);
+		} else {
+			tlb->tlb_add_flush(iova, size, true, cookie);
+		}
+
+		return size;
+	} else if (iopte_leaf(pte, lvl)) {
+		/*
+		 * Insert a table at the next level to map the old region,
+		 * minus the part we want to unmap
+		 */
[[varun]]  Minus could be somwhere in between the contiguous chunk? We should first break the entire block mapping in to a next level page mapping and then unmap a chunk.

+		return arm_lpae_split_blk_unmap(data, iova, size,
+						iopte_prot(pte), lvl, ptep,
+						blk_size);
+	}
+
+	/* Keep on walkin' */
+	ptep = iopte_deref(pte, data);
+	return __arm_lpae_unmap(data, iova, size, lvl + 1, ptep); }
+
+static int arm_lpae_unmap(struct io_pgtable_ops *ops, unsigned long iova,
+			  size_t size)
+{
+	size_t unmapped;
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	struct io_pgtable *iop = &data->iop;
+	arm_lpae_iopte *ptep = data->pgd;
+	int lvl = ARM_LPAE_START_LVL(data);
+
+	unmapped = __arm_lpae_unmap(data, iova, size, lvl, ptep);
+	if (unmapped)
+		iop->cfg.tlb->tlb_sync(iop->cookie);
+
+	return unmapped;
+}
+
+static phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
+					 unsigned long iova)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	arm_lpae_iopte pte, *ptep = data->pgd;
+	int lvl = ARM_LPAE_START_LVL(data);
+
+	do {
+		/* Valid IOPTE pointer? */
+		if (!ptep)
+			return 0;
+
+		/* Grab the IOPTE we're interested in */
+		pte = *(ptep + ARM_LPAE_LVL_IDX(iova, lvl, data));
+
+		/* Valid entry? */
+		if (!pte)
+			return 0;
+
+		/* Leaf entry? */
+		if (iopte_leaf(pte,lvl))
+			goto found_translation;
+
+		/* Take it to the next level */
+		ptep = iopte_deref(pte, data);
+	} while (++lvl < ARM_LPAE_MAX_LEVELS);
+
+	/* Ran out of page tables to walk */
+	return 0;
+
+found_translation:
+	iova &= ((1 << data->pg_shift) - 1);
+	return ((phys_addr_t)iopte_to_pfn(pte,data) << data->pg_shift) | iova; 
+}
+
+static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg) {
+	unsigned long granule;
+
+	/*
+	 * We need to restrict the supported page sizes to match the
+	 * translation regime for a particular granule. Aim to match
+	 * the CPU page size if possible, otherwise prefer smaller sizes.
+	 * While we're at it, restrict the block sizes to match the
+	 * chosen granule.
+	 */
+	if (cfg->pgsize_bitmap & PAGE_SIZE)
+		granule = PAGE_SIZE;
+	else if (cfg->pgsize_bitmap & ~PAGE_MASK)
+		granule = 1UL << __fls(cfg->pgsize_bitmap & ~PAGE_MASK);
+	else if (cfg->pgsize_bitmap & PAGE_MASK)
+		granule = 1UL << __ffs(cfg->pgsize_bitmap & PAGE_MASK);
+	else
+		granule = 0;
+
+	switch (granule) {
+	case SZ_4K:
+		cfg->pgsize_bitmap &= (SZ_4K | SZ_2M | SZ_1G);
+		break;
+	case SZ_16K:
+		cfg->pgsize_bitmap &= (SZ_16K | SZ_32M);
+		break;
+	case SZ_64K:
+		cfg->pgsize_bitmap &= (SZ_64K | SZ_512M);
+		break;
+	default:
+		cfg->pgsize_bitmap = 0;
+	}
+}
+
+static struct arm_lpae_io_pgtable *
+arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg) {
+	unsigned long va_bits;
+	struct arm_lpae_io_pgtable *data;
+
+	arm_lpae_restrict_pgsizes(cfg);
+
+	if (!(cfg->pgsize_bitmap & (SZ_4K | SZ_16K | SZ_64K)))
+		return NULL;
+
+	if (cfg->ias > ARM_LPAE_MAX_ADDR_BITS)
+		return NULL;
+
+	if (cfg->oas > ARM_LPAE_MAX_ADDR_BITS)
+		return NULL;
+
+	data = kmalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return NULL;
+
+	data->pages_per_pgd = 1;
+	data->pg_shift = __ffs(cfg->pgsize_bitmap);
+	data->bits_per_level = data->pg_shift - ilog2(sizeof(arm_lpae_iopte));
+
+	va_bits = cfg->ias - data->pg_shift;
+	data->levels = DIV_ROUND_UP(va_bits, data->bits_per_level);


[[varun]]  Not related to the patch, but this would be applicable to the CPU tables as well i.e, we can't support 48bit VA with 64 KB page tables, right? The AR64 memory maps shows possibility of using 6 bits for the first level page table.

+
+	data->iop.ops = (struct io_pgtable_ops) {
+		.map		= arm_lpae_map,
+		.unmap		= arm_lpae_unmap,
+		.iova_to_phys	= arm_lpae_iova_to_phys,
+	};
+
+	return data;
+}
+
+static struct io_pgtable *arm_lpae_alloc_pgtable_s1(struct io_pgtable_cfg *cfg,
+						    void *cookie)
+{
+	u64 reg;
+	struct arm_lpae_io_pgtable *data = arm_lpae_alloc_pgtable(cfg);
+
+	if (!data)
+		return NULL;
+
+	/* TCR */
+	reg = ARM_LPAE_TCR_EAE |
+	     (ARM_LPAE_TCR_SH_IS << ARM_LPAE_TCR_SH0_SHIFT) |
+	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_IRGN0_SHIFT) |
+	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_ORGN0_SHIFT);
+
+	switch (1 << data->pg_shift) {
+	case SZ_4K:
+		reg |= ARM_LPAE_TCR_TG0_4K;
+		break;
+	case SZ_16K:
+		reg |= ARM_LPAE_TCR_TG0_16K;
+		break;
+	case SZ_64K:
+		reg |= ARM_LPAE_TCR_TG0_64K;
+		break;
+	}
+
+	switch (cfg->oas) {
+	case 32:
+		reg |= (ARM_LPAE_TCR_PS_32_BIT << ARM_LPAE_TCR_IPS_SHIFT);
+		break;
+	case 36:
+		reg |= (ARM_LPAE_TCR_PS_36_BIT << ARM_LPAE_TCR_IPS_SHIFT);
+		break;
+	case 40:
+		reg |= (ARM_LPAE_TCR_PS_40_BIT << ARM_LPAE_TCR_IPS_SHIFT);
+		break;
+	case 42:
+		reg |= (ARM_LPAE_TCR_PS_42_BIT << ARM_LPAE_TCR_IPS_SHIFT);
+		break;
+	case 44:
+		reg |= (ARM_LPAE_TCR_PS_44_BIT << ARM_LPAE_TCR_IPS_SHIFT);
+		break;
+	case 48:
+		reg |= (ARM_LPAE_TCR_PS_48_BIT << ARM_LPAE_TCR_IPS_SHIFT);
+		break;
+	default:
+		goto out_free_data;
+	}
+
+	reg |= (64ULL - cfg->ias) << ARM_LPAE_TCR_T0SZ_SHIFT;
+	cfg->arm_lpae_s1_cfg.tcr = reg;
+
+	/* MAIRs */
+	reg = (ARM_LPAE_MAIR_ATTR_NC
+	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_NC)) |
+	      (ARM_LPAE_MAIR_ATTR_WBRWA
+	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_CACHE)) |
+	      (ARM_LPAE_MAIR_ATTR_DEVICE
+	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_DEV));
+
+	cfg->arm_lpae_s1_cfg.mair[0] = reg;
+	cfg->arm_lpae_s1_cfg.mair[1] = 0;
+
+	/* Looking good; allocate a pgd */
+	data->pgd = alloc_pages_exact(1UL << data->pg_shift,
+				      GFP_KERNEL | __GFP_ZERO);
+	if (!data->pgd)
+		goto out_free_data;
+
+	cfg->tlb->flush_pgtable(data->pgd, (1UL << data->pg_shift), cookie);
+
+	/* TTBRs */
+	cfg->arm_lpae_s1_cfg.ttbr[0] = virt_to_phys(data->pgd);
+	cfg->arm_lpae_s1_cfg.ttbr[1] = 0;
+	return &data->iop;
+
+out_free_data:
+	kfree(data);
+	return NULL;
+}
+
+static struct io_pgtable *arm_lpae_alloc_pgtable_s2(struct io_pgtable_cfg *cfg,
+						    void *cookie)
+{
+	u64 reg, sl;
+	size_t pgd_size;
+	struct arm_lpae_io_pgtable *data = arm_lpae_alloc_pgtable(cfg);
+
+	if (!data)
+		return NULL;
+
+	/*
+	 * Concatenate PGDs at level 1 if possible in order to reduce
+	 * the depth of the stage-2 walk.
+	 */
+	if (data->levels == ARM_LPAE_MAX_LEVELS) {
+		unsigned long pgd_bits, pgd_pages;
+		unsigned long va_bits = cfg->ias - data->pg_shift;
+
+		pgd_bits = data->bits_per_level * (data->levels - 1);
+		pgd_pages = 1 << (va_bits - pgd_bits);
+		if (pgd_pages <= ARM_LPAE_S2_MAX_CONCAT_PAGES) {
+			data->pages_per_pgd = pgd_pages;
+			data->levels--;
+		}
+	}
+
[[varun]] Can you point me to some documentation regarding stage 2 page concatenation. Not sure why this is required? 

+	/* VTCR */
+	reg = ARM_LPAE_TCR_EAE |
+	     (ARM_LPAE_TCR_SH_IS << ARM_LPAE_TCR_SH0_SHIFT) |
+	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_IRGN0_SHIFT) |
+	     (ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_ORGN0_SHIFT);
+
+	sl = ARM_LPAE_START_LVL(data);
+
+	switch (1 << data->pg_shift) {
+	case SZ_4K:
+		reg |= ARM_LPAE_TCR_TG0_4K;
+		sl++; /* SL0 format is different for 4K granule size */
+		break;
+	case SZ_16K:
+		reg |= ARM_LPAE_TCR_TG0_16K;
+		break;
+	case SZ_64K:
+		reg |= ARM_LPAE_TCR_TG0_64K;
+		break;
+	}
+
+	switch (cfg->oas) {
+	case 32:
+		reg |= (ARM_LPAE_TCR_PS_32_BIT << ARM_LPAE_TCR_PS_SHIFT);
+		break;
+	case 36:
+		reg |= (ARM_LPAE_TCR_PS_36_BIT << ARM_LPAE_TCR_PS_SHIFT);
+		break;
+	case 40:
+		reg |= (ARM_LPAE_TCR_PS_40_BIT << ARM_LPAE_TCR_PS_SHIFT);
+		break;
+	case 42:
+		reg |= (ARM_LPAE_TCR_PS_42_BIT << ARM_LPAE_TCR_PS_SHIFT);
+		break;
+	case 44:
+		reg |= (ARM_LPAE_TCR_PS_44_BIT << ARM_LPAE_TCR_PS_SHIFT);
+		break;
+	case 48:
+		reg |= (ARM_LPAE_TCR_PS_48_BIT << ARM_LPAE_TCR_PS_SHIFT);
+		break;
+	default:
+		goto out_free_data;
+	}
+
+	reg |= (64ULL - cfg->ias) << ARM_LPAE_TCR_T0SZ_SHIFT;
+	reg |= (~sl & ARM_LPAE_TCR_SL0_MASK) << ARM_LPAE_TCR_SL0_SHIFT;
+	cfg->arm_lpae_s2_cfg.vtcr = reg;
+
+	/* Allocate pgd pages */
+	pgd_size = data->pages_per_pgd * (1UL << data->pg_shift);
+	data->pgd = alloc_pages_exact(pgd_size, GFP_KERNEL | __GFP_ZERO);
+	if (!data->pgd)
+		goto out_free_data;
+
+	cfg->tlb->flush_pgtable(data->pgd, pgd_size, cookie);
+
+	/* VTTBR */
+	cfg->arm_lpae_s2_cfg.vttbr = virt_to_phys(data->pgd);
+	return &data->iop;
+
+out_free_data:
+	kfree(data);
+	return NULL;
+}
+
+struct io_pgtable_init_fns io_pgtable_arm_lpae_s1_init_fns = {
+	.alloc	= arm_lpae_alloc_pgtable_s1,
+	.free	= arm_lpae_free_pgtable,
+};
+
+struct io_pgtable_init_fns io_pgtable_arm_lpae_s2_init_fns = {
+	.alloc	= arm_lpae_alloc_pgtable_s2,
+	.free	= arm_lpae_free_pgtable,
+};
diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c index 82e39a0db94b..d0a2016efcb4 100644
--- a/drivers/iommu/io-pgtable.c
+++ b/drivers/iommu/io-pgtable.c
@@ -25,8 +25,15 @@
 
 #include "io-pgtable.h"
 
+extern struct io_pgtable_init_fns io_pgtable_arm_lpae_s1_init_fns; 
+extern struct io_pgtable_init_fns io_pgtable_arm_lpae_s2_init_fns;
+
 static struct io_pgtable_init_fns *io_pgtable_init_table[IO_PGTABLE_NUM_FMTS] =  {
+#ifdef CONFIG_IOMMU_IO_PGTABLE_LPAE
+	[ARM_LPAE_S1] = &io_pgtable_arm_lpae_s1_init_fns,
+	[ARM_LPAE_S2] = &io_pgtable_arm_lpae_s2_init_fns, #endif
 };
 
 struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt, diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h index 5ae75d9cae50..c1cff3d045db 100644
--- a/drivers/iommu/io-pgtable.h
+++ b/drivers/iommu/io-pgtable.h
@@ -33,10 +33,22 @@ struct io_pgtable_cfg {
 
 	/* Low-level data specific to the table format */
 	union {
+		struct {
+			u64	ttbr[2];
+			u64	tcr;
+			u64	mair[2];
+		} arm_lpae_s1_cfg;
+
+		struct {
+			u64	vttbr;
+			u64	vtcr;
+		} arm_lpae_s2_cfg;
 	};
 };
 
 enum io_pgtable_fmt {
+	ARM_LPAE_S1,
+	ARM_LPAE_S2,
 	IO_PGTABLE_NUM_FMTS,
 };
 
Thanks,
Varun

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/4] iommu: add ARM LPAE page table allocator
  2014-12-05 10:55         ` Varun Sethi
@ 2014-12-05 18:48             ` Will Deacon
  -1 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-05 18:48 UTC (permalink / raw)
  To: Varun Sethi
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	laurent.pinchart-ryLnwIuWjnjg/C1BVhZhaw,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin Murphy,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Fri, Dec 05, 2014 at 10:55:11AM +0000, Varun Sethi wrote:
> Hi Will,

Hi Varun,

Thanks for the review!

> +static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
> +                            unsigned long iova, phys_addr_t paddr,
> +                            arm_lpae_iopte prot, int lvl,
> +                            arm_lpae_iopte *ptep)
> +{
> +       arm_lpae_iopte pte = prot;
> +
> +       /* We require an unmap first */
> +       if (iopte_leaf(*ptep, lvl))
> +               return -EEXIST;
> [varun] Instead of returning an error, how about displaying a warning and
> replacing the entry?

I'd be ok with displaying a warning, but I don't think we should just
continue. It indicates a misuse of the IOMMU API and probably a missing
TLBI.

> +static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
> +                         phys_addr_t paddr, size_t size, arm_lpae_iopte prot,
> +                         int lvl, arm_lpae_iopte *ptep)
> +{
> +       arm_lpae_iopte *cptep, pte;
> +       void *cookie = data->iop.cookie;
> +       size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
> +
> +       /* Find our entry at the current level */
> +       ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
> +
> +       /* If we can install a leaf entry at this level, then do so */
> +       if (size == block_size && (size & data->iop.cfg.pgsize_bitmap))
> +               return arm_lpae_init_pte(data, iova, paddr, prot, lvl, ptep);
> +
> +       /* We can't allocate tables at the final level */
> +       if (WARN_ON(lvl >= ARM_LPAE_MAX_LEVELS - 1))
> +               return -EINVAL;
> 
> [varun] A warning message would be helpful.

Sure, I can add one.

> +static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
> +                                          int prot)
> +{
> +       arm_lpae_iopte pte;
> +
> +       if (data->iop.fmt == ARM_LPAE_S1) {
> +               pte = ARM_LPAE_PTE_AP_UNPRIV | ARM_LPAE_PTE_nG;
> +
> +               if (!(prot & IOMMU_WRITE) && (prot & IOMMU_READ))
> +                       pte |= ARM_LPAE_PTE_AP_RDONLY;
> +
> +               if (prot & IOMMU_CACHE)
> +                       pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
> +                               << ARM_LPAE_PTE_ATTRINDX_SHIFT);
> +       } else {
> +               pte = ARM_LPAE_PTE_HAP_FAULT;
> +               if (prot & IOMMU_READ)
> +                       pte |= ARM_LPAE_PTE_HAP_READ;
> +               if (prot & IOMMU_WRITE)
> +                       pte |= ARM_LPAE_PTE_HAP_WRITE;
> +               if (prot & IOMMU_CACHE)
> +                       pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
> +               else
> +                       pte |= ARM_LPAE_PTE_MEMATTR_NC;
> +       }
> +
> +       if (prot & IOMMU_NOEXEC)
> +               pte |= ARM_LPAE_PTE_XN;
> +
> +       return pte;
> +}
> [[varun]] Do you plan to add a flag to indicate device memory? We had a
> discussion about this on the patch submitted by me. May be you can include
> that as a part of this patch.

That needs to go in as a separate patch. I think you should continue to push
that separately!

> +static int arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
> +                                   unsigned long iova, size_t size,
> +                                   arm_lpae_iopte prot, int lvl,
> +                                   arm_lpae_iopte *ptep, size_t blk_size) {
> +       unsigned long blk_start, blk_end;
> +       phys_addr_t blk_paddr;
> +       arm_lpae_iopte table = 0;
> +       void *cookie = data->iop.cookie;
> +       struct iommu_gather_ops *tlb = data->iop.cfg.tlb;
> +
> +       blk_start = iova & ~(blk_size - 1);
> +       blk_end = blk_start + blk_size;
> +       blk_paddr = iopte_to_pfn(*ptep, data) << data->pg_shift;
> +
> +       for (; blk_start < blk_end; blk_start += size, blk_paddr += size) {
> +               arm_lpae_iopte *tablep;
> +
> +               /* Unmap! */
> +               if (blk_start == iova)
> +                       continue;
> +
> +               /* __arm_lpae_map expects a pointer to the start of the table */
> +               tablep = &table - ARM_LPAE_LVL_IDX(blk_start, lvl, data);
> 
> 
> [[varun]] Not clear what's happening here. May be I am missing something,
> but where is the table allocated?

It is allocated in __arm_lpae_map, because the pte will be 0. The
subtraction above is to avoid us having to allocate a whole level, just for
a single invalid pte.
> 
> +               if (__arm_lpae_map(data, blk_start, blk_paddr, size, prot, lvl,
> +                                  tablep) < 0) {
> 
> 
> [[varun]] Again not clear how are we unmapping the range. Index at the
> current level should point to a page table (with contiguous block
> mappings). Unmap would applied to the mappings at the next level. Unmap
> can happen anywhere in the contiguous range. It seems that you are just
> creating a subset of the block mapping.

We will be unmapping a single entry at the next level, so we basically
create a table, then map everything at the next level apart from the part we
need to unmap.

> +static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> +                           unsigned long iova, size_t size, int lvl,
> +                           arm_lpae_iopte *ptep)
> +{
> +       arm_lpae_iopte pte;
> +       struct iommu_gather_ops *tlb = data->iop.cfg.tlb;
> +       void *cookie = data->iop.cookie;
> +       size_t blk_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
> +
> +       ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
> +       pte = *ptep;
> +
> +       /* Something went horribly wrong and we ran out of page table */
> +       if (WARN_ON(!pte || (lvl == ARM_LPAE_MAX_LEVELS)))
> +               return 0;
> +
> +       /* If the size matches this level, we're in the right place */
> +       if (size == blk_size) {
> +               *ptep = 0;
> +               tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
> +
> +               if (!iopte_leaf(pte, lvl)) {
> +                       /* Also flush any partial walks */
> +                       tlb->tlb_add_flush(iova, size, false, cookie);
> +                       tlb->tlb_sync(data->iop.cookie);
> +                       ptep = iopte_deref(pte, data);
> +                       __arm_lpae_free_pgtable(data, lvl + 1, ptep);
> +               } else {
> +                       tlb->tlb_add_flush(iova, size, true, cookie);
> +               }
> +
> +               return size;
> +       } else if (iopte_leaf(pte, lvl)) {
> +               /*
> +                * Insert a table at the next level to map the old region,
> +                * minus the part we want to unmap
> +                */
> [[varun]]  Minus could be somwhere in between the contiguous chunk? We
> should first break the entire block mapping in to a next level page
> mapping and then unmap a chunk.

The amount to unmap will match exactly one entry at the next level -- that's
enforced by the IOMMU API (and it will also be aligned as such).

> +static struct arm_lpae_io_pgtable *
> +arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg) {
> +       unsigned long va_bits;
> +       struct arm_lpae_io_pgtable *data;
> +
> +       arm_lpae_restrict_pgsizes(cfg);
> +
> +       if (!(cfg->pgsize_bitmap & (SZ_4K | SZ_16K | SZ_64K)))
> +               return NULL;
> +
> +       if (cfg->ias > ARM_LPAE_MAX_ADDR_BITS)
> +               return NULL;
> +
> +       if (cfg->oas > ARM_LPAE_MAX_ADDR_BITS)
> +               return NULL;
> +
> +       data = kmalloc(sizeof(*data), GFP_KERNEL);
> +       if (!data)
> +               return NULL;
> +
> +       data->pages_per_pgd = 1;
> +       data->pg_shift = __ffs(cfg->pgsize_bitmap);
> +       data->bits_per_level = data->pg_shift - ilog2(sizeof(arm_lpae_iopte));
> +
> +       va_bits = cfg->ias - data->pg_shift;
> +       data->levels = DIV_ROUND_UP(va_bits, data->bits_per_level);
> 
> [[varun]]  Not related to the patch, but this would be applicable to the
> CPU tables as well i.e, we can't support 48bit VA with 64 KB page tables,
> right? The AR64 memory maps shows possibility of using 6 bits for the
> first level page table.

Sure we can support 48-bit VAs with 64k pages. Why do you think we can't?

> +static struct io_pgtable *arm_lpae_alloc_pgtable_s2(struct io_pgtable_cfg *cfg,
> +                                                   void *cookie)
> +{
> +       u64 reg, sl;
> +       size_t pgd_size;
> +       struct arm_lpae_io_pgtable *data = arm_lpae_alloc_pgtable(cfg);
> +
> +       if (!data)
> +               return NULL;
> +
> +       /*
> +        * Concatenate PGDs at level 1 if possible in order to reduce
> +        * the depth of the stage-2 walk.
> +        */
> +       if (data->levels == ARM_LPAE_MAX_LEVELS) {
> +               unsigned long pgd_bits, pgd_pages;
> +               unsigned long va_bits = cfg->ias - data->pg_shift;
> +
> +               pgd_bits = data->bits_per_level * (data->levels - 1);
> +               pgd_pages = 1 << (va_bits - pgd_bits);
> +               if (pgd_pages <= ARM_LPAE_S2_MAX_CONCAT_PAGES) {
> +                       data->pages_per_pgd = pgd_pages;
> +                       data->levels--;
> +               }
> +       }
> +
> [[varun]] Can you point me to some documentation regarding stage 2 page
> concatenation. Not sure why this is required?

It's all in the ARM ARM. The idea is to reduce the depth of the stage-2
walk, since that can have an impact on performance when it gets too deep
(remember that stage-1 table walks are themselves subjected to stage-2
translation).

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 2/4] iommu: add ARM LPAE page table allocator
@ 2014-12-05 18:48             ` Will Deacon
  0 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-05 18:48 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Dec 05, 2014 at 10:55:11AM +0000, Varun Sethi wrote:
> Hi Will,

Hi Varun,

Thanks for the review!

> +static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
> +                            unsigned long iova, phys_addr_t paddr,
> +                            arm_lpae_iopte prot, int lvl,
> +                            arm_lpae_iopte *ptep)
> +{
> +       arm_lpae_iopte pte = prot;
> +
> +       /* We require an unmap first */
> +       if (iopte_leaf(*ptep, lvl))
> +               return -EEXIST;
> [varun] Instead of returning an error, how about displaying a warning and
> replacing the entry?

I'd be ok with displaying a warning, but I don't think we should just
continue. It indicates a misuse of the IOMMU API and probably a missing
TLBI.

> +static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
> +                         phys_addr_t paddr, size_t size, arm_lpae_iopte prot,
> +                         int lvl, arm_lpae_iopte *ptep)
> +{
> +       arm_lpae_iopte *cptep, pte;
> +       void *cookie = data->iop.cookie;
> +       size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
> +
> +       /* Find our entry at the current level */
> +       ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
> +
> +       /* If we can install a leaf entry at this level, then do so */
> +       if (size == block_size && (size & data->iop.cfg.pgsize_bitmap))
> +               return arm_lpae_init_pte(data, iova, paddr, prot, lvl, ptep);
> +
> +       /* We can't allocate tables at the final level */
> +       if (WARN_ON(lvl >= ARM_LPAE_MAX_LEVELS - 1))
> +               return -EINVAL;
> 
> [varun] A warning message would be helpful.

Sure, I can add one.

> +static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
> +                                          int prot)
> +{
> +       arm_lpae_iopte pte;
> +
> +       if (data->iop.fmt == ARM_LPAE_S1) {
> +               pte = ARM_LPAE_PTE_AP_UNPRIV | ARM_LPAE_PTE_nG;
> +
> +               if (!(prot & IOMMU_WRITE) && (prot & IOMMU_READ))
> +                       pte |= ARM_LPAE_PTE_AP_RDONLY;
> +
> +               if (prot & IOMMU_CACHE)
> +                       pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
> +                               << ARM_LPAE_PTE_ATTRINDX_SHIFT);
> +       } else {
> +               pte = ARM_LPAE_PTE_HAP_FAULT;
> +               if (prot & IOMMU_READ)
> +                       pte |= ARM_LPAE_PTE_HAP_READ;
> +               if (prot & IOMMU_WRITE)
> +                       pte |= ARM_LPAE_PTE_HAP_WRITE;
> +               if (prot & IOMMU_CACHE)
> +                       pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
> +               else
> +                       pte |= ARM_LPAE_PTE_MEMATTR_NC;
> +       }
> +
> +       if (prot & IOMMU_NOEXEC)
> +               pte |= ARM_LPAE_PTE_XN;
> +
> +       return pte;
> +}
> [[varun]] Do you plan to add a flag to indicate device memory? We had a
> discussion about this on the patch submitted by me. May be you can include
> that as a part of this patch.

That needs to go in as a separate patch. I think you should continue to push
that separately!

> +static int arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
> +                                   unsigned long iova, size_t size,
> +                                   arm_lpae_iopte prot, int lvl,
> +                                   arm_lpae_iopte *ptep, size_t blk_size) {
> +       unsigned long blk_start, blk_end;
> +       phys_addr_t blk_paddr;
> +       arm_lpae_iopte table = 0;
> +       void *cookie = data->iop.cookie;
> +       struct iommu_gather_ops *tlb = data->iop.cfg.tlb;
> +
> +       blk_start = iova & ~(blk_size - 1);
> +       blk_end = blk_start + blk_size;
> +       blk_paddr = iopte_to_pfn(*ptep, data) << data->pg_shift;
> +
> +       for (; blk_start < blk_end; blk_start += size, blk_paddr += size) {
> +               arm_lpae_iopte *tablep;
> +
> +               /* Unmap! */
> +               if (blk_start == iova)
> +                       continue;
> +
> +               /* __arm_lpae_map expects a pointer to the start of the table */
> +               tablep = &table - ARM_LPAE_LVL_IDX(blk_start, lvl, data);
> 
> 
> [[varun]] Not clear what's happening here. May be I am missing something,
> but where is the table allocated?

It is allocated in __arm_lpae_map, because the pte will be 0. The
subtraction above is to avoid us having to allocate a whole level, just for
a single invalid pte.
> 
> +               if (__arm_lpae_map(data, blk_start, blk_paddr, size, prot, lvl,
> +                                  tablep) < 0) {
> 
> 
> [[varun]] Again not clear how are we unmapping the range. Index at the
> current level should point to a page table (with contiguous block
> mappings). Unmap would applied to the mappings at the next level. Unmap
> can happen anywhere in the contiguous range. It seems that you are just
> creating a subset of the block mapping.

We will be unmapping a single entry at the next level, so we basically
create a table, then map everything at the next level apart from the part we
need to unmap.

> +static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> +                           unsigned long iova, size_t size, int lvl,
> +                           arm_lpae_iopte *ptep)
> +{
> +       arm_lpae_iopte pte;
> +       struct iommu_gather_ops *tlb = data->iop.cfg.tlb;
> +       void *cookie = data->iop.cookie;
> +       size_t blk_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
> +
> +       ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
> +       pte = *ptep;
> +
> +       /* Something went horribly wrong and we ran out of page table */
> +       if (WARN_ON(!pte || (lvl == ARM_LPAE_MAX_LEVELS)))
> +               return 0;
> +
> +       /* If the size matches this level, we're in the right place */
> +       if (size == blk_size) {
> +               *ptep = 0;
> +               tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
> +
> +               if (!iopte_leaf(pte, lvl)) {
> +                       /* Also flush any partial walks */
> +                       tlb->tlb_add_flush(iova, size, false, cookie);
> +                       tlb->tlb_sync(data->iop.cookie);
> +                       ptep = iopte_deref(pte, data);
> +                       __arm_lpae_free_pgtable(data, lvl + 1, ptep);
> +               } else {
> +                       tlb->tlb_add_flush(iova, size, true, cookie);
> +               }
> +
> +               return size;
> +       } else if (iopte_leaf(pte, lvl)) {
> +               /*
> +                * Insert a table at the next level to map the old region,
> +                * minus the part we want to unmap
> +                */
> [[varun]]  Minus could be somwhere in between the contiguous chunk? We
> should first break the entire block mapping in to a next level page
> mapping and then unmap a chunk.

The amount to unmap will match exactly one entry at the next level -- that's
enforced by the IOMMU API (and it will also be aligned as such).

> +static struct arm_lpae_io_pgtable *
> +arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg) {
> +       unsigned long va_bits;
> +       struct arm_lpae_io_pgtable *data;
> +
> +       arm_lpae_restrict_pgsizes(cfg);
> +
> +       if (!(cfg->pgsize_bitmap & (SZ_4K | SZ_16K | SZ_64K)))
> +               return NULL;
> +
> +       if (cfg->ias > ARM_LPAE_MAX_ADDR_BITS)
> +               return NULL;
> +
> +       if (cfg->oas > ARM_LPAE_MAX_ADDR_BITS)
> +               return NULL;
> +
> +       data = kmalloc(sizeof(*data), GFP_KERNEL);
> +       if (!data)
> +               return NULL;
> +
> +       data->pages_per_pgd = 1;
> +       data->pg_shift = __ffs(cfg->pgsize_bitmap);
> +       data->bits_per_level = data->pg_shift - ilog2(sizeof(arm_lpae_iopte));
> +
> +       va_bits = cfg->ias - data->pg_shift;
> +       data->levels = DIV_ROUND_UP(va_bits, data->bits_per_level);
> 
> [[varun]]  Not related to the patch, but this would be applicable to the
> CPU tables as well i.e, we can't support 48bit VA with 64 KB page tables,
> right? The AR64 memory maps shows possibility of using 6 bits for the
> first level page table.

Sure we can support 48-bit VAs with 64k pages. Why do you think we can't?

> +static struct io_pgtable *arm_lpae_alloc_pgtable_s2(struct io_pgtable_cfg *cfg,
> +                                                   void *cookie)
> +{
> +       u64 reg, sl;
> +       size_t pgd_size;
> +       struct arm_lpae_io_pgtable *data = arm_lpae_alloc_pgtable(cfg);
> +
> +       if (!data)
> +               return NULL;
> +
> +       /*
> +        * Concatenate PGDs at level 1 if possible in order to reduce
> +        * the depth of the stage-2 walk.
> +        */
> +       if (data->levels == ARM_LPAE_MAX_LEVELS) {
> +               unsigned long pgd_bits, pgd_pages;
> +               unsigned long va_bits = cfg->ias - data->pg_shift;
> +
> +               pgd_bits = data->bits_per_level * (data->levels - 1);
> +               pgd_pages = 1 << (va_bits - pgd_bits);
> +               if (pgd_pages <= ARM_LPAE_S2_MAX_CONCAT_PAGES) {
> +                       data->pages_per_pgd = pgd_pages;
> +                       data->levels--;
> +               }
> +       }
> +
> [[varun]] Can you point me to some documentation regarding stage 2 page
> concatenation. Not sure why this is required?

It's all in the ARM ARM. The idea is to reduce the depth of the stage-2
walk, since that can have an impact on performance when it gets too deep
(remember that stage-1 table walks are themselves subjected to stage-2
translation).

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/4] iommu: add ARM LPAE page table allocator
  2014-12-02 11:47                     ` Laurent Pinchart
@ 2014-12-05 18:48                       ` Will Deacon
  -1 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-05 18:48 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin Murphy,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Tue, Dec 02, 2014 at 11:47:36AM +0000, Laurent Pinchart wrote:
> On Tuesday 02 December 2014 09:41:56 Will Deacon wrote:
> > On Mon, Dec 01, 2014 at 08:21:58PM +0000, Laurent Pinchart wrote:
> > > On Monday 01 December 2014 17:23:15 Will Deacon wrote:
> > > > On Sun, Nov 30, 2014 at 11:29:46PM +0000, Laurent Pinchart wrote:
> > > > > On Thursday 27 November 2014 11:51:16 Will Deacon wrote:
> > > > > > +     /* Looking good; allocate a pgd */
> > > > > > +     data->pgd = alloc_pages_exact(1UL << data->pg_shift,
> > > > > > +                                   GFP_KERNEL | __GFP_ZERO);
> > > > > 
> > > > > data->pg_shift is computed as __ffs(cfg->pgsize_bitmap). 1UL <<
> > > > > data->pg_shift will thus be equal to the smallest page size supported
> > > > > by the IOMMU. This will thus allocate 4kB, 16kB or 64kB depending on
> > > > > the IOMMU configuration. However, if I'm not mistaken the top-level
> > > > > directory needs to store one entry per largest supported page size.
> > > > > That's 4, 128 or 8 entries depending on the configuration. You're thus
> > > > > over-allocating.
> > > > 
> > > > Yeah, I'll take a closer look at this. The size of the pgd really
> > > > depends on the TxSZ configuration, which in turn depends on the ias and
> > > > the page size. There are also alignment constraints to bear in mind, but
> > > > I'll see what I can do (as it stands, over-allocating will always work).
> > > 
> > > Beside wasting memory, the code also doesn't reflect the requirements. It
> > > works by chance, meaning it could break later.
> > 
> > It won't break, as the maximum size *is* bounded by a page for stage-1
> > and we already handle stage-2 concatenation correctly.
> 
> What I mean is that there's no correlation between the required size and the 
> allocated size in the current code. It happens to work, but if the driver gets 
> extended later to support more IOMMU configurations subtle bugs may crop up.
> 
> > > That's why I'd like to see this
> > > being fixed. Can't the size be computed with something like
> > > 
> > > 	size = (1 << (ias - data->levels * data->pg_shift))
> > > 	
> > > 	     * sizeof(arm_lpae_iopte);
> > > 
> > > (please add a proper detailed comment to explain the computation, as the
> > > meaning is not straightforward)
> > 
> > That's actually the easy part. The harder part is getting the correct
> > alignment, which means managing by own kmem_cache on a per-ops basis. That
> > feels like overkill to me and we also need to make sure that we don't screw
> > up the case of concatenated pgds at stage-2. On top of that, since each
> > cache would be per-ops, I'm not even sure we'd save anything (the slab
> > allocators all operate on pages afaict).
> > 
> > If I use alloc_page_exact, we'll still have some wasteage, but it would
> > be less for the case where the CPU page size is smaller than the SMMU page
> > size. Do you think that's worth the extra complexity? We allocate full pages
> > at all levels after the pgd, so the wasteage is relatively small.
> > 
> > An alternative would be preinitialising some caches for `likely' pgd sizes,
> > but that's also horrible, especially if the kernel decides that it doesn't
> > need a bunch of the configurations at runtime.
> 
> How about just computing the right size, align it to a page size, and using 
> alloc_page_exact ? The waste is small, so it doesn't justify anything more 
> complex than that.

Ok, I'll have a go at that.

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 2/4] iommu: add ARM LPAE page table allocator
@ 2014-12-05 18:48                       ` Will Deacon
  0 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-05 18:48 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Dec 02, 2014 at 11:47:36AM +0000, Laurent Pinchart wrote:
> On Tuesday 02 December 2014 09:41:56 Will Deacon wrote:
> > On Mon, Dec 01, 2014 at 08:21:58PM +0000, Laurent Pinchart wrote:
> > > On Monday 01 December 2014 17:23:15 Will Deacon wrote:
> > > > On Sun, Nov 30, 2014 at 11:29:46PM +0000, Laurent Pinchart wrote:
> > > > > On Thursday 27 November 2014 11:51:16 Will Deacon wrote:
> > > > > > +     /* Looking good; allocate a pgd */
> > > > > > +     data->pgd = alloc_pages_exact(1UL << data->pg_shift,
> > > > > > +                                   GFP_KERNEL | __GFP_ZERO);
> > > > > 
> > > > > data->pg_shift is computed as __ffs(cfg->pgsize_bitmap). 1UL <<
> > > > > data->pg_shift will thus be equal to the smallest page size supported
> > > > > by the IOMMU. This will thus allocate 4kB, 16kB or 64kB depending on
> > > > > the IOMMU configuration. However, if I'm not mistaken the top-level
> > > > > directory needs to store one entry per largest supported page size.
> > > > > That's 4, 128 or 8 entries depending on the configuration. You're thus
> > > > > over-allocating.
> > > > 
> > > > Yeah, I'll take a closer look at this. The size of the pgd really
> > > > depends on the TxSZ configuration, which in turn depends on the ias and
> > > > the page size. There are also alignment constraints to bear in mind, but
> > > > I'll see what I can do (as it stands, over-allocating will always work).
> > > 
> > > Beside wasting memory, the code also doesn't reflect the requirements. It
> > > works by chance, meaning it could break later.
> > 
> > It won't break, as the maximum size *is* bounded by a page for stage-1
> > and we already handle stage-2 concatenation correctly.
> 
> What I mean is that there's no correlation between the required size and the 
> allocated size in the current code. It happens to work, but if the driver gets 
> extended later to support more IOMMU configurations subtle bugs may crop up.
> 
> > > That's why I'd like to see this
> > > being fixed. Can't the size be computed with something like
> > > 
> > > 	size = (1 << (ias - data->levels * data->pg_shift))
> > > 	
> > > 	     * sizeof(arm_lpae_iopte);
> > > 
> > > (please add a proper detailed comment to explain the computation, as the
> > > meaning is not straightforward)
> > 
> > That's actually the easy part. The harder part is getting the correct
> > alignment, which means managing by own kmem_cache on a per-ops basis. That
> > feels like overkill to me and we also need to make sure that we don't screw
> > up the case of concatenated pgds at stage-2. On top of that, since each
> > cache would be per-ops, I'm not even sure we'd save anything (the slab
> > allocators all operate on pages afaict).
> > 
> > If I use alloc_page_exact, we'll still have some wasteage, but it would
> > be less for the case where the CPU page size is smaller than the SMMU page
> > size. Do you think that's worth the extra complexity? We allocate full pages
> > at all levels after the pgd, so the wasteage is relatively small.
> > 
> > An alternative would be preinitialising some caches for `likely' pgd sizes,
> > but that's also horrible, especially if the kernel decides that it doesn't
> > need a bunch of the configurations at runtime.
> 
> How about just computing the right size, align it to a page size, and using 
> alloc_page_exact ? The waste is small, so it doesn't justify anything more 
> complex than that.

Ok, I'll have a go at that.

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [PATCH 2/4] iommu: add ARM LPAE page table allocator
  2014-12-05 18:48             ` Will Deacon
@ 2014-12-14 17:45                 ` Varun Sethi
  -1 siblings, 0 replies; 76+ messages in thread
From: Varun Sethi @ 2014-12-14 17:45 UTC (permalink / raw)
  To: Will Deacon
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	laurent.pinchart-ryLnwIuWjnjg/C1BVhZhaw,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin Murphy,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Will,
Please find my response inline. Search for "varun".

-----Original Message-----
From: Will Deacon [mailto:will.deacon-5wv7dgnIgG8@public.gmane.org] 
Sent: Saturday, December 06, 2014 12:18 AM
To: Sethi Varun-B16395
Cc: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org; iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org; prem.mallappa-dY08KVG/lbpWk0Htik3J/w@public.gmane.org; Robin Murphy; lauraa-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org; mitchelh-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org; laurent.pinchart-ryLnwIuWjnjg/C1BVhZhaw@public.gmane.org; joro-zLv9SwRftAIdnm+yROfE0A@public.gmane.org; m.szyprowski-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org
Subject: Re: [PATCH 2/4] iommu: add ARM LPAE page table allocator

On Fri, Dec 05, 2014 at 10:55:11AM +0000, Varun Sethi wrote:
> Hi Will,

Hi Varun,

Thanks for the review!

> +static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
> +                            unsigned long iova, phys_addr_t paddr,
> +                            arm_lpae_iopte prot, int lvl,
> +                            arm_lpae_iopte *ptep) {
> +       arm_lpae_iopte pte = prot;
> +
> +       /* We require an unmap first */
> +       if (iopte_leaf(*ptep, lvl))
> +               return -EEXIST;
> [varun] Instead of returning an error, how about displaying a warning 
> and replacing the entry?

I'd be ok with displaying a warning, but I don't think we should just continue. It indicates a misuse of the IOMMU API and probably a missing TLBI.


[[varun]] May not apply now, but what if we are dealing with a case where memory is not pinned? It may be possible to hookup (without an unmap) an iova to a different physical address. Offcourse, tlb invalidation would be required. Could this scenario be relevant in case of stall mode?

> +static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
> +                         phys_addr_t paddr, size_t size, arm_lpae_iopte prot,
> +                         int lvl, arm_lpae_iopte *ptep) {
> +       arm_lpae_iopte *cptep, pte;
> +       void *cookie = data->iop.cookie;
> +       size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
> +
> +       /* Find our entry at the current level */
> +       ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
> +
> +       /* If we can install a leaf entry at this level, then do so */
> +       if (size == block_size && (size & data->iop.cfg.pgsize_bitmap))
> +               return arm_lpae_init_pte(data, iova, paddr, prot, lvl, 
> + ptep);
> +
> +       /* We can't allocate tables at the final level */
> +       if (WARN_ON(lvl >= ARM_LPAE_MAX_LEVELS - 1))
> +               return -EINVAL;
> 
> [varun] A warning message would be helpful.

Sure, I can add one.

> +static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
> +                                          int prot) {
> +       arm_lpae_iopte pte;
> +
> +       if (data->iop.fmt == ARM_LPAE_S1) {
> +               pte = ARM_LPAE_PTE_AP_UNPRIV | ARM_LPAE_PTE_nG;
> +
> +               if (!(prot & IOMMU_WRITE) && (prot & IOMMU_READ))
> +                       pte |= ARM_LPAE_PTE_AP_RDONLY;
> +
> +               if (prot & IOMMU_CACHE)
> +                       pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
> +                               << ARM_LPAE_PTE_ATTRINDX_SHIFT);
> +       } else {
> +               pte = ARM_LPAE_PTE_HAP_FAULT;
> +               if (prot & IOMMU_READ)
> +                       pte |= ARM_LPAE_PTE_HAP_READ;
> +               if (prot & IOMMU_WRITE)
> +                       pte |= ARM_LPAE_PTE_HAP_WRITE;
> +               if (prot & IOMMU_CACHE)
> +                       pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
> +               else
> +                       pte |= ARM_LPAE_PTE_MEMATTR_NC;
> +       }
> +
> +       if (prot & IOMMU_NOEXEC)
> +               pte |= ARM_LPAE_PTE_XN;
> +
> +       return pte;
> +}
> [[varun]] Do you plan to add a flag to indicate device memory? We had 
> a discussion about this on the patch submitted by me. May be you can 
> include that as a part of this patch.

That needs to go in as a separate patch. I think you should continue to push that separately!

> +static int arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
> +                                   unsigned long iova, size_t size,
> +                                   arm_lpae_iopte prot, int lvl,
> +                                   arm_lpae_iopte *ptep, size_t blk_size) {
> +       unsigned long blk_start, blk_end;
> +       phys_addr_t blk_paddr;
> +       arm_lpae_iopte table = 0;
> +       void *cookie = data->iop.cookie;
> +       struct iommu_gather_ops *tlb = data->iop.cfg.tlb;
> +
> +       blk_start = iova & ~(blk_size - 1);
> +       blk_end = blk_start + blk_size;
> +       blk_paddr = iopte_to_pfn(*ptep, data) << data->pg_shift;
> +
> +       for (; blk_start < blk_end; blk_start += size, blk_paddr += size) {
> +               arm_lpae_iopte *tablep;
> +
> +               /* Unmap! */
> +               if (blk_start == iova)
> +                       continue;
> +
> +               /* __arm_lpae_map expects a pointer to the start of the table */
> +               tablep = &table - ARM_LPAE_LVL_IDX(blk_start, lvl, 
> + data);
> 
> 
> [[varun]] Not clear what's happening here. May be I am missing 
> something, but where is the table allocated?

It is allocated in __arm_lpae_map, because the pte will be 0. The subtraction above is to avoid us having to allocate a whole level, just for a single invalid pte.
> 
> +               if (__arm_lpae_map(data, blk_start, blk_paddr, size, prot, lvl,
> +                                  tablep) < 0) {
> 
> 
> [[varun]] Again not clear how are we unmapping the range. Index at the 
> current level should point to a page table (with contiguous block 
> mappings). Unmap would applied to the mappings at the next level. 
> Unmap can happen anywhere in the contiguous range. It seems that you 
> are just creating a subset of the block mapping.

We will be unmapping a single entry at the next level, so we basically create a table, then map everything at the next level apart from the part we need to unmap.


[varun] ok, but you could potentially end up splitting mapping to the least possible page size e.g. 4K. You, don't seem to take in to account the possibility of using the block size at the next level. For example, take a case where we have a huge page mapping using 1G page size and we have an unmap request for 4K. We could still split maximum part of the mapping using 2M pages at the next level. The entry where we need to unmap the 4K region would potentially go to the next level.
 
> +static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> +                           unsigned long iova, size_t size, int lvl,
> +                           arm_lpae_iopte *ptep) {
> +       arm_lpae_iopte pte;
> +       struct iommu_gather_ops *tlb = data->iop.cfg.tlb;
> +       void *cookie = data->iop.cookie;
> +       size_t blk_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
> +
> +       ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
> +       pte = *ptep;
> +
> +       /* Something went horribly wrong and we ran out of page table */
> +       if (WARN_ON(!pte || (lvl == ARM_LPAE_MAX_LEVELS)))
> +               return 0;
> +
> +       /* If the size matches this level, we're in the right place */
> +       if (size == blk_size) {
> +               *ptep = 0;
> +               tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
> +
> +               if (!iopte_leaf(pte, lvl)) {
> +                       /* Also flush any partial walks */
> +                       tlb->tlb_add_flush(iova, size, false, cookie);
> +                       tlb->tlb_sync(data->iop.cookie);
> +                       ptep = iopte_deref(pte, data);
> +                       __arm_lpae_free_pgtable(data, lvl + 1, ptep);
> +               } else {
> +                       tlb->tlb_add_flush(iova, size, true, cookie);
> +               }
> +
> +               return size;
> +       } else if (iopte_leaf(pte, lvl)) {
> +               /*
> +                * Insert a table at the next level to map the old region,
> +                * minus the part we want to unmap
> +                */
> [[varun]]  Minus could be somwhere in between the contiguous chunk? We 
> should first break the entire block mapping in to a next level page 
> mapping and then unmap a chunk.

The amount to unmap will match exactly one entry at the next level -- that's enforced by the IOMMU API (and it will also be aligned as such).

> +static struct arm_lpae_io_pgtable *
> +arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg) {
> +       unsigned long va_bits;
> +       struct arm_lpae_io_pgtable *data;
> +
> +       arm_lpae_restrict_pgsizes(cfg);
> +
> +       if (!(cfg->pgsize_bitmap & (SZ_4K | SZ_16K | SZ_64K)))
> +               return NULL;
> +
> +       if (cfg->ias > ARM_LPAE_MAX_ADDR_BITS)
> +               return NULL;
> +
> +       if (cfg->oas > ARM_LPAE_MAX_ADDR_BITS)
> +               return NULL;
> +
> +       data = kmalloc(sizeof(*data), GFP_KERNEL);
> +       if (!data)
> +               return NULL;
> +
> +       data->pages_per_pgd = 1;
> +       data->pg_shift = __ffs(cfg->pgsize_bitmap);
> +       data->bits_per_level = data->pg_shift - 
> + ilog2(sizeof(arm_lpae_iopte));
> +
> +       va_bits = cfg->ias - data->pg_shift;
> +       data->levels = DIV_ROUND_UP(va_bits, data->bits_per_level);
> 
> [[varun]]  Not related to the patch, but this would be applicable to 
> the CPU tables as well i.e, we can't support 48bit VA with 64 KB page 
> tables, right? The AR64 memory maps shows possibility of using 6 bits 
> for the first level page table.

Sure we can support 48-bit VAs with 64k pages. Why do you think we can't?

[varun]  My concern was with respect to the bits per level, which is uneven for the 64 K page sizes. Just wondering how would things work with 64K pages when we do a 3 level page lookup.

> +static struct io_pgtable *arm_lpae_alloc_pgtable_s2(struct io_pgtable_cfg *cfg,
> +                                                   void *cookie) {
> +       u64 reg, sl;
> +       size_t pgd_size;
> +       struct arm_lpae_io_pgtable *data = 
> +arm_lpae_alloc_pgtable(cfg);
> +
> +       if (!data)
> +               return NULL;
> +
> +       /*
> +        * Concatenate PGDs at level 1 if possible in order to reduce
> +        * the depth of the stage-2 walk.
> +        */
> +       if (data->levels == ARM_LPAE_MAX_LEVELS) {
> +               unsigned long pgd_bits, pgd_pages;
> +               unsigned long va_bits = cfg->ias - data->pg_shift;
> +
> +               pgd_bits = data->bits_per_level * (data->levels - 1);
> +               pgd_pages = 1 << (va_bits - pgd_bits);
> +               if (pgd_pages <= ARM_LPAE_S2_MAX_CONCAT_PAGES) {
> +                       data->pages_per_pgd = pgd_pages;
> +                       data->levels--;
> +               }
> +       }
> +
> [[varun]] Can you point me to some documentation regarding stage 2 
> page concatenation. Not sure why this is required?

It's all in the ARM ARM. The idea is to reduce the depth of the stage-2 walk, since that can have an impact on performance when it gets too deep (remember that stage-1 table walks are themselves subjected to stage-2 translation).


[varun] Just curious, why not consider concatenation at stage 1?

Thanks,
Varun

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 2/4] iommu: add ARM LPAE page table allocator
@ 2014-12-14 17:45                 ` Varun Sethi
  0 siblings, 0 replies; 76+ messages in thread
From: Varun Sethi @ 2014-12-14 17:45 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,
Please find my response inline. Search for "varun".

-----Original Message-----
From: Will Deacon [mailto:will.deacon at arm.com] 
Sent: Saturday, December 06, 2014 12:18 AM
To: Sethi Varun-B16395
Cc: linux-arm-kernel at lists.infradead.org; iommu at lists.linux-foundation.org; prem.mallappa at broadcom.com; Robin Murphy; lauraa at codeaurora.org; mitchelh at codeaurora.org; laurent.pinchart at ideasonboard.com; joro at 8bytes.org; m.szyprowski at samsung.com
Subject: Re: [PATCH 2/4] iommu: add ARM LPAE page table allocator

On Fri, Dec 05, 2014 at 10:55:11AM +0000, Varun Sethi wrote:
> Hi Will,

Hi Varun,

Thanks for the review!

> +static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
> +                            unsigned long iova, phys_addr_t paddr,
> +                            arm_lpae_iopte prot, int lvl,
> +                            arm_lpae_iopte *ptep) {
> +       arm_lpae_iopte pte = prot;
> +
> +       /* We require an unmap first */
> +       if (iopte_leaf(*ptep, lvl))
> +               return -EEXIST;
> [varun] Instead of returning an error, how about displaying a warning 
> and replacing the entry?

I'd be ok with displaying a warning, but I don't think we should just continue. It indicates a misuse of the IOMMU API and probably a missing TLBI.


[[varun]] May not apply now, but what if we are dealing with a case where memory is not pinned? It may be possible to hookup (without an unmap) an iova to a different physical address. Offcourse, tlb invalidation would be required. Could this scenario be relevant in case of stall mode?

> +static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
> +                         phys_addr_t paddr, size_t size, arm_lpae_iopte prot,
> +                         int lvl, arm_lpae_iopte *ptep) {
> +       arm_lpae_iopte *cptep, pte;
> +       void *cookie = data->iop.cookie;
> +       size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
> +
> +       /* Find our entry at the current level */
> +       ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
> +
> +       /* If we can install a leaf entry at this level, then do so */
> +       if (size == block_size && (size & data->iop.cfg.pgsize_bitmap))
> +               return arm_lpae_init_pte(data, iova, paddr, prot, lvl, 
> + ptep);
> +
> +       /* We can't allocate tables at the final level */
> +       if (WARN_ON(lvl >= ARM_LPAE_MAX_LEVELS - 1))
> +               return -EINVAL;
> 
> [varun] A warning message would be helpful.

Sure, I can add one.

> +static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
> +                                          int prot) {
> +       arm_lpae_iopte pte;
> +
> +       if (data->iop.fmt == ARM_LPAE_S1) {
> +               pte = ARM_LPAE_PTE_AP_UNPRIV | ARM_LPAE_PTE_nG;
> +
> +               if (!(prot & IOMMU_WRITE) && (prot & IOMMU_READ))
> +                       pte |= ARM_LPAE_PTE_AP_RDONLY;
> +
> +               if (prot & IOMMU_CACHE)
> +                       pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
> +                               << ARM_LPAE_PTE_ATTRINDX_SHIFT);
> +       } else {
> +               pte = ARM_LPAE_PTE_HAP_FAULT;
> +               if (prot & IOMMU_READ)
> +                       pte |= ARM_LPAE_PTE_HAP_READ;
> +               if (prot & IOMMU_WRITE)
> +                       pte |= ARM_LPAE_PTE_HAP_WRITE;
> +               if (prot & IOMMU_CACHE)
> +                       pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
> +               else
> +                       pte |= ARM_LPAE_PTE_MEMATTR_NC;
> +       }
> +
> +       if (prot & IOMMU_NOEXEC)
> +               pte |= ARM_LPAE_PTE_XN;
> +
> +       return pte;
> +}
> [[varun]] Do you plan to add a flag to indicate device memory? We had 
> a discussion about this on the patch submitted by me. May be you can 
> include that as a part of this patch.

That needs to go in as a separate patch. I think you should continue to push that separately!

> +static int arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
> +                                   unsigned long iova, size_t size,
> +                                   arm_lpae_iopte prot, int lvl,
> +                                   arm_lpae_iopte *ptep, size_t blk_size) {
> +       unsigned long blk_start, blk_end;
> +       phys_addr_t blk_paddr;
> +       arm_lpae_iopte table = 0;
> +       void *cookie = data->iop.cookie;
> +       struct iommu_gather_ops *tlb = data->iop.cfg.tlb;
> +
> +       blk_start = iova & ~(blk_size - 1);
> +       blk_end = blk_start + blk_size;
> +       blk_paddr = iopte_to_pfn(*ptep, data) << data->pg_shift;
> +
> +       for (; blk_start < blk_end; blk_start += size, blk_paddr += size) {
> +               arm_lpae_iopte *tablep;
> +
> +               /* Unmap! */
> +               if (blk_start == iova)
> +                       continue;
> +
> +               /* __arm_lpae_map expects a pointer to the start of the table */
> +               tablep = &table - ARM_LPAE_LVL_IDX(blk_start, lvl, 
> + data);
> 
> 
> [[varun]] Not clear what's happening here. May be I am missing 
> something, but where is the table allocated?

It is allocated in __arm_lpae_map, because the pte will be 0. The subtraction above is to avoid us having to allocate a whole level, just for a single invalid pte.
> 
> +               if (__arm_lpae_map(data, blk_start, blk_paddr, size, prot, lvl,
> +                                  tablep) < 0) {
> 
> 
> [[varun]] Again not clear how are we unmapping the range. Index at the 
> current level should point to a page table (with contiguous block 
> mappings). Unmap would applied to the mappings at the next level. 
> Unmap can happen anywhere in the contiguous range. It seems that you 
> are just creating a subset of the block mapping.

We will be unmapping a single entry at the next level, so we basically create a table, then map everything at the next level apart from the part we need to unmap.


[varun] ok, but you could potentially end up splitting mapping to the least possible page size e.g. 4K. You, don't seem to take in to account the possibility of using the block size at the next level. For example, take a case where we have a huge page mapping using 1G page size and we have an unmap request for 4K. We could still split maximum part of the mapping using 2M pages at the next level. The entry where we need to unmap the 4K region would potentially go to the next level.
 
> +static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> +                           unsigned long iova, size_t size, int lvl,
> +                           arm_lpae_iopte *ptep) {
> +       arm_lpae_iopte pte;
> +       struct iommu_gather_ops *tlb = data->iop.cfg.tlb;
> +       void *cookie = data->iop.cookie;
> +       size_t blk_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
> +
> +       ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
> +       pte = *ptep;
> +
> +       /* Something went horribly wrong and we ran out of page table */
> +       if (WARN_ON(!pte || (lvl == ARM_LPAE_MAX_LEVELS)))
> +               return 0;
> +
> +       /* If the size matches this level, we're in the right place */
> +       if (size == blk_size) {
> +               *ptep = 0;
> +               tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
> +
> +               if (!iopte_leaf(pte, lvl)) {
> +                       /* Also flush any partial walks */
> +                       tlb->tlb_add_flush(iova, size, false, cookie);
> +                       tlb->tlb_sync(data->iop.cookie);
> +                       ptep = iopte_deref(pte, data);
> +                       __arm_lpae_free_pgtable(data, lvl + 1, ptep);
> +               } else {
> +                       tlb->tlb_add_flush(iova, size, true, cookie);
> +               }
> +
> +               return size;
> +       } else if (iopte_leaf(pte, lvl)) {
> +               /*
> +                * Insert a table at the next level to map the old region,
> +                * minus the part we want to unmap
> +                */
> [[varun]]  Minus could be somwhere in between the contiguous chunk? We 
> should first break the entire block mapping in to a next level page 
> mapping and then unmap a chunk.

The amount to unmap will match exactly one entry at the next level -- that's enforced by the IOMMU API (and it will also be aligned as such).

> +static struct arm_lpae_io_pgtable *
> +arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg) {
> +       unsigned long va_bits;
> +       struct arm_lpae_io_pgtable *data;
> +
> +       arm_lpae_restrict_pgsizes(cfg);
> +
> +       if (!(cfg->pgsize_bitmap & (SZ_4K | SZ_16K | SZ_64K)))
> +               return NULL;
> +
> +       if (cfg->ias > ARM_LPAE_MAX_ADDR_BITS)
> +               return NULL;
> +
> +       if (cfg->oas > ARM_LPAE_MAX_ADDR_BITS)
> +               return NULL;
> +
> +       data = kmalloc(sizeof(*data), GFP_KERNEL);
> +       if (!data)
> +               return NULL;
> +
> +       data->pages_per_pgd = 1;
> +       data->pg_shift = __ffs(cfg->pgsize_bitmap);
> +       data->bits_per_level = data->pg_shift - 
> + ilog2(sizeof(arm_lpae_iopte));
> +
> +       va_bits = cfg->ias - data->pg_shift;
> +       data->levels = DIV_ROUND_UP(va_bits, data->bits_per_level);
> 
> [[varun]]  Not related to the patch, but this would be applicable to 
> the CPU tables as well i.e, we can't support 48bit VA with 64 KB page 
> tables, right? The AR64 memory maps shows possibility of using 6 bits 
> for the first level page table.

Sure we can support 48-bit VAs with 64k pages. Why do you think we can't?

[varun]  My concern was with respect to the bits per level, which is uneven for the 64 K page sizes. Just wondering how would things work with 64K pages when we do a 3 level page lookup.

> +static struct io_pgtable *arm_lpae_alloc_pgtable_s2(struct io_pgtable_cfg *cfg,
> +                                                   void *cookie) {
> +       u64 reg, sl;
> +       size_t pgd_size;
> +       struct arm_lpae_io_pgtable *data = 
> +arm_lpae_alloc_pgtable(cfg);
> +
> +       if (!data)
> +               return NULL;
> +
> +       /*
> +        * Concatenate PGDs at level 1 if possible in order to reduce
> +        * the depth of the stage-2 walk.
> +        */
> +       if (data->levels == ARM_LPAE_MAX_LEVELS) {
> +               unsigned long pgd_bits, pgd_pages;
> +               unsigned long va_bits = cfg->ias - data->pg_shift;
> +
> +               pgd_bits = data->bits_per_level * (data->levels - 1);
> +               pgd_pages = 1 << (va_bits - pgd_bits);
> +               if (pgd_pages <= ARM_LPAE_S2_MAX_CONCAT_PAGES) {
> +                       data->pages_per_pgd = pgd_pages;
> +                       data->levels--;
> +               }
> +       }
> +
> [[varun]] Can you point me to some documentation regarding stage 2 
> page concatenation. Not sure why this is required?

It's all in the ARM ARM. The idea is to reduce the depth of the stage-2 walk, since that can have an impact on performance when it gets too deep (remember that stage-1 table walks are themselves subjected to stage-2 translation).


[varun] Just curious, why not consider concatenation at stage 1?

Thanks,
Varun

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/4] iommu: introduce generic page table allocation framework
  2014-11-27 11:51     ` Will Deacon
@ 2014-12-14 23:46         ` Laurent Pinchart
  -1 siblings, 0 replies; 76+ messages in thread
From: Laurent Pinchart @ 2014-12-14 23:46 UTC (permalink / raw)
  To: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
  Cc: Robin.Murphy-5wv7dgnIgG8, Will Deacon,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w

Hi Will,

Please see below for another small comment.

On Thursday 27 November 2014 11:51:15 Will Deacon wrote:
> This patch introduces a generic framework for allocating page tables for
> an IOMMU. There are a number of reasons we want to do this:
> 
>   - It avoids duplication of complex table management code in IOMMU
>     drivers that use the same page table format
> 
>   - It removes any coupling with the CPU table format (and even the
>     architecture!)
> 
>   - It defines an API for IOMMU TLB maintenance
> 
> Signed-off-by: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
> ---
>  drivers/iommu/Kconfig      |  8 ++++++
>  drivers/iommu/Makefile     |  1 +
>  drivers/iommu/io-pgtable.c | 71 +++++++++++++++++++++++++++++++++++++++++++
>  drivers/iommu/io-pgtable.h | 65 ++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 145 insertions(+)
>  create mode 100644 drivers/iommu/io-pgtable.c
>  create mode 100644 drivers/iommu/io-pgtable.h

[snip]

> diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
> new file mode 100644
> index 000000000000..5ae75d9cae50
> --- /dev/null
> +++ b/drivers/iommu/io-pgtable.h
> @@ -0,0 +1,65 @@
> +#ifndef __IO_PGTABLE_H
> +#define __IO_PGTABLE_H
> +
> +struct io_pgtable_ops {
> +	int (*map)(struct io_pgtable_ops *ops, unsigned long iova,
> +		   phys_addr_t paddr, size_t size, int prot);
> +	int (*unmap)(struct io_pgtable_ops *ops, unsigned long iova,
> +		     size_t size);
> +	phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
> +				    unsigned long iova);
> +};
> +
> +struct iommu_gather_ops {
> +	/* Synchronously invalidate the entire TLB context */
> +	void (*tlb_flush_all)(void *cookie);
> +
> +	/* Queue up a TLB invalidation for a virtual address range */
> +	void (*tlb_add_flush)(unsigned long iova, size_t size, bool leaf,
> +			      void *cookie);
> +	/* Ensure any queued TLB invalidation has taken effect */
> +	void (*tlb_sync)(void *cookie);
> +
> +	/* Ensure page tables updates are visible to the IOMMU */
> +	void (*flush_pgtable)(void *ptr, size_t size, void *cookie);
> +};
> +
> +struct io_pgtable_cfg {
> +	int			quirks; /* IO_PGTABLE_QUIRK_* */
> +	unsigned long		pgsize_bitmap;
> +	unsigned int		ias;
> +	unsigned int		oas;
> +	struct iommu_gather_ops	*tlb;

Could you make this pointer const ?

> +	/* Low-level data specific to the table format */
> +	union {
> +	};
> +};
> +
> +enum io_pgtable_fmt {
> +	IO_PGTABLE_NUM_FMTS,
> +};
> +
> +struct io_pgtable {
> +	enum io_pgtable_fmt	fmt;
> +	void			*cookie;
> +	struct io_pgtable_cfg	cfg;
> +	struct io_pgtable_ops	ops;
> +};
> +
> +struct io_pgtable_init_fns {
> +	struct io_pgtable *(*alloc)(struct io_pgtable_cfg *cfg, void *cookie);
> +	void (*free)(struct io_pgtable *iop);
> +};
> +
> +struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
> +					    struct io_pgtable_cfg *cfg,
> +					    void *cookie);
> +
> +/*
> + * Free an io_pgtable_ops structure. The caller *must* ensure that the
> + * page table is no longer live, but the TLB can be dirty.
> + */
> +void free_io_pgtable_ops(struct io_pgtable_ops *ops);
> +
> +#endif /* __IO_PGTABLE_H */

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 1/4] iommu: introduce generic page table allocation framework
@ 2014-12-14 23:46         ` Laurent Pinchart
  0 siblings, 0 replies; 76+ messages in thread
From: Laurent Pinchart @ 2014-12-14 23:46 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

Please see below for another small comment.

On Thursday 27 November 2014 11:51:15 Will Deacon wrote:
> This patch introduces a generic framework for allocating page tables for
> an IOMMU. There are a number of reasons we want to do this:
> 
>   - It avoids duplication of complex table management code in IOMMU
>     drivers that use the same page table format
> 
>   - It removes any coupling with the CPU table format (and even the
>     architecture!)
> 
>   - It defines an API for IOMMU TLB maintenance
> 
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  drivers/iommu/Kconfig      |  8 ++++++
>  drivers/iommu/Makefile     |  1 +
>  drivers/iommu/io-pgtable.c | 71 +++++++++++++++++++++++++++++++++++++++++++
>  drivers/iommu/io-pgtable.h | 65 ++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 145 insertions(+)
>  create mode 100644 drivers/iommu/io-pgtable.c
>  create mode 100644 drivers/iommu/io-pgtable.h

[snip]

> diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
> new file mode 100644
> index 000000000000..5ae75d9cae50
> --- /dev/null
> +++ b/drivers/iommu/io-pgtable.h
> @@ -0,0 +1,65 @@
> +#ifndef __IO_PGTABLE_H
> +#define __IO_PGTABLE_H
> +
> +struct io_pgtable_ops {
> +	int (*map)(struct io_pgtable_ops *ops, unsigned long iova,
> +		   phys_addr_t paddr, size_t size, int prot);
> +	int (*unmap)(struct io_pgtable_ops *ops, unsigned long iova,
> +		     size_t size);
> +	phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
> +				    unsigned long iova);
> +};
> +
> +struct iommu_gather_ops {
> +	/* Synchronously invalidate the entire TLB context */
> +	void (*tlb_flush_all)(void *cookie);
> +
> +	/* Queue up a TLB invalidation for a virtual address range */
> +	void (*tlb_add_flush)(unsigned long iova, size_t size, bool leaf,
> +			      void *cookie);
> +	/* Ensure any queued TLB invalidation has taken effect */
> +	void (*tlb_sync)(void *cookie);
> +
> +	/* Ensure page tables updates are visible to the IOMMU */
> +	void (*flush_pgtable)(void *ptr, size_t size, void *cookie);
> +};
> +
> +struct io_pgtable_cfg {
> +	int			quirks; /* IO_PGTABLE_QUIRK_* */
> +	unsigned long		pgsize_bitmap;
> +	unsigned int		ias;
> +	unsigned int		oas;
> +	struct iommu_gather_ops	*tlb;

Could you make this pointer const ?

> +	/* Low-level data specific to the table format */
> +	union {
> +	};
> +};
> +
> +enum io_pgtable_fmt {
> +	IO_PGTABLE_NUM_FMTS,
> +};
> +
> +struct io_pgtable {
> +	enum io_pgtable_fmt	fmt;
> +	void			*cookie;
> +	struct io_pgtable_cfg	cfg;
> +	struct io_pgtable_ops	ops;
> +};
> +
> +struct io_pgtable_init_fns {
> +	struct io_pgtable *(*alloc)(struct io_pgtable_cfg *cfg, void *cookie);
> +	void (*free)(struct io_pgtable *iop);
> +};
> +
> +struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
> +					    struct io_pgtable_cfg *cfg,
> +					    void *cookie);
> +
> +/*
> + * Free an io_pgtable_ops structure. The caller *must* ensure that the
> + * page table is no longer live, but the TLB can be dirty.
> + */
> +void free_io_pgtable_ops(struct io_pgtable_ops *ops);
> +
> +#endif /* __IO_PGTABLE_H */

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/4] Generic IOMMU page table framework
  2014-11-27 11:51 ` Will Deacon
@ 2014-12-14 23:49     ` Laurent Pinchart
  -1 siblings, 0 replies; 76+ messages in thread
From: Laurent Pinchart @ 2014-12-14 23:49 UTC (permalink / raw)
  To: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
  Cc: Robin.Murphy-5wv7dgnIgG8, Will Deacon,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w

Hi Will,

On Thursday 27 November 2014 11:51:14 Will Deacon wrote:
> Hi all,
> 
> This series introduces a generic IOMMU page table allocation framework,
> implements support for ARM long-descriptors and then ports the arm-smmu
> driver over to the new code.
> 
> There are a few reasons for doing this:
> 
>   - Page table code is hard, and I don't enjoy shopping
> 
>   - A number of IOMMUs actually use the same table format, but currently
>     duplicate the code
> 
>   - It provides a CPU (and architecture) independent allocator, which
>     may be useful for some systems where the CPU is using a different
>     table format for its own mappings
> 
> As illustrated in the final patch, an IOMMU driver interacts with the
> allocator by passing in a configuration structure describing the
> input and output address ranges, the supported pages sizes and a set of
> ops for performing various TLB invalidation and PTE flushing routines.
> 
> The LPAE code implements support for 4k/2M/1G, 16k/32M and 64k/512M
> mappings, but I decided not to implement the contiguous bit in the
> interest of trying to keep the code semi-readable. This could always be
> added later, if needed.
> 
> I also included some self-tests for the LPAE implementation. Ideally
> we'd merge these, but I'm also happy to drop them if there are
> objections.
> 
> Tested with the self-tests, but also VFIO + MMU-500 at stage-1 and
> stage-2. Patches taken against my iommu/devel branch (queued by Joerg
> for 3.19).
> 
> All feedback welcome.

I've successfully tested the patch set with the Renesas IPMMU-VMSA driver with
the following extension to the allocator.

Tested-by: Laurent Pinchart <laurent.pinchart-ryLnwIuWjnjg/C1BVhZhaw@public.gmane.org>

>From 4bebb7f3a5a48541d4c89ce7c61e6ff66686c3a9 Mon Sep 17 00:00:00 2001
From: Laurent Pinchart <laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw@public.gmane.org>
Date: Sun, 14 Dec 2014 23:34:50 +0200
Subject: [PATCH] iommu: io-pgtable-arm: Add Non-Secure quirk

The quirk causes the Non-Secure bit to be set in all page table entries.

Signed-off-by: Laurent Pinchart <laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw@public.gmane.org>
---
 drivers/iommu/io-pgtable-arm.c | 7 +++++++
 drivers/iommu/io-pgtable.h     | 3 +++
 2 files changed, 10 insertions(+)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 669e322a83a4..b6910e142734 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -80,11 +80,13 @@
 #define ARM_LPAE_PTE_TYPE_TABLE		3
 #define ARM_LPAE_PTE_TYPE_PAGE		3
 
+#define ARM_LPAE_PTE_NSTABLE		(((arm_lpae_iopte)1) << 63)
 #define ARM_LPAE_PTE_XN			(((arm_lpae_iopte)3) << 53)
 #define ARM_LPAE_PTE_AF			(((arm_lpae_iopte)1) << 10)
 #define ARM_LPAE_PTE_SH_NS		(((arm_lpae_iopte)0) << 8)
 #define ARM_LPAE_PTE_SH_OS		(((arm_lpae_iopte)2) << 8)
 #define ARM_LPAE_PTE_SH_IS		(((arm_lpae_iopte)3) << 8)
+#define ARM_LPAE_PTE_NS			(((arm_lpae_iopte)1) << 5)
 #define ARM_LPAE_PTE_VALID		(((arm_lpae_iopte)1) << 0)
 
 #define ARM_LPAE_PTE_ATTR_LO_MASK	(((arm_lpae_iopte)0x3ff) << 2)
@@ -201,6 +203,9 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
 	if (iopte_leaf(*ptep, lvl))
 		return -EEXIST;
 
+	if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_NON_SECURE)
+		pte |= ARM_LPAE_PTE_NS;
+
 	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
 		pte |= ARM_LPAE_PTE_TYPE_PAGE;
 	else
@@ -244,6 +249,8 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
 		data->iop.cfg.tlb->flush_pgtable(cptep, 1UL << data->pg_shift,
 						 cookie);
 		pte = __pa(cptep) | ARM_LPAE_PTE_TYPE_TABLE;
+		if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_NON_SECURE)
+			pte |= ARM_LPAE_PTE_NSTABLE;
 		*ptep = pte;
 		data->iop.cfg.tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
 	} else {
diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
index c1cff3d045db..a41a15d30596 100644
--- a/drivers/iommu/io-pgtable.h
+++ b/drivers/iommu/io-pgtable.h
@@ -24,6 +24,9 @@ struct iommu_gather_ops {
 	void (*flush_pgtable)(void *ptr, size_t size, void *cookie);
 };
 
+/* Set the Non-Secure bit in the PTEs */
+#define IO_PGTABLE_QUIRK_NON_SECURE	(1 << 0)
+
 struct io_pgtable_cfg {
 	int			quirks; /* IO_PGTABLE_QUIRK_* */
 	unsigned long		pgsize_bitmap;
-- 

> --->8
> 
> Will Deacon (4):
>   iommu: introduce generic page table allocation framework
>   iommu: add ARM LPAE page table allocator
>   iommu: add self-consistency tests to ARM LPAE IO page table allocator
>   iommu/arm-smmu: make use of generic LPAE allocator
> 
>  MAINTAINERS                    |   1 +
>  arch/arm64/Kconfig             |   1 -
>  drivers/iommu/Kconfig          |  32 +-
>  drivers/iommu/Makefile         |   2 +
>  drivers/iommu/arm-smmu.c       | 872 +++++++++++---------------------------
>  drivers/iommu/io-pgtable-arm.c | 925 ++++++++++++++++++++++++++++++++++++++
>  drivers/iommu/io-pgtable.c     |  78 ++++
>  drivers/iommu/io-pgtable.h     |  77 ++++
>  8 files changed, 1361 insertions(+), 627 deletions(-)
>  create mode 100644 drivers/iommu/io-pgtable-arm.c
>  create mode 100644 drivers/iommu/io-pgtable.c
>  create mode 100644 drivers/iommu/io-pgtable.h

-- 
Regards,

Laurent Pinchart

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 0/4] Generic IOMMU page table framework
@ 2014-12-14 23:49     ` Laurent Pinchart
  0 siblings, 0 replies; 76+ messages in thread
From: Laurent Pinchart @ 2014-12-14 23:49 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

On Thursday 27 November 2014 11:51:14 Will Deacon wrote:
> Hi all,
> 
> This series introduces a generic IOMMU page table allocation framework,
> implements support for ARM long-descriptors and then ports the arm-smmu
> driver over to the new code.
> 
> There are a few reasons for doing this:
> 
>   - Page table code is hard, and I don't enjoy shopping
> 
>   - A number of IOMMUs actually use the same table format, but currently
>     duplicate the code
> 
>   - It provides a CPU (and architecture) independent allocator, which
>     may be useful for some systems where the CPU is using a different
>     table format for its own mappings
> 
> As illustrated in the final patch, an IOMMU driver interacts with the
> allocator by passing in a configuration structure describing the
> input and output address ranges, the supported pages sizes and a set of
> ops for performing various TLB invalidation and PTE flushing routines.
> 
> The LPAE code implements support for 4k/2M/1G, 16k/32M and 64k/512M
> mappings, but I decided not to implement the contiguous bit in the
> interest of trying to keep the code semi-readable. This could always be
> added later, if needed.
> 
> I also included some self-tests for the LPAE implementation. Ideally
> we'd merge these, but I'm also happy to drop them if there are
> objections.
> 
> Tested with the self-tests, but also VFIO + MMU-500 at stage-1 and
> stage-2. Patches taken against my iommu/devel branch (queued by Joerg
> for 3.19).
> 
> All feedback welcome.

I've successfully tested the patch set with the Renesas IPMMU-VMSA driver with
the following extension to the allocator.

Tested-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>

>From 4bebb7f3a5a48541d4c89ce7c61e6ff66686c3a9 Mon Sep 17 00:00:00 2001
From: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
Date: Sun, 14 Dec 2014 23:34:50 +0200
Subject: [PATCH] iommu: io-pgtable-arm: Add Non-Secure quirk

The quirk causes the Non-Secure bit to be set in all page table entries.

Signed-off-by: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
---
 drivers/iommu/io-pgtable-arm.c | 7 +++++++
 drivers/iommu/io-pgtable.h     | 3 +++
 2 files changed, 10 insertions(+)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 669e322a83a4..b6910e142734 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -80,11 +80,13 @@
 #define ARM_LPAE_PTE_TYPE_TABLE		3
 #define ARM_LPAE_PTE_TYPE_PAGE		3
 
+#define ARM_LPAE_PTE_NSTABLE		(((arm_lpae_iopte)1) << 63)
 #define ARM_LPAE_PTE_XN			(((arm_lpae_iopte)3) << 53)
 #define ARM_LPAE_PTE_AF			(((arm_lpae_iopte)1) << 10)
 #define ARM_LPAE_PTE_SH_NS		(((arm_lpae_iopte)0) << 8)
 #define ARM_LPAE_PTE_SH_OS		(((arm_lpae_iopte)2) << 8)
 #define ARM_LPAE_PTE_SH_IS		(((arm_lpae_iopte)3) << 8)
+#define ARM_LPAE_PTE_NS			(((arm_lpae_iopte)1) << 5)
 #define ARM_LPAE_PTE_VALID		(((arm_lpae_iopte)1) << 0)
 
 #define ARM_LPAE_PTE_ATTR_LO_MASK	(((arm_lpae_iopte)0x3ff) << 2)
@@ -201,6 +203,9 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
 	if (iopte_leaf(*ptep, lvl))
 		return -EEXIST;
 
+	if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_NON_SECURE)
+		pte |= ARM_LPAE_PTE_NS;
+
 	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
 		pte |= ARM_LPAE_PTE_TYPE_PAGE;
 	else
@@ -244,6 +249,8 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
 		data->iop.cfg.tlb->flush_pgtable(cptep, 1UL << data->pg_shift,
 						 cookie);
 		pte = __pa(cptep) | ARM_LPAE_PTE_TYPE_TABLE;
+		if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_NON_SECURE)
+			pte |= ARM_LPAE_PTE_NSTABLE;
 		*ptep = pte;
 		data->iop.cfg.tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
 	} else {
diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
index c1cff3d045db..a41a15d30596 100644
--- a/drivers/iommu/io-pgtable.h
+++ b/drivers/iommu/io-pgtable.h
@@ -24,6 +24,9 @@ struct iommu_gather_ops {
 	void (*flush_pgtable)(void *ptr, size_t size, void *cookie);
 };
 
+/* Set the Non-Secure bit in the PTEs */
+#define IO_PGTABLE_QUIRK_NON_SECURE	(1 << 0)
+
 struct io_pgtable_cfg {
 	int			quirks; /* IO_PGTABLE_QUIRK_* */
 	unsigned long		pgsize_bitmap;
-- 

> --->8
> 
> Will Deacon (4):
>   iommu: introduce generic page table allocation framework
>   iommu: add ARM LPAE page table allocator
>   iommu: add self-consistency tests to ARM LPAE IO page table allocator
>   iommu/arm-smmu: make use of generic LPAE allocator
> 
>  MAINTAINERS                    |   1 +
>  arch/arm64/Kconfig             |   1 -
>  drivers/iommu/Kconfig          |  32 +-
>  drivers/iommu/Makefile         |   2 +
>  drivers/iommu/arm-smmu.c       | 872 +++++++++++---------------------------
>  drivers/iommu/io-pgtable-arm.c | 925 ++++++++++++++++++++++++++++++++++++++
>  drivers/iommu/io-pgtable.c     |  78 ++++
>  drivers/iommu/io-pgtable.h     |  77 ++++
>  8 files changed, 1361 insertions(+), 627 deletions(-)
>  create mode 100644 drivers/iommu/io-pgtable-arm.c
>  create mode 100644 drivers/iommu/io-pgtable.c
>  create mode 100644 drivers/iommu/io-pgtable.h

-- 
Regards,

Laurent Pinchart

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/4] iommu: introduce generic page table allocation framework
  2014-12-14 23:46         ` Laurent Pinchart
@ 2014-12-15  9:45           ` Will Deacon
  -1 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-15  9:45 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Robin Murphy, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Sun, Dec 14, 2014 at 11:46:30PM +0000, Laurent Pinchart wrote:
> > +struct io_pgtable_cfg {
> > +	int			quirks; /* IO_PGTABLE_QUIRK_* */
> > +	unsigned long		pgsize_bitmap;
> > +	unsigned int		ias;
> > +	unsigned int		oas;
> > +	struct iommu_gather_ops	*tlb;
> 
> Could you make this pointer const ?

Sure. I'll post a v2 at -rc1 assuming I'm not fishing/drinking (I'm off
work from Wednesday this week).

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 1/4] iommu: introduce generic page table allocation framework
@ 2014-12-15  9:45           ` Will Deacon
  0 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-15  9:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Dec 14, 2014 at 11:46:30PM +0000, Laurent Pinchart wrote:
> > +struct io_pgtable_cfg {
> > +	int			quirks; /* IO_PGTABLE_QUIRK_* */
> > +	unsigned long		pgsize_bitmap;
> > +	unsigned int		ias;
> > +	unsigned int		oas;
> > +	struct iommu_gather_ops	*tlb;
> 
> Could you make this pointer const ?

Sure. I'll post a v2 at -rc1 assuming I'm not fishing/drinking (I'm off
work from Wednesday this week).

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/4] iommu: add ARM LPAE page table allocator
  2014-12-14 17:45                 ` Varun Sethi
@ 2014-12-15 13:30                     ` Will Deacon
  -1 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-15 13:30 UTC (permalink / raw)
  To: Varun Sethi
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	laurent.pinchart-ryLnwIuWjnjg/C1BVhZhaw,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin Murphy,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Sun, Dec 14, 2014 at 05:45:49PM +0000, Varun Sethi wrote:
> Please find my response inline. Search for "varun".

This is getting fiddly now that you've already replied once. Any chance you
could sort your mail client out, please?

> > +static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
> > +                            unsigned long iova, phys_addr_t paddr,
> > +                            arm_lpae_iopte prot, int lvl,
> > +                            arm_lpae_iopte *ptep) {
> > +       arm_lpae_iopte pte = prot;
> > +
> > +       /* We require an unmap first */
> > +       if (iopte_leaf(*ptep, lvl))
> > +               return -EEXIST;
> > [varun] Instead of returning an error, how about displaying a warning
> > and replacing the entry?
> 
> I'd be ok with displaying a warning, but I don't think we should just
> continue. It indicates a misuse of the IOMMU API and probably a missing
> TLBI.
> 
> [[varun]] May not apply now, but what if we are dealing with a case where
> memory is not pinned? It may be possible to hookup (without an unmap) an
> iova to a different physical address. Offcourse, tlb invalidation would be
> required. Could this scenario be relevant in case of stall mode?

If we wanted to support that, then we'd need some new functions for grabbing
hold of the entry and manipulating it in a similar way to the CPU side (e.g.
pte_mkdirty etc).

> > +               if (__arm_lpae_map(data, blk_start, blk_paddr, size, prot, lvl,
> > +                                  tablep) < 0) {
> >
> >
> > [[varun]] Again not clear how are we unmapping the range. Index at the
> > current level should point to a page table (with contiguous block
> > mappings). Unmap would applied to the mappings at the next level.
> > Unmap can happen anywhere in the contiguous range. It seems that you
> > are just creating a subset of the block mapping.
> 
> We will be unmapping a single entry at the next level, so we basically
> create a table, then map everything at the next level apart from the part
> we need to unmap.
> 
> 
> [varun] ok, but you could potentially end up splitting mapping to the
> least possible page size e.g. 4K. You, don't seem to take in to account
> the possibility of using the block size at the next level. For example,
> take a case where we have a huge page mapping using 1G page size and we
> have an unmap request for 4K. We could still split maximum part of the
> mapping using 2M pages at the next level. The entry where we need to unmap
> the 4K region would potentially go to the next level.

Aha, I see what you mean here, thanks. I'll take a look...

> > +static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> > +                           unsigned long iova, size_t size, int lvl,
> > +                           arm_lpae_iopte *ptep) {
> > +       arm_lpae_iopte pte;
> > +       struct iommu_gather_ops *tlb = data->iop.cfg.tlb;
> > +       void *cookie = data->iop.cookie;
> > +       size_t blk_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
> > +
> > +       ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
> > +       pte = *ptep;
> > +
> > +       /* Something went horribly wrong and we ran out of page table */
> > +       if (WARN_ON(!pte || (lvl == ARM_LPAE_MAX_LEVELS)))
> > +               return 0;
> > +
> > +       /* If the size matches this level, we're in the right place */
> > +       if (size == blk_size) {
> > +               *ptep = 0;
> > +               tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
> > +
> > +               if (!iopte_leaf(pte, lvl)) {
> > +                       /* Also flush any partial walks */
> > +                       tlb->tlb_add_flush(iova, size, false, cookie);
> > +                       tlb->tlb_sync(data->iop.cookie);
> > +                       ptep = iopte_deref(pte, data);
> > +                       __arm_lpae_free_pgtable(data, lvl + 1, ptep);
> > +               } else {
> > +                       tlb->tlb_add_flush(iova, size, true, cookie);
> > +               }
> > +
> > +               return size;
> > +       } else if (iopte_leaf(pte, lvl)) {
> > +               /*
> > +                * Insert a table at the next level to map the old region,
> > +                * minus the part we want to unmap
> > +                */
> > [[varun]]  Minus could be somwhere in between the contiguous chunk? We
> > should first break the entire block mapping in to a next level page
> > mapping and then unmap a chunk.
> 
> The amount to unmap will match exactly one entry at the next level --
> that's enforced by the IOMMU API (and it will also be aligned as such).
> 
> > +static struct arm_lpae_io_pgtable *
> > +arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg) {
> > +       unsigned long va_bits;
> > +       struct arm_lpae_io_pgtable *data;
> > +
> > +       arm_lpae_restrict_pgsizes(cfg);
> > +
> > +       if (!(cfg->pgsize_bitmap & (SZ_4K | SZ_16K | SZ_64K)))
> > +               return NULL;
> > +
> > +       if (cfg->ias > ARM_LPAE_MAX_ADDR_BITS)
> > +               return NULL;
> > +
> > +       if (cfg->oas > ARM_LPAE_MAX_ADDR_BITS)
> > +               return NULL;
> > +
> > +       data = kmalloc(sizeof(*data), GFP_KERNEL);
> > +       if (!data)
> > +               return NULL;
> > +
> > +       data->pages_per_pgd = 1;
> > +       data->pg_shift = __ffs(cfg->pgsize_bitmap);
> > +       data->bits_per_level = data->pg_shift -
> > + ilog2(sizeof(arm_lpae_iopte));
> > +
> > +       va_bits = cfg->ias - data->pg_shift;
> > +       data->levels = DIV_ROUND_UP(va_bits, data->bits_per_level);
> >
> > [[varun]]  Not related to the patch, but this would be applicable to
> > the CPU tables as well i.e, we can't support 48bit VA with 64 KB page
> > tables, right? The AR64 memory maps shows possibility of using 6 bits
> > for the first level page table.
> 
> Sure we can support 48-bit VAs with 64k pages. Why do you think we can't?
> 
> [varun]  My concern was with respect to the bits per level, which is
> uneven for the 64 K page sizes. Just wondering how would things work with
> 64K pages when we do a 3 level page lookup.

Well, it's uneven (9) for the 4k case too. Do you actually see an issue
here?

48-bit VA with 64k pages gives us:

  va_bits = (48 - 16) = 32
  bits_per_level = (16 - 3) = 13
  levels = ceil(32/13) = 3

so the starting level is 1, which resolves 32-(13*2) = 6 bits.

Does that make sense?

> > +static struct io_pgtable *arm_lpae_alloc_pgtable_s2(struct io_pgtable_cfg *cfg,
> > +                                                   void *cookie) {
> > +       u64 reg, sl;
> > +       size_t pgd_size;
> > +       struct arm_lpae_io_pgtable *data =
> > +arm_lpae_alloc_pgtable(cfg);
> > +
> > +       if (!data)
> > +               return NULL;
> > +
> > +       /*
> > +        * Concatenate PGDs at level 1 if possible in order to reduce
> > +        * the depth of the stage-2 walk.
> > +        */
> > +       if (data->levels == ARM_LPAE_MAX_LEVELS) {
> > +               unsigned long pgd_bits, pgd_pages;
> > +               unsigned long va_bits = cfg->ias - data->pg_shift;
> > +
> > +               pgd_bits = data->bits_per_level * (data->levels - 1);
> > +               pgd_pages = 1 << (va_bits - pgd_bits);
> > +               if (pgd_pages <= ARM_LPAE_S2_MAX_CONCAT_PAGES) {
> > +                       data->pages_per_pgd = pgd_pages;
> > +                       data->levels--;
> > +               }
> > +       }
> > +
> > [[varun]] Can you point me to some documentation regarding stage 2
> > page concatenation. Not sure why this is required?
> 
> It's all in the ARM ARM. The idea is to reduce the depth of the stage-2
> walk, since that can have an impact on performance when it gets too deep
> (remember that stage-1 table walks are themselves subjected to stage-2
> translation).
> 
> [varun] Just curious, why not consider concatenation at stage 1?

Well, the architecture doesn't support it. It's not as big a deal at
stage 1 anyway, because there is no nesting in that case (unlike stage 2,
which is applied to stage 1 table walks).

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 2/4] iommu: add ARM LPAE page table allocator
@ 2014-12-15 13:30                     ` Will Deacon
  0 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-15 13:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Dec 14, 2014 at 05:45:49PM +0000, Varun Sethi wrote:
> Please find my response inline. Search for "varun".

This is getting fiddly now that you've already replied once. Any chance you
could sort your mail client out, please?

> > +static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
> > +                            unsigned long iova, phys_addr_t paddr,
> > +                            arm_lpae_iopte prot, int lvl,
> > +                            arm_lpae_iopte *ptep) {
> > +       arm_lpae_iopte pte = prot;
> > +
> > +       /* We require an unmap first */
> > +       if (iopte_leaf(*ptep, lvl))
> > +               return -EEXIST;
> > [varun] Instead of returning an error, how about displaying a warning
> > and replacing the entry?
> 
> I'd be ok with displaying a warning, but I don't think we should just
> continue. It indicates a misuse of the IOMMU API and probably a missing
> TLBI.
> 
> [[varun]] May not apply now, but what if we are dealing with a case where
> memory is not pinned? It may be possible to hookup (without an unmap) an
> iova to a different physical address. Offcourse, tlb invalidation would be
> required. Could this scenario be relevant in case of stall mode?

If we wanted to support that, then we'd need some new functions for grabbing
hold of the entry and manipulating it in a similar way to the CPU side (e.g.
pte_mkdirty etc).

> > +               if (__arm_lpae_map(data, blk_start, blk_paddr, size, prot, lvl,
> > +                                  tablep) < 0) {
> >
> >
> > [[varun]] Again not clear how are we unmapping the range. Index at the
> > current level should point to a page table (with contiguous block
> > mappings). Unmap would applied to the mappings at the next level.
> > Unmap can happen anywhere in the contiguous range. It seems that you
> > are just creating a subset of the block mapping.
> 
> We will be unmapping a single entry at the next level, so we basically
> create a table, then map everything at the next level apart from the part
> we need to unmap.
> 
> 
> [varun] ok, but you could potentially end up splitting mapping to the
> least possible page size e.g. 4K. You, don't seem to take in to account
> the possibility of using the block size at the next level. For example,
> take a case where we have a huge page mapping using 1G page size and we
> have an unmap request for 4K. We could still split maximum part of the
> mapping using 2M pages at the next level. The entry where we need to unmap
> the 4K region would potentially go to the next level.

Aha, I see what you mean here, thanks. I'll take a look...

> > +static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> > +                           unsigned long iova, size_t size, int lvl,
> > +                           arm_lpae_iopte *ptep) {
> > +       arm_lpae_iopte pte;
> > +       struct iommu_gather_ops *tlb = data->iop.cfg.tlb;
> > +       void *cookie = data->iop.cookie;
> > +       size_t blk_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
> > +
> > +       ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
> > +       pte = *ptep;
> > +
> > +       /* Something went horribly wrong and we ran out of page table */
> > +       if (WARN_ON(!pte || (lvl == ARM_LPAE_MAX_LEVELS)))
> > +               return 0;
> > +
> > +       /* If the size matches this level, we're in the right place */
> > +       if (size == blk_size) {
> > +               *ptep = 0;
> > +               tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
> > +
> > +               if (!iopte_leaf(pte, lvl)) {
> > +                       /* Also flush any partial walks */
> > +                       tlb->tlb_add_flush(iova, size, false, cookie);
> > +                       tlb->tlb_sync(data->iop.cookie);
> > +                       ptep = iopte_deref(pte, data);
> > +                       __arm_lpae_free_pgtable(data, lvl + 1, ptep);
> > +               } else {
> > +                       tlb->tlb_add_flush(iova, size, true, cookie);
> > +               }
> > +
> > +               return size;
> > +       } else if (iopte_leaf(pte, lvl)) {
> > +               /*
> > +                * Insert a table at the next level to map the old region,
> > +                * minus the part we want to unmap
> > +                */
> > [[varun]]  Minus could be somwhere in between the contiguous chunk? We
> > should first break the entire block mapping in to a next level page
> > mapping and then unmap a chunk.
> 
> The amount to unmap will match exactly one entry at the next level --
> that's enforced by the IOMMU API (and it will also be aligned as such).
> 
> > +static struct arm_lpae_io_pgtable *
> > +arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg) {
> > +       unsigned long va_bits;
> > +       struct arm_lpae_io_pgtable *data;
> > +
> > +       arm_lpae_restrict_pgsizes(cfg);
> > +
> > +       if (!(cfg->pgsize_bitmap & (SZ_4K | SZ_16K | SZ_64K)))
> > +               return NULL;
> > +
> > +       if (cfg->ias > ARM_LPAE_MAX_ADDR_BITS)
> > +               return NULL;
> > +
> > +       if (cfg->oas > ARM_LPAE_MAX_ADDR_BITS)
> > +               return NULL;
> > +
> > +       data = kmalloc(sizeof(*data), GFP_KERNEL);
> > +       if (!data)
> > +               return NULL;
> > +
> > +       data->pages_per_pgd = 1;
> > +       data->pg_shift = __ffs(cfg->pgsize_bitmap);
> > +       data->bits_per_level = data->pg_shift -
> > + ilog2(sizeof(arm_lpae_iopte));
> > +
> > +       va_bits = cfg->ias - data->pg_shift;
> > +       data->levels = DIV_ROUND_UP(va_bits, data->bits_per_level);
> >
> > [[varun]]  Not related to the patch, but this would be applicable to
> > the CPU tables as well i.e, we can't support 48bit VA with 64 KB page
> > tables, right? The AR64 memory maps shows possibility of using 6 bits
> > for the first level page table.
> 
> Sure we can support 48-bit VAs with 64k pages. Why do you think we can't?
> 
> [varun]  My concern was with respect to the bits per level, which is
> uneven for the 64 K page sizes. Just wondering how would things work with
> 64K pages when we do a 3 level page lookup.

Well, it's uneven (9) for the 4k case too. Do you actually see an issue
here?

48-bit VA with 64k pages gives us:

  va_bits = (48 - 16) = 32
  bits_per_level = (16 - 3) = 13
  levels = ceil(32/13) = 3

so the starting level is 1, which resolves 32-(13*2) = 6 bits.

Does that make sense?

> > +static struct io_pgtable *arm_lpae_alloc_pgtable_s2(struct io_pgtable_cfg *cfg,
> > +                                                   void *cookie) {
> > +       u64 reg, sl;
> > +       size_t pgd_size;
> > +       struct arm_lpae_io_pgtable *data =
> > +arm_lpae_alloc_pgtable(cfg);
> > +
> > +       if (!data)
> > +               return NULL;
> > +
> > +       /*
> > +        * Concatenate PGDs at level 1 if possible in order to reduce
> > +        * the depth of the stage-2 walk.
> > +        */
> > +       if (data->levels == ARM_LPAE_MAX_LEVELS) {
> > +               unsigned long pgd_bits, pgd_pages;
> > +               unsigned long va_bits = cfg->ias - data->pg_shift;
> > +
> > +               pgd_bits = data->bits_per_level * (data->levels - 1);
> > +               pgd_pages = 1 << (va_bits - pgd_bits);
> > +               if (pgd_pages <= ARM_LPAE_S2_MAX_CONCAT_PAGES) {
> > +                       data->pages_per_pgd = pgd_pages;
> > +                       data->levels--;
> > +               }
> > +       }
> > +
> > [[varun]] Can you point me to some documentation regarding stage 2
> > page concatenation. Not sure why this is required?
> 
> It's all in the ARM ARM. The idea is to reduce the depth of the stage-2
> walk, since that can have an impact on performance when it gets too deep
> (remember that stage-1 table walks are themselves subjected to stage-2
> translation).
> 
> [varun] Just curious, why not consider concatenation at stage 1?

Well, the architecture doesn't support it. It's not as big a deal at
stage 1 anyway, because there is no nesting in that case (unlike stage 2,
which is applied to stage 1 table walks).

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/4] iommu: add ARM LPAE page table allocator
  2014-12-15 13:30                     ` Will Deacon
@ 2014-12-15 15:43                         ` Will Deacon
  -1 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-15 15:43 UTC (permalink / raw)
  To: Varun Sethi
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	laurent.pinchart-ryLnwIuWjnjg/C1BVhZhaw,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin Murphy,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Mon, Dec 15, 2014 at 01:30:20PM +0000, Will Deacon wrote:
> On Sun, Dec 14, 2014 at 05:45:49PM +0000, Varun Sethi wrote:
> > [varun] ok, but you could potentially end up splitting mapping to the
> > least possible page size e.g. 4K. You, don't seem to take in to account
> > the possibility of using the block size at the next level. For example,
> > take a case where we have a huge page mapping using 1G page size and we
> > have an unmap request for 4K. We could still split maximum part of the
> > mapping using 2M pages at the next level. The entry where we need to unmap
> > the 4K region would potentially go to the next level.
> 
> Aha, I see what you mean here, thanks. I'll take a look...

Scratch that, I think the code is fine as it is. For the case you highlight,
we iterate over the 1GB region remapping it using 4k pages, but skipping
the one we want to unmap, so I don't think there's a problem
(__arm_lpae_map will create the relevant table entries).

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 2/4] iommu: add ARM LPAE page table allocator
@ 2014-12-15 15:43                         ` Will Deacon
  0 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-15 15:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Dec 15, 2014 at 01:30:20PM +0000, Will Deacon wrote:
> On Sun, Dec 14, 2014 at 05:45:49PM +0000, Varun Sethi wrote:
> > [varun] ok, but you could potentially end up splitting mapping to the
> > least possible page size e.g. 4K. You, don't seem to take in to account
> > the possibility of using the block size at the next level. For example,
> > take a case where we have a huge page mapping using 1G page size and we
> > have an unmap request for 4K. We could still split maximum part of the
> > mapping using 2M pages at the next level. The entry where we need to unmap
> > the 4K region would potentially go to the next level.
> 
> Aha, I see what you mean here, thanks. I'll take a look...

Scratch that, I think the code is fine as it is. For the case you highlight,
we iterate over the 1GB region remapping it using 4k pages, but skipping
the one we want to unmap, so I don't think there's a problem
(__arm_lpae_map will create the relevant table entries).

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/4] Generic IOMMU page table framework
  2014-12-14 23:49     ` Laurent Pinchart
@ 2014-12-15 16:10       ` Will Deacon
  -1 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-15 16:10 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Robin Murphy, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Sun, Dec 14, 2014 at 11:49:30PM +0000, Laurent Pinchart wrote:
> Hi Will,

Hi Laurent,

> On Thursday 27 November 2014 11:51:14 Will Deacon wrote:
> > This series introduces a generic IOMMU page table allocation framework,
> > implements support for ARM long-descriptors and then ports the arm-smmu
> > driver over to the new code.

[...]

> > All feedback welcome.
> 
> I've successfully tested the patch set with the Renesas IPMMU-VMSA driver with
> the following extension to the allocator.
> 
> Tested-by: Laurent Pinchart <laurent.pinchart-ryLnwIuWjnjg/C1BVhZhaw@public.gmane.org>

Wahey, that's really cool, thanks! I have a few minor comments on your patch
below. If you don't object, then I can make them locally and include your
patch on top of my v2 series?

> From 4bebb7f3a5a48541d4c89ce7c61e6ff66686c3a9 Mon Sep 17 00:00:00 2001
> From: Laurent Pinchart <laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw@public.gmane.org>
> Date: Sun, 14 Dec 2014 23:34:50 +0200
> Subject: [PATCH] iommu: io-pgtable-arm: Add Non-Secure quirk
> 
> The quirk causes the Non-Secure bit to be set in all page table entries.
> 
> Signed-off-by: Laurent Pinchart <laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw@public.gmane.org>
> ---
>  drivers/iommu/io-pgtable-arm.c | 7 +++++++
>  drivers/iommu/io-pgtable.h     | 3 +++
>  2 files changed, 10 insertions(+)
> 
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> index 669e322a83a4..b6910e142734 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -80,11 +80,13 @@
>  #define ARM_LPAE_PTE_TYPE_TABLE		3
>  #define ARM_LPAE_PTE_TYPE_PAGE		3
>  
> +#define ARM_LPAE_PTE_NSTABLE		(((arm_lpae_iopte)1) << 63)
>  #define ARM_LPAE_PTE_XN			(((arm_lpae_iopte)3) << 53)
>  #define ARM_LPAE_PTE_AF			(((arm_lpae_iopte)1) << 10)
>  #define ARM_LPAE_PTE_SH_NS		(((arm_lpae_iopte)0) << 8)
>  #define ARM_LPAE_PTE_SH_OS		(((arm_lpae_iopte)2) << 8)
>  #define ARM_LPAE_PTE_SH_IS		(((arm_lpae_iopte)3) << 8)
> +#define ARM_LPAE_PTE_NS			(((arm_lpae_iopte)1) << 5)
>  #define ARM_LPAE_PTE_VALID		(((arm_lpae_iopte)1) << 0)
>  
>  #define ARM_LPAE_PTE_ATTR_LO_MASK	(((arm_lpae_iopte)0x3ff) << 2)
> @@ -201,6 +203,9 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
>  	if (iopte_leaf(*ptep, lvl))
>  		return -EEXIST;
>  
> +	if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_NON_SECURE)
> +		pte |= ARM_LPAE_PTE_NS;
> +
>  	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
>  		pte |= ARM_LPAE_PTE_TYPE_PAGE;
>  	else
> @@ -244,6 +249,8 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
>  		data->iop.cfg.tlb->flush_pgtable(cptep, 1UL << data->pg_shift,
>  						 cookie);
>  		pte = __pa(cptep) | ARM_LPAE_PTE_TYPE_TABLE;
> +		if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_NON_SECURE)
> +			pte |= ARM_LPAE_PTE_NSTABLE;
>  		*ptep = pte;
>  		data->iop.cfg.tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
>  	} else {
> diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
> index c1cff3d045db..a41a15d30596 100644
> --- a/drivers/iommu/io-pgtable.h
> +++ b/drivers/iommu/io-pgtable.h
> @@ -24,6 +24,9 @@ struct iommu_gather_ops {
>  	void (*flush_pgtable)(void *ptr, size_t size, void *cookie);
>  };
>  
> +/* Set the Non-Secure bit in the PTEs */
> +#define IO_PGTABLE_QUIRK_NON_SECURE	(1 << 0)

I think I'd stick an _ARM_ somewhere in here, so maybe
IO_PGTABLE_QUIRK_ARM_NS?

> +
>  struct io_pgtable_cfg {

and I'd put the #define here, next to the member.

>  	int			quirks; /* IO_PGTABLE_QUIRK_* */
>  	unsigned long		pgsize_bitmap;
> -- 

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 0/4] Generic IOMMU page table framework
@ 2014-12-15 16:10       ` Will Deacon
  0 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-15 16:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Dec 14, 2014 at 11:49:30PM +0000, Laurent Pinchart wrote:
> Hi Will,

Hi Laurent,

> On Thursday 27 November 2014 11:51:14 Will Deacon wrote:
> > This series introduces a generic IOMMU page table allocation framework,
> > implements support for ARM long-descriptors and then ports the arm-smmu
> > driver over to the new code.

[...]

> > All feedback welcome.
> 
> I've successfully tested the patch set with the Renesas IPMMU-VMSA driver with
> the following extension to the allocator.
> 
> Tested-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>

Wahey, that's really cool, thanks! I have a few minor comments on your patch
below. If you don't object, then I can make them locally and include your
patch on top of my v2 series?

> From 4bebb7f3a5a48541d4c89ce7c61e6ff66686c3a9 Mon Sep 17 00:00:00 2001
> From: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
> Date: Sun, 14 Dec 2014 23:34:50 +0200
> Subject: [PATCH] iommu: io-pgtable-arm: Add Non-Secure quirk
> 
> The quirk causes the Non-Secure bit to be set in all page table entries.
> 
> Signed-off-by: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
> ---
>  drivers/iommu/io-pgtable-arm.c | 7 +++++++
>  drivers/iommu/io-pgtable.h     | 3 +++
>  2 files changed, 10 insertions(+)
> 
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> index 669e322a83a4..b6910e142734 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -80,11 +80,13 @@
>  #define ARM_LPAE_PTE_TYPE_TABLE		3
>  #define ARM_LPAE_PTE_TYPE_PAGE		3
>  
> +#define ARM_LPAE_PTE_NSTABLE		(((arm_lpae_iopte)1) << 63)
>  #define ARM_LPAE_PTE_XN			(((arm_lpae_iopte)3) << 53)
>  #define ARM_LPAE_PTE_AF			(((arm_lpae_iopte)1) << 10)
>  #define ARM_LPAE_PTE_SH_NS		(((arm_lpae_iopte)0) << 8)
>  #define ARM_LPAE_PTE_SH_OS		(((arm_lpae_iopte)2) << 8)
>  #define ARM_LPAE_PTE_SH_IS		(((arm_lpae_iopte)3) << 8)
> +#define ARM_LPAE_PTE_NS			(((arm_lpae_iopte)1) << 5)
>  #define ARM_LPAE_PTE_VALID		(((arm_lpae_iopte)1) << 0)
>  
>  #define ARM_LPAE_PTE_ATTR_LO_MASK	(((arm_lpae_iopte)0x3ff) << 2)
> @@ -201,6 +203,9 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
>  	if (iopte_leaf(*ptep, lvl))
>  		return -EEXIST;
>  
> +	if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_NON_SECURE)
> +		pte |= ARM_LPAE_PTE_NS;
> +
>  	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
>  		pte |= ARM_LPAE_PTE_TYPE_PAGE;
>  	else
> @@ -244,6 +249,8 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
>  		data->iop.cfg.tlb->flush_pgtable(cptep, 1UL << data->pg_shift,
>  						 cookie);
>  		pte = __pa(cptep) | ARM_LPAE_PTE_TYPE_TABLE;
> +		if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_NON_SECURE)
> +			pte |= ARM_LPAE_PTE_NSTABLE;
>  		*ptep = pte;
>  		data->iop.cfg.tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
>  	} else {
> diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
> index c1cff3d045db..a41a15d30596 100644
> --- a/drivers/iommu/io-pgtable.h
> +++ b/drivers/iommu/io-pgtable.h
> @@ -24,6 +24,9 @@ struct iommu_gather_ops {
>  	void (*flush_pgtable)(void *ptr, size_t size, void *cookie);
>  };
>  
> +/* Set the Non-Secure bit in the PTEs */
> +#define IO_PGTABLE_QUIRK_NON_SECURE	(1 << 0)

I think I'd stick an _ARM_ somewhere in here, so maybe
IO_PGTABLE_QUIRK_ARM_NS?

> +
>  struct io_pgtable_cfg {

and I'd put the #define here, next to the member.

>  	int			quirks; /* IO_PGTABLE_QUIRK_* */
>  	unsigned long		pgsize_bitmap;
> -- 

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [PATCH 2/4] iommu: add ARM LPAE page table allocator
  2014-12-15 15:43                         ` Will Deacon
@ 2014-12-15 16:35                           ` Varun Sethi
  -1 siblings, 0 replies; 76+ messages in thread
From: Varun Sethi @ 2014-12-15 16:35 UTC (permalink / raw)
  To: Will Deacon
  Cc: lauraa, mitchelh, joro, iommu, laurent.pinchart, prem.mallappa,
	Robin Murphy, linux-arm-kernel, m.szyprowski

Hi Will,

-----Original Message-----
From: Will Deacon [mailto:will.deacon@arm.com] 
Sent: Monday, December 15, 2014 9:13 PM
To: Sethi Varun-B16395
Cc: linux-arm-kernel@lists.infradead.org; iommu@lists.linux-foundation.org; prem.mallappa@broadcom.com; Robin Murphy; lauraa@codeaurora.org; mitchelh@codeaurora.org; laurent.pinchart@ideasonboard.com; joro@8bytes.org; m.szyprowski@samsung.com
Subject: Re: [PATCH 2/4] iommu: add ARM LPAE page table allocator

On Mon, Dec 15, 2014 at 01:30:20PM +0000, Will Deacon wrote:
> On Sun, Dec 14, 2014 at 05:45:49PM +0000, Varun Sethi wrote:
> > [varun] ok, but you could potentially end up splitting mapping to 
> > the least possible page size e.g. 4K. You, don't seem to take in to 
> > account the possibility of using the block size at the next level. 
> > For example, take a case where we have a huge page mapping using 1G 
> > page size and we have an unmap request for 4K. We could still split 
> > maximum part of the mapping using 2M pages at the next level. The 
> > entry where we need to unmap the 4K region would potentially go to the next level.
> 
> Aha, I see what you mean here, thanks. I'll take a look...

Scratch that, I think the code is fine as it is. For the case you highlight, we iterate over the 1GB region remapping it using 4k pages, but skipping the one we want to unmap, so I don't think there's a problem (__arm_lpae_map will create the relevant table entries).


[[varun]]  But you can split 1G in 2M mappings and then split up the unmapped region using 4K pages. In this case you split up the entire region using 4K pages.

Thanks,
Varun

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 2/4] iommu: add ARM LPAE page table allocator
@ 2014-12-15 16:35                           ` Varun Sethi
  0 siblings, 0 replies; 76+ messages in thread
From: Varun Sethi @ 2014-12-15 16:35 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

-----Original Message-----
From: Will Deacon [mailto:will.deacon at arm.com] 
Sent: Monday, December 15, 2014 9:13 PM
To: Sethi Varun-B16395
Cc: linux-arm-kernel at lists.infradead.org; iommu at lists.linux-foundation.org; prem.mallappa at broadcom.com; Robin Murphy; lauraa at codeaurora.org; mitchelh at codeaurora.org; laurent.pinchart at ideasonboard.com; joro at 8bytes.org; m.szyprowski at samsung.com
Subject: Re: [PATCH 2/4] iommu: add ARM LPAE page table allocator

On Mon, Dec 15, 2014 at 01:30:20PM +0000, Will Deacon wrote:
> On Sun, Dec 14, 2014 at 05:45:49PM +0000, Varun Sethi wrote:
> > [varun] ok, but you could potentially end up splitting mapping to 
> > the least possible page size e.g. 4K. You, don't seem to take in to 
> > account the possibility of using the block size at the next level. 
> > For example, take a case where we have a huge page mapping using 1G 
> > page size and we have an unmap request for 4K. We could still split 
> > maximum part of the mapping using 2M pages at the next level. The 
> > entry where we need to unmap the 4K region would potentially go to the next level.
> 
> Aha, I see what you mean here, thanks. I'll take a look...

Scratch that, I think the code is fine as it is. For the case you highlight, we iterate over the 1GB region remapping it using 4k pages, but skipping the one we want to unmap, so I don't think there's a problem (__arm_lpae_map will create the relevant table entries).


[[varun]]  But you can split 1G in 2M mappings and then split up the unmapped region using 4K pages. In this case you split up the entire region using 4K pages.

Thanks,
Varun

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [PATCH 2/4] iommu: add ARM LPAE page table allocator
  2014-12-15 13:30                     ` Will Deacon
@ 2014-12-15 16:43                         ` Varun Sethi
  -1 siblings, 0 replies; 76+ messages in thread
From: Varun Sethi @ 2014-12-15 16:43 UTC (permalink / raw)
  To: Will Deacon
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	laurent.pinchart-ryLnwIuWjnjg/C1BVhZhaw,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin Murphy,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r



-----Original Message-----
From: Will Deacon [mailto:will.deacon-5wv7dgnIgG8@public.gmane.org] 
Sent: Monday, December 15, 2014 7:00 PM
To: Sethi Varun-B16395
Cc: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org; iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org; prem.mallappa-dY08KVG/lbpWk0Htik3J/w@public.gmane.org; Robin Murphy; lauraa-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org; mitchelh-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org; laurent.pinchart-ryLnwIuWjnjg/C1BVhZhaw@public.gmane.org; joro-zLv9SwRftAIdnm+yROfE0A@public.gmane.org; m.szyprowski-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org
Subject: Re: [PATCH 2/4] iommu: add ARM LPAE page table allocator

On Sun, Dec 14, 2014 at 05:45:49PM +0000, Varun Sethi wrote:
> Please find my response inline. Search for "varun".

This is getting fiddly now that you've already replied once. Any chance you could sort your mail client out, please?


[[varun]]  Yes I need to do that, this is painful.

> > +       if (!data)
> > +               return NULL;
> > +
> > +       data->pages_per_pgd = 1;
> > +       data->pg_shift = __ffs(cfg->pgsize_bitmap);
> > +       data->bits_per_level = data->pg_shift - 
> > + ilog2(sizeof(arm_lpae_iopte));
> > +
> > +       va_bits = cfg->ias - data->pg_shift;
> > +       data->levels = DIV_ROUND_UP(va_bits, data->bits_per_level);
> >
> > [[varun]]  Not related to the patch, but this would be applicable to 
> > the CPU tables as well i.e, we can't support 48bit VA with 64 KB 
> > page tables, right? The AR64 memory maps shows possibility of using 
> > 6 bits for the first level page table.
> 
> Sure we can support 48-bit VAs with 64k pages. Why do you think we can't?
> 
> [varun]  My concern was with respect to the bits per level, which is 
> uneven for the 64 K page sizes. Just wondering how would things work 
> with 64K pages when we do a 3 level page lookup.

Well, it's uneven (9) for the 4k case too. Do you actually see an issue here?

48-bit VA with 64k pages gives us:

  va_bits = (48 - 16) = 32
  bits_per_level = (16 - 3) = 13
  levels = ceil(32/13) = 3

so the starting level is 1, which resolves 32-(13*2) = 6 bits.

Does that make sense?

[[varun]]Yes, but what I meant was, is that in case of 4K pages you have 9 bits per level, but for 64K pages you have 6 bits for the first  level and 13 each for second and third. So, bits per level would not work in case of 64 K pages?

Thanks,
Varun

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 2/4] iommu: add ARM LPAE page table allocator
@ 2014-12-15 16:43                         ` Varun Sethi
  0 siblings, 0 replies; 76+ messages in thread
From: Varun Sethi @ 2014-12-15 16:43 UTC (permalink / raw)
  To: linux-arm-kernel



-----Original Message-----
From: Will Deacon [mailto:will.deacon at arm.com] 
Sent: Monday, December 15, 2014 7:00 PM
To: Sethi Varun-B16395
Cc: linux-arm-kernel at lists.infradead.org; iommu at lists.linux-foundation.org; prem.mallappa at broadcom.com; Robin Murphy; lauraa at codeaurora.org; mitchelh at codeaurora.org; laurent.pinchart at ideasonboard.com; joro at 8bytes.org; m.szyprowski at samsung.com
Subject: Re: [PATCH 2/4] iommu: add ARM LPAE page table allocator

On Sun, Dec 14, 2014 at 05:45:49PM +0000, Varun Sethi wrote:
> Please find my response inline. Search for "varun".

This is getting fiddly now that you've already replied once. Any chance you could sort your mail client out, please?


[[varun]]  Yes I need to do that, this is painful.

> > +       if (!data)
> > +               return NULL;
> > +
> > +       data->pages_per_pgd = 1;
> > +       data->pg_shift = __ffs(cfg->pgsize_bitmap);
> > +       data->bits_per_level = data->pg_shift - 
> > + ilog2(sizeof(arm_lpae_iopte));
> > +
> > +       va_bits = cfg->ias - data->pg_shift;
> > +       data->levels = DIV_ROUND_UP(va_bits, data->bits_per_level);
> >
> > [[varun]]  Not related to the patch, but this would be applicable to 
> > the CPU tables as well i.e, we can't support 48bit VA with 64 KB 
> > page tables, right? The AR64 memory maps shows possibility of using 
> > 6 bits for the first level page table.
> 
> Sure we can support 48-bit VAs with 64k pages. Why do you think we can't?
> 
> [varun]  My concern was with respect to the bits per level, which is 
> uneven for the 64 K page sizes. Just wondering how would things work 
> with 64K pages when we do a 3 level page lookup.

Well, it's uneven (9) for the 4k case too. Do you actually see an issue here?

48-bit VA with 64k pages gives us:

  va_bits = (48 - 16) = 32
  bits_per_level = (16 - 3) = 13
  levels = ceil(32/13) = 3

so the starting level is 1, which resolves 32-(13*2) = 6 bits.

Does that make sense?

[[varun]]Yes, but what I meant was, is that in case of 4K pages you have 9 bits per level, but for 64K pages you have 6 bits for the first  level and 13 each for second and third. So, bits per level would not work in case of 64 K pages?

Thanks,
Varun

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/4] iommu: add ARM LPAE page table allocator
  2014-12-15 16:43                         ` Varun Sethi
@ 2014-12-15 17:20                             ` Will Deacon
  -1 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-15 17:20 UTC (permalink / raw)
  To: Varun Sethi
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	laurent.pinchart-ryLnwIuWjnjg/C1BVhZhaw,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin Murphy,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Mon, Dec 15, 2014 at 04:43:59PM +0000, Varun Sethi wrote:
> > Sure we can support 48-bit VAs with 64k pages. Why do you think we can't?
> > 
> > [varun]  My concern was with respect to the bits per level, which is 
> > uneven for the 64 K page sizes. Just wondering how would things work 
> > with 64K pages when we do a 3 level page lookup.
> 
> Well, it's uneven (9) for the 4k case too. Do you actually see an issue here?
> 
> 48-bit VA with 64k pages gives us:
> 
>   va_bits = (48 - 16) = 32
>   bits_per_level = (16 - 3) = 13
>   levels = ceil(32/13) = 3
> 
> so the starting level is 1, which resolves 32-(13*2) = 6 bits.
> 
> Does that make sense?
> 
> [[varun]]Yes, but what I meant was, is that in case of 4K pages you have 9
> bits per level, but for 64K pages you have 6 bits for the first  level and
> 13 each for second and third. So, bits per level would not work in case of
> 64 K pages?

The current code takes this into account.

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 2/4] iommu: add ARM LPAE page table allocator
@ 2014-12-15 17:20                             ` Will Deacon
  0 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-15 17:20 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Dec 15, 2014 at 04:43:59PM +0000, Varun Sethi wrote:
> > Sure we can support 48-bit VAs with 64k pages. Why do you think we can't?
> > 
> > [varun]  My concern was with respect to the bits per level, which is 
> > uneven for the 64 K page sizes. Just wondering how would things work 
> > with 64K pages when we do a 3 level page lookup.
> 
> Well, it's uneven (9) for the 4k case too. Do you actually see an issue here?
> 
> 48-bit VA with 64k pages gives us:
> 
>   va_bits = (48 - 16) = 32
>   bits_per_level = (16 - 3) = 13
>   levels = ceil(32/13) = 3
> 
> so the starting level is 1, which resolves 32-(13*2) = 6 bits.
> 
> Does that make sense?
> 
> [[varun]]Yes, but what I meant was, is that in case of 4K pages you have 9
> bits per level, but for 64K pages you have 6 bits for the first  level and
> 13 each for second and third. So, bits per level would not work in case of
> 64 K pages?

The current code takes this into account.

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/4] iommu: add ARM LPAE page table allocator
  2014-12-15 16:35                           ` Varun Sethi
@ 2014-12-15 17:25                               ` Will Deacon
  -1 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-15 17:25 UTC (permalink / raw)
  To: Varun Sethi
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	laurent.pinchart-ryLnwIuWjnjg/C1BVhZhaw,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w, Robin Murphy,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Mon, Dec 15, 2014 at 04:35:12PM +0000, Varun Sethi wrote:
> -----Original Message-----
> From: Will Deacon [mailto:will.deacon-5wv7dgnIgG8@public.gmane.org] 
> Sent: Monday, December 15, 2014 9:13 PM
> To: Sethi Varun-B16395
> Cc: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org; iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org; prem.mallappa-dY08KVG/lbpWk0Htik3J/w@public.gmane.org; Robin Murphy; lauraa-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org; mitchelh-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org; laurent.pinchart-ryLnwIuWjnjg/C1BVhZhaw@public.gmane.org; joro-zLv9SwRftAIdnm+yROfE0A@public.gmane.org; m.szyprowski-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org
> Subject: Re: [PATCH 2/4] iommu: add ARM LPAE page table allocator
> 
> On Mon, Dec 15, 2014 at 01:30:20PM +0000, Will Deacon wrote:
> > On Sun, Dec 14, 2014 at 05:45:49PM +0000, Varun Sethi wrote:
> > > [varun] ok, but you could potentially end up splitting mapping to 
> > > the least possible page size e.g. 4K. You, don't seem to take in to 
> > > account the possibility of using the block size at the next level. 
> > > For example, take a case where we have a huge page mapping using 1G 
> > > page size and we have an unmap request for 4K. We could still split 
> > > maximum part of the mapping using 2M pages at the next level. The 
> > > entry where we need to unmap the 4K region would potentially go to the next level.
> > 
> > Aha, I see what you mean here, thanks. I'll take a look...
> 
> Scratch that, I think the code is fine as it is. For the case you
> highlight, we iterate over the 1GB region remapping it using 4k pages, but
> skipping the one we want to unmap, so I don't think there's a problem
> (__arm_lpae_map will create the relevant table entries).
> 
> 
> [[varun]]  But you can split 1G in 2M mappings and then split up the
> unmapped region using 4K pages. In this case you split up the entire
> region using 4K pages.

True, I miss an optimisation opportunity there, but I don't know that it's
common enough to care (in the same way that we don't recreate a 1G mapping
if you remapped that 4k page back like it was).

You could add this by making arm_lpae_split_blk_unmap recursive, if you
wanted to.

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 2/4] iommu: add ARM LPAE page table allocator
@ 2014-12-15 17:25                               ` Will Deacon
  0 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-15 17:25 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Dec 15, 2014 at 04:35:12PM +0000, Varun Sethi wrote:
> -----Original Message-----
> From: Will Deacon [mailto:will.deacon at arm.com] 
> Sent: Monday, December 15, 2014 9:13 PM
> To: Sethi Varun-B16395
> Cc: linux-arm-kernel at lists.infradead.org; iommu at lists.linux-foundation.org; prem.mallappa at broadcom.com; Robin Murphy; lauraa at codeaurora.org; mitchelh at codeaurora.org; laurent.pinchart at ideasonboard.com; joro at 8bytes.org; m.szyprowski at samsung.com
> Subject: Re: [PATCH 2/4] iommu: add ARM LPAE page table allocator
> 
> On Mon, Dec 15, 2014 at 01:30:20PM +0000, Will Deacon wrote:
> > On Sun, Dec 14, 2014 at 05:45:49PM +0000, Varun Sethi wrote:
> > > [varun] ok, but you could potentially end up splitting mapping to 
> > > the least possible page size e.g. 4K. You, don't seem to take in to 
> > > account the possibility of using the block size at the next level. 
> > > For example, take a case where we have a huge page mapping using 1G 
> > > page size and we have an unmap request for 4K. We could still split 
> > > maximum part of the mapping using 2M pages at the next level. The 
> > > entry where we need to unmap the 4K region would potentially go to the next level.
> > 
> > Aha, I see what you mean here, thanks. I'll take a look...
> 
> Scratch that, I think the code is fine as it is. For the case you
> highlight, we iterate over the 1GB region remapping it using 4k pages, but
> skipping the one we want to unmap, so I don't think there's a problem
> (__arm_lpae_map will create the relevant table entries).
> 
> 
> [[varun]]  But you can split 1G in 2M mappings and then split up the
> unmapped region using 4K pages. In this case you split up the entire
> region using 4K pages.

True, I miss an optimisation opportunity there, but I don't know that it's
common enough to care (in the same way that we don't recreate a 1G mapping
if you remapped that 4k page back like it was).

You could add this by making arm_lpae_split_blk_unmap recursive, if you
wanted to.

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/4] Generic IOMMU page table framework
  2014-12-15 16:10       ` Will Deacon
@ 2014-12-15 17:33           ` Laurent Pinchart
  -1 siblings, 0 replies; 76+ messages in thread
From: Laurent Pinchart @ 2014-12-15 17:33 UTC (permalink / raw)
  To: Will Deacon
  Cc: Robin Murphy, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Will,

On Monday 15 December 2014 16:10:52 Will Deacon wrote:
> On Sun, Dec 14, 2014 at 11:49:30PM +0000, Laurent Pinchart wrote:
> > On Thursday 27 November 2014 11:51:14 Will Deacon wrote:
> > > This series introduces a generic IOMMU page table allocation framework,
> > > implements support for ARM long-descriptors and then ports the arm-smmu
> > > driver over to the new code.
> 
> [...]
> 
> > > All feedback welcome.
> > 
> > I've successfully tested the patch set with the Renesas IPMMU-VMSA driver
> > with the following extension to the allocator.
> > 
> > Tested-by: Laurent Pinchart <laurent.pinchart-ryLnwIuWjnjg/C1BVhZhaw@public.gmane.org>
> 
> Wahey, that's really cool, thanks! I have a few minor comments on your patch
> below. If you don't object, then I can make them locally and include your
> patch on top of my v2 series?

Sure. Please see my reply below.

> > From 4bebb7f3a5a48541d4c89ce7c61e6ff66686c3a9 Mon Sep 17 00:00:00 2001
> > From: Laurent Pinchart <laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw@public.gmane.org>
> > Date: Sun, 14 Dec 2014 23:34:50 +0200
> > Subject: [PATCH] iommu: io-pgtable-arm: Add Non-Secure quirk
> > 
> > The quirk causes the Non-Secure bit to be set in all page table entries.
> > 
> > Signed-off-by: Laurent Pinchart
> > <laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw@public.gmane.org>
> > ---
> > 
> >  drivers/iommu/io-pgtable-arm.c | 7 +++++++
> >  drivers/iommu/io-pgtable.h     | 3 +++
> >  2 files changed, 10 insertions(+)
> > 
> > diff --git a/drivers/iommu/io-pgtable-arm.c
> > b/drivers/iommu/io-pgtable-arm.c index 669e322a83a4..b6910e142734 100644
> > --- a/drivers/iommu/io-pgtable-arm.c
> > +++ b/drivers/iommu/io-pgtable-arm.c
> > @@ -80,11 +80,13 @@
> > 
> >  #define ARM_LPAE_PTE_TYPE_TABLE		3
> >  #define ARM_LPAE_PTE_TYPE_PAGE		3
> > +#define ARM_LPAE_PTE_NSTABLE		(((arm_lpae_iopte)1) << 63)
> >  #define ARM_LPAE_PTE_XN			(((arm_lpae_iopte)3) << 53)
> >  #define ARM_LPAE_PTE_AF			(((arm_lpae_iopte)1) << 10)
> >  #define ARM_LPAE_PTE_SH_NS		(((arm_lpae_iopte)0) << 8)
> >  #define ARM_LPAE_PTE_SH_OS		(((arm_lpae_iopte)2) << 8)
> >  #define ARM_LPAE_PTE_SH_IS		(((arm_lpae_iopte)3) << 8)
> > +#define ARM_LPAE_PTE_NS			(((arm_lpae_iopte)1) << 5)
> >  #define ARM_LPAE_PTE_VALID		(((arm_lpae_iopte)1) << 0)
> >  #define ARM_LPAE_PTE_ATTR_LO_MASK	(((arm_lpae_iopte)0x3ff) << 2)
> > 
> > @@ -201,6 +203,9 @@ static int arm_lpae_init_pte(struct
> > arm_lpae_io_pgtable *data,> 
> >  	if (iopte_leaf(*ptep, lvl))
> >  		return -EEXIST;
> > 
> > +	if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_NON_SECURE)
> > +		pte |= ARM_LPAE_PTE_NS;
> > +
> >  	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
> >  		pte |= ARM_LPAE_PTE_TYPE_PAGE;
> >  	else
> > @@ -244,6 +249,8 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable
> > *data, unsigned long iova,> 
> >  		data->iop.cfg.tlb->flush_pgtable(cptep, 1UL << data->pg_shift,
> >  						 cookie);
> >  		pte = __pa(cptep) | ARM_LPAE_PTE_TYPE_TABLE;
> > +		if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_NON_SECURE)
> > +			pte |= ARM_LPAE_PTE_NSTABLE;
> >  		*ptep = pte;
> >  		data->iop.cfg.tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
> >  	} else {
> > diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
> > index c1cff3d045db..a41a15d30596 100644
> > --- a/drivers/iommu/io-pgtable.h
> > +++ b/drivers/iommu/io-pgtable.h
> > @@ -24,6 +24,9 @@ struct iommu_gather_ops {
> >  	void (*flush_pgtable)(void *ptr, size_t size, void *cookie);
> >  };
> > 
> > +/* Set the Non-Secure bit in the PTEs */
> > +#define IO_PGTABLE_QUIRK_NON_SECURE	(1 << 0)
> 
> I think I'd stick an _ARM_ somewhere in here, so maybe
> IO_PGTABLE_QUIRK_ARM_NS?

I'm fine with that.

By the way, I'm only familiar with the Renesas implementation of the VMSA 
IOMMU, could you double-check whether setting the NSTABLE and NS bits on all 
levels make sense to you ? It seems to be required by my hardware, even though 
the ARM spec mentions that setting the NSTABLE bit causes non-secure accesses 
to page tables for all lower levels regardless of their NSTABLE/NS bits.

> > +
> > 
> >  struct io_pgtable_cfg {
> 
> and I'd put the #define here, next to the member.

They're right before the structure so I don't think they're too far away, but 
if you prefer that coding style that's fine with me.

> >  	int			quirks; /* IO_PGTABLE_QUIRK_* */
> >  	unsigned long		pgsize_bitmap;

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 0/4] Generic IOMMU page table framework
@ 2014-12-15 17:33           ` Laurent Pinchart
  0 siblings, 0 replies; 76+ messages in thread
From: Laurent Pinchart @ 2014-12-15 17:33 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

On Monday 15 December 2014 16:10:52 Will Deacon wrote:
> On Sun, Dec 14, 2014 at 11:49:30PM +0000, Laurent Pinchart wrote:
> > On Thursday 27 November 2014 11:51:14 Will Deacon wrote:
> > > This series introduces a generic IOMMU page table allocation framework,
> > > implements support for ARM long-descriptors and then ports the arm-smmu
> > > driver over to the new code.
> 
> [...]
> 
> > > All feedback welcome.
> > 
> > I've successfully tested the patch set with the Renesas IPMMU-VMSA driver
> > with the following extension to the allocator.
> > 
> > Tested-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
> 
> Wahey, that's really cool, thanks! I have a few minor comments on your patch
> below. If you don't object, then I can make them locally and include your
> patch on top of my v2 series?

Sure. Please see my reply below.

> > From 4bebb7f3a5a48541d4c89ce7c61e6ff66686c3a9 Mon Sep 17 00:00:00 2001
> > From: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
> > Date: Sun, 14 Dec 2014 23:34:50 +0200
> > Subject: [PATCH] iommu: io-pgtable-arm: Add Non-Secure quirk
> > 
> > The quirk causes the Non-Secure bit to be set in all page table entries.
> > 
> > Signed-off-by: Laurent Pinchart
> > <laurent.pinchart+renesas@ideasonboard.com>
> > ---
> > 
> >  drivers/iommu/io-pgtable-arm.c | 7 +++++++
> >  drivers/iommu/io-pgtable.h     | 3 +++
> >  2 files changed, 10 insertions(+)
> > 
> > diff --git a/drivers/iommu/io-pgtable-arm.c
> > b/drivers/iommu/io-pgtable-arm.c index 669e322a83a4..b6910e142734 100644
> > --- a/drivers/iommu/io-pgtable-arm.c
> > +++ b/drivers/iommu/io-pgtable-arm.c
> > @@ -80,11 +80,13 @@
> > 
> >  #define ARM_LPAE_PTE_TYPE_TABLE		3
> >  #define ARM_LPAE_PTE_TYPE_PAGE		3
> > +#define ARM_LPAE_PTE_NSTABLE		(((arm_lpae_iopte)1) << 63)
> >  #define ARM_LPAE_PTE_XN			(((arm_lpae_iopte)3) << 53)
> >  #define ARM_LPAE_PTE_AF			(((arm_lpae_iopte)1) << 10)
> >  #define ARM_LPAE_PTE_SH_NS		(((arm_lpae_iopte)0) << 8)
> >  #define ARM_LPAE_PTE_SH_OS		(((arm_lpae_iopte)2) << 8)
> >  #define ARM_LPAE_PTE_SH_IS		(((arm_lpae_iopte)3) << 8)
> > +#define ARM_LPAE_PTE_NS			(((arm_lpae_iopte)1) << 5)
> >  #define ARM_LPAE_PTE_VALID		(((arm_lpae_iopte)1) << 0)
> >  #define ARM_LPAE_PTE_ATTR_LO_MASK	(((arm_lpae_iopte)0x3ff) << 2)
> > 
> > @@ -201,6 +203,9 @@ static int arm_lpae_init_pte(struct
> > arm_lpae_io_pgtable *data,> 
> >  	if (iopte_leaf(*ptep, lvl))
> >  		return -EEXIST;
> > 
> > +	if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_NON_SECURE)
> > +		pte |= ARM_LPAE_PTE_NS;
> > +
> >  	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
> >  		pte |= ARM_LPAE_PTE_TYPE_PAGE;
> >  	else
> > @@ -244,6 +249,8 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable
> > *data, unsigned long iova,> 
> >  		data->iop.cfg.tlb->flush_pgtable(cptep, 1UL << data->pg_shift,
> >  						 cookie);
> >  		pte = __pa(cptep) | ARM_LPAE_PTE_TYPE_TABLE;
> > +		if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_NON_SECURE)
> > +			pte |= ARM_LPAE_PTE_NSTABLE;
> >  		*ptep = pte;
> >  		data->iop.cfg.tlb->flush_pgtable(ptep, sizeof(*ptep), cookie);
> >  	} else {
> > diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
> > index c1cff3d045db..a41a15d30596 100644
> > --- a/drivers/iommu/io-pgtable.h
> > +++ b/drivers/iommu/io-pgtable.h
> > @@ -24,6 +24,9 @@ struct iommu_gather_ops {
> >  	void (*flush_pgtable)(void *ptr, size_t size, void *cookie);
> >  };
> > 
> > +/* Set the Non-Secure bit in the PTEs */
> > +#define IO_PGTABLE_QUIRK_NON_SECURE	(1 << 0)
> 
> I think I'd stick an _ARM_ somewhere in here, so maybe
> IO_PGTABLE_QUIRK_ARM_NS?

I'm fine with that.

By the way, I'm only familiar with the Renesas implementation of the VMSA 
IOMMU, could you double-check whether setting the NSTABLE and NS bits on all 
levels make sense to you ? It seems to be required by my hardware, even though 
the ARM spec mentions that setting the NSTABLE bit causes non-secure accesses 
to page tables for all lower levels regardless of their NSTABLE/NS bits.

> > +
> > 
> >  struct io_pgtable_cfg {
> 
> and I'd put the #define here, next to the member.

They're right before the structure so I don't think they're too far away, but 
if you prefer that coding style that's fine with me.

> >  	int			quirks; /* IO_PGTABLE_QUIRK_* */
> >  	unsigned long		pgsize_bitmap;

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/4] Generic IOMMU page table framework
  2014-12-15 17:33           ` Laurent Pinchart
@ 2014-12-15 17:39             ` Will Deacon
  -1 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-15 17:39 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Robin Murphy, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Mon, Dec 15, 2014 at 05:33:32PM +0000, Laurent Pinchart wrote:
> On Monday 15 December 2014 16:10:52 Will Deacon wrote:
> > On Sun, Dec 14, 2014 at 11:49:30PM +0000, Laurent Pinchart wrote:
> > > diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
> > > index c1cff3d045db..a41a15d30596 100644
> > > --- a/drivers/iommu/io-pgtable.h
> > > +++ b/drivers/iommu/io-pgtable.h
> > > @@ -24,6 +24,9 @@ struct iommu_gather_ops {
> > >  	void (*flush_pgtable)(void *ptr, size_t size, void *cookie);
> > >  };
> > > 
> > > +/* Set the Non-Secure bit in the PTEs */
> > > +#define IO_PGTABLE_QUIRK_NON_SECURE	(1 << 0)
> > 
> > I think I'd stick an _ARM_ somewhere in here, so maybe
> > IO_PGTABLE_QUIRK_ARM_NS?
> 
> I'm fine with that.
> 
> By the way, I'm only familiar with the Renesas implementation of the VMSA 
> IOMMU, could you double-check whether setting the NSTABLE and NS bits on all 
> levels make sense to you ? It seems to be required by my hardware, even though 
> the ARM spec mentions that setting the NSTABLE bit causes non-secure accesses 
> to page tables for all lower levels regardless of their NSTABLE/NS bits.

The ARM ARM is very clear that subsequent levels of lookup must ignore the
NSTABLE/NS bits since otherwise you potentially have a security violation
where you can use the table walker to access secure memory from
non-secure...

So, you might want to check up on that, but given that this is a quirk I'm
happy for it to do whatever you need.

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 0/4] Generic IOMMU page table framework
@ 2014-12-15 17:39             ` Will Deacon
  0 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2014-12-15 17:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Dec 15, 2014 at 05:33:32PM +0000, Laurent Pinchart wrote:
> On Monday 15 December 2014 16:10:52 Will Deacon wrote:
> > On Sun, Dec 14, 2014 at 11:49:30PM +0000, Laurent Pinchart wrote:
> > > diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
> > > index c1cff3d045db..a41a15d30596 100644
> > > --- a/drivers/iommu/io-pgtable.h
> > > +++ b/drivers/iommu/io-pgtable.h
> > > @@ -24,6 +24,9 @@ struct iommu_gather_ops {
> > >  	void (*flush_pgtable)(void *ptr, size_t size, void *cookie);
> > >  };
> > > 
> > > +/* Set the Non-Secure bit in the PTEs */
> > > +#define IO_PGTABLE_QUIRK_NON_SECURE	(1 << 0)
> > 
> > I think I'd stick an _ARM_ somewhere in here, so maybe
> > IO_PGTABLE_QUIRK_ARM_NS?
> 
> I'm fine with that.
> 
> By the way, I'm only familiar with the Renesas implementation of the VMSA 
> IOMMU, could you double-check whether setting the NSTABLE and NS bits on all 
> levels make sense to you ? It seems to be required by my hardware, even though 
> the ARM spec mentions that setting the NSTABLE bit causes non-secure accesses 
> to page tables for all lower levels regardless of their NSTABLE/NS bits.

The ARM ARM is very clear that subsequent levels of lookup must ignore the
NSTABLE/NS bits since otherwise you potentially have a security violation
where you can use the table walker to access secure memory from
non-secure...

So, you might want to check up on that, but given that this is a quirk I'm
happy for it to do whatever you need.

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/4] Generic IOMMU page table framework
  2014-12-15 17:39             ` Will Deacon
@ 2014-12-15 17:46                 ` Laurent Pinchart
  -1 siblings, 0 replies; 76+ messages in thread
From: Laurent Pinchart @ 2014-12-15 17:46 UTC (permalink / raw)
  To: Will Deacon
  Cc: Robin Murphy, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Varun.Sethi-KZfg59tc24xl57MIdRCFDg,
	prem.mallappa-dY08KVG/lbpWk0Htik3J/w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Will,

On Monday 15 December 2014 17:39:11 Will Deacon wrote:
> On Mon, Dec 15, 2014 at 05:33:32PM +0000, Laurent Pinchart wrote:
> > On Monday 15 December 2014 16:10:52 Will Deacon wrote:
> > > On Sun, Dec 14, 2014 at 11:49:30PM +0000, Laurent Pinchart wrote:
> > > > diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
> > > > index c1cff3d045db..a41a15d30596 100644
> > > > --- a/drivers/iommu/io-pgtable.h
> > > > +++ b/drivers/iommu/io-pgtable.h
> > > > @@ -24,6 +24,9 @@ struct iommu_gather_ops {
> > > > 
> > > >  	void (*flush_pgtable)(void *ptr, size_t size, void *cookie);
> > > >  
> > > >  };
> > > > 
> > > > +/* Set the Non-Secure bit in the PTEs */
> > > > +#define IO_PGTABLE_QUIRK_NON_SECURE	(1 << 0)
> > > 
> > > I think I'd stick an _ARM_ somewhere in here, so maybe
> > > IO_PGTABLE_QUIRK_ARM_NS?
> > 
> > I'm fine with that.
> > 
> > By the way, I'm only familiar with the Renesas implementation of the VMSA
> > IOMMU, could you double-check whether setting the NSTABLE and NS bits on
> > all levels make sense to you ? It seems to be required by my hardware,
> > even though the ARM spec mentions that setting the NSTABLE bit causes
> > non-secure accesses to page tables for all lower levels regardless of
> > their NSTABLE/NS bits.
>
> The ARM ARM is very clear that subsequent levels of lookup must ignore the
> NSTABLE/NS bits since otherwise you potentially have a security violation
> where you can use the table walker to access secure memory from
> non-secure...
>
> So, you might want to check up on that, but given that this is a quirk I'm
> happy for it to do whatever you need.

Given that the ARM ARM states that subsequent levels must consider the 
NSTABLE/NS bits to be set, I think it's harmless to actually set them. I just 
wanted to double-check with you, as we agree let's proceed with the proposed 
patch.

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 0/4] Generic IOMMU page table framework
@ 2014-12-15 17:46                 ` Laurent Pinchart
  0 siblings, 0 replies; 76+ messages in thread
From: Laurent Pinchart @ 2014-12-15 17:46 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

On Monday 15 December 2014 17:39:11 Will Deacon wrote:
> On Mon, Dec 15, 2014 at 05:33:32PM +0000, Laurent Pinchart wrote:
> > On Monday 15 December 2014 16:10:52 Will Deacon wrote:
> > > On Sun, Dec 14, 2014 at 11:49:30PM +0000, Laurent Pinchart wrote:
> > > > diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
> > > > index c1cff3d045db..a41a15d30596 100644
> > > > --- a/drivers/iommu/io-pgtable.h
> > > > +++ b/drivers/iommu/io-pgtable.h
> > > > @@ -24,6 +24,9 @@ struct iommu_gather_ops {
> > > > 
> > > >  	void (*flush_pgtable)(void *ptr, size_t size, void *cookie);
> > > >  
> > > >  };
> > > > 
> > > > +/* Set the Non-Secure bit in the PTEs */
> > > > +#define IO_PGTABLE_QUIRK_NON_SECURE	(1 << 0)
> > > 
> > > I think I'd stick an _ARM_ somewhere in here, so maybe
> > > IO_PGTABLE_QUIRK_ARM_NS?
> > 
> > I'm fine with that.
> > 
> > By the way, I'm only familiar with the Renesas implementation of the VMSA
> > IOMMU, could you double-check whether setting the NSTABLE and NS bits on
> > all levels make sense to you ? It seems to be required by my hardware,
> > even though the ARM spec mentions that setting the NSTABLE bit causes
> > non-secure accesses to page tables for all lower levels regardless of
> > their NSTABLE/NS bits.
>
> The ARM ARM is very clear that subsequent levels of lookup must ignore the
> NSTABLE/NS bits since otherwise you potentially have a security violation
> where you can use the table walker to access secure memory from
> non-secure...
>
> So, you might want to check up on that, but given that this is a quirk I'm
> happy for it to do whatever you need.

Given that the ARM ARM states that subsequent levels must consider the 
NSTABLE/NS bits to be set, I think it's harmless to actually set them. I just 
wanted to double-check with you, as we agree let's proceed with the proposed 
patch.

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2014-12-15 17:46 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-27 11:51 [PATCH 0/4] Generic IOMMU page table framework Will Deacon
2014-11-27 11:51 ` Will Deacon
     [not found] ` <1417089078-22900-1-git-send-email-will.deacon-5wv7dgnIgG8@public.gmane.org>
2014-11-27 11:51   ` [PATCH 1/4] iommu: introduce generic page table allocation framework Will Deacon
2014-11-27 11:51     ` Will Deacon
     [not found]     ` <1417089078-22900-2-git-send-email-will.deacon-5wv7dgnIgG8@public.gmane.org>
2014-11-30 22:00       ` Laurent Pinchart
2014-11-30 22:00         ` Laurent Pinchart
2014-12-01 12:13         ` Will Deacon
2014-12-01 12:13           ` Will Deacon
     [not found]           ` <20141201121338.GD18466-5wv7dgnIgG8@public.gmane.org>
2014-12-01 13:33             ` Laurent Pinchart
2014-12-01 13:33               ` Laurent Pinchart
2014-12-01 13:53               ` Will Deacon
2014-12-01 13:53                 ` Will Deacon
2014-12-14 23:46       ` Laurent Pinchart
2014-12-14 23:46         ` Laurent Pinchart
2014-12-15  9:45         ` Will Deacon
2014-12-15  9:45           ` Will Deacon
2014-11-27 11:51   ` [PATCH 2/4] iommu: add ARM LPAE page table allocator Will Deacon
2014-11-27 11:51     ` Will Deacon
     [not found]     ` <1417089078-22900-3-git-send-email-will.deacon-5wv7dgnIgG8@public.gmane.org>
2014-11-30 23:29       ` Laurent Pinchart
2014-11-30 23:29         ` Laurent Pinchart
2014-12-01 17:23         ` Will Deacon
2014-12-01 17:23           ` Will Deacon
     [not found]           ` <20141201172315.GI18466-5wv7dgnIgG8@public.gmane.org>
2014-12-01 20:21             ` Laurent Pinchart
2014-12-01 20:21               ` Laurent Pinchart
2014-12-02  9:41               ` Will Deacon
2014-12-02  9:41                 ` Will Deacon
     [not found]                 ` <20141202094156.GB9917-5wv7dgnIgG8@public.gmane.org>
2014-12-02 11:47                   ` Laurent Pinchart
2014-12-02 11:47                     ` Laurent Pinchart
2014-12-05 18:48                     ` Will Deacon
2014-12-05 18:48                       ` Will Deacon
2014-12-02 22:41       ` Mitchel Humpherys
2014-12-02 22:41         ` Mitchel Humpherys
     [not found]         ` <vnkw8uipznbj.fsf-Yf+dfxj6toJBVvN7MMdr1KRtKmQZhJ7pQQ4Iyu8u01E@public.gmane.org>
2014-12-03 11:11           ` Will Deacon
2014-12-03 11:11             ` Will Deacon
2014-12-05 10:55       ` Varun Sethi
2014-12-05 10:55         ` Varun Sethi
     [not found]         ` <BN3PR0301MB12198CE5D736CDC6A221EDC2EA790-CEkquS/Gb81uuip9JPHoc5wN6zqB+hSMnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2014-12-05 18:48           ` Will Deacon
2014-12-05 18:48             ` Will Deacon
     [not found]             ` <20141205184802.GH1203-5wv7dgnIgG8@public.gmane.org>
2014-12-14 17:45               ` Varun Sethi
2014-12-14 17:45                 ` Varun Sethi
     [not found]                 ` <BN3PR0301MB1219D3161E4E9DB314FDD8FAEA6E0-CEkquS/Gb81uuip9JPHoc5wN6zqB+hSMnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2014-12-15 13:30                   ` Will Deacon
2014-12-15 13:30                     ` Will Deacon
     [not found]                     ` <20141215133020.GJ20738-5wv7dgnIgG8@public.gmane.org>
2014-12-15 15:43                       ` Will Deacon
2014-12-15 15:43                         ` Will Deacon
2014-12-15 16:35                         ` Varun Sethi
2014-12-15 16:35                           ` Varun Sethi
     [not found]                           ` <BN3PR0301MB12194A8F5CFF870B7A124623EA6F0-CEkquS/Gb81uuip9JPHoc5wN6zqB+hSMnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2014-12-15 17:25                             ` Will Deacon
2014-12-15 17:25                               ` Will Deacon
2014-12-15 16:43                       ` Varun Sethi
2014-12-15 16:43                         ` Varun Sethi
     [not found]                         ` <BN3PR0301MB12199C5CAD33E745CC7E51F4EA6F0-CEkquS/Gb81uuip9JPHoc5wN6zqB+hSMnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2014-12-15 17:20                           ` Will Deacon
2014-12-15 17:20                             ` Will Deacon
2014-11-27 11:51   ` [PATCH 3/4] iommu: add self-consistency tests to ARM LPAE IO " Will Deacon
2014-11-27 11:51     ` Will Deacon
2014-11-27 11:51   ` [PATCH 4/4] iommu/arm-smmu: make use of generic LPAE allocator Will Deacon
2014-11-27 11:51     ` Will Deacon
2014-11-30 22:03   ` [PATCH 0/4] Generic IOMMU page table framework Laurent Pinchart
2014-11-30 22:03     ` Laurent Pinchart
2014-12-01 12:05     ` Will Deacon
2014-12-01 12:05       ` Will Deacon
     [not found]       ` <20141201120534.GC18466-5wv7dgnIgG8@public.gmane.org>
2014-12-02 13:47         ` Laurent Pinchart
2014-12-02 13:47           ` Laurent Pinchart
2014-12-02 13:53           ` Will Deacon
2014-12-02 13:53             ` Will Deacon
     [not found]             ` <20141202135356.GF9917-5wv7dgnIgG8@public.gmane.org>
2014-12-02 22:29               ` Laurent Pinchart
2014-12-02 22:29                 ` Laurent Pinchart
2014-12-14 23:49   ` Laurent Pinchart
2014-12-14 23:49     ` Laurent Pinchart
2014-12-15 16:10     ` Will Deacon
2014-12-15 16:10       ` Will Deacon
     [not found]       ` <20141215161052.GM20738-5wv7dgnIgG8@public.gmane.org>
2014-12-15 17:33         ` Laurent Pinchart
2014-12-15 17:33           ` Laurent Pinchart
2014-12-15 17:39           ` Will Deacon
2014-12-15 17:39             ` Will Deacon
     [not found]             ` <20141215173911.GT20738-5wv7dgnIgG8@public.gmane.org>
2014-12-15 17:46               ` Laurent Pinchart
2014-12-15 17:46                 ` Laurent Pinchart

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.