iommu.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/11] vfio/iommu_type1: Implement dirty log tracking based on smmuv3 HTTU
@ 2021-01-28 15:17 Keqian Zhu
  2021-01-28 15:17 ` [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU Keqian Zhu
                   ` (10 more replies)
  0 siblings, 11 replies; 36+ messages in thread
From: Keqian Zhu @ 2021-01-28 15:17 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu, Will Deacon,
	Alex Williamson, Marc Zyngier, Catalin Marinas
  Cc: Mark Rutland, jiangkunkun, Suzuki K Poulose, Cornelia Huck,
	lushenming, Kirti Wankhede, James Morse, wanghaibin.wang,
	Robin Murphy

Hi all,

This patch series implement a new dirty log tracking method for vfio dma.

Intention:

As we know, vfio live migration is an important and valuable feature, but there
are still many hurdles to solve, including migration of interrupt, device state,
DMA dirty log tracking, and etc.

For now, the only dirty log tracking interface is pinning. It has some drawbacks:
1. Only smart vendor drivers are aware of this.
2. It's coarse-grained, the pinned-scope is generally bigger than what the device actually access.
3. It can't track dirty continuously and precisely, vfio populates all pinned-scope as dirty.
   So it doesn't work well with iteratively dirty log handling.

About SMMU HTTU:

HTTU (Hardware Translation Table Update) is a feature of ARM SMMUv3, it can update
access flag or/and dirty state of the TTD (Translation Table Descriptor) by hardware.
With HTTU, stage1 TTD is classified into 3 types:
                        DBM bit             AP[2](readonly bit)
1. writable_clean         1                       1
2. writable_dirty         1                       0
3. readonly               0                       1

If HTTU_HD (manage dirty state) is enabled, smmu can change TTD from writable_clean to
writable_dirty. Then software can scan TTD to sync dirty state into dirty bitmap. With
this feature, we can track the dirty log of DMA continuously and precisely.

About this series:

Patch 1-3: Add feature detection for smmu HTTU and enable HTTU for smmu stage1 mapping.
           And add feature detection for smmu BBML. We need to split block mapping when
           start dirty log tracking and merge page mapping when stop dirty log tracking,
		   which requires break-before-make procedure. But it might cause problems when the
		   TTD is alive. The I/O streams might not tolerate translation faults. So BBML
		   should be used.

Patch 4-7: Add four interfaces (split_block, merge_page, sync_dirty_log and clear_dirty_log)
           in IOMMU layer, they are essential to implement dma dirty log tracking for vfio.
		   We implement these interfaces for arm smmuv3.

Patch   8: Add HWDBM (Hardware Dirty Bit Management) device feature reporting in IOMMU layer.

Patch9-11: Implement a new dirty log tracking method for vfio based on iommu hwdbm. A new
           ioctl operation named VFIO_DIRTY_LOG_MANUAL_CLEAR is added, which can eliminate
		   some redundant dirty handling of userspace.

Optimizations TO Do:

1. We recognized that each smmu_domain (a vfio_container may has several smmu_domain) has its
   own stage1 mapping, and we must scan all these mapping to sync dirty state. We plan to refactor
   smmu_domain to support more than one smmu in one smmu_domain, then these smmus can share a same
   stage1 mapping.
2. We also recognized that scan TTD is a hotspot of performance. Recently, I have implement a
   SW/HW conbined dirty log tracking at MMU side [1], which can effectively solve this problem.
   This idea can be applied to smmu side too.

Thanks,
Keqian


[1] https://lore.kernel.org/linux-arm-kernel/20210126124444.27136-1-zhukeqian1@huawei.com/

jiangkunkun (11):
  iommu/arm-smmu-v3: Add feature detection for HTTU
  iommu/arm-smmu-v3: Enable HTTU for SMMU stage1 mapping
  iommu/arm-smmu-v3: Add feature detection for BBML
  iommu/arm-smmu-v3: Split block descriptor to a span of page
  iommu/arm-smmu-v3: Merge a span of page to block descriptor
  iommu/arm-smmu-v3: Scan leaf TTD to sync hardware dirty log
  iommu/arm-smmu-v3: Clear dirty log according to bitmap
  iommu/arm-smmu-v3: Add HWDBM device feature reporting
  vfio/iommu_type1: Add HWDBM status maintanance
  vfio/iommu_type1: Optimize dirty bitmap population based on iommu
    HWDBM
  vfio/iommu_type1: Add support for manual dirty log clear

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 138 ++++++-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  14 +
 drivers/iommu/io-pgtable-arm.c              | 392 +++++++++++++++++++-
 drivers/iommu/iommu.c                       | 227 ++++++++++++
 drivers/vfio/vfio_iommu_type1.c             | 235 +++++++++++-
 include/linux/io-pgtable.h                  |  14 +
 include/linux/iommu.h                       |  55 +++
 include/uapi/linux/vfio.h                   |  28 +-
 8 files changed, 1093 insertions(+), 10 deletions(-)

-- 
2.19.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU
  2021-01-28 15:17 [RFC PATCH 00/11] vfio/iommu_type1: Implement dirty log tracking based on smmuv3 HTTU Keqian Zhu
@ 2021-01-28 15:17 ` Keqian Zhu
  2021-02-04 19:50   ` Robin Murphy
  2021-01-28 15:17 ` [RFC PATCH 02/11] iommu/arm-smmu-v3: Enable HTTU for SMMU stage1 mapping Keqian Zhu
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Keqian Zhu @ 2021-01-28 15:17 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu, Will Deacon,
	Alex Williamson, Marc Zyngier, Catalin Marinas
  Cc: Mark Rutland, jiangkunkun, Suzuki K Poulose, Cornelia Huck,
	lushenming, Kirti Wankhede, James Morse, wanghaibin.wang,
	Robin Murphy

From: jiangkunkun <jiangkunkun@huawei.com>

The SMMU which supports HTTU (Hardware Translation Table Update) can
update the access flag and the dirty state of TTD by hardware. It is
essential to track dirty pages of DMA.

This adds feature detection, none functional change.

Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 ++++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 ++++++++
 include/linux/io-pgtable.h                  |  1 +
 3 files changed, 25 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 8ca7415d785d..0f0fe71cc10d 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1987,6 +1987,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
 		.pgsize_bitmap	= smmu->pgsize_bitmap,
 		.ias		= ias,
 		.oas		= oas,
+		.httu_hd	= smmu->features & ARM_SMMU_FEAT_HTTU_HD,
 		.coherent_walk	= smmu->features & ARM_SMMU_FEAT_COHERENCY,
 		.tlb		= &arm_smmu_flush_ops,
 		.iommu_dev	= smmu->dev,
@@ -3224,6 +3225,21 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 	if (reg & IDR0_HYP)
 		smmu->features |= ARM_SMMU_FEAT_HYP;
 
+	switch (FIELD_GET(IDR0_HTTU, reg)) {
+	case IDR0_HTTU_NONE:
+		break;
+	case IDR0_HTTU_HA:
+		smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
+		break;
+	case IDR0_HTTU_HAD:
+		smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
+		smmu->features |= ARM_SMMU_FEAT_HTTU_HD;
+		break;
+	default:
+		dev_err(smmu->dev, "unknown/unsupported HTTU!\n");
+		return -ENXIO;
+	}
+
 	/*
 	 * The coherency feature as set by FW is used in preference to the ID
 	 * register, but warn on mismatch.
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 96c2e9565e00..e91bea44519e 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -33,6 +33,10 @@
 #define IDR0_ASID16			(1 << 12)
 #define IDR0_ATS			(1 << 10)
 #define IDR0_HYP			(1 << 9)
+#define IDR0_HTTU			GENMASK(7, 6)
+#define IDR0_HTTU_NONE			0
+#define IDR0_HTTU_HA			1
+#define IDR0_HTTU_HAD			2
 #define IDR0_COHACC			(1 << 4)
 #define IDR0_TTF			GENMASK(3, 2)
 #define IDR0_TTF_AARCH64		2
@@ -286,6 +290,8 @@
 #define CTXDESC_CD_0_TCR_TBI0		(1ULL << 38)
 
 #define CTXDESC_CD_0_AA64		(1UL << 41)
+#define CTXDESC_CD_0_HD			(1UL << 42)
+#define CTXDESC_CD_0_HA			(1UL << 43)
 #define CTXDESC_CD_0_S			(1UL << 44)
 #define CTXDESC_CD_0_R			(1UL << 45)
 #define CTXDESC_CD_0_A			(1UL << 46)
@@ -604,6 +610,8 @@ struct arm_smmu_device {
 #define ARM_SMMU_FEAT_RANGE_INV		(1 << 15)
 #define ARM_SMMU_FEAT_BTM		(1 << 16)
 #define ARM_SMMU_FEAT_SVA		(1 << 17)
+#define ARM_SMMU_FEAT_HTTU_HA		(1 << 18)
+#define ARM_SMMU_FEAT_HTTU_HD		(1 << 19)
 	u32				features;
 
 #define ARM_SMMU_OPT_SKIP_PREFETCH	(1 << 0)
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index ea727eb1a1a9..1a00ea8562c7 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -97,6 +97,7 @@ struct io_pgtable_cfg {
 	unsigned long			pgsize_bitmap;
 	unsigned int			ias;
 	unsigned int			oas;
+	bool				httu_hd;
 	bool				coherent_walk;
 	const struct iommu_flush_ops	*tlb;
 	struct device			*iommu_dev;
-- 
2.19.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 02/11] iommu/arm-smmu-v3: Enable HTTU for SMMU stage1 mapping
  2021-01-28 15:17 [RFC PATCH 00/11] vfio/iommu_type1: Implement dirty log tracking based on smmuv3 HTTU Keqian Zhu
  2021-01-28 15:17 ` [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU Keqian Zhu
@ 2021-01-28 15:17 ` Keqian Zhu
  2021-01-28 15:17 ` [RFC PATCH 03/11] iommu/arm-smmu-v3: Add feature detection for BBML Keqian Zhu
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Keqian Zhu @ 2021-01-28 15:17 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu, Will Deacon,
	Alex Williamson, Marc Zyngier, Catalin Marinas
  Cc: Mark Rutland, jiangkunkun, Suzuki K Poulose, Cornelia Huck,
	lushenming, Kirti Wankhede, James Morse, wanghaibin.wang,
	Robin Murphy

From: jiangkunkun <jiangkunkun@huawei.com>

If HTTU is supported, we enable HA/HD bits in the SMMU CD (stage 1
mapping), and set DBM bit for writable TTD.

The dirty state information is encoded using the access permission
bits AP[2] (stage 1) or S2AP[1] (stage 2) in conjunction with the
DBM (Dirty Bit Modifier) bit, where DBM means writable and AP[2]/
S2AP[1] means dirty.

Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 5 +++++
 drivers/iommu/io-pgtable-arm.c              | 7 ++++++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 0f0fe71cc10d..8cc9d7536b08 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1036,6 +1036,11 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain, int ssid,
 			FIELD_PREP(CTXDESC_CD_0_ASID, cd->asid) |
 			CTXDESC_CD_0_V;
 
+		if (smmu->features & ARM_SMMU_FEAT_HTTU_HA)
+			val |= CTXDESC_CD_0_HA;
+		if (smmu->features & ARM_SMMU_FEAT_HTTU_HD)
+			val |= CTXDESC_CD_0_HD;
+
 		/* STALL_MODEL==0b10 && CD.S==0 is ILLEGAL */
 		if (smmu->features & ARM_SMMU_FEAT_STALL_FORCE)
 			val |= CTXDESC_CD_0_S;
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 87def58e79b5..e299a44808ae 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -72,6 +72,7 @@
 
 #define ARM_LPAE_PTE_NSTABLE		(((arm_lpae_iopte)1) << 63)
 #define ARM_LPAE_PTE_XN			(((arm_lpae_iopte)3) << 53)
+#define ARM_LPAE_PTE_DBM		(((arm_lpae_iopte)1) << 51)
 #define ARM_LPAE_PTE_AF			(((arm_lpae_iopte)1) << 10)
 #define ARM_LPAE_PTE_SH_NS		(((arm_lpae_iopte)0) << 8)
 #define ARM_LPAE_PTE_SH_OS		(((arm_lpae_iopte)2) << 8)
@@ -81,7 +82,7 @@
 
 #define ARM_LPAE_PTE_ATTR_LO_MASK	(((arm_lpae_iopte)0x3ff) << 2)
 /* Ignore the contiguous bit for block splitting */
-#define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
+#define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)13) << 51)
 #define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
 					 ARM_LPAE_PTE_ATTR_HI_MASK)
 /* Software bit for solving coherency races */
@@ -379,6 +380,7 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
 static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
 					   int prot)
 {
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
 	arm_lpae_iopte pte;
 
 	if (data->iop.fmt == ARM_64_LPAE_S1 ||
@@ -386,6 +388,9 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
 		pte = ARM_LPAE_PTE_nG;
 		if (!(prot & IOMMU_WRITE) && (prot & IOMMU_READ))
 			pte |= ARM_LPAE_PTE_AP_RDONLY;
+		else if (cfg->httu_hd)
+			pte |= ARM_LPAE_PTE_DBM;
+
 		if (!(prot & IOMMU_PRIV))
 			pte |= ARM_LPAE_PTE_AP_UNPRIV;
 	} else {
-- 
2.19.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 03/11] iommu/arm-smmu-v3: Add feature detection for BBML
  2021-01-28 15:17 [RFC PATCH 00/11] vfio/iommu_type1: Implement dirty log tracking based on smmuv3 HTTU Keqian Zhu
  2021-01-28 15:17 ` [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU Keqian Zhu
  2021-01-28 15:17 ` [RFC PATCH 02/11] iommu/arm-smmu-v3: Enable HTTU for SMMU stage1 mapping Keqian Zhu
@ 2021-01-28 15:17 ` Keqian Zhu
  2021-01-28 15:17 ` [RFC PATCH 04/11] iommu/arm-smmu-v3: Split block descriptor to a span of page Keqian Zhu
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Keqian Zhu @ 2021-01-28 15:17 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu, Will Deacon,
	Alex Williamson, Marc Zyngier, Catalin Marinas
  Cc: Mark Rutland, jiangkunkun, Suzuki K Poulose, Cornelia Huck,
	lushenming, Kirti Wankhede, James Morse, wanghaibin.wang,
	Robin Murphy

From: jiangkunkun <jiangkunkun@huawei.com>

When altering a translation table descriptor of some specific reasons,
we require break-before-make procedure. But it might cause problems when
the TTD is alive. The I/O streams might not tolerate translation faults.

If the SMMU supports BBML level 1 or BBML level 2, we can change the block
size without using break-before-make.

This adds feature detection for BBML, none functional change.

Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24 ++++++++++++++++++++-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  6 ++++++
 include/linux/io-pgtable.h                  |  1 +
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 8cc9d7536b08..9208881a571c 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1947,7 +1947,7 @@ static int arm_smmu_domain_finalise_s2(struct arm_smmu_domain *smmu_domain,
 static int arm_smmu_domain_finalise(struct iommu_domain *domain,
 				    struct arm_smmu_master *master)
 {
-	int ret;
+	int ret, bbml;
 	unsigned long ias, oas;
 	enum io_pgtable_fmt fmt;
 	struct io_pgtable_cfg pgtbl_cfg;
@@ -1988,12 +1988,20 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
 		return -EINVAL;
 	}
 
+	if (smmu->features & ARM_SMMU_FEAT_BBML2)
+		bbml = 2;
+	else if (smmu->features & ARM_SMMU_FEAT_BBML1)
+		bbml = 1;
+	else
+		bbml = 0;
+
 	pgtbl_cfg = (struct io_pgtable_cfg) {
 		.pgsize_bitmap	= smmu->pgsize_bitmap,
 		.ias		= ias,
 		.oas		= oas,
 		.httu_hd	= smmu->features & ARM_SMMU_FEAT_HTTU_HD,
 		.coherent_walk	= smmu->features & ARM_SMMU_FEAT_COHERENCY,
+		.bbml		= bbml,
 		.tlb		= &arm_smmu_flush_ops,
 		.iommu_dev	= smmu->dev,
 	};
@@ -3328,6 +3336,20 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 
 	/* IDR3 */
 	reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
+	switch (FIELD_GET(IDR3_BBML, reg)) {
+	case IDR3_BBML0:
+		break;
+	case IDR3_BBML1:
+		smmu->features |= ARM_SMMU_FEAT_BBML1;
+		break;
+	case IDR3_BBML2:
+		smmu->features |= ARM_SMMU_FEAT_BBML2;
+		break;
+	default:
+		dev_err(smmu->dev, "unknown/unsupported BBM behavior level\n");
+		return -ENXIO;
+	}
+
 	if (FIELD_GET(IDR3_RIL, reg))
 		smmu->features |= ARM_SMMU_FEAT_RANGE_INV;
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index e91bea44519e..11e526ab7239 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -55,6 +55,10 @@
 #define IDR1_SIDSIZE			GENMASK(5, 0)
 
 #define ARM_SMMU_IDR3			0xc
+#define IDR3_BBML			GENMASK(12, 11)
+#define IDR3_BBML0			0
+#define IDR3_BBML1			1
+#define IDR3_BBML2			2
 #define IDR3_RIL			(1 << 10)
 
 #define ARM_SMMU_IDR5			0x14
@@ -612,6 +616,8 @@ struct arm_smmu_device {
 #define ARM_SMMU_FEAT_SVA		(1 << 17)
 #define ARM_SMMU_FEAT_HTTU_HA		(1 << 18)
 #define ARM_SMMU_FEAT_HTTU_HD		(1 << 19)
+#define ARM_SMMU_FEAT_BBML1		(1 << 20)
+#define ARM_SMMU_FEAT_BBML2		(1 << 21)
 	u32				features;
 
 #define ARM_SMMU_OPT_SKIP_PREFETCH	(1 << 0)
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index 1a00ea8562c7..26583beeb5d9 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -99,6 +99,7 @@ struct io_pgtable_cfg {
 	unsigned int			oas;
 	bool				httu_hd;
 	bool				coherent_walk;
+	int				bbml;
 	const struct iommu_flush_ops	*tlb;
 	struct device			*iommu_dev;
 
-- 
2.19.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 04/11] iommu/arm-smmu-v3: Split block descriptor to a span of page
  2021-01-28 15:17 [RFC PATCH 00/11] vfio/iommu_type1: Implement dirty log tracking based on smmuv3 HTTU Keqian Zhu
                   ` (2 preceding siblings ...)
  2021-01-28 15:17 ` [RFC PATCH 03/11] iommu/arm-smmu-v3: Add feature detection for BBML Keqian Zhu
@ 2021-01-28 15:17 ` Keqian Zhu
  2021-02-04 19:51   ` Robin Murphy
  2021-01-28 15:17 ` [RFC PATCH 05/11] iommu/arm-smmu-v3: Merge a span of page to block descriptor Keqian Zhu
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Keqian Zhu @ 2021-01-28 15:17 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu, Will Deacon,
	Alex Williamson, Marc Zyngier, Catalin Marinas
  Cc: Mark Rutland, jiangkunkun, Suzuki K Poulose, Cornelia Huck,
	lushenming, Kirti Wankhede, James Morse, wanghaibin.wang,
	Robin Murphy

From: jiangkunkun <jiangkunkun@huawei.com>

Block descriptor is not a proper granule for dirty log tracking. This
adds a new interface named split_block in iommu layer and arm smmuv3
implements it, which splits block descriptor to an equivalent span of
page descriptors.

During spliting block, other interfaces are not expected to be working,
so race condition does not exist. And we flush all iotlbs after the split
procedure is completed to ease the pressure of iommu, as we will split a
huge range of block mappings in general.

Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  20 ++++
 drivers/iommu/io-pgtable-arm.c              | 122 ++++++++++++++++++++
 drivers/iommu/iommu.c                       |  40 +++++++
 include/linux/io-pgtable.h                  |   2 +
 include/linux/iommu.h                       |  10 ++
 5 files changed, 194 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 9208881a571c..5469f4fca820 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2510,6 +2510,25 @@ static int arm_smmu_domain_set_attr(struct iommu_domain *domain,
 	return ret;
 }
 
+static size_t arm_smmu_split_block(struct iommu_domain *domain,
+				   unsigned long iova, size_t size)
+{
+	struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+
+	if (!(smmu->features & (ARM_SMMU_FEAT_BBML1 | ARM_SMMU_FEAT_BBML2))) {
+		dev_err(smmu->dev, "don't support BBML1/2 and split block\n");
+		return 0;
+	}
+
+	if (!ops || !ops->split_block) {
+		pr_err("don't support split block\n");
+		return 0;
+	}
+
+	return ops->split_block(ops, iova, size);
+}
+
 static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
 {
 	return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2609,6 +2628,7 @@ static struct iommu_ops arm_smmu_ops = {
 	.device_group		= arm_smmu_device_group,
 	.domain_get_attr	= arm_smmu_domain_get_attr,
 	.domain_set_attr	= arm_smmu_domain_set_attr,
+	.split_block		= arm_smmu_split_block,
 	.of_xlate		= arm_smmu_of_xlate,
 	.get_resv_regions	= arm_smmu_get_resv_regions,
 	.put_resv_regions	= generic_iommu_put_resv_regions,
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index e299a44808ae..f3b7f7115e38 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -79,6 +79,8 @@
 #define ARM_LPAE_PTE_SH_IS		(((arm_lpae_iopte)3) << 8)
 #define ARM_LPAE_PTE_NS			(((arm_lpae_iopte)1) << 5)
 #define ARM_LPAE_PTE_VALID		(((arm_lpae_iopte)1) << 0)
+/* Block descriptor bits */
+#define ARM_LPAE_PTE_NT			(((arm_lpae_iopte)1) << 16)
 
 #define ARM_LPAE_PTE_ATTR_LO_MASK	(((arm_lpae_iopte)0x3ff) << 2)
 /* Ignore the contiguous bit for block splitting */
@@ -679,6 +681,125 @@ static phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
 	return iopte_to_paddr(pte, data) | iova;
 }
 
+static size_t __arm_lpae_split_block(struct arm_lpae_io_pgtable *data,
+				     unsigned long iova, size_t size, int lvl,
+				     arm_lpae_iopte *ptep);
+
+static size_t arm_lpae_do_split_blk(struct arm_lpae_io_pgtable *data,
+				    unsigned long iova, size_t size,
+				    arm_lpae_iopte blk_pte, int lvl,
+				    arm_lpae_iopte *ptep)
+{
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	arm_lpae_iopte pte, *tablep;
+	phys_addr_t blk_paddr;
+	size_t tablesz = ARM_LPAE_GRANULE(data);
+	size_t split_sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
+	int i;
+
+	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+		return 0;
+
+	tablep = __arm_lpae_alloc_pages(tablesz, GFP_ATOMIC, cfg);
+	if (!tablep)
+		return 0;
+
+	blk_paddr = iopte_to_paddr(blk_pte, data);
+	pte = iopte_prot(blk_pte);
+	for (i = 0; i < tablesz / sizeof(pte); i++, blk_paddr += split_sz)
+		__arm_lpae_init_pte(data, blk_paddr, pte, lvl, &tablep[i]);
+
+	if (cfg->bbml == 1) {
+		/* Race does not exist */
+		blk_pte |= ARM_LPAE_PTE_NT;
+		__arm_lpae_set_pte(ptep, blk_pte, cfg);
+		io_pgtable_tlb_flush_walk(&data->iop, iova, size, size);
+	}
+	/* Race does not exist */
+	pte = arm_lpae_install_table(tablep, ptep, blk_pte, cfg);
+
+	/* Have splited it into page? */
+	if (lvl == (ARM_LPAE_MAX_LEVELS - 1))
+		return size;
+
+	/* Go back to lvl - 1 */
+	ptep -= ARM_LPAE_LVL_IDX(iova, lvl - 1, data);
+	return __arm_lpae_split_block(data, iova, size, lvl - 1, ptep);
+}
+
+static size_t __arm_lpae_split_block(struct arm_lpae_io_pgtable *data,
+				     unsigned long iova, size_t size, int lvl,
+				     arm_lpae_iopte *ptep)
+{
+	arm_lpae_iopte pte;
+	struct io_pgtable *iop = &data->iop;
+	size_t base, next_size, total_size;
+
+	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+		return 0;
+
+	ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+	pte = READ_ONCE(*ptep);
+	if (WARN_ON(!pte))
+		return 0;
+
+	if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
+		if (iopte_leaf(pte, lvl, iop->fmt)) {
+			if (lvl == (ARM_LPAE_MAX_LEVELS - 1) ||
+			    (pte & ARM_LPAE_PTE_AP_RDONLY))
+				return size;
+
+			/* We find a writable block, split it. */
+			return arm_lpae_do_split_blk(data, iova, size, pte,
+					lvl + 1, ptep);
+		} else {
+			/* If it is the last table level, then nothing to do */
+			if (lvl == (ARM_LPAE_MAX_LEVELS - 2))
+				return size;
+
+			total_size = 0;
+			next_size = ARM_LPAE_BLOCK_SIZE(lvl + 1, data);
+			ptep = iopte_deref(pte, data);
+			for (base = 0; base < size; base += next_size)
+				total_size += __arm_lpae_split_block(data,
+						iova + base, next_size, lvl + 1,
+						ptep);
+			return total_size;
+		}
+	} else if (iopte_leaf(pte, lvl, iop->fmt)) {
+		WARN(1, "Can't split behind a block.\n");
+		return 0;
+	}
+
+	/* Keep on walkin */
+	ptep = iopte_deref(pte, data);
+	return __arm_lpae_split_block(data, iova, size, lvl + 1, ptep);
+}
+
+static size_t arm_lpae_split_block(struct io_pgtable_ops *ops,
+				   unsigned long iova, size_t size)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	arm_lpae_iopte *ptep = data->pgd;
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	int lvl = data->start_level;
+	long iaext = (s64)iova >> cfg->ias;
+
+	if (WARN_ON(!size || (size & cfg->pgsize_bitmap) != size))
+		return 0;
+
+	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1)
+		iaext = ~iaext;
+	if (WARN_ON(iaext))
+		return 0;
+
+	/* If it is smallest granule, then nothing to do */
+	if (size == ARM_LPAE_BLOCK_SIZE(ARM_LPAE_MAX_LEVELS - 1, data))
+		return size;
+
+	return __arm_lpae_split_block(data, iova, size, lvl, ptep);
+}
+
 static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
 {
 	unsigned long granule, page_sizes;
@@ -757,6 +878,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
 		.map		= arm_lpae_map,
 		.unmap		= arm_lpae_unmap,
 		.iova_to_phys	= arm_lpae_iova_to_phys,
+		.split_block	= arm_lpae_split_block,
 	};
 
 	return data;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index ffeebda8d6de..7dc0850448c3 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2707,6 +2707,46 @@ int iommu_domain_set_attr(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_domain_set_attr);
 
+size_t iommu_split_block(struct iommu_domain *domain, unsigned long iova,
+			 size_t size)
+{
+	const struct iommu_ops *ops = domain->ops;
+	unsigned int min_pagesz;
+	size_t pgsize, splited_size;
+	size_t splited = 0;
+
+	min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
+
+	if (!IS_ALIGNED(iova | size, min_pagesz)) {
+		pr_err("unaligned: iova 0x%lx size 0x%zx min_pagesz 0x%x\n",
+		       iova, size, min_pagesz);
+		return 0;
+	}
+
+	if (!ops || !ops->split_block) {
+		pr_err("don't support split block\n");
+		return 0;
+	}
+
+	while (size) {
+		pgsize = iommu_pgsize(domain, iova, size);
+
+		splited_size = ops->split_block(domain, iova, pgsize);
+
+		pr_debug("splited: iova 0x%lx size 0x%zx\n", iova, splited_size);
+		iova += splited_size;
+		size -= splited_size;
+		splited += splited_size;
+
+		if (splited_size != pgsize)
+			break;
+	}
+	iommu_flush_iotlb_all(domain);
+
+	return splited;
+}
+EXPORT_SYMBOL_GPL(iommu_split_block);
+
 void iommu_get_resv_regions(struct device *dev, struct list_head *list)
 {
 	const struct iommu_ops *ops = dev->bus->iommu_ops;
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index 26583beeb5d9..b87c6f4ecaa2 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -162,6 +162,8 @@ struct io_pgtable_ops {
 			size_t size, struct iommu_iotlb_gather *gather);
 	phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
 				    unsigned long iova);
+	size_t (*split_block)(struct io_pgtable_ops *ops, unsigned long iova,
+			      size_t size);
 };
 
 /**
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index b3f0e2018c62..abeb811098a5 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -258,6 +258,8 @@ struct iommu_ops {
 			       enum iommu_attr attr, void *data);
 	int (*domain_set_attr)(struct iommu_domain *domain,
 			       enum iommu_attr attr, void *data);
+	size_t (*split_block)(struct iommu_domain *domain, unsigned long iova,
+			      size_t size);
 
 	/* Request/Free a list of reserved regions for a device */
 	void (*get_resv_regions)(struct device *dev, struct list_head *list);
@@ -509,6 +511,8 @@ extern int iommu_domain_get_attr(struct iommu_domain *domain, enum iommu_attr,
 				 void *data);
 extern int iommu_domain_set_attr(struct iommu_domain *domain, enum iommu_attr,
 				 void *data);
+extern size_t iommu_split_block(struct iommu_domain *domain, unsigned long iova,
+				size_t size);
 
 /* Window handling function prototypes */
 extern int iommu_domain_window_enable(struct iommu_domain *domain, u32 wnd_nr,
@@ -903,6 +907,12 @@ static inline int iommu_domain_set_attr(struct iommu_domain *domain,
 	return -EINVAL;
 }
 
+static inline size_t iommu_split_block(struct iommu_domain *domain,
+				       unsigned long iova, size_t size)
+{
+	return 0;
+}
+
 static inline int  iommu_device_register(struct iommu_device *iommu)
 {
 	return -ENODEV;
-- 
2.19.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 05/11] iommu/arm-smmu-v3: Merge a span of page to block descriptor
  2021-01-28 15:17 [RFC PATCH 00/11] vfio/iommu_type1: Implement dirty log tracking based on smmuv3 HTTU Keqian Zhu
                   ` (3 preceding siblings ...)
  2021-01-28 15:17 ` [RFC PATCH 04/11] iommu/arm-smmu-v3: Split block descriptor to a span of page Keqian Zhu
@ 2021-01-28 15:17 ` Keqian Zhu
  2021-02-04 19:52   ` Robin Murphy
  2021-01-28 15:17 ` [RFC PATCH 06/11] iommu/arm-smmu-v3: Scan leaf TTD to sync hardware dirty log Keqian Zhu
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Keqian Zhu @ 2021-01-28 15:17 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu, Will Deacon,
	Alex Williamson, Marc Zyngier, Catalin Marinas
  Cc: Mark Rutland, jiangkunkun, Suzuki K Poulose, Cornelia Huck,
	lushenming, Kirti Wankhede, James Morse, wanghaibin.wang,
	Robin Murphy

From: jiangkunkun <jiangkunkun@huawei.com>

When stop dirty log tracking, we need to recover all block descriptors
which are splited when start dirty log tracking. This adds a new
interface named merge_page in iommu layer and arm smmuv3 implements it,
which reinstall block mappings and unmap the span of page mappings.

It's caller's duty to find contiuous physical memory.

During merging page, other interfaces are not expected to be working,
so race condition does not exist. And we flush all iotlbs after the merge
procedure is completed to ease the pressure of iommu, as we will merge a
huge range of page mappings in general.

Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 20 ++++++
 drivers/iommu/io-pgtable-arm.c              | 78 +++++++++++++++++++++
 drivers/iommu/iommu.c                       | 75 ++++++++++++++++++++
 include/linux/io-pgtable.h                  |  2 +
 include/linux/iommu.h                       | 10 +++
 5 files changed, 185 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 5469f4fca820..2434519e4bb6 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2529,6 +2529,25 @@ static size_t arm_smmu_split_block(struct iommu_domain *domain,
 	return ops->split_block(ops, iova, size);
 }
 
+static size_t arm_smmu_merge_page(struct iommu_domain *domain, unsigned long iova,
+				  phys_addr_t paddr, size_t size, int prot)
+{
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+	struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
+
+	if (!(smmu->features & (ARM_SMMU_FEAT_BBML1 | ARM_SMMU_FEAT_BBML2))) {
+		dev_err(smmu->dev, "don't support BBML1/2 and merge page\n");
+		return 0;
+	}
+
+	if (!ops || !ops->merge_page) {
+		pr_err("don't support merge page\n");
+		return 0;
+	}
+
+	return ops->merge_page(ops, iova, paddr, size, prot);
+}
+
 static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
 {
 	return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2629,6 +2648,7 @@ static struct iommu_ops arm_smmu_ops = {
 	.domain_get_attr	= arm_smmu_domain_get_attr,
 	.domain_set_attr	= arm_smmu_domain_set_attr,
 	.split_block		= arm_smmu_split_block,
+	.merge_page		= arm_smmu_merge_page,
 	.of_xlate		= arm_smmu_of_xlate,
 	.get_resv_regions	= arm_smmu_get_resv_regions,
 	.put_resv_regions	= generic_iommu_put_resv_regions,
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index f3b7f7115e38..17390f258eb1 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -800,6 +800,83 @@ static size_t arm_lpae_split_block(struct io_pgtable_ops *ops,
 	return __arm_lpae_split_block(data, iova, size, lvl, ptep);
 }
 
+static size_t __arm_lpae_merge_page(struct arm_lpae_io_pgtable *data,
+				    unsigned long iova, phys_addr_t paddr,
+				    size_t size, int lvl, arm_lpae_iopte *ptep,
+				    arm_lpae_iopte prot)
+{
+	arm_lpae_iopte pte, *tablep;
+	struct io_pgtable *iop = &data->iop;
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+
+	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+		return 0;
+
+	ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+	pte = READ_ONCE(*ptep);
+	if (WARN_ON(!pte))
+		return 0;
+
+	if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
+		if (iopte_leaf(pte, lvl, iop->fmt))
+			return size;
+
+		/* Race does not exist */
+		if (cfg->bbml == 1) {
+			prot |= ARM_LPAE_PTE_NT;
+			__arm_lpae_init_pte(data, paddr, prot, lvl, ptep);
+			io_pgtable_tlb_flush_walk(iop, iova, size,
+						  ARM_LPAE_GRANULE(data));
+
+			prot &= ~(ARM_LPAE_PTE_NT);
+			__arm_lpae_init_pte(data, paddr, prot, lvl, ptep);
+		} else {
+			__arm_lpae_init_pte(data, paddr, prot, lvl, ptep);
+		}
+
+		tablep = iopte_deref(pte, data);
+		__arm_lpae_free_pgtable(data, lvl + 1, tablep);
+		return size;
+	} else if (iopte_leaf(pte, lvl, iop->fmt)) {
+		/* The size is too small, already merged */
+		return size;
+	}
+
+	/* Keep on walkin */
+	ptep = iopte_deref(pte, data);
+	return __arm_lpae_merge_page(data, iova, paddr, size, lvl + 1, ptep, prot);
+}
+
+static size_t arm_lpae_merge_page(struct io_pgtable_ops *ops, unsigned long iova,
+				  phys_addr_t paddr, size_t size, int iommu_prot)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	arm_lpae_iopte *ptep = data->pgd;
+	int lvl = data->start_level;
+	arm_lpae_iopte prot;
+	long iaext = (s64)iova >> cfg->ias;
+
+	/* If no access, then nothing to do */
+	if (!(iommu_prot & (IOMMU_READ | IOMMU_WRITE)))
+		return size;
+
+	if (WARN_ON(!size || (size & cfg->pgsize_bitmap) != size))
+		return 0;
+
+	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1)
+		iaext = ~iaext;
+	if (WARN_ON(iaext || paddr >> cfg->oas))
+		return 0;
+
+	/* If it is smallest granule, then nothing to do */
+	if (size == ARM_LPAE_BLOCK_SIZE(ARM_LPAE_MAX_LEVELS - 1, data))
+		return size;
+
+	prot = arm_lpae_prot_to_pte(data, iommu_prot);
+	return __arm_lpae_merge_page(data, iova, paddr, size, lvl, ptep, prot);
+}
+
 static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
 {
 	unsigned long granule, page_sizes;
@@ -879,6 +956,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
 		.unmap		= arm_lpae_unmap,
 		.iova_to_phys	= arm_lpae_iova_to_phys,
 		.split_block	= arm_lpae_split_block,
+		.merge_page	= arm_lpae_merge_page,
 	};
 
 	return data;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 7dc0850448c3..f1261da11ea8 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2747,6 +2747,81 @@ size_t iommu_split_block(struct iommu_domain *domain, unsigned long iova,
 }
 EXPORT_SYMBOL_GPL(iommu_split_block);
 
+static size_t __iommu_merge_page(struct iommu_domain *domain, unsigned long iova,
+				 phys_addr_t paddr, size_t size, int prot)
+{
+	const struct iommu_ops *ops = domain->ops;
+	unsigned int min_pagesz;
+	size_t pgsize, merged_size;
+	size_t merged = 0;
+
+	min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
+
+	if (!IS_ALIGNED(iova | paddr | size, min_pagesz)) {
+		pr_err("unaligned: iova 0x%lx pa %pa size 0x%zx min_pagesz 0x%x\n",
+			iova, &paddr, size, min_pagesz);
+		return 0;
+	}
+
+	if (!ops || !ops->merge_page) {
+		pr_err("don't support merge page\n");
+		return 0;
+	}
+
+	while (size) {
+		pgsize = iommu_pgsize(domain, iova | paddr, size);
+
+		merged_size = ops->merge_page(domain, iova, paddr, pgsize, prot);
+
+		pr_debug("merged: iova 0x%lx pa %pa size 0x%zx\n", iova, &paddr,
+			 merged_size);
+		iova += merged_size;
+		paddr += merged_size;
+		size -= merged_size;
+		merged += merged_size;
+
+		if (merged_size != pgsize)
+			break;
+	}
+
+	return merged;
+}
+
+size_t iommu_merge_page(struct iommu_domain *domain, unsigned long iova,
+			size_t size, int prot)
+{
+	phys_addr_t phys;
+	dma_addr_t p, i;
+	size_t cont_size, merged_size;
+	size_t merged = 0;
+
+	while (size) {
+		phys = iommu_iova_to_phys(domain, iova);
+		cont_size = PAGE_SIZE;
+		p = phys + cont_size;
+		i = iova + cont_size;
+
+		while (cont_size < size && p == iommu_iova_to_phys(domain, i)) {
+			p += PAGE_SIZE;
+			i += PAGE_SIZE;
+			cont_size += PAGE_SIZE;
+		}
+
+		merged_size = __iommu_merge_page(domain, iova, phys, cont_size,
+				prot);
+		iova += merged_size;
+		size -= merged_size;
+		merged += merged_size;
+
+		if (merged_size != cont_size)
+			break;
+	}
+	iommu_flush_iotlb_all(domain);
+
+	return merged;
+}
+EXPORT_SYMBOL_GPL(iommu_merge_page);
+
 void iommu_get_resv_regions(struct device *dev, struct list_head *list)
 {
 	const struct iommu_ops *ops = dev->bus->iommu_ops;
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index b87c6f4ecaa2..754b62a1bbaf 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -164,6 +164,8 @@ struct io_pgtable_ops {
 				    unsigned long iova);
 	size_t (*split_block)(struct io_pgtable_ops *ops, unsigned long iova,
 			      size_t size);
+	size_t (*merge_page)(struct io_pgtable_ops *ops, unsigned long iova,
+			     phys_addr_t phys, size_t size, int prot);
 };
 
 /**
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index abeb811098a5..ac2b0b1bce0f 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -260,6 +260,8 @@ struct iommu_ops {
 			       enum iommu_attr attr, void *data);
 	size_t (*split_block)(struct iommu_domain *domain, unsigned long iova,
 			      size_t size);
+	size_t (*merge_page)(struct iommu_domain *domain, unsigned long iova,
+			     phys_addr_t phys, size_t size, int prot);
 
 	/* Request/Free a list of reserved regions for a device */
 	void (*get_resv_regions)(struct device *dev, struct list_head *list);
@@ -513,6 +515,8 @@ extern int iommu_domain_set_attr(struct iommu_domain *domain, enum iommu_attr,
 				 void *data);
 extern size_t iommu_split_block(struct iommu_domain *domain, unsigned long iova,
 				size_t size);
+extern size_t iommu_merge_page(struct iommu_domain *domain, unsigned long iova,
+			       size_t size, int prot);
 
 /* Window handling function prototypes */
 extern int iommu_domain_window_enable(struct iommu_domain *domain, u32 wnd_nr,
@@ -913,6 +917,12 @@ static inline size_t iommu_split_block(struct iommu_domain *domain,
 	return 0;
 }
 
+static inline size_t iommu_merge_page(struct iommu_domain *domain,
+				      unsigned long iova, size_t size, int prot)
+{
+	return -EINVAL;
+}
+
 static inline int  iommu_device_register(struct iommu_device *iommu)
 {
 	return -ENODEV;
-- 
2.19.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 06/11] iommu/arm-smmu-v3: Scan leaf TTD to sync hardware dirty log
  2021-01-28 15:17 [RFC PATCH 00/11] vfio/iommu_type1: Implement dirty log tracking based on smmuv3 HTTU Keqian Zhu
                   ` (4 preceding siblings ...)
  2021-01-28 15:17 ` [RFC PATCH 05/11] iommu/arm-smmu-v3: Merge a span of page to block descriptor Keqian Zhu
@ 2021-01-28 15:17 ` Keqian Zhu
  2021-02-04 19:52   ` Robin Murphy
  2021-01-28 15:17 ` [RFC PATCH 07/11] iommu/arm-smmu-v3: Clear dirty log according to bitmap Keqian Zhu
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Keqian Zhu @ 2021-01-28 15:17 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu, Will Deacon,
	Alex Williamson, Marc Zyngier, Catalin Marinas
  Cc: Mark Rutland, jiangkunkun, Suzuki K Poulose, Cornelia Huck,
	lushenming, Kirti Wankhede, James Morse, wanghaibin.wang,
	Robin Murphy

From: jiangkunkun <jiangkunkun@huawei.com>

During dirty log tracking, user will try to retrieve dirty log from
iommu if it supports hardware dirty log. This adds a new interface
named sync_dirty_log in iommu layer and arm smmuv3 implements it,
which scans leaf TTD and treats it's dirty if it's writable (As we
just enable HTTU for stage1, so check AP[2] is not set).

Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 27 +++++++
 drivers/iommu/io-pgtable-arm.c              | 90 +++++++++++++++++++++
 drivers/iommu/iommu.c                       | 41 ++++++++++
 include/linux/io-pgtable.h                  |  4 +
 include/linux/iommu.h                       | 17 ++++
 5 files changed, 179 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 2434519e4bb6..43d0536b429a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2548,6 +2548,32 @@ static size_t arm_smmu_merge_page(struct iommu_domain *domain, unsigned long iov
 	return ops->merge_page(ops, iova, paddr, size, prot);
 }
 
+static int arm_smmu_sync_dirty_log(struct iommu_domain *domain,
+				   unsigned long iova, size_t size,
+				   unsigned long *bitmap,
+				   unsigned long base_iova,
+				   unsigned long bitmap_pgshift)
+{
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+	struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
+
+	if (!(smmu->features & ARM_SMMU_FEAT_HTTU_HD)) {
+		dev_err(smmu->dev, "don't support HTTU_HD and sync dirty log\n");
+		return -EPERM;
+	}
+
+	if (!ops || !ops->sync_dirty_log) {
+		pr_err("don't support sync dirty log\n");
+		return -ENODEV;
+	}
+
+	/* To ensure all inflight transactions are completed */
+	arm_smmu_flush_iotlb_all(domain);
+
+	return ops->sync_dirty_log(ops, iova, size, bitmap,
+			base_iova, bitmap_pgshift);
+}
+
 static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
 {
 	return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2649,6 +2675,7 @@ static struct iommu_ops arm_smmu_ops = {
 	.domain_set_attr	= arm_smmu_domain_set_attr,
 	.split_block		= arm_smmu_split_block,
 	.merge_page		= arm_smmu_merge_page,
+	.sync_dirty_log		= arm_smmu_sync_dirty_log,
 	.of_xlate		= arm_smmu_of_xlate,
 	.get_resv_regions	= arm_smmu_get_resv_regions,
 	.put_resv_regions	= generic_iommu_put_resv_regions,
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 17390f258eb1..6cfe1ef3fedd 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -877,6 +877,95 @@ static size_t arm_lpae_merge_page(struct io_pgtable_ops *ops, unsigned long iova
 	return __arm_lpae_merge_page(data, iova, paddr, size, lvl, ptep, prot);
 }
 
+static int __arm_lpae_sync_dirty_log(struct arm_lpae_io_pgtable *data,
+				     unsigned long iova, size_t size,
+				     int lvl, arm_lpae_iopte *ptep,
+				     unsigned long *bitmap,
+				     unsigned long base_iova,
+				     unsigned long bitmap_pgshift)
+{
+	arm_lpae_iopte pte;
+	struct io_pgtable *iop = &data->iop;
+	size_t base, next_size;
+	unsigned long offset;
+	int nbits, ret;
+
+	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+		return -EINVAL;
+
+	ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+	pte = READ_ONCE(*ptep);
+	if (WARN_ON(!pte))
+		return -EINVAL;
+
+	if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
+		if (iopte_leaf(pte, lvl, iop->fmt)) {
+			if (pte & ARM_LPAE_PTE_AP_RDONLY)
+				return 0;
+
+			/* It is writable, set the bitmap */
+			nbits = size >> bitmap_pgshift;
+			offset = (iova - base_iova) >> bitmap_pgshift;
+			bitmap_set(bitmap, offset, nbits);
+			return 0;
+		} else {
+			/* To traverse next level */
+			next_size = ARM_LPAE_BLOCK_SIZE(lvl + 1, data);
+			ptep = iopte_deref(pte, data);
+			for (base = 0; base < size; base += next_size) {
+				ret = __arm_lpae_sync_dirty_log(data,
+						iova + base, next_size, lvl + 1,
+						ptep, bitmap, base_iova, bitmap_pgshift);
+				if (ret)
+					return ret;
+			}
+			return 0;
+		}
+	} else if (iopte_leaf(pte, lvl, iop->fmt)) {
+		if (pte & ARM_LPAE_PTE_AP_RDONLY)
+			return 0;
+
+		/* Though the size is too small, also set bitmap */
+		nbits = size >> bitmap_pgshift;
+		offset = (iova - base_iova) >> bitmap_pgshift;
+		bitmap_set(bitmap, offset, nbits);
+		return 0;
+	}
+
+	/* Keep on walkin */
+	ptep = iopte_deref(pte, data);
+	return __arm_lpae_sync_dirty_log(data, iova, size, lvl + 1, ptep,
+			bitmap, base_iova, bitmap_pgshift);
+}
+
+static int arm_lpae_sync_dirty_log(struct io_pgtable_ops *ops,
+				   unsigned long iova, size_t size,
+				   unsigned long *bitmap,
+				   unsigned long base_iova,
+				   unsigned long bitmap_pgshift)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	arm_lpae_iopte *ptep = data->pgd;
+	int lvl = data->start_level;
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	long iaext = (s64)iova >> cfg->ias;
+
+	if (WARN_ON(!size || (size & cfg->pgsize_bitmap) != size))
+		return -EINVAL;
+
+	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1)
+		iaext = ~iaext;
+	if (WARN_ON(iaext))
+		return -EINVAL;
+
+	if (data->iop.fmt != ARM_64_LPAE_S1 &&
+	    data->iop.fmt != ARM_32_LPAE_S1)
+		return -EINVAL;
+
+	return __arm_lpae_sync_dirty_log(data, iova, size, lvl, ptep,
+					 bitmap, base_iova, bitmap_pgshift);
+}
+
 static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
 {
 	unsigned long granule, page_sizes;
@@ -957,6 +1046,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
 		.iova_to_phys	= arm_lpae_iova_to_phys,
 		.split_block	= arm_lpae_split_block,
 		.merge_page	= arm_lpae_merge_page,
+		.sync_dirty_log	= arm_lpae_sync_dirty_log,
 	};
 
 	return data;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index f1261da11ea8..69f268069282 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2822,6 +2822,47 @@ size_t iommu_merge_page(struct iommu_domain *domain, unsigned long iova,
 }
 EXPORT_SYMBOL_GPL(iommu_merge_page);
 
+int iommu_sync_dirty_log(struct iommu_domain *domain, unsigned long iova,
+			 size_t size, unsigned long *bitmap,
+			 unsigned long base_iova, unsigned long bitmap_pgshift)
+{
+	const struct iommu_ops *ops = domain->ops;
+	unsigned int min_pagesz;
+	size_t pgsize;
+	int ret;
+
+	min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
+
+	if (!IS_ALIGNED(iova | size, min_pagesz)) {
+		pr_err("unaligned: iova 0x%lx size 0x%zx min_pagesz 0x%x\n",
+		       iova, size, min_pagesz);
+		return -EINVAL;
+	}
+
+	if (!ops || !ops->sync_dirty_log) {
+		pr_err("don't support sync dirty log\n");
+		return -ENODEV;
+	}
+
+	while (size) {
+		pgsize = iommu_pgsize(domain, iova, size);
+
+		ret = ops->sync_dirty_log(domain, iova, pgsize,
+					  bitmap, base_iova, bitmap_pgshift);
+		if (ret)
+			break;
+
+		pr_debug("dirty_log_sync: iova 0x%lx pagesz 0x%zx\n", iova,
+			 pgsize);
+
+		iova += pgsize;
+		size -= pgsize;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_sync_dirty_log);
+
 void iommu_get_resv_regions(struct device *dev, struct list_head *list)
 {
 	const struct iommu_ops *ops = dev->bus->iommu_ops;
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index 754b62a1bbaf..f44551e4a454 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -166,6 +166,10 @@ struct io_pgtable_ops {
 			      size_t size);
 	size_t (*merge_page)(struct io_pgtable_ops *ops, unsigned long iova,
 			     phys_addr_t phys, size_t size, int prot);
+	int (*sync_dirty_log)(struct io_pgtable_ops *ops,
+			      unsigned long iova, size_t size,
+			      unsigned long *bitmap, unsigned long base_iova,
+			      unsigned long bitmap_pgshift);
 };
 
 /**
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index ac2b0b1bce0f..8069c8375e63 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -262,6 +262,10 @@ struct iommu_ops {
 			      size_t size);
 	size_t (*merge_page)(struct iommu_domain *domain, unsigned long iova,
 			     phys_addr_t phys, size_t size, int prot);
+	int (*sync_dirty_log)(struct iommu_domain *domain,
+			      unsigned long iova, size_t size,
+			      unsigned long *bitmap, unsigned long base_iova,
+			      unsigned long bitmap_pgshift);
 
 	/* Request/Free a list of reserved regions for a device */
 	void (*get_resv_regions)(struct device *dev, struct list_head *list);
@@ -517,6 +521,10 @@ extern size_t iommu_split_block(struct iommu_domain *domain, unsigned long iova,
 				size_t size);
 extern size_t iommu_merge_page(struct iommu_domain *domain, unsigned long iova,
 			       size_t size, int prot);
+extern int iommu_sync_dirty_log(struct iommu_domain *domain, unsigned long iova,
+				size_t size, unsigned long *bitmap,
+				unsigned long base_iova,
+				unsigned long bitmap_pgshift);
 
 /* Window handling function prototypes */
 extern int iommu_domain_window_enable(struct iommu_domain *domain, u32 wnd_nr,
@@ -923,6 +931,15 @@ static inline size_t iommu_merge_page(struct iommu_domain *domain,
 	return -EINVAL;
 }
 
+static inline int iommu_sync_dirty_log(struct iommu_domain *domain,
+				       unsigned long iova, size_t size,
+				       unsigned long *bitmap,
+				       unsigned long base_iova,
+				       unsigned long pgshift)
+{
+	return -EINVAL;
+}
+
 static inline int  iommu_device_register(struct iommu_device *iommu)
 {
 	return -ENODEV;
-- 
2.19.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 07/11] iommu/arm-smmu-v3: Clear dirty log according to bitmap
  2021-01-28 15:17 [RFC PATCH 00/11] vfio/iommu_type1: Implement dirty log tracking based on smmuv3 HTTU Keqian Zhu
                   ` (5 preceding siblings ...)
  2021-01-28 15:17 ` [RFC PATCH 06/11] iommu/arm-smmu-v3: Scan leaf TTD to sync hardware dirty log Keqian Zhu
@ 2021-01-28 15:17 ` Keqian Zhu
  2021-01-28 15:17 ` [RFC PATCH 08/11] iommu/arm-smmu-v3: Add HWDBM device feature reporting Keqian Zhu
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Keqian Zhu @ 2021-01-28 15:17 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu, Will Deacon,
	Alex Williamson, Marc Zyngier, Catalin Marinas
  Cc: Mark Rutland, jiangkunkun, Suzuki K Poulose, Cornelia Huck,
	lushenming, Kirti Wankhede, James Morse, wanghaibin.wang,
	Robin Murphy

From: jiangkunkun <jiangkunkun@huawei.com>

After dirty log is retrieved, user should clear dirty log to re-enable
dirty log tracking for these dirtied pages. This adds a new interface
named clear_dirty_log and arm smmuv3 implements it, which clears the dirty
state (As we just enable HTTU for stage1, so set the AP[2] bit) of these
TTDs that are specified by the user provided bitmap.

Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24 ++++++
 drivers/iommu/io-pgtable-arm.c              | 95 +++++++++++++++++++++
 drivers/iommu/iommu.c                       | 71 +++++++++++++++
 include/linux/io-pgtable.h                  |  4 +
 include/linux/iommu.h                       | 17 ++++
 5 files changed, 211 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 43d0536b429a..0c24503d29d3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2574,6 +2574,29 @@ static int arm_smmu_sync_dirty_log(struct iommu_domain *domain,
 			base_iova, bitmap_pgshift);
 }
 
+static int arm_smmu_clear_dirty_log(struct iommu_domain *domain,
+				    unsigned long iova, size_t size,
+				    unsigned long *bitmap,
+				    unsigned long base_iova,
+				    unsigned long bitmap_pgshift)
+{
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+	struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
+
+	if (!(smmu->features & ARM_SMMU_FEAT_HTTU_HD)) {
+		dev_err(smmu->dev, "don't support HTTU_HD and clear dirty log\n");
+		return -EPERM;
+	}
+
+	if (!ops || !ops->clear_dirty_log) {
+		pr_err("don't support clear dirty log\n");
+		return -ENODEV;
+	}
+
+	return ops->clear_dirty_log(ops, iova, size, bitmap, base_iova,
+				    bitmap_pgshift);
+}
+
 static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
 {
 	return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2676,6 +2699,7 @@ static struct iommu_ops arm_smmu_ops = {
 	.split_block		= arm_smmu_split_block,
 	.merge_page		= arm_smmu_merge_page,
 	.sync_dirty_log		= arm_smmu_sync_dirty_log,
+	.clear_dirty_log	= arm_smmu_clear_dirty_log,
 	.of_xlate		= arm_smmu_of_xlate,
 	.get_resv_regions	= arm_smmu_get_resv_regions,
 	.put_resv_regions	= generic_iommu_put_resv_regions,
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 6cfe1ef3fedd..2256e37bcb3a 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -966,6 +966,100 @@ static int arm_lpae_sync_dirty_log(struct io_pgtable_ops *ops,
 					 bitmap, base_iova, bitmap_pgshift);
 }
 
+static int __arm_lpae_clear_dirty_log(struct arm_lpae_io_pgtable *data,
+				      unsigned long iova, size_t size,
+				      int lvl, arm_lpae_iopte *ptep,
+				      unsigned long *bitmap,
+				      unsigned long base_iova,
+				      unsigned long bitmap_pgshift)
+{
+	arm_lpae_iopte pte;
+	struct io_pgtable *iop = &data->iop;
+	unsigned long offset;
+	size_t base, next_size;
+	int nbits, ret, i;
+
+	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+		return -EINVAL;
+
+	ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+	pte = READ_ONCE(*ptep);
+	if (WARN_ON(!pte))
+		return -EINVAL;
+
+	if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
+		if (iopte_leaf(pte, lvl, iop->fmt)) {
+			if (pte & ARM_LPAE_PTE_AP_RDONLY)
+				return 0;
+
+			/* Ensure all corresponding bits are set */
+			nbits = size >> bitmap_pgshift;
+			offset = (iova - base_iova) >> bitmap_pgshift;
+			for (i = offset; i < offset + nbits; i++) {
+				if (!test_bit(i, bitmap))
+					return 0;
+			}
+
+			/* Race does not exist */
+			pte |= ARM_LPAE_PTE_AP_RDONLY;
+			__arm_lpae_set_pte(ptep, pte, &iop->cfg);
+			return 0;
+		} else {
+			/* To traverse next level */
+			next_size = ARM_LPAE_BLOCK_SIZE(lvl + 1, data);
+			ptep = iopte_deref(pte, data);
+			for (base = 0; base < size; base += next_size) {
+				ret = __arm_lpae_clear_dirty_log(data,
+						iova + base, next_size, lvl + 1,
+						ptep, bitmap, base_iova,
+						bitmap_pgshift);
+				if (ret)
+					return ret;
+			}
+			return 0;
+		}
+	} else if (iopte_leaf(pte, lvl, iop->fmt)) {
+		/* Though the size is too small, it is already clean */
+		if (pte & ARM_LPAE_PTE_AP_RDONLY)
+			return 0;
+
+		return -EINVAL;
+	}
+
+	/* Keep on walkin */
+	ptep = iopte_deref(pte, data);
+	return __arm_lpae_clear_dirty_log(data, iova, size, lvl + 1, ptep,
+			bitmap, base_iova, bitmap_pgshift);
+}
+
+static int arm_lpae_clear_dirty_log(struct io_pgtable_ops *ops,
+				    unsigned long iova, size_t size,
+				    unsigned long *bitmap,
+				    unsigned long base_iova,
+				    unsigned long bitmap_pgshift)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	arm_lpae_iopte *ptep = data->pgd;
+	int lvl = data->start_level;
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	long iaext = (s64)iova >> cfg->ias;
+
+	if (WARN_ON(!size || (size & cfg->pgsize_bitmap) != size))
+		return -EINVAL;
+
+	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1)
+		iaext = ~iaext;
+	if (WARN_ON(iaext))
+		return -EINVAL;
+
+	if (data->iop.fmt != ARM_64_LPAE_S1 &&
+	    data->iop.fmt != ARM_32_LPAE_S1)
+		return -EINVAL;
+
+	return __arm_lpae_clear_dirty_log(data, iova, size, lvl, ptep,
+			bitmap, base_iova, bitmap_pgshift);
+}
+
 static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
 {
 	unsigned long granule, page_sizes;
@@ -1047,6 +1141,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
 		.split_block	= arm_lpae_split_block,
 		.merge_page	= arm_lpae_merge_page,
 		.sync_dirty_log	= arm_lpae_sync_dirty_log,
+		.clear_dirty_log = arm_lpae_clear_dirty_log,
 	};
 
 	return data;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 69f268069282..e2731a7afab2 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2863,6 +2863,77 @@ int iommu_sync_dirty_log(struct iommu_domain *domain, unsigned long iova,
 }
 EXPORT_SYMBOL_GPL(iommu_sync_dirty_log);
 
+static int __iommu_clear_dirty_log(struct iommu_domain *domain,
+				   unsigned long iova, size_t size,
+				   unsigned long *bitmap,
+				   unsigned long base_iova,
+				   unsigned long bitmap_pgshift)
+{
+	const struct iommu_ops *ops = domain->ops;
+	size_t pgsize;
+	int ret = 0;
+
+	if (!ops || !ops->clear_dirty_log) {
+		pr_err("don't support clear dirty log\n");
+		return -ENODEV;
+	}
+
+	while (size) {
+		pgsize = iommu_pgsize(domain, iova, size);
+		ret = ops->clear_dirty_log(domain, iova, pgsize, bitmap,
+				base_iova, bitmap_pgshift);
+
+		if (ret)
+			break;
+
+		pr_debug("dirty_log_clear: iova 0x%lx pagesz 0x%zx\n", iova,
+			 pgsize);
+
+		iova += pgsize;
+		size -= pgsize;
+	}
+	iommu_flush_iotlb_all(domain);
+
+	return ret;
+}
+
+int iommu_clear_dirty_log(struct iommu_domain *domain,
+			  unsigned long iova, size_t size,
+			  unsigned long *bitmap, unsigned long base_iova,
+			  unsigned long bitmap_pgshift)
+{
+	unsigned long riova, rsize;
+	unsigned int min_pagesz;
+	bool flush = false;
+	int rs, re, start, end, ret = 0;
+
+	min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
+
+	if (!IS_ALIGNED(iova | size, min_pagesz)) {
+		pr_err("unaligned: iova 0x%lx min_pagesz 0x%x\n",
+		       iova, min_pagesz);
+		return -EINVAL;
+	}
+
+	start = (iova - base_iova) >> bitmap_pgshift;
+	end = start + (size >> bitmap_pgshift);
+	bitmap_for_each_set_region(bitmap, rs, re, start, end) {
+		flush = true;
+		riova = iova + (rs << bitmap_pgshift);
+		rsize = (re - rs) << bitmap_pgshift;
+		ret = __iommu_clear_dirty_log(domain, riova, rsize, bitmap,
+					      base_iova, bitmap_pgshift);
+		if (ret)
+			break;
+	}
+
+	if (flush)
+		iommu_flush_iotlb_all(domain);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_clear_dirty_log);
+
 void iommu_get_resv_regions(struct device *dev, struct list_head *list)
 {
 	const struct iommu_ops *ops = dev->bus->iommu_ops;
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index f44551e4a454..e7134ee224c9 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -170,6 +170,10 @@ struct io_pgtable_ops {
 			      unsigned long iova, size_t size,
 			      unsigned long *bitmap, unsigned long base_iova,
 			      unsigned long bitmap_pgshift);
+	int (*clear_dirty_log)(struct io_pgtable_ops *ops,
+			       unsigned long iova, size_t size,
+			       unsigned long *bitmap, unsigned long base_iova,
+			       unsigned long bitmap_pgshift);
 };
 
 /**
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 8069c8375e63..1cb6cd0cfc7b 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -266,6 +266,10 @@ struct iommu_ops {
 			      unsigned long iova, size_t size,
 			      unsigned long *bitmap, unsigned long base_iova,
 			      unsigned long bitmap_pgshift);
+	int (*clear_dirty_log)(struct iommu_domain *domain,
+			       unsigned long iova, size_t size,
+			       unsigned long *bitmap, unsigned long base_iova,
+			       unsigned long bitmap_pgshift);
 
 	/* Request/Free a list of reserved regions for a device */
 	void (*get_resv_regions)(struct device *dev, struct list_head *list);
@@ -525,6 +529,10 @@ extern int iommu_sync_dirty_log(struct iommu_domain *domain, unsigned long iova,
 				size_t size, unsigned long *bitmap,
 				unsigned long base_iova,
 				unsigned long bitmap_pgshift);
+extern int iommu_clear_dirty_log(struct iommu_domain *domain, unsigned long iova,
+				 size_t dma_size, unsigned long *bitmap,
+				 unsigned long base_iova,
+				 unsigned long bitmap_pgshift);
 
 /* Window handling function prototypes */
 extern int iommu_domain_window_enable(struct iommu_domain *domain, u32 wnd_nr,
@@ -940,6 +948,15 @@ static inline int iommu_sync_dirty_log(struct iommu_domain *domain,
 	return -EINVAL;
 }
 
+static inline int iommu_clear_dirty_log(struct iommu_domain *domain,
+					unsigned long iova, size_t size,
+					unsigned long *bitmap,
+					unsigned long base_iova,
+					unsigned long pgshift)
+{
+	return -EINVAL;
+}
+
 static inline int  iommu_device_register(struct iommu_device *iommu)
 {
 	return -ENODEV;
-- 
2.19.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 08/11] iommu/arm-smmu-v3: Add HWDBM device feature reporting
  2021-01-28 15:17 [RFC PATCH 00/11] vfio/iommu_type1: Implement dirty log tracking based on smmuv3 HTTU Keqian Zhu
                   ` (6 preceding siblings ...)
  2021-01-28 15:17 ` [RFC PATCH 07/11] iommu/arm-smmu-v3: Clear dirty log according to bitmap Keqian Zhu
@ 2021-01-28 15:17 ` Keqian Zhu
  2021-01-28 15:17 ` [RFC PATCH 09/11] vfio/iommu_type1: Add HWDBM status maintanance Keqian Zhu
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Keqian Zhu @ 2021-01-28 15:17 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu, Will Deacon,
	Alex Williamson, Marc Zyngier, Catalin Marinas
  Cc: Mark Rutland, jiangkunkun, Suzuki K Poulose, Cornelia Huck,
	lushenming, Kirti Wankhede, James Morse, wanghaibin.wang,
	Robin Murphy

From: jiangkunkun <jiangkunkun@huawei.com>

We have implemented these interfaces required to support iommu
dirty log tracking. The last step is reporting this feature to
upper user, then the user can perform higher policy base on it.
This adds a new dev feature named IOMMU_DEV_FEAT_HWDBM in iommu
layer. For arm smmuv3, it is equal to ARM_SMMU_FEAT_HTTU_HD.

Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 ++
 include/linux/iommu.h                       | 1 +
 2 files changed, 3 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 0c24503d29d3..cbde0489cf31 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2629,6 +2629,8 @@ static bool arm_smmu_dev_has_feature(struct device *dev,
 	switch (feat) {
 	case IOMMU_DEV_FEAT_SVA:
 		return arm_smmu_master_sva_supported(master);
+	case IOMMU_DEV_FEAT_HWDBM:
+		return !!(master->smmu->features & ARM_SMMU_FEAT_HTTU_HD);
 	default:
 		return false;
 	}
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 1cb6cd0cfc7b..77e561ed57fd 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -160,6 +160,7 @@ struct iommu_resv_region {
 enum iommu_dev_features {
 	IOMMU_DEV_FEAT_AUX,	/* Aux-domain feature */
 	IOMMU_DEV_FEAT_SVA,	/* Shared Virtual Addresses */
+	IOMMU_DEV_FEAT_HWDBM,	/* Hardware Dirty Bit Management */
 };
 
 #define IOMMU_PASID_INVALID	(-1U)
-- 
2.19.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 09/11] vfio/iommu_type1: Add HWDBM status maintanance
  2021-01-28 15:17 [RFC PATCH 00/11] vfio/iommu_type1: Implement dirty log tracking based on smmuv3 HTTU Keqian Zhu
                   ` (7 preceding siblings ...)
  2021-01-28 15:17 ` [RFC PATCH 08/11] iommu/arm-smmu-v3: Add HWDBM device feature reporting Keqian Zhu
@ 2021-01-28 15:17 ` Keqian Zhu
  2021-01-28 15:17 ` [RFC PATCH 10/11] vfio/iommu_type1: Optimize dirty bitmap population based on iommu HWDBM Keqian Zhu
  2021-01-28 15:17 ` [RFC PATCH 11/11] vfio/iommu_type1: Add support for manual dirty log clear Keqian Zhu
  10 siblings, 0 replies; 36+ messages in thread
From: Keqian Zhu @ 2021-01-28 15:17 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu, Will Deacon,
	Alex Williamson, Marc Zyngier, Catalin Marinas
  Cc: Mark Rutland, jiangkunkun, Suzuki K Poulose, Cornelia Huck,
	lushenming, Kirti Wankhede, James Morse, wanghaibin.wang,
	Robin Murphy

From: jiangkunkun <jiangkunkun@huawei.com>

We are going to optimize dirty log tracking based on iommu
HWDBM feature, but the dirty log from iommu is useful only
when all iommu backed groups are connected to iommu with
HWDBM feature. This maintains a counter for this feature.

Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
---
 drivers/vfio/vfio_iommu_type1.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 0b4dedaa9128..3b8522ebf955 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -74,6 +74,7 @@ struct vfio_iommu {
 	bool			nesting;
 	bool			dirty_page_tracking;
 	bool			pinned_page_dirty_scope;
+	uint64_t		num_non_hwdbm_groups;
 };
 
 struct vfio_domain {
@@ -102,6 +103,7 @@ struct vfio_group {
 	struct list_head	next;
 	bool			mdev_group;	/* An mdev group */
 	bool			pinned_page_dirty_scope;
+	bool			iommu_hwdbm;	/* Valid for non-mdev group */
 };
 
 struct vfio_iova {
@@ -976,6 +978,27 @@ static void vfio_update_pgsize_bitmap(struct vfio_iommu *iommu)
 	}
 }
 
+static int vfio_dev_has_feature(struct device *dev, void *data)
+{
+	enum iommu_dev_features *feat = data;
+
+	if (!iommu_dev_has_feature(dev, *feat))
+		return -ENODEV;
+
+	return 0;
+}
+
+static bool vfio_group_supports_hwdbm(struct vfio_group *group)
+{
+	enum iommu_dev_features feat = IOMMU_DEV_FEAT_HWDBM;
+
+	if (iommu_group_for_each_dev(group->iommu_group, &feat,
+				     vfio_dev_has_feature))
+		return false;
+
+	return true;
+}
+
 static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
 			      struct vfio_dma *dma, dma_addr_t base_iova,
 			      size_t pgsize)
@@ -2189,6 +2212,12 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	 * capable via the page pinning interface.
 	 */
 	iommu->pinned_page_dirty_scope = false;
+
+	/* Update the hwdbm status of group and iommu */
+	group->iommu_hwdbm = vfio_group_supports_hwdbm(group);
+	if (!group->iommu_hwdbm)
+		iommu->num_non_hwdbm_groups++;
+
 	mutex_unlock(&iommu->lock);
 	vfio_iommu_resv_free(&group_resv_regions);
 
@@ -2342,6 +2371,7 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 	struct vfio_domain *domain;
 	struct vfio_group *group;
 	bool update_dirty_scope = false;
+	bool update_iommu_hwdbm = false;
 	LIST_HEAD(iova_copy);
 
 	mutex_lock(&iommu->lock);
@@ -2380,6 +2410,7 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 
 		vfio_iommu_detach_group(domain, group);
 		update_dirty_scope = !group->pinned_page_dirty_scope;
+		update_iommu_hwdbm = !group->iommu_hwdbm;
 		list_del(&group->next);
 		kfree(group);
 		/*
@@ -2417,6 +2448,8 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 	 */
 	if (update_dirty_scope)
 		update_pinned_page_dirty_scope(iommu);
+	if (update_iommu_hwdbm)
+		iommu->num_non_hwdbm_groups--;
 	mutex_unlock(&iommu->lock);
 }
 
-- 
2.19.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 10/11] vfio/iommu_type1: Optimize dirty bitmap population based on iommu HWDBM
  2021-01-28 15:17 [RFC PATCH 00/11] vfio/iommu_type1: Implement dirty log tracking based on smmuv3 HTTU Keqian Zhu
                   ` (8 preceding siblings ...)
  2021-01-28 15:17 ` [RFC PATCH 09/11] vfio/iommu_type1: Add HWDBM status maintanance Keqian Zhu
@ 2021-01-28 15:17 ` Keqian Zhu
  2021-02-07  9:56   ` Yi Sun
  2021-01-28 15:17 ` [RFC PATCH 11/11] vfio/iommu_type1: Add support for manual dirty log clear Keqian Zhu
  10 siblings, 1 reply; 36+ messages in thread
From: Keqian Zhu @ 2021-01-28 15:17 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu, Will Deacon,
	Alex Williamson, Marc Zyngier, Catalin Marinas
  Cc: Mark Rutland, jiangkunkun, Suzuki K Poulose, Cornelia Huck,
	lushenming, Kirti Wankhede, James Morse, wanghaibin.wang,
	Robin Murphy

From: jiangkunkun <jiangkunkun@huawei.com>

In the past if vfio_iommu is not of pinned_page_dirty_scope and
vfio_dma is iommu_mapped, we populate full dirty bitmap for this
vfio_dma. Now we can try to get dirty log from iommu before make
the lousy decision.

Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
---
 drivers/vfio/vfio_iommu_type1.c | 97 ++++++++++++++++++++++++++++++++-
 1 file changed, 94 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 3b8522ebf955..1cd10f3e7ed4 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -999,6 +999,25 @@ static bool vfio_group_supports_hwdbm(struct vfio_group *group)
 	return true;
 }
 
+static int vfio_iommu_dirty_log_clear(struct vfio_iommu *iommu,
+				      dma_addr_t start_iova, size_t size,
+				      unsigned long *bitmap_buffer,
+				      dma_addr_t base_iova, size_t pgsize)
+{
+	struct vfio_domain *d;
+	unsigned long pgshift = __ffs(pgsize);
+	int ret;
+
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		ret = iommu_clear_dirty_log(d->domain, start_iova, size,
+					    bitmap_buffer, base_iova, pgshift);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
 static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
 			      struct vfio_dma *dma, dma_addr_t base_iova,
 			      size_t pgsize)
@@ -1010,13 +1029,28 @@ static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
 	unsigned long shift = bit_offset % BITS_PER_LONG;
 	unsigned long leftover;
 
+	if (iommu->pinned_page_dirty_scope || !dma->iommu_mapped)
+		goto bitmap_done;
+
+	/* try to get dirty log from IOMMU */
+	if (!iommu->num_non_hwdbm_groups) {
+		struct vfio_domain *d;
+
+		list_for_each_entry(d, &iommu->domain_list, next) {
+			if (iommu_sync_dirty_log(d->domain, dma->iova, dma->size,
+						dma->bitmap, dma->iova, pgshift))
+				return -EFAULT;
+		}
+		goto bitmap_done;
+	}
+
 	/*
 	 * mark all pages dirty if any IOMMU capable device is not able
 	 * to report dirty pages and all pages are pinned and mapped.
 	 */
-	if (!iommu->pinned_page_dirty_scope && dma->iommu_mapped)
-		bitmap_set(dma->bitmap, 0, nbits);
+	bitmap_set(dma->bitmap, 0, nbits);
 
+bitmap_done:
 	if (shift) {
 		bitmap_shift_left(dma->bitmap, dma->bitmap, shift,
 				  nbits + shift);
@@ -1078,6 +1112,18 @@ static int vfio_iova_dirty_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
 		 */
 		bitmap_clear(dma->bitmap, 0, dma->size >> pgshift);
 		vfio_dma_populate_bitmap(dma, pgsize);
+
+		/* Clear iommu dirty log to re-enable dirty log tracking */
+		if (!iommu->pinned_page_dirty_scope &&
+		    dma->iommu_mapped && !iommu->num_non_hwdbm_groups) {
+			ret = vfio_iommu_dirty_log_clear(iommu,	dma->iova,
+					dma->size, dma->bitmap, dma->iova,
+					pgsize);
+			if (ret) {
+				pr_warn("dma dirty log clear failed!\n");
+				return ret;
+			}
+		}
 	}
 	return 0;
 }
@@ -2780,6 +2826,48 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
 			-EFAULT : 0;
 }
 
+static void vfio_dma_dirty_log_start(struct vfio_iommu *iommu,
+				     struct vfio_dma *dma)
+{
+	struct vfio_domain *d;
+
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		/* Go through all domain anyway even if we fail */
+		iommu_split_block(d->domain, dma->iova, dma->size);
+	}
+}
+
+static void vfio_dma_dirty_log_stop(struct vfio_iommu *iommu,
+				    struct vfio_dma *dma)
+{
+	struct vfio_domain *d;
+
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		/* Go through all domain anyway even if we fail */
+		iommu_merge_page(d->domain, dma->iova, dma->size,
+				 d->prot | dma->prot);
+	}
+}
+
+static void vfio_iommu_dirty_log_switch(struct vfio_iommu *iommu, bool start)
+{
+	struct rb_node *n;
+
+	/* Split and merge even if all iommu don't support HWDBM now */
+	for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) {
+		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
+
+		if (!dma->iommu_mapped)
+			continue;
+
+		/* Go through all dma range anyway even if we fail */
+		if (start)
+			vfio_dma_dirty_log_start(iommu, dma);
+		else
+			vfio_dma_dirty_log_stop(iommu, dma);
+	}
+}
+
 static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
 					unsigned long arg)
 {
@@ -2812,8 +2900,10 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
 		pgsize = 1 << __ffs(iommu->pgsize_bitmap);
 		if (!iommu->dirty_page_tracking) {
 			ret = vfio_dma_bitmap_alloc_all(iommu, pgsize);
-			if (!ret)
+			if (!ret) {
 				iommu->dirty_page_tracking = true;
+				vfio_iommu_dirty_log_switch(iommu, true);
+			}
 		}
 		mutex_unlock(&iommu->lock);
 		return ret;
@@ -2822,6 +2912,7 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
 		if (iommu->dirty_page_tracking) {
 			iommu->dirty_page_tracking = false;
 			vfio_dma_bitmap_free_all(iommu);
+			vfio_iommu_dirty_log_switch(iommu, false);
 		}
 		mutex_unlock(&iommu->lock);
 		return 0;
-- 
2.19.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 11/11] vfio/iommu_type1: Add support for manual dirty log clear
  2021-01-28 15:17 [RFC PATCH 00/11] vfio/iommu_type1: Implement dirty log tracking based on smmuv3 HTTU Keqian Zhu
                   ` (9 preceding siblings ...)
  2021-01-28 15:17 ` [RFC PATCH 10/11] vfio/iommu_type1: Optimize dirty bitmap population based on iommu HWDBM Keqian Zhu
@ 2021-01-28 15:17 ` Keqian Zhu
  10 siblings, 0 replies; 36+ messages in thread
From: Keqian Zhu @ 2021-01-28 15:17 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu, Will Deacon,
	Alex Williamson, Marc Zyngier, Catalin Marinas
  Cc: Mark Rutland, jiangkunkun, Suzuki K Poulose, Cornelia Huck,
	lushenming, Kirti Wankhede, James Morse, wanghaibin.wang,
	Robin Murphy

From: jiangkunkun <jiangkunkun@huawei.com>

In the past, we clear dirty log immediately after sync dirty
log to userspace. This may cause redundant dirty handling if
userspace handles dirty log iteratively:

After vfio clears dirty log, new dirty log starts to generate.
These new dirty log will be reported to userspace even if they
are generated before userspace handles the same dirty page.

That's to say, we should minimize the time gap of dirty log
clearing and dirty log handling. We can give userspace the
interface to clear dirty log.

Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
---
 drivers/vfio/vfio_iommu_type1.c | 103 ++++++++++++++++++++++++++++++--
 include/uapi/linux/vfio.h       |  28 ++++++++-
 2 files changed, 126 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 1cd10f3e7ed4..a32dc684b86e 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -73,6 +73,7 @@ struct vfio_iommu {
 	bool			v2;
 	bool			nesting;
 	bool			dirty_page_tracking;
+	bool			dirty_log_manual_clear;
 	bool			pinned_page_dirty_scope;
 	uint64_t		num_non_hwdbm_groups;
 };
@@ -1018,6 +1019,78 @@ static int vfio_iommu_dirty_log_clear(struct vfio_iommu *iommu,
 	return 0;
 }
 
+static int vfio_iova_dirty_log_clear(u64 __user *bitmap,
+				     struct vfio_iommu *iommu,
+				     dma_addr_t iova, size_t size,
+				     size_t pgsize)
+{
+	struct vfio_dma *dma;
+	struct rb_node *n;
+	dma_addr_t start_iova, end_iova, riova;
+	unsigned long pgshift = __ffs(pgsize);
+	unsigned long bitmap_size;
+	unsigned long *bitmap_buffer = NULL;
+	bool clear_valid;
+	int rs, re, start, end, dma_offset;
+	int ret = 0;
+
+	bitmap_size = DIRTY_BITMAP_BYTES(size >> pgshift);
+	bitmap_buffer = kvmalloc(bitmap_size, GFP_KERNEL);
+	if (!bitmap_buffer) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if (copy_from_user(bitmap_buffer, bitmap, bitmap_size)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) {
+		dma = rb_entry(n, struct vfio_dma, node);
+		if (!dma->iommu_mapped)
+			continue;
+		if ((dma->iova + dma->size - 1) < iova)
+			continue;
+		if (dma->iova > iova + size - 1)
+			break;
+
+		start_iova = max(iova, dma->iova);
+		end_iova = min(iova + size, dma->iova + dma->size);
+
+		/* Similar logic as the tail of vfio_iova_dirty_bitmap */
+
+		clear_valid = false;
+		start = (start_iova - iova) >> pgshift;
+		end = (end_iova - iova) >> pgshift;
+		bitmap_for_each_set_region(bitmap_buffer, rs, re, start, end) {
+			clear_valid = true;
+			riova = iova + (rs << pgshift);
+			dma_offset = (riova - dma->iova) >> pgshift;
+			bitmap_clear(dma->bitmap, dma_offset, re - rs);
+		}
+
+		if (clear_valid)
+			vfio_dma_populate_bitmap(dma, pgsize);
+
+		if (clear_valid && !iommu->pinned_page_dirty_scope &&
+		    dma->iommu_mapped && !iommu->num_non_hwdbm_groups) {
+			ret = vfio_iommu_dirty_log_clear(iommu, start_iova,
+					end_iova - start_iova,	bitmap_buffer,
+					iova, pgsize);
+			if (ret) {
+				pr_warn("dma dirty log clear failed!\n");
+				goto out;
+			}
+		}
+
+	}
+
+out:
+	kfree(bitmap_buffer);
+	return ret;
+}
+
 static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
 			      struct vfio_dma *dma, dma_addr_t base_iova,
 			      size_t pgsize)
@@ -1067,6 +1140,10 @@ static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
 			 DIRTY_BITMAP_BYTES(nbits + shift)))
 		return -EFAULT;
 
+	if (shift && iommu->dirty_log_manual_clear)
+		bitmap_shift_right(dma->bitmap, dma->bitmap, shift,
+				   nbits + shift);
+
 	return 0;
 }
 
@@ -1105,6 +1182,9 @@ static int vfio_iova_dirty_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
 		if (ret)
 			return ret;
 
+		if (iommu->dirty_log_manual_clear)
+			continue;
+
 		/*
 		 * Re-populate bitmap to include all pinned pages which are
 		 * considered as dirty but exclude pages which are unpinned and
@@ -2601,6 +2681,11 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
 		if (!iommu)
 			return 0;
 		return vfio_domains_have_iommu_cache(iommu);
+	case VFIO_DIRTY_LOG_MANUAL_CLEAR:
+		if (!iommu)
+			return 0;
+		iommu->dirty_log_manual_clear = true;
+		return 1;
 	default:
 		return 0;
 	}
@@ -2874,7 +2959,8 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
 	struct vfio_iommu_type1_dirty_bitmap dirty;
 	uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
 			VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
-			VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
+			VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP |
+			VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP;
 	unsigned long minsz;
 	int ret = 0;
 
@@ -2916,7 +3002,8 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
 		}
 		mutex_unlock(&iommu->lock);
 		return 0;
-	} else if (dirty.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
+	} else if (dirty.flags & (VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP |
+				VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP)) {
 		struct vfio_iommu_type1_dirty_bitmap_get range;
 		unsigned long pgshift;
 		size_t data_size = dirty.argsz - minsz;
@@ -2959,13 +3046,21 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
 			goto out_unlock;
 		}
 
-		if (iommu->dirty_page_tracking)
+		if (!iommu->dirty_page_tracking) {
+			ret = -EINVAL;
+			goto out_unlock;
+		}
+
+		if (dirty.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP)
 			ret = vfio_iova_dirty_bitmap(range.bitmap.data,
 						     iommu, range.iova,
 						     range.size,
 						     range.bitmap.pgsize);
 		else
-			ret = -EINVAL;
+			ret = vfio_iova_dirty_log_clear(range.bitmap.data,
+							iommu, range.iova,
+							range.size,
+							range.bitmap.pgsize);
 out_unlock:
 		mutex_unlock(&iommu->lock);
 
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index d1812777139f..77a64ff38f64 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -46,6 +46,14 @@
  */
 #define VFIO_NOIOMMU_IOMMU		8
 
+/*
+ * The vfio_iommu driver may support user clears dirty log manually, which means
+ * dirty log is not cleared automatically after dirty log is copied to userspace,
+ * it's user's duty to clear dirty log. Note: when user queries this extension
+ * and vfio_iommu driver supports it, then it is enabled.
+ */
+#define VFIO_DIRTY_LOG_MANUAL_CLEAR	9
+
 /*
  * The IOCTL interface is designed for extensibility by embedding the
  * structure length (argsz) and flags into structures passed between
@@ -1161,7 +1169,24 @@ struct vfio_iommu_type1_dma_unmap {
  * actual bitmap. If dirty pages logging is not enabled, an error will be
  * returned.
  *
- * Only one of the flags _START, _STOP and _GET may be specified at a time.
+ * Calling the IOCTL with VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP flag set,
+ * instructs the IOMMU driver to clear the dirty status of pages in a bitmap
+ * for IOMMU container for a given IOVA range. The user must specify the IOVA
+ * range, the bitmap and the pgsize through the structure
+ * vfio_iommu_type1_dirty_bitmap_get in the data[] portion. This interface
+ * supports clearing a bitmap of the smallest supported pgsize only and can be
+ * modified in future to clear a bitmap of any specified supported pgsize. The
+ * user must provide a memory area for the bitmap memory and specify its size
+ * in bitmap.size. One bit is used to represent one page consecutively starting
+ * from iova offset. The user should provide page size in bitmap.pgsize field.
+ * A bit set in the bitmap indicates that the page at that offset from iova is
+ * cleared the dirty status, and dirty tracking is re-enabled for that page. The
+ * caller must set argsz to a value including the size of structure
+ * vfio_iommu_dirty_bitmap_get, but excluing the size of the actual bitmap. If
+ * dirty pages logging is not enabled, an error will be returned.
+ *
+ * Only one of the flags _START, _STOP, _GET and _CLEAR may be specified at a
+ * time.
  *
  */
 struct vfio_iommu_type1_dirty_bitmap {
@@ -1170,6 +1195,7 @@ struct vfio_iommu_type1_dirty_bitmap {
 #define VFIO_IOMMU_DIRTY_PAGES_FLAG_START	(1 << 0)
 #define VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP	(1 << 1)
 #define VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP	(1 << 2)
+#define VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP (1 << 3)
 	__u8         data[];
 };
 
-- 
2.19.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU
  2021-01-28 15:17 ` [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU Keqian Zhu
@ 2021-02-04 19:50   ` Robin Murphy
  2021-02-05  9:13     ` Keqian Zhu
  2021-03-02  7:42     ` Keqian Zhu
  0 siblings, 2 replies; 36+ messages in thread
From: Robin Murphy @ 2021-02-04 19:50 UTC (permalink / raw)
  To: Keqian Zhu, linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu,
	Will Deacon, Alex Williamson, Marc Zyngier, Catalin Marinas
  Cc: Mark Rutland, jiangkunkun, Suzuki K Poulose, Cornelia Huck,
	lushenming, Kirti Wankhede, James Morse, wanghaibin.wang

On 2021-01-28 15:17, Keqian Zhu wrote:
> From: jiangkunkun <jiangkunkun@huawei.com>
> 
> The SMMU which supports HTTU (Hardware Translation Table Update) can
> update the access flag and the dirty state of TTD by hardware. It is
> essential to track dirty pages of DMA.
> 
> This adds feature detection, none functional change.
> 
> Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
> Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
> ---
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 ++++++++++++++++
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 ++++++++
>   include/linux/io-pgtable.h                  |  1 +
>   3 files changed, 25 insertions(+)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 8ca7415d785d..0f0fe71cc10d 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1987,6 +1987,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
>   		.pgsize_bitmap	= smmu->pgsize_bitmap,
>   		.ias		= ias,
>   		.oas		= oas,
> +		.httu_hd	= smmu->features & ARM_SMMU_FEAT_HTTU_HD,
>   		.coherent_walk	= smmu->features & ARM_SMMU_FEAT_COHERENCY,
>   		.tlb		= &arm_smmu_flush_ops,
>   		.iommu_dev	= smmu->dev,
> @@ -3224,6 +3225,21 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
>   	if (reg & IDR0_HYP)
>   		smmu->features |= ARM_SMMU_FEAT_HYP;
>   
> +	switch (FIELD_GET(IDR0_HTTU, reg)) {

We need to accommodate the firmware override as well if we need this to 
be meaningful. Jean-Philippe is already carrying a suitable patch in the 
SVA stack[1].

> +	case IDR0_HTTU_NONE:
> +		break;
> +	case IDR0_HTTU_HA:
> +		smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
> +		break;
> +	case IDR0_HTTU_HAD:
> +		smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
> +		smmu->features |= ARM_SMMU_FEAT_HTTU_HD;
> +		break;
> +	default:
> +		dev_err(smmu->dev, "unknown/unsupported HTTU!\n");
> +		return -ENXIO;
> +	}
> +
>   	/*
>   	 * The coherency feature as set by FW is used in preference to the ID
>   	 * register, but warn on mismatch.
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 96c2e9565e00..e91bea44519e 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -33,6 +33,10 @@
>   #define IDR0_ASID16			(1 << 12)
>   #define IDR0_ATS			(1 << 10)
>   #define IDR0_HYP			(1 << 9)
> +#define IDR0_HTTU			GENMASK(7, 6)
> +#define IDR0_HTTU_NONE			0
> +#define IDR0_HTTU_HA			1
> +#define IDR0_HTTU_HAD			2
>   #define IDR0_COHACC			(1 << 4)
>   #define IDR0_TTF			GENMASK(3, 2)
>   #define IDR0_TTF_AARCH64		2
> @@ -286,6 +290,8 @@
>   #define CTXDESC_CD_0_TCR_TBI0		(1ULL << 38)
>   
>   #define CTXDESC_CD_0_AA64		(1UL << 41)
> +#define CTXDESC_CD_0_HD			(1UL << 42)
> +#define CTXDESC_CD_0_HA			(1UL << 43)
>   #define CTXDESC_CD_0_S			(1UL << 44)
>   #define CTXDESC_CD_0_R			(1UL << 45)
>   #define CTXDESC_CD_0_A			(1UL << 46)
> @@ -604,6 +610,8 @@ struct arm_smmu_device {
>   #define ARM_SMMU_FEAT_RANGE_INV		(1 << 15)
>   #define ARM_SMMU_FEAT_BTM		(1 << 16)
>   #define ARM_SMMU_FEAT_SVA		(1 << 17)
> +#define ARM_SMMU_FEAT_HTTU_HA		(1 << 18)
> +#define ARM_SMMU_FEAT_HTTU_HD		(1 << 19)
>   	u32				features;
>   
>   #define ARM_SMMU_OPT_SKIP_PREFETCH	(1 << 0)
> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
> index ea727eb1a1a9..1a00ea8562c7 100644
> --- a/include/linux/io-pgtable.h
> +++ b/include/linux/io-pgtable.h
> @@ -97,6 +97,7 @@ struct io_pgtable_cfg {
>   	unsigned long			pgsize_bitmap;
>   	unsigned int			ias;
>   	unsigned int			oas;
> +	bool				httu_hd;

This is very specific to the AArch64 stage 1 format, not a generic 
capability - I think it should be a quirk flag rather than a common field.

Robin.

[1] 
https://jpbrucker.net/git/linux/commit/?h=sva/current&id=1ef7d512fb9082450dfe0d22ca4f7e35625a097b

>   	bool				coherent_walk;
>   	const struct iommu_flush_ops	*tlb;
>   	struct device			*iommu_dev;
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 04/11] iommu/arm-smmu-v3: Split block descriptor to a span of page
  2021-01-28 15:17 ` [RFC PATCH 04/11] iommu/arm-smmu-v3: Split block descriptor to a span of page Keqian Zhu
@ 2021-02-04 19:51   ` Robin Murphy
  2021-02-07  8:18     ` Keqian Zhu
  0 siblings, 1 reply; 36+ messages in thread
From: Robin Murphy @ 2021-02-04 19:51 UTC (permalink / raw)
  To: Keqian Zhu, linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu,
	Will Deacon, Alex Williamson, Marc Zyngier, Catalin Marinas
  Cc: Mark Rutland, jiangkunkun, Suzuki K Poulose, Cornelia Huck,
	lushenming, Kirti Wankhede, James Morse, wanghaibin.wang

On 2021-01-28 15:17, Keqian Zhu wrote:
> From: jiangkunkun <jiangkunkun@huawei.com>
> 
> Block descriptor is not a proper granule for dirty log tracking. This
> adds a new interface named split_block in iommu layer and arm smmuv3
> implements it, which splits block descriptor to an equivalent span of
> page descriptors.
> 
> During spliting block, other interfaces are not expected to be working,
> so race condition does not exist. And we flush all iotlbs after the split
> procedure is completed to ease the pressure of iommu, as we will split a
> huge range of block mappings in general.

"Not expected to be" is not the same thing as "can not". Presumably the 
whole point of dirty log tracking is that it can be run speculatively in 
the background, so is there any actual guarantee that the guest can't, 
say, issue a hotplug event that would cause some memory to be released 
back to the host and unmapped while a scan might be in progress? Saying 
effectively "there is no race condition as long as you assume there is 
no race condition" isn't all that reassuring...

That said, it's not very clear why patches #4 and #5 are here at all, 
given that patches #6 and #7 appear quite happy to handle block entries.

> Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
> Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
> ---
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  20 ++++
>   drivers/iommu/io-pgtable-arm.c              | 122 ++++++++++++++++++++
>   drivers/iommu/iommu.c                       |  40 +++++++
>   include/linux/io-pgtable.h                  |   2 +
>   include/linux/iommu.h                       |  10 ++
>   5 files changed, 194 insertions(+)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 9208881a571c..5469f4fca820 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2510,6 +2510,25 @@ static int arm_smmu_domain_set_attr(struct iommu_domain *domain,
>   	return ret;
>   }
>   
> +static size_t arm_smmu_split_block(struct iommu_domain *domain,
> +				   unsigned long iova, size_t size)
> +{
> +	struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
> +	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
> +
> +	if (!(smmu->features & (ARM_SMMU_FEAT_BBML1 | ARM_SMMU_FEAT_BBML2))) {
> +		dev_err(smmu->dev, "don't support BBML1/2 and split block\n");
> +		return 0;
> +	}
> +
> +	if (!ops || !ops->split_block) {
> +		pr_err("don't support split block\n");
> +		return 0;
> +	}
> +
> +	return ops->split_block(ops, iova, size);
> +}
> +
>   static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
>   {
>   	return iommu_fwspec_add_ids(dev, args->args, 1);
> @@ -2609,6 +2628,7 @@ static struct iommu_ops arm_smmu_ops = {
>   	.device_group		= arm_smmu_device_group,
>   	.domain_get_attr	= arm_smmu_domain_get_attr,
>   	.domain_set_attr	= arm_smmu_domain_set_attr,
> +	.split_block		= arm_smmu_split_block,
>   	.of_xlate		= arm_smmu_of_xlate,
>   	.get_resv_regions	= arm_smmu_get_resv_regions,
>   	.put_resv_regions	= generic_iommu_put_resv_regions,
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> index e299a44808ae..f3b7f7115e38 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -79,6 +79,8 @@
>   #define ARM_LPAE_PTE_SH_IS		(((arm_lpae_iopte)3) << 8)
>   #define ARM_LPAE_PTE_NS			(((arm_lpae_iopte)1) << 5)
>   #define ARM_LPAE_PTE_VALID		(((arm_lpae_iopte)1) << 0)
> +/* Block descriptor bits */
> +#define ARM_LPAE_PTE_NT			(((arm_lpae_iopte)1) << 16)
>   
>   #define ARM_LPAE_PTE_ATTR_LO_MASK	(((arm_lpae_iopte)0x3ff) << 2)
>   /* Ignore the contiguous bit for block splitting */
> @@ -679,6 +681,125 @@ static phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
>   	return iopte_to_paddr(pte, data) | iova;
>   }
>   
> +static size_t __arm_lpae_split_block(struct arm_lpae_io_pgtable *data,
> +				     unsigned long iova, size_t size, int lvl,
> +				     arm_lpae_iopte *ptep);
> +
> +static size_t arm_lpae_do_split_blk(struct arm_lpae_io_pgtable *data,
> +				    unsigned long iova, size_t size,
> +				    arm_lpae_iopte blk_pte, int lvl,
> +				    arm_lpae_iopte *ptep)
> +{
> +	struct io_pgtable_cfg *cfg = &data->iop.cfg;
> +	arm_lpae_iopte pte, *tablep;
> +	phys_addr_t blk_paddr;
> +	size_t tablesz = ARM_LPAE_GRANULE(data);
> +	size_t split_sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
> +	int i;
> +
> +	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
> +		return 0;
> +
> +	tablep = __arm_lpae_alloc_pages(tablesz, GFP_ATOMIC, cfg);
> +	if (!tablep)
> +		return 0;
> +
> +	blk_paddr = iopte_to_paddr(blk_pte, data);
> +	pte = iopte_prot(blk_pte);
> +	for (i = 0; i < tablesz / sizeof(pte); i++, blk_paddr += split_sz)
> +		__arm_lpae_init_pte(data, blk_paddr, pte, lvl, &tablep[i]);
> +
> +	if (cfg->bbml == 1) {
> +		/* Race does not exist */
> +		blk_pte |= ARM_LPAE_PTE_NT;
> +		__arm_lpae_set_pte(ptep, blk_pte, cfg);
> +		io_pgtable_tlb_flush_walk(&data->iop, iova, size, size);
> +	}
> +	/* Race does not exist */
> +	pte = arm_lpae_install_table(tablep, ptep, blk_pte, cfg);
> +
> +	/* Have splited it into page? */
> +	if (lvl == (ARM_LPAE_MAX_LEVELS - 1))
> +		return size;
> +
> +	/* Go back to lvl - 1 */
> +	ptep -= ARM_LPAE_LVL_IDX(iova, lvl - 1, data);
> +	return __arm_lpae_split_block(data, iova, size, lvl - 1, ptep);

If there is a good enough justification for actually using this, 
recursive splitting is a horrible way to do it. The theoretical 
split_blk_unmap case does it for the sake of simplicity and mitigating 
races wherein multiple parts of the same block may be unmapped, but if 
were were using this in anger then we'd need it to be fast - and a race 
does not exist, right? - so building the entire sub-table first then 
swizzling a single PTE would make far more sense.

> +}
> +
> +static size_t __arm_lpae_split_block(struct arm_lpae_io_pgtable *data,
> +				     unsigned long iova, size_t size, int lvl,
> +				     arm_lpae_iopte *ptep)
> +{
> +	arm_lpae_iopte pte;
> +	struct io_pgtable *iop = &data->iop;
> +	size_t base, next_size, total_size;
> +
> +	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
> +		return 0;
> +
> +	ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
> +	pte = READ_ONCE(*ptep);
> +	if (WARN_ON(!pte))
> +		return 0;
> +
> +	if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
> +		if (iopte_leaf(pte, lvl, iop->fmt)) {
> +			if (lvl == (ARM_LPAE_MAX_LEVELS - 1) ||
> +			    (pte & ARM_LPAE_PTE_AP_RDONLY))
> +				return size;
> +
> +			/* We find a writable block, split it. */
> +			return arm_lpae_do_split_blk(data, iova, size, pte,
> +					lvl + 1, ptep);
> +		} else {
> +			/* If it is the last table level, then nothing to do */
> +			if (lvl == (ARM_LPAE_MAX_LEVELS - 2))
> +				return size;
> +
> +			total_size = 0;
> +			next_size = ARM_LPAE_BLOCK_SIZE(lvl + 1, data);
> +			ptep = iopte_deref(pte, data);
> +			for (base = 0; base < size; base += next_size)
> +				total_size += __arm_lpae_split_block(data,
> +						iova + base, next_size, lvl + 1,
> +						ptep);
> +			return total_size;
> +		}
> +	} else if (iopte_leaf(pte, lvl, iop->fmt)) {
> +		WARN(1, "Can't split behind a block.\n");
> +		return 0;
> +	}
> +
> +	/* Keep on walkin */
> +	ptep = iopte_deref(pte, data);
> +	return __arm_lpae_split_block(data, iova, size, lvl + 1, ptep);
> +}
> +
> +static size_t arm_lpae_split_block(struct io_pgtable_ops *ops,
> +				   unsigned long iova, size_t size)
> +{
> +	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
> +	arm_lpae_iopte *ptep = data->pgd;
> +	struct io_pgtable_cfg *cfg = &data->iop.cfg;
> +	int lvl = data->start_level;
> +	long iaext = (s64)iova >> cfg->ias;
> +
> +	if (WARN_ON(!size || (size & cfg->pgsize_bitmap) != size))
> +		return 0;
> +
> +	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1)
> +		iaext = ~iaext;
> +	if (WARN_ON(iaext))
> +		return 0;
> +
> +	/* If it is smallest granule, then nothing to do */
> +	if (size == ARM_LPAE_BLOCK_SIZE(ARM_LPAE_MAX_LEVELS - 1, data))
> +		return size;
> +
> +	return __arm_lpae_split_block(data, iova, size, lvl, ptep);
> +}
> +
>   static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
>   {
>   	unsigned long granule, page_sizes;
> @@ -757,6 +878,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
>   		.map		= arm_lpae_map,
>   		.unmap		= arm_lpae_unmap,
>   		.iova_to_phys	= arm_lpae_iova_to_phys,
> +		.split_block	= arm_lpae_split_block,
>   	};
>   
>   	return data;
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index ffeebda8d6de..7dc0850448c3 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -2707,6 +2707,46 @@ int iommu_domain_set_attr(struct iommu_domain *domain,
>   }
>   EXPORT_SYMBOL_GPL(iommu_domain_set_attr);
>   
> +size_t iommu_split_block(struct iommu_domain *domain, unsigned long iova,
> +			 size_t size)
> +{
> +	const struct iommu_ops *ops = domain->ops;
> +	unsigned int min_pagesz;
> +	size_t pgsize, splited_size;
> +	size_t splited = 0;
> +
> +	min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
> +
> +	if (!IS_ALIGNED(iova | size, min_pagesz)) {
> +		pr_err("unaligned: iova 0x%lx size 0x%zx min_pagesz 0x%x\n",
> +		       iova, size, min_pagesz);
> +		return 0;
> +	}
> +
> +	if (!ops || !ops->split_block) {
> +		pr_err("don't support split block\n");
> +		return 0;
> +	}
> +
> +	while (size) {
> +		pgsize = iommu_pgsize(domain, iova, size);

If the whole point of this operation is to split a mapping down to a 
specific granularity, why bother recalculating that granularity over and 
over again?

> +
> +		splited_size = ops->split_block(domain, iova, pgsize);
> +
> +		pr_debug("splited: iova 0x%lx size 0x%zx\n", iova, splited_size);
> +		iova += splited_size;
> +		size -= splited_size;
> +		splited += splited_size;
> +
> +		if (splited_size != pgsize)
> +			break;
> +	}
> +	iommu_flush_iotlb_all(domain);
> +
> +	return splited;

Language tangent: note that "split" is one of those delightful irregular 
verbs, in that its past tense is also "split". This isn't the best 
operation for clear, unambigous naming :)

Don't let these idle nitpicks distract from the bigger concerns above, 
though. FWIW if the caller knows from the start that they want to keep 
track of things at page granularity, they always have the option of just 
stripping the larger sizes out of domain->pgsize_bitmap before mapping 
anything.

Robin.

> +}
> +EXPORT_SYMBOL_GPL(iommu_split_block);
> +
>   void iommu_get_resv_regions(struct device *dev, struct list_head *list)
>   {
>   	const struct iommu_ops *ops = dev->bus->iommu_ops;
> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
> index 26583beeb5d9..b87c6f4ecaa2 100644
> --- a/include/linux/io-pgtable.h
> +++ b/include/linux/io-pgtable.h
> @@ -162,6 +162,8 @@ struct io_pgtable_ops {
>   			size_t size, struct iommu_iotlb_gather *gather);
>   	phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
>   				    unsigned long iova);
> +	size_t (*split_block)(struct io_pgtable_ops *ops, unsigned long iova,
> +			      size_t size);
>   };
>   
>   /**
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index b3f0e2018c62..abeb811098a5 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -258,6 +258,8 @@ struct iommu_ops {
>   			       enum iommu_attr attr, void *data);
>   	int (*domain_set_attr)(struct iommu_domain *domain,
>   			       enum iommu_attr attr, void *data);
> +	size_t (*split_block)(struct iommu_domain *domain, unsigned long iova,
> +			      size_t size);
>   
>   	/* Request/Free a list of reserved regions for a device */
>   	void (*get_resv_regions)(struct device *dev, struct list_head *list);
> @@ -509,6 +511,8 @@ extern int iommu_domain_get_attr(struct iommu_domain *domain, enum iommu_attr,
>   				 void *data);
>   extern int iommu_domain_set_attr(struct iommu_domain *domain, enum iommu_attr,
>   				 void *data);
> +extern size_t iommu_split_block(struct iommu_domain *domain, unsigned long iova,
> +				size_t size);
>   
>   /* Window handling function prototypes */
>   extern int iommu_domain_window_enable(struct iommu_domain *domain, u32 wnd_nr,
> @@ -903,6 +907,12 @@ static inline int iommu_domain_set_attr(struct iommu_domain *domain,
>   	return -EINVAL;
>   }
>   
> +static inline size_t iommu_split_block(struct iommu_domain *domain,
> +				       unsigned long iova, size_t size)
> +{
> +	return 0;
> +}
> +
>   static inline int  iommu_device_register(struct iommu_device *iommu)
>   {
>   	return -ENODEV;
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 05/11] iommu/arm-smmu-v3: Merge a span of page to block descriptor
  2021-01-28 15:17 ` [RFC PATCH 05/11] iommu/arm-smmu-v3: Merge a span of page to block descriptor Keqian Zhu
@ 2021-02-04 19:52   ` Robin Murphy
  2021-02-07 12:13     ` Keqian Zhu
  0 siblings, 1 reply; 36+ messages in thread
From: Robin Murphy @ 2021-02-04 19:52 UTC (permalink / raw)
  To: Keqian Zhu, linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu,
	Will Deacon, Alex Williamson, Marc Zyngier, Catalin Marinas
  Cc: Mark Rutland, jiangkunkun, Suzuki K Poulose, Cornelia Huck,
	lushenming, Kirti Wankhede, James Morse, wanghaibin.wang

On 2021-01-28 15:17, Keqian Zhu wrote:
> From: jiangkunkun <jiangkunkun@huawei.com>
> 
> When stop dirty log tracking, we need to recover all block descriptors
> which are splited when start dirty log tracking. This adds a new
> interface named merge_page in iommu layer and arm smmuv3 implements it,
> which reinstall block mappings and unmap the span of page mappings.
> 
> It's caller's duty to find contiuous physical memory.
> 
> During merging page, other interfaces are not expected to be working,
> so race condition does not exist. And we flush all iotlbs after the merge
> procedure is completed to ease the pressure of iommu, as we will merge a
> huge range of page mappings in general.

Again, I think we need better reasoning than "race conditions don't 
exist because we don't expect them to exist".

> Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
> Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
> ---
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 20 ++++++
>   drivers/iommu/io-pgtable-arm.c              | 78 +++++++++++++++++++++
>   drivers/iommu/iommu.c                       | 75 ++++++++++++++++++++
>   include/linux/io-pgtable.h                  |  2 +
>   include/linux/iommu.h                       | 10 +++
>   5 files changed, 185 insertions(+)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 5469f4fca820..2434519e4bb6 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2529,6 +2529,25 @@ static size_t arm_smmu_split_block(struct iommu_domain *domain,
>   	return ops->split_block(ops, iova, size);
>   }
>   
> +static size_t arm_smmu_merge_page(struct iommu_domain *domain, unsigned long iova,
> +				  phys_addr_t paddr, size_t size, int prot)
> +{
> +	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
> +	struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
> +
> +	if (!(smmu->features & (ARM_SMMU_FEAT_BBML1 | ARM_SMMU_FEAT_BBML2))) {
> +		dev_err(smmu->dev, "don't support BBML1/2 and merge page\n");
> +		return 0;
> +	}
> +
> +	if (!ops || !ops->merge_page) {
> +		pr_err("don't support merge page\n");
> +		return 0;
> +	}
> +
> +	return ops->merge_page(ops, iova, paddr, size, prot);
> +}
> +
>   static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
>   {
>   	return iommu_fwspec_add_ids(dev, args->args, 1);
> @@ -2629,6 +2648,7 @@ static struct iommu_ops arm_smmu_ops = {
>   	.domain_get_attr	= arm_smmu_domain_get_attr,
>   	.domain_set_attr	= arm_smmu_domain_set_attr,
>   	.split_block		= arm_smmu_split_block,
> +	.merge_page		= arm_smmu_merge_page,
>   	.of_xlate		= arm_smmu_of_xlate,
>   	.get_resv_regions	= arm_smmu_get_resv_regions,
>   	.put_resv_regions	= generic_iommu_put_resv_regions,
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> index f3b7f7115e38..17390f258eb1 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -800,6 +800,83 @@ static size_t arm_lpae_split_block(struct io_pgtable_ops *ops,
>   	return __arm_lpae_split_block(data, iova, size, lvl, ptep);
>   }
>   
> +static size_t __arm_lpae_merge_page(struct arm_lpae_io_pgtable *data,
> +				    unsigned long iova, phys_addr_t paddr,
> +				    size_t size, int lvl, arm_lpae_iopte *ptep,
> +				    arm_lpae_iopte prot)
> +{
> +	arm_lpae_iopte pte, *tablep;
> +	struct io_pgtable *iop = &data->iop;
> +	struct io_pgtable_cfg *cfg = &data->iop.cfg;
> +
> +	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
> +		return 0;
> +
> +	ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
> +	pte = READ_ONCE(*ptep);
> +	if (WARN_ON(!pte))
> +		return 0;
> +
> +	if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
> +		if (iopte_leaf(pte, lvl, iop->fmt))
> +			return size;
> +
> +		/* Race does not exist */
> +		if (cfg->bbml == 1) {
> +			prot |= ARM_LPAE_PTE_NT;
> +			__arm_lpae_init_pte(data, paddr, prot, lvl, ptep);
> +			io_pgtable_tlb_flush_walk(iop, iova, size,
> +						  ARM_LPAE_GRANULE(data));
> +
> +			prot &= ~(ARM_LPAE_PTE_NT);
> +			__arm_lpae_init_pte(data, paddr, prot, lvl, ptep);
> +		} else {
> +			__arm_lpae_init_pte(data, paddr, prot, lvl, ptep);
> +		}
> +
> +		tablep = iopte_deref(pte, data);
> +		__arm_lpae_free_pgtable(data, lvl + 1, tablep);
> +		return size;
> +	} else if (iopte_leaf(pte, lvl, iop->fmt)) {
> +		/* The size is too small, already merged */
> +		return size;
> +	}
> +
> +	/* Keep on walkin */
> +	ptep = iopte_deref(pte, data);
> +	return __arm_lpae_merge_page(data, iova, paddr, size, lvl + 1, ptep, prot);
> +}
> +
> +static size_t arm_lpae_merge_page(struct io_pgtable_ops *ops, unsigned long iova,
> +				  phys_addr_t paddr, size_t size, int iommu_prot)
> +{
> +	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
> +	struct io_pgtable_cfg *cfg = &data->iop.cfg;
> +	arm_lpae_iopte *ptep = data->pgd;
> +	int lvl = data->start_level;
> +	arm_lpae_iopte prot;
> +	long iaext = (s64)iova >> cfg->ias;
> +
> +	/* If no access, then nothing to do */
> +	if (!(iommu_prot & (IOMMU_READ | IOMMU_WRITE)))
> +		return size;
> +
> +	if (WARN_ON(!size || (size & cfg->pgsize_bitmap) != size))
> +		return 0;
> +
> +	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1)
> +		iaext = ~iaext;
> +	if (WARN_ON(iaext || paddr >> cfg->oas))
> +		return 0;
> +
> +	/* If it is smallest granule, then nothing to do */
> +	if (size == ARM_LPAE_BLOCK_SIZE(ARM_LPAE_MAX_LEVELS - 1, data))
> +		return size;
> +
> +	prot = arm_lpae_prot_to_pte(data, iommu_prot);
> +	return __arm_lpae_merge_page(data, iova, paddr, size, lvl, ptep, prot);
> +}
> +
>   static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
>   {
>   	unsigned long granule, page_sizes;
> @@ -879,6 +956,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
>   		.unmap		= arm_lpae_unmap,
>   		.iova_to_phys	= arm_lpae_iova_to_phys,
>   		.split_block	= arm_lpae_split_block,
> +		.merge_page	= arm_lpae_merge_page,
>   	};
>   
>   	return data;
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 7dc0850448c3..f1261da11ea8 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -2747,6 +2747,81 @@ size_t iommu_split_block(struct iommu_domain *domain, unsigned long iova,
>   }
>   EXPORT_SYMBOL_GPL(iommu_split_block);
>   
> +static size_t __iommu_merge_page(struct iommu_domain *domain, unsigned long iova,
> +				 phys_addr_t paddr, size_t size, int prot)
> +{
> +	const struct iommu_ops *ops = domain->ops;
> +	unsigned int min_pagesz;
> +	size_t pgsize, merged_size;
> +	size_t merged = 0;
> +
> +	min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
> +
> +	if (!IS_ALIGNED(iova | paddr | size, min_pagesz)) {
> +		pr_err("unaligned: iova 0x%lx pa %pa size 0x%zx min_pagesz 0x%x\n",
> +			iova, &paddr, size, min_pagesz);
> +		return 0;
> +	}
> +
> +	if (!ops || !ops->merge_page) {
> +		pr_err("don't support merge page\n");
> +		return 0;
> +	}
> +
> +	while (size) {
> +		pgsize = iommu_pgsize(domain, iova | paddr, size);
> +
> +		merged_size = ops->merge_page(domain, iova, paddr, pgsize, prot);
> +
> +		pr_debug("merged: iova 0x%lx pa %pa size 0x%zx\n", iova, &paddr,
> +			 merged_size);
> +		iova += merged_size;
> +		paddr += merged_size;
> +		size -= merged_size;
> +		merged += merged_size;
> +
> +		if (merged_size != pgsize)
> +			break;
> +	}
> +
> +	return merged;
> +}
> +
> +size_t iommu_merge_page(struct iommu_domain *domain, unsigned long iova,
> +			size_t size, int prot)
> +{
> +	phys_addr_t phys;
> +	dma_addr_t p, i;
> +	size_t cont_size, merged_size;
> +	size_t merged = 0;
> +
> +	while (size) {
> +		phys = iommu_iova_to_phys(domain, iova);
> +		cont_size = PAGE_SIZE;
> +		p = phys + cont_size;
> +		i = iova + cont_size;
> +
> +		while (cont_size < size && p == iommu_iova_to_phys(domain, i)) {
> +			p += PAGE_SIZE;
> +			i += PAGE_SIZE;
> +			cont_size += PAGE_SIZE;
> +		}
> +
> +		merged_size = __iommu_merge_page(domain, iova, phys, cont_size,
> +				prot);

This is incredibly silly. The amount of time you'll spend just on 
walking the tables in all those iova_to_phys() calls is probably 
significantly more than it would take the low-level pagetable code to do 
the entire operation for itself. At this level, any knowledge of how 
mappings are actually constructed is lost once __iommu_map() returns, so 
we just don't know, and for this operation in particular there seems 
little point in trying to guess - the driver backend still has to figure 
out whether something we *think* might me mergeable actually is, so it's 
better off doing the entire operation in a single pass by itself.

There's especially little point in starting all this work *before* 
checking that it's even possible...

Robin.

> +		iova += merged_size;
> +		size -= merged_size;
> +		merged += merged_size;
> +
> +		if (merged_size != cont_size)
> +			break;
> +	}
> +	iommu_flush_iotlb_all(domain);
> +
> +	return merged;
> +}
> +EXPORT_SYMBOL_GPL(iommu_merge_page);
> +
>   void iommu_get_resv_regions(struct device *dev, struct list_head *list)
>   {
>   	const struct iommu_ops *ops = dev->bus->iommu_ops;
> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
> index b87c6f4ecaa2..754b62a1bbaf 100644
> --- a/include/linux/io-pgtable.h
> +++ b/include/linux/io-pgtable.h
> @@ -164,6 +164,8 @@ struct io_pgtable_ops {
>   				    unsigned long iova);
>   	size_t (*split_block)(struct io_pgtable_ops *ops, unsigned long iova,
>   			      size_t size);
> +	size_t (*merge_page)(struct io_pgtable_ops *ops, unsigned long iova,
> +			     phys_addr_t phys, size_t size, int prot);
>   };
>   
>   /**
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index abeb811098a5..ac2b0b1bce0f 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -260,6 +260,8 @@ struct iommu_ops {
>   			       enum iommu_attr attr, void *data);
>   	size_t (*split_block)(struct iommu_domain *domain, unsigned long iova,
>   			      size_t size);
> +	size_t (*merge_page)(struct iommu_domain *domain, unsigned long iova,
> +			     phys_addr_t phys, size_t size, int prot);
>   
>   	/* Request/Free a list of reserved regions for a device */
>   	void (*get_resv_regions)(struct device *dev, struct list_head *list);
> @@ -513,6 +515,8 @@ extern int iommu_domain_set_attr(struct iommu_domain *domain, enum iommu_attr,
>   				 void *data);
>   extern size_t iommu_split_block(struct iommu_domain *domain, unsigned long iova,
>   				size_t size);
> +extern size_t iommu_merge_page(struct iommu_domain *domain, unsigned long iova,
> +			       size_t size, int prot);
>   
>   /* Window handling function prototypes */
>   extern int iommu_domain_window_enable(struct iommu_domain *domain, u32 wnd_nr,
> @@ -913,6 +917,12 @@ static inline size_t iommu_split_block(struct iommu_domain *domain,
>   	return 0;
>   }
>   
> +static inline size_t iommu_merge_page(struct iommu_domain *domain,
> +				      unsigned long iova, size_t size, int prot)
> +{
> +	return -EINVAL;
> +}
> +
>   static inline int  iommu_device_register(struct iommu_device *iommu)
>   {
>   	return -ENODEV;
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 06/11] iommu/arm-smmu-v3: Scan leaf TTD to sync hardware dirty log
  2021-01-28 15:17 ` [RFC PATCH 06/11] iommu/arm-smmu-v3: Scan leaf TTD to sync hardware dirty log Keqian Zhu
@ 2021-02-04 19:52   ` Robin Murphy
  2021-02-07 12:41     ` Keqian Zhu
  2021-02-08  1:17     ` Keqian Zhu
  0 siblings, 2 replies; 36+ messages in thread
From: Robin Murphy @ 2021-02-04 19:52 UTC (permalink / raw)
  To: Keqian Zhu, linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu,
	Will Deacon, Alex Williamson, Marc Zyngier, Catalin Marinas
  Cc: Mark Rutland, jiangkunkun, Suzuki K Poulose, Cornelia Huck,
	lushenming, Kirti Wankhede, James Morse, wanghaibin.wang

On 2021-01-28 15:17, Keqian Zhu wrote:
> From: jiangkunkun <jiangkunkun@huawei.com>
> 
> During dirty log tracking, user will try to retrieve dirty log from
> iommu if it supports hardware dirty log. This adds a new interface
> named sync_dirty_log in iommu layer and arm smmuv3 implements it,
> which scans leaf TTD and treats it's dirty if it's writable (As we
> just enable HTTU for stage1, so check AP[2] is not set).
> 
> Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
> Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
> ---
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 27 +++++++
>   drivers/iommu/io-pgtable-arm.c              | 90 +++++++++++++++++++++
>   drivers/iommu/iommu.c                       | 41 ++++++++++
>   include/linux/io-pgtable.h                  |  4 +
>   include/linux/iommu.h                       | 17 ++++
>   5 files changed, 179 insertions(+)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 2434519e4bb6..43d0536b429a 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2548,6 +2548,32 @@ static size_t arm_smmu_merge_page(struct iommu_domain *domain, unsigned long iov
>   	return ops->merge_page(ops, iova, paddr, size, prot);
>   }
>   
> +static int arm_smmu_sync_dirty_log(struct iommu_domain *domain,
> +				   unsigned long iova, size_t size,
> +				   unsigned long *bitmap,
> +				   unsigned long base_iova,
> +				   unsigned long bitmap_pgshift)
> +{
> +	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
> +	struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
> +
> +	if (!(smmu->features & ARM_SMMU_FEAT_HTTU_HD)) {
> +		dev_err(smmu->dev, "don't support HTTU_HD and sync dirty log\n");
> +		return -EPERM;
> +	}
> +
> +	if (!ops || !ops->sync_dirty_log) {
> +		pr_err("don't support sync dirty log\n");
> +		return -ENODEV;
> +	}
> +
> +	/* To ensure all inflight transactions are completed */
> +	arm_smmu_flush_iotlb_all(domain);

What about transactions that arrive between the point that this 
completes, and the point - potentially much later - that we actually 
access any given PTE during the walk? I don't see what this is supposed 
to be synchronising against, even if it were just a CMD_SYNC (I 
especially don't see why we'd want to knock out the TLBs).

> +
> +	return ops->sync_dirty_log(ops, iova, size, bitmap,
> +			base_iova, bitmap_pgshift);
> +}
> +
>   static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
>   {
>   	return iommu_fwspec_add_ids(dev, args->args, 1);
> @@ -2649,6 +2675,7 @@ static struct iommu_ops arm_smmu_ops = {
>   	.domain_set_attr	= arm_smmu_domain_set_attr,
>   	.split_block		= arm_smmu_split_block,
>   	.merge_page		= arm_smmu_merge_page,
> +	.sync_dirty_log		= arm_smmu_sync_dirty_log,
>   	.of_xlate		= arm_smmu_of_xlate,
>   	.get_resv_regions	= arm_smmu_get_resv_regions,
>   	.put_resv_regions	= generic_iommu_put_resv_regions,
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> index 17390f258eb1..6cfe1ef3fedd 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -877,6 +877,95 @@ static size_t arm_lpae_merge_page(struct io_pgtable_ops *ops, unsigned long iova
>   	return __arm_lpae_merge_page(data, iova, paddr, size, lvl, ptep, prot);
>   }
>   
> +static int __arm_lpae_sync_dirty_log(struct arm_lpae_io_pgtable *data,
> +				     unsigned long iova, size_t size,
> +				     int lvl, arm_lpae_iopte *ptep,
> +				     unsigned long *bitmap,
> +				     unsigned long base_iova,
> +				     unsigned long bitmap_pgshift)
> +{
> +	arm_lpae_iopte pte;
> +	struct io_pgtable *iop = &data->iop;
> +	size_t base, next_size;
> +	unsigned long offset;
> +	int nbits, ret;
> +
> +	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
> +		return -EINVAL;
> +
> +	ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
> +	pte = READ_ONCE(*ptep);
> +	if (WARN_ON(!pte))
> +		return -EINVAL;
> +
> +	if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
> +		if (iopte_leaf(pte, lvl, iop->fmt)) {
> +			if (pte & ARM_LPAE_PTE_AP_RDONLY)
> +				return 0;
> +
> +			/* It is writable, set the bitmap */
> +			nbits = size >> bitmap_pgshift;
> +			offset = (iova - base_iova) >> bitmap_pgshift;
> +			bitmap_set(bitmap, offset, nbits);
> +			return 0;
> +		} else {
> +			/* To traverse next level */
> +			next_size = ARM_LPAE_BLOCK_SIZE(lvl + 1, data);
> +			ptep = iopte_deref(pte, data);
> +			for (base = 0; base < size; base += next_size) {
> +				ret = __arm_lpae_sync_dirty_log(data,
> +						iova + base, next_size, lvl + 1,
> +						ptep, bitmap, base_iova, bitmap_pgshift);
> +				if (ret)
> +					return ret;
> +			}
> +			return 0;
> +		}
> +	} else if (iopte_leaf(pte, lvl, iop->fmt)) {
> +		if (pte & ARM_LPAE_PTE_AP_RDONLY)
> +			return 0;
> +
> +		/* Though the size is too small, also set bitmap */
> +		nbits = size >> bitmap_pgshift;
> +		offset = (iova - base_iova) >> bitmap_pgshift;
> +		bitmap_set(bitmap, offset, nbits);
> +		return 0;
> +	}
> +
> +	/* Keep on walkin */
> +	ptep = iopte_deref(pte, data);
> +	return __arm_lpae_sync_dirty_log(data, iova, size, lvl + 1, ptep,
> +			bitmap, base_iova, bitmap_pgshift);
> +}
> +
> +static int arm_lpae_sync_dirty_log(struct io_pgtable_ops *ops,
> +				   unsigned long iova, size_t size,
> +				   unsigned long *bitmap,
> +				   unsigned long base_iova,
> +				   unsigned long bitmap_pgshift)
> +{
> +	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
> +	arm_lpae_iopte *ptep = data->pgd;
> +	int lvl = data->start_level;
> +	struct io_pgtable_cfg *cfg = &data->iop.cfg;
> +	long iaext = (s64)iova >> cfg->ias;
> +
> +	if (WARN_ON(!size || (size & cfg->pgsize_bitmap) != size))
> +		return -EINVAL;
> +
> +	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1)
> +		iaext = ~iaext;
> +	if (WARN_ON(iaext))
> +		return -EINVAL;
> +
> +	if (data->iop.fmt != ARM_64_LPAE_S1 &&
> +	    data->iop.fmt != ARM_32_LPAE_S1)
> +		return -EINVAL;
> +
> +	return __arm_lpae_sync_dirty_log(data, iova, size, lvl, ptep,
> +					 bitmap, base_iova, bitmap_pgshift);
> +}
> +
>   static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
>   {
>   	unsigned long granule, page_sizes;
> @@ -957,6 +1046,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
>   		.iova_to_phys	= arm_lpae_iova_to_phys,
>   		.split_block	= arm_lpae_split_block,
>   		.merge_page	= arm_lpae_merge_page,
> +		.sync_dirty_log	= arm_lpae_sync_dirty_log,
>   	};
>   
>   	return data;
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index f1261da11ea8..69f268069282 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -2822,6 +2822,47 @@ size_t iommu_merge_page(struct iommu_domain *domain, unsigned long iova,
>   }
>   EXPORT_SYMBOL_GPL(iommu_merge_page);
>   
> +int iommu_sync_dirty_log(struct iommu_domain *domain, unsigned long iova,
> +			 size_t size, unsigned long *bitmap,
> +			 unsigned long base_iova, unsigned long bitmap_pgshift)
> +{
> +	const struct iommu_ops *ops = domain->ops;
> +	unsigned int min_pagesz;
> +	size_t pgsize;
> +	int ret;
> +
> +	min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
> +
> +	if (!IS_ALIGNED(iova | size, min_pagesz)) {
> +		pr_err("unaligned: iova 0x%lx size 0x%zx min_pagesz 0x%x\n",
> +		       iova, size, min_pagesz);
> +		return -EINVAL;
> +	}
> +
> +	if (!ops || !ops->sync_dirty_log) {
> +		pr_err("don't support sync dirty log\n");
> +		return -ENODEV;
> +	}
> +
> +	while (size) {
> +		pgsize = iommu_pgsize(domain, iova, size);
> +
> +		ret = ops->sync_dirty_log(domain, iova, pgsize,
> +					  bitmap, base_iova, bitmap_pgshift);

Once again, we have a worst-of-both-worlds iteration that doesn't make 
much sense. iommu_pgsize() essentially tells you the best supported size 
that an IOVA range *can* be mapped with, but we're iterating a range 
that's already mapped, so we don't know if it's relevant, and either way 
it may not bear any relation to the granularity of the bitmap, which is 
presumably what actually matters.

Logically, either we should iterate at the bitmap granularity here, and 
the driver just says whether the given iova chunk contains any dirty 
pages or not, or we just pass everything through to the driver and let 
it do the whole job itself. Doing a little bit of both is just an 
overcomplicated mess.

I'm skimming patch #7 and pretty much the same comments apply, so I 
can't be bothered to repeat them there...

Robin.

> +		if (ret)
> +			break;
> +
> +		pr_debug("dirty_log_sync: iova 0x%lx pagesz 0x%zx\n", iova,
> +			 pgsize);
> +
> +		iova += pgsize;
> +		size -= pgsize;
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_sync_dirty_log);
> +
>   void iommu_get_resv_regions(struct device *dev, struct list_head *list)
>   {
>   	const struct iommu_ops *ops = dev->bus->iommu_ops;
> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
> index 754b62a1bbaf..f44551e4a454 100644
> --- a/include/linux/io-pgtable.h
> +++ b/include/linux/io-pgtable.h
> @@ -166,6 +166,10 @@ struct io_pgtable_ops {
>   			      size_t size);
>   	size_t (*merge_page)(struct io_pgtable_ops *ops, unsigned long iova,
>   			     phys_addr_t phys, size_t size, int prot);
> +	int (*sync_dirty_log)(struct io_pgtable_ops *ops,
> +			      unsigned long iova, size_t size,
> +			      unsigned long *bitmap, unsigned long base_iova,
> +			      unsigned long bitmap_pgshift);
>   };
>   
>   /**
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index ac2b0b1bce0f..8069c8375e63 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -262,6 +262,10 @@ struct iommu_ops {
>   			      size_t size);
>   	size_t (*merge_page)(struct iommu_domain *domain, unsigned long iova,
>   			     phys_addr_t phys, size_t size, int prot);
> +	int (*sync_dirty_log)(struct iommu_domain *domain,
> +			      unsigned long iova, size_t size,
> +			      unsigned long *bitmap, unsigned long base_iova,
> +			      unsigned long bitmap_pgshift);
>   
>   	/* Request/Free a list of reserved regions for a device */
>   	void (*get_resv_regions)(struct device *dev, struct list_head *list);
> @@ -517,6 +521,10 @@ extern size_t iommu_split_block(struct iommu_domain *domain, unsigned long iova,
>   				size_t size);
>   extern size_t iommu_merge_page(struct iommu_domain *domain, unsigned long iova,
>   			       size_t size, int prot);
> +extern int iommu_sync_dirty_log(struct iommu_domain *domain, unsigned long iova,
> +				size_t size, unsigned long *bitmap,
> +				unsigned long base_iova,
> +				unsigned long bitmap_pgshift);
>   
>   /* Window handling function prototypes */
>   extern int iommu_domain_window_enable(struct iommu_domain *domain, u32 wnd_nr,
> @@ -923,6 +931,15 @@ static inline size_t iommu_merge_page(struct iommu_domain *domain,
>   	return -EINVAL;
>   }
>   
> +static inline int iommu_sync_dirty_log(struct iommu_domain *domain,
> +				       unsigned long iova, size_t size,
> +				       unsigned long *bitmap,
> +				       unsigned long base_iova,
> +				       unsigned long pgshift)
> +{
> +	return -EINVAL;
> +}
> +
>   static inline int  iommu_device_register(struct iommu_device *iommu)
>   {
>   	return -ENODEV;
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU
  2021-02-04 19:50   ` Robin Murphy
@ 2021-02-05  9:13     ` Keqian Zhu
  2021-02-05  9:51       ` Jean-Philippe Brucker
  2021-02-05 11:48       ` Robin Murphy
  2021-03-02  7:42     ` Keqian Zhu
  1 sibling, 2 replies; 36+ messages in thread
From: Keqian Zhu @ 2021-02-05  9:13 UTC (permalink / raw)
  To: Robin Murphy, linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu,
	Jean-Philippe Brucker
  Cc: Mark Rutland, Tian, Kevin, Cornelia Huck, Suzuki K Poulose,
	Marc Zyngier, jiangkunkun, Kirti Wankhede, lushenming,
	Alex Williamson, James Morse, Catalin Marinas, wanghaibin.wang,
	Will Deacon

Hi Robin and Jean,

On 2021/2/5 3:50, Robin Murphy wrote:
> On 2021-01-28 15:17, Keqian Zhu wrote:
>> From: jiangkunkun <jiangkunkun@huawei.com>
>>
>> The SMMU which supports HTTU (Hardware Translation Table Update) can
>> update the access flag and the dirty state of TTD by hardware. It is
>> essential to track dirty pages of DMA.
>>
>> This adds feature detection, none functional change.
>>
>> Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
>> Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
>> ---
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 ++++++++++++++++
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 ++++++++
>>   include/linux/io-pgtable.h                  |  1 +
>>   3 files changed, 25 insertions(+)
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index 8ca7415d785d..0f0fe71cc10d 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -1987,6 +1987,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
>>           .pgsize_bitmap    = smmu->pgsize_bitmap,
>>           .ias        = ias,
>>           .oas        = oas,
>> +        .httu_hd    = smmu->features & ARM_SMMU_FEAT_HTTU_HD,
>>           .coherent_walk    = smmu->features & ARM_SMMU_FEAT_COHERENCY,
>>           .tlb        = &arm_smmu_flush_ops,
>>           .iommu_dev    = smmu->dev,
>> @@ -3224,6 +3225,21 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
>>       if (reg & IDR0_HYP)
>>           smmu->features |= ARM_SMMU_FEAT_HYP;
>>   +    switch (FIELD_GET(IDR0_HTTU, reg)) {
> 
> We need to accommodate the firmware override as well if we need this to be meaningful. Jean-Philippe is already carrying a suitable patch in the SVA stack[1].
Robin, Thanks for pointing it out.

Jean, I see that the IORT HTTU flag overrides the hardware register info unconditionally. I have some concern about it:

If the override flag has HTTU but hardware doesn't support it, then driver will use this feature but receive access fault or permission fault from SMMU unexpectedly.
1) If IOPF is not supported, then kernel can not work normally.
2) If IOPF is supported, kernel will perform useless actions, such as HTTU based dma dirty tracking (this series).

As the IORT spec doesn't give an explicit explanation for HTTU override, can we comprehend it as a mask for HTTU related hardware register?
So the logic becomes: smmu->feature = HTTU override & IDR0_HTTU;

> 
>> +    case IDR0_HTTU_NONE:
>> +        break;
>> +    case IDR0_HTTU_HA:
>> +        smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
>> +        break;
>> +    case IDR0_HTTU_HAD:
>> +        smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
>> +        smmu->features |= ARM_SMMU_FEAT_HTTU_HD;
>> +        break;
>> +    default:
>> +        dev_err(smmu->dev, "unknown/unsupported HTTU!\n");
>> +        return -ENXIO;
>> +    }
>> +
>>       /*
>>        * The coherency feature as set by FW is used in preference to the ID
>>        * register, but warn on mismatch.
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> index 96c2e9565e00..e91bea44519e 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> @@ -33,6 +33,10 @@
>>   #define IDR0_ASID16            (1 << 12)
>>   #define IDR0_ATS            (1 << 10)
>>   #define IDR0_HYP            (1 << 9)
>> +#define IDR0_HTTU            GENMASK(7, 6)
>> +#define IDR0_HTTU_NONE            0
>> +#define IDR0_HTTU_HA            1
>> +#define IDR0_HTTU_HAD            2
>>   #define IDR0_COHACC            (1 << 4)
>>   #define IDR0_TTF            GENMASK(3, 2)
>>   #define IDR0_TTF_AARCH64        2
>> @@ -286,6 +290,8 @@
>>   #define CTXDESC_CD_0_TCR_TBI0        (1ULL << 38)
>>     #define CTXDESC_CD_0_AA64        (1UL << 41)
>> +#define CTXDESC_CD_0_HD            (1UL << 42)
>> +#define CTXDESC_CD_0_HA            (1UL << 43)
>>   #define CTXDESC_CD_0_S            (1UL << 44)
>>   #define CTXDESC_CD_0_R            (1UL << 45)
>>   #define CTXDESC_CD_0_A            (1UL << 46)
>> @@ -604,6 +610,8 @@ struct arm_smmu_device {
>>   #define ARM_SMMU_FEAT_RANGE_INV        (1 << 15)
>>   #define ARM_SMMU_FEAT_BTM        (1 << 16)
>>   #define ARM_SMMU_FEAT_SVA        (1 << 17)
>> +#define ARM_SMMU_FEAT_HTTU_HA        (1 << 18)
>> +#define ARM_SMMU_FEAT_HTTU_HD        (1 << 19)
>>       u32                features;
>>     #define ARM_SMMU_OPT_SKIP_PREFETCH    (1 << 0)
>> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
>> index ea727eb1a1a9..1a00ea8562c7 100644
>> --- a/include/linux/io-pgtable.h
>> +++ b/include/linux/io-pgtable.h
>> @@ -97,6 +97,7 @@ struct io_pgtable_cfg {
>>       unsigned long            pgsize_bitmap;
>>       unsigned int            ias;
>>       unsigned int            oas;
>> +    bool                httu_hd;
> 
> This is very specific to the AArch64 stage 1 format, not a generic capability - I think it should be a quirk flag rather than a common field.
OK, so BBML should be a quirk flag too?

Though the word "quirk" is not suitable for HTTU and BBML, we have no other place to convey smmu feature to io-pgtable.

Maybe we can add another field named "iommu_features"? Anyway, I am OK with putting them into quirk flag. :-)

> 
> Robin.
> 
> [1] https://jpbrucker.net/git/linux/commit/?h=sva/current&id=1ef7d512fb9082450dfe0d22ca4f7e35625a097b
> 
>>       bool                coherent_walk;
>>       const struct iommu_flush_ops    *tlb;
>>       struct device            *iommu_dev;
>>
> .
> 
Thanks,
Keqian
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU
  2021-02-05  9:13     ` Keqian Zhu
@ 2021-02-05  9:51       ` Jean-Philippe Brucker
  2021-02-07  1:42         ` Keqian Zhu
  2021-02-05 11:48       ` Robin Murphy
  1 sibling, 1 reply; 36+ messages in thread
From: Jean-Philippe Brucker @ 2021-02-05  9:51 UTC (permalink / raw)
  To: Keqian Zhu
  Cc: Mark Rutland, Tian, Kevin, Cornelia Huck, Alex Williamson, kvm,
	Suzuki K Poulose, Will Deacon, jiangkunkun, lushenming,
	linux-kernel, Kirti Wankhede, Catalin Marinas, iommu,
	James Morse, Marc Zyngier, wanghaibin.wang, Robin Murphy, kvmarm,
	linux-arm-kernel

Hi Keqian,

On Fri, Feb 05, 2021 at 05:13:50PM +0800, Keqian Zhu wrote:
> > We need to accommodate the firmware override as well if we need this to be meaningful. Jean-Philippe is already carrying a suitable patch in the SVA stack[1].
> Robin, Thanks for pointing it out.
> 
> Jean, I see that the IORT HTTU flag overrides the hardware register info unconditionally. I have some concern about it:
> 
> If the override flag has HTTU but hardware doesn't support it, then driver will use this feature but receive access fault or permission fault from SMMU unexpectedly.
> 1) If IOPF is not supported, then kernel can not work normally.
> 2) If IOPF is supported, kernel will perform useless actions, such as HTTU based dma dirty tracking (this series).
> 
> As the IORT spec doesn't give an explicit explanation for HTTU override, can we comprehend it as a mask for HTTU related hardware register?

To me "Overrides the value of SMMU_IDR0.HTTU" is clear enough: disregard
the value of SMMU_IDR0.HTTU and use the one specified by IORT instead. And
that's both ways, since there is no validity mask for the IORT value: if
there is an IORT table, always ignore SMMU_IDR0.HTTU.

That's how the SMMU driver implements the COHACC bit, which has the same
wording in IORT. So I think we should implement HTTU the same way.

One complication is that there is no equivalent override for device tree.
I think it can be added later if necessary, because unlike IORT it can be
tri state (property not present, overriden positive, overridden negative).

Thanks,
Jean

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU
  2021-02-05  9:13     ` Keqian Zhu
  2021-02-05  9:51       ` Jean-Philippe Brucker
@ 2021-02-05 11:48       ` Robin Murphy
  2021-02-05 16:11         ` Robin Murphy
  2021-02-07  2:19         ` Keqian Zhu
  1 sibling, 2 replies; 36+ messages in thread
From: Robin Murphy @ 2021-02-05 11:48 UTC (permalink / raw)
  To: Keqian Zhu, linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu,
	Jean-Philippe Brucker
  Cc: Mark Rutland, Tian, Kevin, Cornelia Huck, Suzuki K Poulose,
	Marc Zyngier, jiangkunkun, Kirti Wankhede, lushenming,
	Alex Williamson, James Morse, Catalin Marinas, wanghaibin.wang,
	Will Deacon

On 2021-02-05 09:13, Keqian Zhu wrote:
> Hi Robin and Jean,
> 
> On 2021/2/5 3:50, Robin Murphy wrote:
>> On 2021-01-28 15:17, Keqian Zhu wrote:
>>> From: jiangkunkun <jiangkunkun@huawei.com>
>>>
>>> The SMMU which supports HTTU (Hardware Translation Table Update) can
>>> update the access flag and the dirty state of TTD by hardware. It is
>>> essential to track dirty pages of DMA.
>>>
>>> This adds feature detection, none functional change.
>>>
>>> Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
>>> Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
>>> ---
>>>    drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 ++++++++++++++++
>>>    drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 ++++++++
>>>    include/linux/io-pgtable.h                  |  1 +
>>>    3 files changed, 25 insertions(+)
>>>
>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>> index 8ca7415d785d..0f0fe71cc10d 100644
>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>> @@ -1987,6 +1987,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
>>>            .pgsize_bitmap    = smmu->pgsize_bitmap,
>>>            .ias        = ias,
>>>            .oas        = oas,
>>> +        .httu_hd    = smmu->features & ARM_SMMU_FEAT_HTTU_HD,
>>>            .coherent_walk    = smmu->features & ARM_SMMU_FEAT_COHERENCY,
>>>            .tlb        = &arm_smmu_flush_ops,
>>>            .iommu_dev    = smmu->dev,
>>> @@ -3224,6 +3225,21 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
>>>        if (reg & IDR0_HYP)
>>>            smmu->features |= ARM_SMMU_FEAT_HYP;
>>>    +    switch (FIELD_GET(IDR0_HTTU, reg)) {
>>
>> We need to accommodate the firmware override as well if we need this to be meaningful. Jean-Philippe is already carrying a suitable patch in the SVA stack[1].
> Robin, Thanks for pointing it out.
> 
> Jean, I see that the IORT HTTU flag overrides the hardware register info unconditionally. I have some concern about it:
> 
> If the override flag has HTTU but hardware doesn't support it, then driver will use this feature but receive access fault or permission fault from SMMU unexpectedly.
> 1) If IOPF is not supported, then kernel can not work normally.
> 2) If IOPF is supported, kernel will perform useless actions, such as HTTU based dma dirty tracking (this series).

Yes, if the IORT describes the SMMU incorrectly, things will not work 
well. Just like if it describes the wrong base address or the wrong 
interrupt numbers, things will also not work well. The point is that 
incorrect firmware can be updated in the field fairly easily; incorrect 
hardware can not.

Say the SMMU designer hard-codes the ID register field to 0x2 because 
the SMMU itself is capable of HTTU, and they assume it's always going to 
be wired up coherently, but then a customer integrates it to a 
non-coherent interconnect. Firmware needs to override that value to 
prevent an OS thinking that the claimed HTTU capability is ever going to 
work.

Or say the SMMU *is* integrated correctly, but due to an erratum 
discovered later in the interconnect or SMMU itself, it turns out DBM 
doesn't always work reliably, but AF is still OK. Firmware needs to 
downgrade the indicated level of support from that which was intended to 
that which works reliably.

Or say someone forgets to set an integration tieoff so their SMMU 
reports 0x0 even though it and the interconnect *can* happily support 
HTTU. In that case, firmware may want to upgrade the value to *allow* an 
OS to use HTTU despite the ID register being wrong.

> As the IORT spec doesn't give an explicit explanation for HTTU override, can we comprehend it as a mask for HTTU related hardware register?
> So the logic becomes: smmu->feature = HTTU override & IDR0_HTTU;

No, it literally states that the OS must use the value of the firmware 
field *instead* of the value from the hardware field.

>>> +    case IDR0_HTTU_NONE:
>>> +        break;
>>> +    case IDR0_HTTU_HA:
>>> +        smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
>>> +        break;
>>> +    case IDR0_HTTU_HAD:
>>> +        smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
>>> +        smmu->features |= ARM_SMMU_FEAT_HTTU_HD;
>>> +        break;
>>> +    default:
>>> +        dev_err(smmu->dev, "unknown/unsupported HTTU!\n");
>>> +        return -ENXIO;
>>> +    }
>>> +
>>>        /*
>>>         * The coherency feature as set by FW is used in preference to the ID
>>>         * register, but warn on mismatch.
>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>>> index 96c2e9565e00..e91bea44519e 100644
>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>>> @@ -33,6 +33,10 @@
>>>    #define IDR0_ASID16            (1 << 12)
>>>    #define IDR0_ATS            (1 << 10)
>>>    #define IDR0_HYP            (1 << 9)
>>> +#define IDR0_HTTU            GENMASK(7, 6)
>>> +#define IDR0_HTTU_NONE            0
>>> +#define IDR0_HTTU_HA            1
>>> +#define IDR0_HTTU_HAD            2
>>>    #define IDR0_COHACC            (1 << 4)
>>>    #define IDR0_TTF            GENMASK(3, 2)
>>>    #define IDR0_TTF_AARCH64        2
>>> @@ -286,6 +290,8 @@
>>>    #define CTXDESC_CD_0_TCR_TBI0        (1ULL << 38)
>>>      #define CTXDESC_CD_0_AA64        (1UL << 41)
>>> +#define CTXDESC_CD_0_HD            (1UL << 42)
>>> +#define CTXDESC_CD_0_HA            (1UL << 43)
>>>    #define CTXDESC_CD_0_S            (1UL << 44)
>>>    #define CTXDESC_CD_0_R            (1UL << 45)
>>>    #define CTXDESC_CD_0_A            (1UL << 46)
>>> @@ -604,6 +610,8 @@ struct arm_smmu_device {
>>>    #define ARM_SMMU_FEAT_RANGE_INV        (1 << 15)
>>>    #define ARM_SMMU_FEAT_BTM        (1 << 16)
>>>    #define ARM_SMMU_FEAT_SVA        (1 << 17)
>>> +#define ARM_SMMU_FEAT_HTTU_HA        (1 << 18)
>>> +#define ARM_SMMU_FEAT_HTTU_HD        (1 << 19)
>>>        u32                features;
>>>      #define ARM_SMMU_OPT_SKIP_PREFETCH    (1 << 0)
>>> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
>>> index ea727eb1a1a9..1a00ea8562c7 100644
>>> --- a/include/linux/io-pgtable.h
>>> +++ b/include/linux/io-pgtable.h
>>> @@ -97,6 +97,7 @@ struct io_pgtable_cfg {
>>>        unsigned long            pgsize_bitmap;
>>>        unsigned int            ias;
>>>        unsigned int            oas;
>>> +    bool                httu_hd;
>>
>> This is very specific to the AArch64 stage 1 format, not a generic capability - I think it should be a quirk flag rather than a common field.
> OK, so BBML should be a quirk flag too?
> 
> Though the word "quirk" is not suitable for HTTU and BBML, we have no other place to convey smmu feature to io-pgtable.

Indeed these features aren't decorative grooves on a piece of furniture, 
but in the case of io-pgtable we're merely using "quirk" in its broadest 
sense to imply something that differs from the baseline default 
behaviour - ARM_MTK_EXT, ARM_TTBR1 and ARM_OUTER_WBWA (or whatever it's 
called this week) are all just indicating extra hardware features 
entirely comparable to HTTU; NON_STRICT is describing a similarly 
intentional and desired software behaviour. In fact only ARM_NS 
represents something that could be considered a "workaround".

Robin.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU
  2021-02-05 11:48       ` Robin Murphy
@ 2021-02-05 16:11         ` Robin Murphy
  2021-02-07  1:56           ` Keqian Zhu
  2021-02-07  2:19         ` Keqian Zhu
  1 sibling, 1 reply; 36+ messages in thread
From: Robin Murphy @ 2021-02-05 16:11 UTC (permalink / raw)
  To: Keqian Zhu, linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu,
	Jean-Philippe Brucker
  Cc: Mark Rutland, Tian, Kevin, jiangkunkun, Suzuki K Poulose,
	Marc Zyngier, Cornelia Huck, Alex Williamson, lushenming,
	Kirti Wankhede, James Morse, Catalin Marinas, wanghaibin.wang,
	Will Deacon

On 2021-02-05 11:48, Robin Murphy wrote:
> On 2021-02-05 09:13, Keqian Zhu wrote:
>> Hi Robin and Jean,
>>
>> On 2021/2/5 3:50, Robin Murphy wrote:
>>> On 2021-01-28 15:17, Keqian Zhu wrote:
>>>> From: jiangkunkun <jiangkunkun@huawei.com>
>>>>
>>>> The SMMU which supports HTTU (Hardware Translation Table Update) can
>>>> update the access flag and the dirty state of TTD by hardware. It is
>>>> essential to track dirty pages of DMA.
>>>>
>>>> This adds feature detection, none functional change.
>>>>
>>>> Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
>>>> Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
>>>> ---
>>>>    drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 ++++++++++++++++
>>>>    drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 ++++++++
>>>>    include/linux/io-pgtable.h                  |  1 +
>>>>    3 files changed, 25 insertions(+)
>>>>
>>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
>>>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>> index 8ca7415d785d..0f0fe71cc10d 100644
>>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>> @@ -1987,6 +1987,7 @@ static int arm_smmu_domain_finalise(struct 
>>>> iommu_domain *domain,
>>>>            .pgsize_bitmap    = smmu->pgsize_bitmap,
>>>>            .ias        = ias,
>>>>            .oas        = oas,
>>>> +        .httu_hd    = smmu->features & ARM_SMMU_FEAT_HTTU_HD,
>>>>            .coherent_walk    = smmu->features & 
>>>> ARM_SMMU_FEAT_COHERENCY,
>>>>            .tlb        = &arm_smmu_flush_ops,
>>>>            .iommu_dev    = smmu->dev,
>>>> @@ -3224,6 +3225,21 @@ static int arm_smmu_device_hw_probe(struct 
>>>> arm_smmu_device *smmu)
>>>>        if (reg & IDR0_HYP)
>>>>            smmu->features |= ARM_SMMU_FEAT_HYP;
>>>>    +    switch (FIELD_GET(IDR0_HTTU, reg)) {
>>>
>>> We need to accommodate the firmware override as well if we need this 
>>> to be meaningful. Jean-Philippe is already carrying a suitable patch 
>>> in the SVA stack[1].
>> Robin, Thanks for pointing it out.
>>
>> Jean, I see that the IORT HTTU flag overrides the hardware register 
>> info unconditionally. I have some concern about it:
>>
>> If the override flag has HTTU but hardware doesn't support it, then 
>> driver will use this feature but receive access fault or permission 
>> fault from SMMU unexpectedly.
>> 1) If IOPF is not supported, then kernel can not work normally.
>> 2) If IOPF is supported, kernel will perform useless actions, such as 
>> HTTU based dma dirty tracking (this series).
> 
> Yes, if the IORT describes the SMMU incorrectly, things will not work 
> well. Just like if it describes the wrong base address or the wrong 
> interrupt numbers, things will also not work well. The point is that 
> incorrect firmware can be updated in the field fairly easily; incorrect 
> hardware can not.
> 
> Say the SMMU designer hard-codes the ID register field to 0x2 because 
> the SMMU itself is capable of HTTU, and they assume it's always going to 
> be wired up coherently, but then a customer integrates it to a 
> non-coherent interconnect. Firmware needs to override that value to 
> prevent an OS thinking that the claimed HTTU capability is ever going to 
> work.
> 
> Or say the SMMU *is* integrated correctly, but due to an erratum 
> discovered later in the interconnect or SMMU itself, it turns out DBM 
> doesn't always work reliably, but AF is still OK. Firmware needs to 
> downgrade the indicated level of support from that which was intended to 
> that which works reliably.
> 
> Or say someone forgets to set an integration tieoff so their SMMU 
> reports 0x0 even though it and the interconnect *can* happily support 
> HTTU. In that case, firmware may want to upgrade the value to *allow* an 
> OS to use HTTU despite the ID register being wrong.
> 
>> As the IORT spec doesn't give an explicit explanation for HTTU 
>> override, can we comprehend it as a mask for HTTU related hardware 
>> register?
>> So the logic becomes: smmu->feature = HTTU override & IDR0_HTTU;
> 
> No, it literally states that the OS must use the value of the firmware 
> field *instead* of the value from the hardware field.

Oops, apologies for an oversight there - I've been reviewing IORT spec 
updates lately so naturally had the newest version open already. Turns 
out these descriptions were only clarified in the most recent release, 
so if you were looking at an older document they *were* horribly vague.

Robin.

>>>> +    case IDR0_HTTU_NONE:
>>>> +        break;
>>>> +    case IDR0_HTTU_HA:
>>>> +        smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
>>>> +        break;
>>>> +    case IDR0_HTTU_HAD:
>>>> +        smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
>>>> +        smmu->features |= ARM_SMMU_FEAT_HTTU_HD;
>>>> +        break;
>>>> +    default:
>>>> +        dev_err(smmu->dev, "unknown/unsupported HTTU!\n");
>>>> +        return -ENXIO;
>>>> +    }
>>>> +
>>>>        /*
>>>>         * The coherency feature as set by FW is used in preference 
>>>> to the ID
>>>>         * register, but warn on mismatch.
>>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
>>>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>>>> index 96c2e9565e00..e91bea44519e 100644
>>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>>>> @@ -33,6 +33,10 @@
>>>>    #define IDR0_ASID16            (1 << 12)
>>>>    #define IDR0_ATS            (1 << 10)
>>>>    #define IDR0_HYP            (1 << 9)
>>>> +#define IDR0_HTTU            GENMASK(7, 6)
>>>> +#define IDR0_HTTU_NONE            0
>>>> +#define IDR0_HTTU_HA            1
>>>> +#define IDR0_HTTU_HAD            2
>>>>    #define IDR0_COHACC            (1 << 4)
>>>>    #define IDR0_TTF            GENMASK(3, 2)
>>>>    #define IDR0_TTF_AARCH64        2
>>>> @@ -286,6 +290,8 @@
>>>>    #define CTXDESC_CD_0_TCR_TBI0        (1ULL << 38)
>>>>      #define CTXDESC_CD_0_AA64        (1UL << 41)
>>>> +#define CTXDESC_CD_0_HD            (1UL << 42)
>>>> +#define CTXDESC_CD_0_HA            (1UL << 43)
>>>>    #define CTXDESC_CD_0_S            (1UL << 44)
>>>>    #define CTXDESC_CD_0_R            (1UL << 45)
>>>>    #define CTXDESC_CD_0_A            (1UL << 46)
>>>> @@ -604,6 +610,8 @@ struct arm_smmu_device {
>>>>    #define ARM_SMMU_FEAT_RANGE_INV        (1 << 15)
>>>>    #define ARM_SMMU_FEAT_BTM        (1 << 16)
>>>>    #define ARM_SMMU_FEAT_SVA        (1 << 17)
>>>> +#define ARM_SMMU_FEAT_HTTU_HA        (1 << 18)
>>>> +#define ARM_SMMU_FEAT_HTTU_HD        (1 << 19)
>>>>        u32                features;
>>>>      #define ARM_SMMU_OPT_SKIP_PREFETCH    (1 << 0)
>>>> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
>>>> index ea727eb1a1a9..1a00ea8562c7 100644
>>>> --- a/include/linux/io-pgtable.h
>>>> +++ b/include/linux/io-pgtable.h
>>>> @@ -97,6 +97,7 @@ struct io_pgtable_cfg {
>>>>        unsigned long            pgsize_bitmap;
>>>>        unsigned int            ias;
>>>>        unsigned int            oas;
>>>> +    bool                httu_hd;
>>>
>>> This is very specific to the AArch64 stage 1 format, not a generic 
>>> capability - I think it should be a quirk flag rather than a common 
>>> field.
>> OK, so BBML should be a quirk flag too?
>>
>> Though the word "quirk" is not suitable for HTTU and BBML, we have no 
>> other place to convey smmu feature to io-pgtable.
> 
> Indeed these features aren't decorative grooves on a piece of furniture, 
> but in the case of io-pgtable we're merely using "quirk" in its broadest 
> sense to imply something that differs from the baseline default 
> behaviour - ARM_MTK_EXT, ARM_TTBR1 and ARM_OUTER_WBWA (or whatever it's 
> called this week) are all just indicating extra hardware features 
> entirely comparable to HTTU; NON_STRICT is describing a similarly 
> intentional and desired software behaviour. In fact only ARM_NS 
> represents something that could be considered a "workaround".
> 
> Robin.
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU
  2021-02-05  9:51       ` Jean-Philippe Brucker
@ 2021-02-07  1:42         ` Keqian Zhu
  0 siblings, 0 replies; 36+ messages in thread
From: Keqian Zhu @ 2021-02-07  1:42 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Mark Rutland, Tian, Kevin, Cornelia Huck, Alex Williamson, kvm,
	Suzuki K Poulose, Will Deacon, jiangkunkun, lushenming,
	linux-kernel, Kirti Wankhede, Catalin Marinas, iommu,
	James Morse, Marc Zyngier, wanghaibin.wang, Robin Murphy, kvmarm,
	linux-arm-kernel

Hi Jean,

On 2021/2/5 17:51, Jean-Philippe Brucker wrote:
> Hi Keqian,
> 
> On Fri, Feb 05, 2021 at 05:13:50PM +0800, Keqian Zhu wrote:
>>> We need to accommodate the firmware override as well if we need this to be meaningful. Jean-Philippe is already carrying a suitable patch in the SVA stack[1].
>> Robin, Thanks for pointing it out.
>>
>> Jean, I see that the IORT HTTU flag overrides the hardware register info unconditionally. I have some concern about it:
>>
>> If the override flag has HTTU but hardware doesn't support it, then driver will use this feature but receive access fault or permission fault from SMMU unexpectedly.
>> 1) If IOPF is not supported, then kernel can not work normally.
>> 2) If IOPF is supported, kernel will perform useless actions, such as HTTU based dma dirty tracking (this series).
>>
>> As the IORT spec doesn't give an explicit explanation for HTTU override, can we comprehend it as a mask for HTTU related hardware register?
> 
> To me "Overrides the value of SMMU_IDR0.HTTU" is clear enough: disregard
> the value of SMMU_IDR0.HTTU and use the one specified by IORT instead. And
> that's both ways, since there is no validity mask for the IORT value: if
> there is an IORT table, always ignore SMMU_IDR0.HTTU.
> 
> That's how the SMMU driver implements the COHACC bit, which has the same
> wording in IORT. So I think we should implement HTTU the same way.
OK, and Robin said that the latest IORT spec literally states it.

> 
> One complication is that there is no equivalent override for device tree.
> I think it can be added later if necessary, because unlike IORT it can be
> tri state (property not present, overriden positive, overridden negative).
Yeah, that would be more flexible. ;-)

> 
> Thanks,
> Jean
> 
> .
> 
Thanks,
Keqian
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU
  2021-02-05 16:11         ` Robin Murphy
@ 2021-02-07  1:56           ` Keqian Zhu
  0 siblings, 0 replies; 36+ messages in thread
From: Keqian Zhu @ 2021-02-07  1:56 UTC (permalink / raw)
  To: Robin Murphy, linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu,
	Jean-Philippe Brucker
  Cc: Mark Rutland, Tian, Kevin, jiangkunkun, Suzuki K Poulose,
	Marc Zyngier, Cornelia Huck, Alex Williamson, lushenming,
	Kirti Wankhede, James Morse, Catalin Marinas, wanghaibin.wang,
	Will Deacon

Hi Robin,

On 2021/2/6 0:11, Robin Murphy wrote:
> On 2021-02-05 11:48, Robin Murphy wrote:
>> On 2021-02-05 09:13, Keqian Zhu wrote:
>>> Hi Robin and Jean,
>>>
>>> On 2021/2/5 3:50, Robin Murphy wrote:
>>>> On 2021-01-28 15:17, Keqian Zhu wrote:
>>>>> From: jiangkunkun <jiangkunkun@huawei.com>
>>>>>
>>>>> The SMMU which supports HTTU (Hardware Translation Table Update) can
>>>>> update the access flag and the dirty state of TTD by hardware. It is
>>>>> essential to track dirty pages of DMA.
>>>>>
>>>>> This adds feature detection, none functional change.
>>>>>
>>>>> Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
>>>>> Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
>>>>> ---
>>>>>    drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 ++++++++++++++++
>>>>>    drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 ++++++++
>>>>>    include/linux/io-pgtable.h                  |  1 +
>>>>>    3 files changed, 25 insertions(+)
>>>>>
>>>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>>> index 8ca7415d785d..0f0fe71cc10d 100644
>>>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>>> @@ -1987,6 +1987,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
>>>>>            .pgsize_bitmap    = smmu->pgsize_bitmap,
>>>>>            .ias        = ias,
>>>>>            .oas        = oas,
>>>>> +        .httu_hd    = smmu->features & ARM_SMMU_FEAT_HTTU_HD,
>>>>>            .coherent_walk    = smmu->features & ARM_SMMU_FEAT_COHERENCY,
>>>>>            .tlb        = &arm_smmu_flush_ops,
>>>>>            .iommu_dev    = smmu->dev,
>>>>> @@ -3224,6 +3225,21 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
>>>>>        if (reg & IDR0_HYP)
>>>>>            smmu->features |= ARM_SMMU_FEAT_HYP;
>>>>>    +    switch (FIELD_GET(IDR0_HTTU, reg)) {
>>>>
>>>> We need to accommodate the firmware override as well if we need this to be meaningful. Jean-Philippe is already carrying a suitable patch in the SVA stack[1].
>>> Robin, Thanks for pointing it out.
>>>
>>> Jean, I see that the IORT HTTU flag overrides the hardware register info unconditionally. I have some concern about it:
>>>
>>> If the override flag has HTTU but hardware doesn't support it, then driver will use this feature but receive access fault or permission fault from SMMU unexpectedly.
>>> 1) If IOPF is not supported, then kernel can not work normally.
>>> 2) If IOPF is supported, kernel will perform useless actions, such as HTTU based dma dirty tracking (this series).
>>
>> Yes, if the IORT describes the SMMU incorrectly, things will not work well. Just like if it describes the wrong base address or the wrong interrupt numbers, things will also not work well. The point is that incorrect firmware can be updated in the field fairly easily; incorrect hardware can not.
>>
>> Say the SMMU designer hard-codes the ID register field to 0x2 because the SMMU itself is capable of HTTU, and they assume it's always going to be wired up coherently, but then a customer integrates it to a non-coherent interconnect. Firmware needs to override that value to prevent an OS thinking that the claimed HTTU capability is ever going to work.
>>
>> Or say the SMMU *is* integrated correctly, but due to an erratum discovered later in the interconnect or SMMU itself, it turns out DBM doesn't always work reliably, but AF is still OK. Firmware needs to downgrade the indicated level of support from that which was intended to that which works reliably.
>>
>> Or say someone forgets to set an integration tieoff so their SMMU reports 0x0 even though it and the interconnect *can* happily support HTTU. In that case, firmware may want to upgrade the value to *allow* an OS to use HTTU despite the ID register being wrong.
>>
>>> As the IORT spec doesn't give an explicit explanation for HTTU override, can we comprehend it as a mask for HTTU related hardware register?
>>> So the logic becomes: smmu->feature = HTTU override & IDR0_HTTU;
>>
>> No, it literally states that the OS must use the value of the firmware field *instead* of the value from the hardware field.
> 
> Oops, apologies for an oversight there - I've been reviewing IORT spec updates lately so naturally had the newest version open already. Turns out these descriptions were only clarified in the most recent release, so if you were looking at an older document they *were* horribly vague.
Yep, my local version is E which was released at July 2020. I download the version E.a just now, thanks. ;-)

Thanks,
Keqian
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU
  2021-02-05 11:48       ` Robin Murphy
  2021-02-05 16:11         ` Robin Murphy
@ 2021-02-07  2:19         ` Keqian Zhu
  1 sibling, 0 replies; 36+ messages in thread
From: Keqian Zhu @ 2021-02-07  2:19 UTC (permalink / raw)
  To: Robin Murphy, linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu,
	Jean-Philippe Brucker
  Cc: Mark Rutland, Tian, Kevin, Cornelia Huck, Suzuki K Poulose,
	Marc Zyngier, jiangkunkun, Kirti Wankhede, lushenming,
	Alex Williamson, James Morse, Catalin Marinas, wanghaibin.wang,
	Will Deacon

Hi Robin,

On 2021/2/5 19:48, Robin Murphy wrote:
> On 2021-02-05 09:13, Keqian Zhu wrote:
>> Hi Robin and Jean,
>>
>> On 2021/2/5 3:50, Robin Murphy wrote:
>>> On 2021-01-28 15:17, Keqian Zhu wrote:
>>>> From: jiangkunkun <jiangkunkun@huawei.com>
>>>>
>>>> The SMMU which supports HTTU (Hardware Translation Table Update) can
>>>> update the access flag and the dirty state of TTD by hardware. It is
>>>> essential to track dirty pages of DMA.
>>>>
>>>> This adds feature detection, none functional change.
>>>>
>>>> Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
>>>> Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
>>>> ---
>>>>    drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 ++++++++++++++++
>>>>    drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 ++++++++
>>>>    include/linux/io-pgtable.h                  |  1 +
>>>>    3 files changed, 25 insertions(+)
>>>>
>>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>> index 8ca7415d785d..0f0fe71cc10d 100644
>>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>> @@ -1987,6 +1987,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
>>>>            .pgsize_bitmap    = smmu->pgsize_bitmap,
>>>>            .ias        = ias,
>>>>            .oas        = oas,
>>>> +        .httu_hd    = smmu->features & ARM_SMMU_FEAT_HTTU_HD,
>>>>            .coherent_walk    = smmu->features & ARM_SMMU_FEAT_COHERENCY,
>>>>            .tlb        = &arm_smmu_flush_ops,
>>>>            .iommu_dev    = smmu->dev,
>>>> @@ -3224,6 +3225,21 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
>>>>        if (reg & IDR0_HYP)
>>>>            smmu->features |= ARM_SMMU_FEAT_HYP;
>>>>    +    switch (FIELD_GET(IDR0_HTTU, reg)) {
>>>
>>> We need to accommodate the firmware override as well if we need this to be meaningful. Jean-Philippe is already carrying a suitable patch in the SVA stack[1].
>> Robin, Thanks for pointing it out.
>>
>> Jean, I see that the IORT HTTU flag overrides the hardware register info unconditionally. I have some concern about it:
>>
>> If the override flag has HTTU but hardware doesn't support it, then driver will use this feature but receive access fault or permission fault from SMMU unexpectedly.
>> 1) If IOPF is not supported, then kernel can not work normally.
>> 2) If IOPF is supported, kernel will perform useless actions, such as HTTU based dma dirty tracking (this series).
> 
> Yes, if the IORT describes the SMMU incorrectly, things will not work well. Just like if it describes the wrong base address or the wrong interrupt numbers, things will also not work well. The point is that incorrect firmware can be updated in the field fairly easily; incorrect hardware can not.
Agree.

> 
> Say the SMMU designer hard-codes the ID register field to 0x2 because the SMMU itself is capable of HTTU, and they assume it's always going to be wired up coherently, but then a customer integrates it to a non-coherent interconnect. Firmware needs to override that value to prevent an OS thinking that the claimed HTTU capability is ever going to work.
> 
> Or say the SMMU *is* integrated correctly, but due to an erratum discovered later in the interconnect or SMMU itself, it turns out DBM doesn't always work reliably, but AF is still OK. Firmware needs to downgrade the indicated level of support from that which was intended to that which works reliably.
> 
> Or say someone forgets to set an integration tieoff so their SMMU reports 0x0 even though it and the interconnect *can* happily support HTTU. In that case, firmware may want to upgrade the value to *allow* an OS to use HTTU despite the ID register being wrong.
Fair enough. Mask can realize "downgrade", but not "upgrade". You give a reasonable point for upgrade.

BTW, my original intention is that mask can provide some convenience for BIOS maker, as the override flag can keep same for SMMUs regardless they support HTTU or not. But it shows that mask cannot cover all scenario.

> 
>> As the IORT spec doesn't give an explicit explanation for HTTU override, can we comprehend it as a mask for HTTU related hardware register?
>> So the logic becomes: smmu->feature = HTTU override & IDR0_HTTU;
> 
> No, it literally states that the OS must use the value of the firmware field *instead* of the value from the hardware field.
Yep, I just get the latest version and see it.

> 
>>>> +    case IDR0_HTTU_NONE:
>>>> +        break;
>>>> +    case IDR0_HTTU_HA:
>>>> +        smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
>>>> +        break;
>>>> +    case IDR0_HTTU_HAD:
>>>> +        smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
>>>> +        smmu->features |= ARM_SMMU_FEAT_HTTU_HD;
>>>> +        break;
>>>> +    default:
>>>> +        dev_err(smmu->dev, "unknown/unsupported HTTU!\n");
>>>> +        return -ENXIO;
>>>> +    }
>>>> +
>>>>        /*
>>>>         * The coherency feature as set by FW is used in preference to the ID
>>>>         * register, but warn on mismatch.
>>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>>>> index 96c2e9565e00..e91bea44519e 100644
>>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>>>> @@ -33,6 +33,10 @@
>>>>    #define IDR0_ASID16            (1 << 12)
>>>>    #define IDR0_ATS            (1 << 10)
>>>>    #define IDR0_HYP            (1 << 9)
>>>> +#define IDR0_HTTU            GENMASK(7, 6)
>>>> +#define IDR0_HTTU_NONE            0
>>>> +#define IDR0_HTTU_HA            1
>>>> +#define IDR0_HTTU_HAD            2
>>>>    #define IDR0_COHACC            (1 << 4)
>>>>    #define IDR0_TTF            GENMASK(3, 2)
>>>>    #define IDR0_TTF_AARCH64        2
>>>> @@ -286,6 +290,8 @@
>>>>    #define CTXDESC_CD_0_TCR_TBI0        (1ULL << 38)
>>>>      #define CTXDESC_CD_0_AA64        (1UL << 41)
>>>> +#define CTXDESC_CD_0_HD            (1UL << 42)
>>>> +#define CTXDESC_CD_0_HA            (1UL << 43)
>>>>    #define CTXDESC_CD_0_S            (1UL << 44)
>>>>    #define CTXDESC_CD_0_R            (1UL << 45)
>>>>    #define CTXDESC_CD_0_A            (1UL << 46)
>>>> @@ -604,6 +610,8 @@ struct arm_smmu_device {
>>>>    #define ARM_SMMU_FEAT_RANGE_INV        (1 << 15)
>>>>    #define ARM_SMMU_FEAT_BTM        (1 << 16)
>>>>    #define ARM_SMMU_FEAT_SVA        (1 << 17)
>>>> +#define ARM_SMMU_FEAT_HTTU_HA        (1 << 18)
>>>> +#define ARM_SMMU_FEAT_HTTU_HD        (1 << 19)
>>>>        u32                features;
>>>>      #define ARM_SMMU_OPT_SKIP_PREFETCH    (1 << 0)
>>>> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
>>>> index ea727eb1a1a9..1a00ea8562c7 100644
>>>> --- a/include/linux/io-pgtable.h
>>>> +++ b/include/linux/io-pgtable.h
>>>> @@ -97,6 +97,7 @@ struct io_pgtable_cfg {
>>>>        unsigned long            pgsize_bitmap;
>>>>        unsigned int            ias;
>>>>        unsigned int            oas;
>>>> +    bool                httu_hd;
>>>
>>> This is very specific to the AArch64 stage 1 format, not a generic capability - I think it should be a quirk flag rather than a common field.
>> OK, so BBML should be a quirk flag too?
>>
>> Though the word "quirk" is not suitable for HTTU and BBML, we have no other place to convey smmu feature to io-pgtable.
> 
> Indeed these features aren't decorative grooves on a piece of furniture, but in the case of io-pgtable we're merely using "quirk" in its broadest sense to imply something that differs from the baseline default behaviour - ARM_MTK_EXT, ARM_TTBR1 and ARM_OUTER_WBWA (or whatever it's called this week) are all just indicating extra hardware features entirely comparable to HTTU; NON_STRICT is describing a similarly intentional and desired software behaviour. In fact only ARM_NS represents something that could be considered a "workaround".
OK, I will update it in v2.

> 
> Robin.
> .
> 
Thanks,
Keqian
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 04/11] iommu/arm-smmu-v3: Split block descriptor to a span of page
  2021-02-04 19:51   ` Robin Murphy
@ 2021-02-07  8:18     ` Keqian Zhu
  0 siblings, 0 replies; 36+ messages in thread
From: Keqian Zhu @ 2021-02-07  8:18 UTC (permalink / raw)
  To: Robin Murphy, linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu,
	Will Deacon, Alex Williamson, Marc Zyngier, Catalin Marinas
  Cc: Mark Rutland, jiangkunkun, Suzuki K Poulose, Cornelia Huck,
	lushenming, Kirti Wankhede, James Morse, wanghaibin.wang

Hi Robin,

On 2021/2/5 3:51, Robin Murphy wrote:
> On 2021-01-28 15:17, Keqian Zhu wrote:
>> From: jiangkunkun <jiangkunkun@huawei.com>
>>
>> Block descriptor is not a proper granule for dirty log tracking. This
>> adds a new interface named split_block in iommu layer and arm smmuv3
>> implements it, which splits block descriptor to an equivalent span of
>> page descriptors.
>>
>> During spliting block, other interfaces are not expected to be working,
>> so race condition does not exist. And we flush all iotlbs after the split
>> procedure is completed to ease the pressure of iommu, as we will split a
>> huge range of block mappings in general.
> 
> "Not expected to be" is not the same thing as "can not". Presumably the whole point of dirty log tracking is that it can be run speculatively in the background, so is there any actual guarantee that the guest can't, say, issue a hotplug event that would cause some memory to be released back to the host and unmapped while a scan might be in progress? Saying effectively "there is no race condition as long as you assume there is no race condition" isn't all that reassuring...
Sorry for my inaccuracy expression. "Not expected to be" is inappropriate here, the actual meaning is "can not".

As the only user of these newly added interfaces is vfio_iommu_type1 for now, and vfio_iommu_type1 always acquires "iommu->lock" before invoke them.

> 
> That said, it's not very clear why patches #4 and #5 are here at all, given that patches #6 and #7 appear quite happy to handle block entries.
Split block into page is very important for dirty page tracking. Page mapping can greatly reduce the amount of dirty memory handling. The KVM mmu stage2 side also has this logic.

Yes, #6 (log_sync) and #7 (log_clear) is designed to be applied for both block and page mapping. As the "split" operation may fail (e.g, without BBML1/2 or ENOMEM), but we can still track dirty at block granule, which is still a much better choice compared to the full dirty policy.

> 
>> Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
>> Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
>> ---
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  20 ++++
>>   drivers/iommu/io-pgtable-arm.c              | 122 ++++++++++++++++++++
>>   drivers/iommu/iommu.c                       |  40 +++++++
>>   include/linux/io-pgtable.h                  |   2 +
>>   include/linux/iommu.h                       |  10 ++
>>   5 files changed, 194 insertions(+)
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index 9208881a571c..5469f4fca820 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -2510,6 +2510,25 @@ static int arm_smmu_domain_set_attr(struct iommu_domain *domain,
>>       return ret;
>>   }
>>   +static size_t arm_smmu_split_block(struct iommu_domain *domain,
>> +                   unsigned long iova, size_t size)
>> +{
>> +    struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
>> +    struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
>> +
>> +    if (!(smmu->features & (ARM_SMMU_FEAT_BBML1 | ARM_SMMU_FEAT_BBML2))) {
>> +        dev_err(smmu->dev, "don't support BBML1/2 and split block\n");
>> +        return 0;
>> +    }
>> +
>> +    if (!ops || !ops->split_block) {
>> +        pr_err("don't support split block\n");
>> +        return 0;
>> +    }
>> +
>> +    return ops->split_block(ops, iova, size);
>> +}
>> +
>>   static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
>>   {
>>       return iommu_fwspec_add_ids(dev, args->args, 1);
>> @@ -2609,6 +2628,7 @@ static struct iommu_ops arm_smmu_ops = {
>>       .device_group        = arm_smmu_device_group,
>>       .domain_get_attr    = arm_smmu_domain_get_attr,
>>       .domain_set_attr    = arm_smmu_domain_set_attr,
>> +    .split_block        = arm_smmu_split_block,
>>       .of_xlate        = arm_smmu_of_xlate,
>>       .get_resv_regions    = arm_smmu_get_resv_regions,
>>       .put_resv_regions    = generic_iommu_put_resv_regions,
>> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
>> index e299a44808ae..f3b7f7115e38 100644
>> --- a/drivers/iommu/io-pgtable-arm.c
>> +++ b/drivers/iommu/io-pgtable-arm.c
>> @@ -79,6 +79,8 @@
>>   #define ARM_LPAE_PTE_SH_IS        (((arm_lpae_iopte)3) << 8)
>>   #define ARM_LPAE_PTE_NS            (((arm_lpae_iopte)1) << 5)
>>   #define ARM_LPAE_PTE_VALID        (((arm_lpae_iopte)1) << 0)
>> +/* Block descriptor bits */
>> +#define ARM_LPAE_PTE_NT            (((arm_lpae_iopte)1) << 16)
>>     #define ARM_LPAE_PTE_ATTR_LO_MASK    (((arm_lpae_iopte)0x3ff) << 2)
>>   /* Ignore the contiguous bit for block splitting */
>> @@ -679,6 +681,125 @@ static phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
>>       return iopte_to_paddr(pte, data) | iova;
>>   }
>>   +static size_t __arm_lpae_split_block(struct arm_lpae_io_pgtable *data,
>> +                     unsigned long iova, size_t size, int lvl,
>> +                     arm_lpae_iopte *ptep);
>> +
>> +static size_t arm_lpae_do_split_blk(struct arm_lpae_io_pgtable *data,
>> +                    unsigned long iova, size_t size,
>> +                    arm_lpae_iopte blk_pte, int lvl,
>> +                    arm_lpae_iopte *ptep)
>> +{
>> +    struct io_pgtable_cfg *cfg = &data->iop.cfg;
>> +    arm_lpae_iopte pte, *tablep;
>> +    phys_addr_t blk_paddr;
>> +    size_t tablesz = ARM_LPAE_GRANULE(data);
>> +    size_t split_sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
>> +    int i;
>> +
>> +    if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
>> +        return 0;
>> +
>> +    tablep = __arm_lpae_alloc_pages(tablesz, GFP_ATOMIC, cfg);
>> +    if (!tablep)
>> +        return 0;
>> +
>> +    blk_paddr = iopte_to_paddr(blk_pte, data);
>> +    pte = iopte_prot(blk_pte);
>> +    for (i = 0; i < tablesz / sizeof(pte); i++, blk_paddr += split_sz)
>> +        __arm_lpae_init_pte(data, blk_paddr, pte, lvl, &tablep[i]);
>> +
>> +    if (cfg->bbml == 1) {
>> +        /* Race does not exist */
>> +        blk_pte |= ARM_LPAE_PTE_NT;
>> +        __arm_lpae_set_pte(ptep, blk_pte, cfg);
>> +        io_pgtable_tlb_flush_walk(&data->iop, iova, size, size);
>> +    }
>> +    /* Race does not exist */
>> +    pte = arm_lpae_install_table(tablep, ptep, blk_pte, cfg);
>> +
>> +    /* Have splited it into page? */
>> +    if (lvl == (ARM_LPAE_MAX_LEVELS - 1))
>> +        return size;
>> +
>> +    /* Go back to lvl - 1 */
>> +    ptep -= ARM_LPAE_LVL_IDX(iova, lvl - 1, data);
>> +    return __arm_lpae_split_block(data, iova, size, lvl - 1, ptep);
> 
> If there is a good enough justification for actually using this, recursive splitting is a horrible way to do it. The theoretical split_blk_unmap case does it for the sake of simplicity and mitigating races wherein multiple parts of the same block may be unmapped, but if were were using this in anger then we'd need it to be fast - and a race does not exist, right? - so building the entire sub-table first then swizzling a single PTE would make far more sense.
The main reason for recursive splitting is simplicity too, and simplicity means more reliable.

I take a skeptical attitude to that recursive splitting introduces much extra trade-off compared to straight splitting.
The recursive splitting takes a two step that firstly fills next-level table with block TTD and then *replaces* block TTD with next-next-level table address.
The straight splitting can fill next-level table with next-next-level table address, thus omit the *replace* procedure.

Take 1G block splitting as an example, I think the extra trade-off is 512 times of replacement, which should has little impact on total procedure.


> 
>> +}
>> +
>> +static size_t __arm_lpae_split_block(struct arm_lpae_io_pgtable *data,
>> +                     unsigned long iova, size_t size, int lvl,
>> +                     arm_lpae_iopte *ptep)
>> +{
>> +    arm_lpae_iopte pte;
>> +    struct io_pgtable *iop = &data->iop;
>> +    size_t base, next_size, total_size;
>> +
>> +    if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
>> +        return 0;
>> +
>> +    ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
>> +    pte = READ_ONCE(*ptep);
>> +    if (WARN_ON(!pte))
>> +        return 0;
>> +
>> +    if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
>> +        if (iopte_leaf(pte, lvl, iop->fmt)) {
>> +            if (lvl == (ARM_LPAE_MAX_LEVELS - 1) ||
>> +                (pte & ARM_LPAE_PTE_AP_RDONLY))
>> +                return size;
>> +
>> +            /* We find a writable block, split it. */
>> +            return arm_lpae_do_split_blk(data, iova, size, pte,
>> +                    lvl + 1, ptep);
>> +        } else {
>> +            /* If it is the last table level, then nothing to do */
>> +            if (lvl == (ARM_LPAE_MAX_LEVELS - 2))
>> +                return size;
>> +
>> +            total_size = 0;
>> +            next_size = ARM_LPAE_BLOCK_SIZE(lvl + 1, data);
>> +            ptep = iopte_deref(pte, data);
>> +            for (base = 0; base < size; base += next_size)
>> +                total_size += __arm_lpae_split_block(data,
>> +                        iova + base, next_size, lvl + 1,
>> +                        ptep);
>> +            return total_size;
>> +        }
>> +    } else if (iopte_leaf(pte, lvl, iop->fmt)) {
>> +        WARN(1, "Can't split behind a block.\n");
>> +        return 0;
>> +    }
>> +
>> +    /* Keep on walkin */
>> +    ptep = iopte_deref(pte, data);
>> +    return __arm_lpae_split_block(data, iova, size, lvl + 1, ptep);
>> +}
>> +
>> +static size_t arm_lpae_split_block(struct io_pgtable_ops *ops,
>> +                   unsigned long iova, size_t size)
>> +{
>> +    struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
>> +    arm_lpae_iopte *ptep = data->pgd;
>> +    struct io_pgtable_cfg *cfg = &data->iop.cfg;
>> +    int lvl = data->start_level;
>> +    long iaext = (s64)iova >> cfg->ias;
>> +
>> +    if (WARN_ON(!size || (size & cfg->pgsize_bitmap) != size))
>> +        return 0;
>> +
>> +    if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1)
>> +        iaext = ~iaext;
>> +    if (WARN_ON(iaext))
>> +        return 0;
>> +
>> +    /* If it is smallest granule, then nothing to do */
>> +    if (size == ARM_LPAE_BLOCK_SIZE(ARM_LPAE_MAX_LEVELS - 1, data))
>> +        return size;
>> +
>> +    return __arm_lpae_split_block(data, iova, size, lvl, ptep);
>> +}
>> +
>>   static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
>>   {
>>       unsigned long granule, page_sizes;
>> @@ -757,6 +878,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
>>           .map        = arm_lpae_map,
>>           .unmap        = arm_lpae_unmap,
>>           .iova_to_phys    = arm_lpae_iova_to_phys,
>> +        .split_block    = arm_lpae_split_block,
>>       };
>>         return data;
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index ffeebda8d6de..7dc0850448c3 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -2707,6 +2707,46 @@ int iommu_domain_set_attr(struct iommu_domain *domain,
>>   }
>>   EXPORT_SYMBOL_GPL(iommu_domain_set_attr);
>>   +size_t iommu_split_block(struct iommu_domain *domain, unsigned long iova,
>> +             size_t size)
>> +{
>> +    const struct iommu_ops *ops = domain->ops;
>> +    unsigned int min_pagesz;
>> +    size_t pgsize, splited_size;
>> +    size_t splited = 0;
>> +
>> +    min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
>> +
>> +    if (!IS_ALIGNED(iova | size, min_pagesz)) {
>> +        pr_err("unaligned: iova 0x%lx size 0x%zx min_pagesz 0x%x\n",
>> +               iova, size, min_pagesz);
>> +        return 0;
>> +    }
>> +
>> +    if (!ops || !ops->split_block) {
>> +        pr_err("don't support split block\n");
>> +        return 0;
>> +    }
>> +
>> +    while (size) {
>> +        pgsize = iommu_pgsize(domain, iova, size);
> 
> If the whole point of this operation is to split a mapping down to a specific granularity, why bother recalculating that granularity over and over again?
Sorry for the unsuitable expression again. The computed pgsize is not our destined granularity, it's the max splitting size that meets alignment acquirement and fits into the pgsize_bitmap.

The "split" interface assumes the @size fits into pgsize_bitmap to simplify its implementation. This assumption is same for "map" and "unmap", and the logic of "split" is very similar to "unmap".

> 
>> +
>> +        splited_size = ops->split_block(domain, iova, pgsize);
>> +
>> +        pr_debug("splited: iova 0x%lx size 0x%zx\n", iova, splited_size);
>> +        iova += splited_size;
>> +        size -= splited_size;
>> +        splited += splited_size;
>> +
>> +        if (splited_size != pgsize)
>> +            break;
>> +    }
>> +    iommu_flush_iotlb_all(domain);
>> +
>> +    return splited;
> 
> Language tangent: note that "split" is one of those delightful irregular verbs, in that its past tense is also "split". This isn't the best operation for clear, unambigous naming :)
Ouch, we ought to use another word here. ;-)

> 
> Don't let these idle nitpicks distract from the bigger concerns above, though. FWIW if the caller knows from the start that they want to keep track of things at page granularity, they always have the option of just stripping the larger sizes out of domain->pgsize_bitmap before mapping anything.
Before dirty tracking, we use block mapping for best performance, and when dirty tracking start, we will split block mapping into page mapping for least dirty handling.
I think we should not sacrifice DMA performance to omit splitting, as we test, splitting 1G block mapping to 4K mapping just takes about 2ms.

Thanks,
Keqian

> 
> Robin.
> 
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_split_block);
>> +
>>   void iommu_get_resv_regions(struct device *dev, struct list_head *list)
>>   {
>>       const struct iommu_ops *ops = dev->bus->iommu_ops;
>> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
>> index 26583beeb5d9..b87c6f4ecaa2 100644
>> --- a/include/linux/io-pgtable.h
>> +++ b/include/linux/io-pgtable.h
>> @@ -162,6 +162,8 @@ struct io_pgtable_ops {
>>               size_t size, struct iommu_iotlb_gather *gather);
>>       phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
>>                       unsigned long iova);
>> +    size_t (*split_block)(struct io_pgtable_ops *ops, unsigned long iova,
>> +                  size_t size);
>>   };
>>     /**
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index b3f0e2018c62..abeb811098a5 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -258,6 +258,8 @@ struct iommu_ops {
>>                      enum iommu_attr attr, void *data);
>>       int (*domain_set_attr)(struct iommu_domain *domain,
>>                      enum iommu_attr attr, void *data);
>> +    size_t (*split_block)(struct iommu_domain *domain, unsigned long iova,
>> +                  size_t size);
>>         /* Request/Free a list of reserved regions for a device */
>>       void (*get_resv_regions)(struct device *dev, struct list_head *list);
>> @@ -509,6 +511,8 @@ extern int iommu_domain_get_attr(struct iommu_domain *domain, enum iommu_attr,
>>                    void *data);
>>   extern int iommu_domain_set_attr(struct iommu_domain *domain, enum iommu_attr,
>>                    void *data);
>> +extern size_t iommu_split_block(struct iommu_domain *domain, unsigned long iova,
>> +                size_t size);
>>     /* Window handling function prototypes */
>>   extern int iommu_domain_window_enable(struct iommu_domain *domain, u32 wnd_nr,
>> @@ -903,6 +907,12 @@ static inline int iommu_domain_set_attr(struct iommu_domain *domain,
>>       return -EINVAL;
>>   }
>>   +static inline size_t iommu_split_block(struct iommu_domain *domain,
>> +                       unsigned long iova, size_t size)
>> +{
>> +    return 0;
>> +}
>> +
>>   static inline int  iommu_device_register(struct iommu_device *iommu)
>>   {
>>       return -ENODEV;
>>
> .
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 10/11] vfio/iommu_type1: Optimize dirty bitmap population based on iommu HWDBM
  2021-01-28 15:17 ` [RFC PATCH 10/11] vfio/iommu_type1: Optimize dirty bitmap population based on iommu HWDBM Keqian Zhu
@ 2021-02-07  9:56   ` Yi Sun
  2021-02-07 10:40     ` Keqian Zhu
  2021-02-09 11:16     ` Robin Murphy
  0 siblings, 2 replies; 36+ messages in thread
From: Yi Sun @ 2021-02-07  9:56 UTC (permalink / raw)
  To: Keqian Zhu
  Cc: Mark Rutland, kvm, Catalin Marinas, Kirti Wankhede, Will Deacon,
	kvmarm, Marc Zyngier, jiangkunkun, wanghaibin.wang, kevin.tian,
	yan.y.zhao, Suzuki K Poulose, Alex Williamson, linux-arm-kernel,
	Cornelia Huck, linux-kernel, lushenming, iommu, James Morse,
	Robin Murphy

Hi,

On 21-01-28 23:17:41, Keqian Zhu wrote:

[...]

> +static void vfio_dma_dirty_log_start(struct vfio_iommu *iommu,
> +				     struct vfio_dma *dma)
> +{
> +	struct vfio_domain *d;
> +
> +	list_for_each_entry(d, &iommu->domain_list, next) {
> +		/* Go through all domain anyway even if we fail */
> +		iommu_split_block(d->domain, dma->iova, dma->size);
> +	}
> +}

This should be a switch to prepare for dirty log start. Per Intel
Vtd spec, there is SLADE defined in Scalable-Mode PASID Table Entry.
It enables Accessed/Dirty Flags in second-level paging entries.
So, a generic iommu interface here is better. For Intel iommu, it
enables SLADE. For ARM, it splits block.

> +
> +static void vfio_dma_dirty_log_stop(struct vfio_iommu *iommu,
> +				    struct vfio_dma *dma)
> +{
> +	struct vfio_domain *d;
> +
> +	list_for_each_entry(d, &iommu->domain_list, next) {
> +		/* Go through all domain anyway even if we fail */
> +		iommu_merge_page(d->domain, dma->iova, dma->size,
> +				 d->prot | dma->prot);
> +	}
> +}

Same as above comment, a generic interface is required here.

> +
> +static void vfio_iommu_dirty_log_switch(struct vfio_iommu *iommu, bool start)
> +{
> +	struct rb_node *n;
> +
> +	/* Split and merge even if all iommu don't support HWDBM now */
> +	for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) {
> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> +
> +		if (!dma->iommu_mapped)
> +			continue;
> +
> +		/* Go through all dma range anyway even if we fail */
> +		if (start)
> +			vfio_dma_dirty_log_start(iommu, dma);
> +		else
> +			vfio_dma_dirty_log_stop(iommu, dma);
> +	}
> +}
> +
>  static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
>  					unsigned long arg)
>  {
> @@ -2812,8 +2900,10 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
>  		pgsize = 1 << __ffs(iommu->pgsize_bitmap);
>  		if (!iommu->dirty_page_tracking) {
>  			ret = vfio_dma_bitmap_alloc_all(iommu, pgsize);
> -			if (!ret)
> +			if (!ret) {
>  				iommu->dirty_page_tracking = true;
> +				vfio_iommu_dirty_log_switch(iommu, true);
> +			}
>  		}
>  		mutex_unlock(&iommu->lock);
>  		return ret;
> @@ -2822,6 +2912,7 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
>  		if (iommu->dirty_page_tracking) {
>  			iommu->dirty_page_tracking = false;
>  			vfio_dma_bitmap_free_all(iommu);
> +			vfio_iommu_dirty_log_switch(iommu, false);
>  		}
>  		mutex_unlock(&iommu->lock);
>  		return 0;
> -- 
> 2.19.1
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 10/11] vfio/iommu_type1: Optimize dirty bitmap population based on iommu HWDBM
  2021-02-07  9:56   ` Yi Sun
@ 2021-02-07 10:40     ` Keqian Zhu
  2021-02-09 11:57       ` Yi Sun
  2021-02-09 11:16     ` Robin Murphy
  1 sibling, 1 reply; 36+ messages in thread
From: Keqian Zhu @ 2021-02-07 10:40 UTC (permalink / raw)
  To: Yi Sun
  Cc: Mark Rutland, kvm, Catalin Marinas, Kirti Wankhede, Will Deacon,
	kvmarm, Marc Zyngier, jiangkunkun, wanghaibin.wang, kevin.tian,
	yan.y.zhao, Suzuki K Poulose, Alex Williamson, linux-arm-kernel,
	Cornelia Huck, linux-kernel, lushenming, iommu, James Morse,
	Robin Murphy

Hi Yi,

On 2021/2/7 17:56, Yi Sun wrote:
> Hi,
> 
> On 21-01-28 23:17:41, Keqian Zhu wrote:
> 
> [...]
> 
>> +static void vfio_dma_dirty_log_start(struct vfio_iommu *iommu,
>> +				     struct vfio_dma *dma)
>> +{
>> +	struct vfio_domain *d;
>> +
>> +	list_for_each_entry(d, &iommu->domain_list, next) {
>> +		/* Go through all domain anyway even if we fail */
>> +		iommu_split_block(d->domain, dma->iova, dma->size);
>> +	}
>> +}
> 
> This should be a switch to prepare for dirty log start. Per Intel
> Vtd spec, there is SLADE defined in Scalable-Mode PASID Table Entry.
> It enables Accessed/Dirty Flags in second-level paging entries.
> So, a generic iommu interface here is better. For Intel iommu, it
> enables SLADE. For ARM, it splits block.
Indeed, a generic interface name is better.

The vendor iommu driver plays vendor's specific actions to start dirty log, and Intel iommu and ARM smmu may differ. Besides, we may add more actions in ARM smmu driver in future.

One question: Though I am not familiar with Intel iommu, I think it also should split block mapping besides enable SLADE. Right?

Thanks,
Keqian
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 05/11] iommu/arm-smmu-v3: Merge a span of page to block descriptor
  2021-02-04 19:52   ` Robin Murphy
@ 2021-02-07 12:13     ` Keqian Zhu
  0 siblings, 0 replies; 36+ messages in thread
From: Keqian Zhu @ 2021-02-07 12:13 UTC (permalink / raw)
  To: Robin Murphy, linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu,
	Will Deacon, Alex Williamson, Marc Zyngier, Catalin Marinas
  Cc: Mark Rutland, jiangkunkun, Suzuki K Poulose, Cornelia Huck,
	lushenming, Kirti Wankhede, James Morse, wanghaibin.wang

Hi Robin,

On 2021/2/5 3:52, Robin Murphy wrote:
> On 2021-01-28 15:17, Keqian Zhu wrote:
>> From: jiangkunkun <jiangkunkun@huawei.com>
>>
>> When stop dirty log tracking, we need to recover all block descriptors
>> which are splited when start dirty log tracking. This adds a new
>> interface named merge_page in iommu layer and arm smmuv3 implements it,
>> which reinstall block mappings and unmap the span of page mappings.
>>
>> It's caller's duty to find contiuous physical memory.
>>
>> During merging page, other interfaces are not expected to be working,
>> so race condition does not exist. And we flush all iotlbs after the merge
>> procedure is completed to ease the pressure of iommu, as we will merge a
>> huge range of page mappings in general.
> 
> Again, I think we need better reasoning than "race conditions don't exist because we don't expect them to exist".
Sure, because they can't. ;-)

> 
>> Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
>> Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
>> ---
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 20 ++++++
>>   drivers/iommu/io-pgtable-arm.c              | 78 +++++++++++++++++++++
>>   drivers/iommu/iommu.c                       | 75 ++++++++++++++++++++
>>   include/linux/io-pgtable.h                  |  2 +
>>   include/linux/iommu.h                       | 10 +++
>>   5 files changed, 185 insertions(+)
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index 5469f4fca820..2434519e4bb6 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -2529,6 +2529,25 @@ static size_t arm_smmu_split_block(struct iommu_domain *domain,
>>       return ops->split_block(ops, iova, size);
>>   }
[...]

>> +
>> +size_t iommu_merge_page(struct iommu_domain *domain, unsigned long iova,
>> +            size_t size, int prot)
>> +{
>> +    phys_addr_t phys;
>> +    dma_addr_t p, i;
>> +    size_t cont_size, merged_size;
>> +    size_t merged = 0;
>> +
>> +    while (size) {
>> +        phys = iommu_iova_to_phys(domain, iova);
>> +        cont_size = PAGE_SIZE;
>> +        p = phys + cont_size;
>> +        i = iova + cont_size;
>> +
>> +        while (cont_size < size && p == iommu_iova_to_phys(domain, i)) {
>> +            p += PAGE_SIZE;
>> +            i += PAGE_SIZE;
>> +            cont_size += PAGE_SIZE;
>> +        }
>> +
>> +        merged_size = __iommu_merge_page(domain, iova, phys, cont_size,
>> +                prot);
> 
> This is incredibly silly. The amount of time you'll spend just on walking the tables in all those iova_to_phys() calls is probably significantly more than it would take the low-level pagetable code to do the entire operation for itself. At this level, any knowledge of how mappings are actually constructed is lost once __iommu_map() returns, so we just don't know, and for this operation in particular there seems little point in trying to guess - the driver backend still has to figure out whether something we *think* might me mergeable actually is, so it's better off doing the entire operation in a single pass by itself.
>
> There's especially little point in starting all this work *before* checking that it's even possible...
>
> Robin.

Well, this looks silly indeed. But the iova->phys info is only stored in pgtable. It seems that there is no other method to find continuous physical address :-( (actually, the vfio_iommu_replay() has similar logic).

We put the finding procedure of continuous physical address in common iommu layer, because this is a common logic for all types of iommu driver.

If a vendor iommu driver thinks (iova, phys, cont_size) is not merge-able, it can make its own decision to map them. This keeps same as iommu_map(), which provides (iova, paddr, pgsize) to vendor driver, and vendor driver can make its own decision to map them.

Do I understand your idea correctly?

Thanks,
Keqian
> 
>> +        iova += merged_size;
>> +        size -= merged_size;
>> +        merged += merged_size;
>> +
>> +        if (merged_size != cont_size)
>> +            break;
>> +    }
>> +    iommu_flush_iotlb_all(domain);
>> +
>> +    return merged;
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_merge_page);
>> +
>>   void iommu_get_resv_regions(struct device *dev, struct list_head *list)
>>   {
>>       const struct iommu_ops *ops = dev->bus->iommu_ops;
>> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
>> index b87c6f4ecaa2..754b62a1bbaf 100644
>> --- a/include/linux/io-pgtable.h
>> +++ b/include/linux/io-pgtable.h
>> @@ -164,6 +164,8 @@ struct io_pgtable_ops {
>>                       unsigned long iova);
>>       size_t (*split_block)(struct io_pgtable_ops *ops, unsigned long iova,
>>                     size_t size);
>> +    size_t (*merge_page)(struct io_pgtable_ops *ops, unsigned long iova,
>> +                 phys_addr_t phys, size_t size, int prot);
>>   };
>>     /**
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index abeb811098a5..ac2b0b1bce0f 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -260,6 +260,8 @@ struct iommu_ops {
>>                      enum iommu_attr attr, void *data);
>>       size_t (*split_block)(struct iommu_domain *domain, unsigned long iova,
>>                     size_t size);
>> +    size_t (*merge_page)(struct iommu_domain *domain, unsigned long iova,
>> +                 phys_addr_t phys, size_t size, int prot);
>>         /* Request/Free a list of reserved regions for a device */
>>       void (*get_resv_regions)(struct device *dev, struct list_head *list);
>> @@ -513,6 +515,8 @@ extern int iommu_domain_set_attr(struct iommu_domain *domain, enum iommu_attr,
>>                    void *data);
>>   extern size_t iommu_split_block(struct iommu_domain *domain, unsigned long iova,
>>                   size_t size);
>> +extern size_t iommu_merge_page(struct iommu_domain *domain, unsigned long iova,
>> +                   size_t size, int prot);
>>     /* Window handling function prototypes */
>>   extern int iommu_domain_window_enable(struct iommu_domain *domain, u32 wnd_nr,
>> @@ -913,6 +917,12 @@ static inline size_t iommu_split_block(struct iommu_domain *domain,
>>       return 0;
>>   }
>>   +static inline size_t iommu_merge_page(struct iommu_domain *domain,
>> +                      unsigned long iova, size_t size, int prot)
>> +{
>> +    return -EINVAL;
>> +}
>> +
>>   static inline int  iommu_device_register(struct iommu_device *iommu)
>>   {
>>       return -ENODEV;
>>
> .
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 06/11] iommu/arm-smmu-v3: Scan leaf TTD to sync hardware dirty log
  2021-02-04 19:52   ` Robin Murphy
@ 2021-02-07 12:41     ` Keqian Zhu
  2021-02-08  1:17     ` Keqian Zhu
  1 sibling, 0 replies; 36+ messages in thread
From: Keqian Zhu @ 2021-02-07 12:41 UTC (permalink / raw)
  To: Robin Murphy, linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu,
	Will Deacon, Alex Williamson, Marc Zyngier, Catalin Marinas
  Cc: Mark Rutland, jiangkunkun, Suzuki K Poulose, Cornelia Huck,
	lushenming, Kirti Wankhede, James Morse, wanghaibin.wang

Hi Robin,

On 2021/2/5 3:52, Robin Murphy wrote:
> On 2021-01-28 15:17, Keqian Zhu wrote:
>> From: jiangkunkun <jiangkunkun@huawei.com>
>>
>> During dirty log tracking, user will try to retrieve dirty log from
>> iommu if it supports hardware dirty log. This adds a new interface
>> named sync_dirty_log in iommu layer and arm smmuv3 implements it,
>> which scans leaf TTD and treats it's dirty if it's writable (As we
>> just enable HTTU for stage1, so check AP[2] is not set).
>>
>> Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
>> Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
>> ---
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 27 +++++++
>>   drivers/iommu/io-pgtable-arm.c              | 90 +++++++++++++++++++++
>>   drivers/iommu/iommu.c                       | 41 ++++++++++
>>   include/linux/io-pgtable.h                  |  4 +
>>   include/linux/iommu.h                       | 17 ++++
>>   5 files changed, 179 insertions(+)
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index 2434519e4bb6..43d0536b429a 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -2548,6 +2548,32 @@ static size_t arm_smmu_merge_page(struct iommu_domain *domain, unsigned long iov
>>       return ops->merge_page(ops, iova, paddr, size, prot);
>>   }
>>   +static int arm_smmu_sync_dirty_log(struct iommu_domain *domain,
>> +                   unsigned long iova, size_t size,
>> +                   unsigned long *bitmap,
>> +                   unsigned long base_iova,
>> +                   unsigned long bitmap_pgshift)
>> +{
>> +    struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
>> +    struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
>> +
>> +    if (!(smmu->features & ARM_SMMU_FEAT_HTTU_HD)) {
>> +        dev_err(smmu->dev, "don't support HTTU_HD and sync dirty log\n");
>> +        return -EPERM;
>> +    }
>> +
>> +    if (!ops || !ops->sync_dirty_log) {
>> +        pr_err("don't support sync dirty log\n");
>> +        return -ENODEV;
>> +    }
>> +
>> +    /* To ensure all inflight transactions are completed */
>> +    arm_smmu_flush_iotlb_all(domain);
> 
> What about transactions that arrive between the point that this completes, and the point - potentially much later - that we actually access any given PTE during the walk? I don't see what this is supposed to be synchronising against, even if it were just a CMD_SYNC (I especially don't see why we'd want to knock out the TLBs).
The idea is that pgtable may be updated by HTTU *before* or *after* actual DMA access.

1) For PCI ATS. As SMMU spec (3.13.6.1 Hardware flag update for ATS & PRI) states:

"In addition to the behavior that is described earlier in this section, if hardware-management of Dirty state is enabled
and an ATS request for write access (with NW == 0) is made to a page that is marked Writable Clean, the SMMU
assumes a write will be made to that page and marks the page as Writable Dirty before returning the ATS response
that grants write access. When this happens, the modification to the page data by a device is not visible before
the page state is visible as Writable Dirty."

The problem is that guest memory may be dirtied *after* we actually handle it.

2) For inflight DMA. As SMMU spec (3.13.4 HTTU behavior summary) states:

"In addition, the completion of a TLB invalidation operation makes TTD updates that were caused by
transactions that are themselves completed by the completion of the TLB invalidation visible. Both
broadcast and explicit CMD_TLBI_* invalidations have this property."

The problem is that we should flush all dma transaction after guest stop.



The key to solve these problems is that we should invalidate related TLB.
1) TLBI can flush inflight dma translation (before dirty_log_sync()).
2) If a DMA translation uses ATC and occurs after we have handle dirty memory, then the ATC has been invalidated, so this will remark page as dirty (in dirty_log_clear()).

Thanks,
Keqian

> 
>> +
>> +    return ops->sync_dirty_log(ops, iova, size, bitmap,
>> +            base_iova, bitmap_pgshift);
>> +}
>> +
>>   static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
>>   {
>>       return iommu_fwspec_add_ids(dev, args->args, 1);
>> @@ -2649,6 +2675,7 @@ static struct iommu_ops arm_smmu_ops = {
>>       .domain_set_attr    = arm_smmu_domain_set_attr,
>>       .split_block        = arm_smmu_split_block,
>>       .merge_page        = arm_smmu_merge_page,
>> +    .sync_dirty_log        = arm_smmu_sync_dirty_log,
>>       .of_xlate        = arm_smmu_of_xlate,
>>       .get_resv_regions    = arm_smmu_get_resv_regions,
>>       .put_resv_regions    = generic_iommu_put_resv_regions,
>> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
>> index 17390f258eb1..6cfe1ef3fedd 100644
>> --- a/drivers/iommu/io-pgtable-arm.c
>> +++ b/drivers/iommu/io-pgtable-arm.c
>> @@ -877,6 +877,95 @@ static size_t arm_lpae_merge_page(struct io_pgtable_ops *ops, unsigned long iova
>>       return __arm_lpae_merge_page(data, iova, paddr, size, lvl, ptep, prot);
>>   }
>>   +static int __arm_lpae_sync_dirty_log(struct arm_lpae_io_pgtable *data,
>> +                     unsigned long iova, size_t size,
>> +                     int lvl, arm_lpae_iopte *ptep,
>> +                     unsigned long *bitmap,
>> +                     unsigned long base_iova,
>> +                     unsigned long bitmap_pgshift)
>> +{
>> +    arm_lpae_iopte pte;
>> +    struct io_pgtable *iop = &data->iop;
>> +    size_t base, next_size;
>> +    unsigned long offset;
>> +    int nbits, ret;
>> +
>> +    if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
>> +        return -EINVAL;
>> +
>> +    ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
>> +    pte = READ_ONCE(*ptep);
>> +    if (WARN_ON(!pte))
>> +        return -EINVAL;
>> +
>> +    if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
>> +        if (iopte_leaf(pte, lvl, iop->fmt)) {
>> +            if (pte & ARM_LPAE_PTE_AP_RDONLY)
>> +                return 0;
>> +
>> +            /* It is writable, set the bitmap */
>> +            nbits = size >> bitmap_pgshift;
>> +            offset = (iova - base_iova) >> bitmap_pgshift;
>> +            bitmap_set(bitmap, offset, nbits);
>> +            return 0;
>> +        } else {
>> +            /* To traverse next level */
>> +            next_size = ARM_LPAE_BLOCK_SIZE(lvl + 1, data);
>> +            ptep = iopte_deref(pte, data);
>> +            for (base = 0; base < size; base += next_size) {
>> +                ret = __arm_lpae_sync_dirty_log(data,
>> +                        iova + base, next_size, lvl + 1,
>> +                        ptep, bitmap, base_iova, bitmap_pgshift);
>> +                if (ret)
>> +                    return ret;
>> +            }
>> +            return 0;
>> +        }
>> +    } else if (iopte_leaf(pte, lvl, iop->fmt)) {
>> +        if (pte & ARM_LPAE_PTE_AP_RDONLY)
>> +            return 0;
>> +
>> +        /* Though the size is too small, also set bitmap */
>> +        nbits = size >> bitmap_pgshift;
>> +        offset = (iova - base_iova) >> bitmap_pgshift;
>> +        bitmap_set(bitmap, offset, nbits);
>> +        return 0;
>> +    }
>> +
>> +    /* Keep on walkin */
>> +    ptep = iopte_deref(pte, data);
>> +    return __arm_lpae_sync_dirty_log(data, iova, size, lvl + 1, ptep,
>> +            bitmap, base_iova, bitmap_pgshift);
>> +}
>> +
>> +static int arm_lpae_sync_dirty_log(struct io_pgtable_ops *ops,
>> +                   unsigned long iova, size_t size,
>> +                   unsigned long *bitmap,
>> +                   unsigned long base_iova,
>> +                   unsigned long bitmap_pgshift)
>> +{
>> +    struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
>> +    arm_lpae_iopte *ptep = data->pgd;
>> +    int lvl = data->start_level;
>> +    struct io_pgtable_cfg *cfg = &data->iop.cfg;
>> +    long iaext = (s64)iova >> cfg->ias;
>> +
>> +    if (WARN_ON(!size || (size & cfg->pgsize_bitmap) != size))
>> +        return -EINVAL;
>> +
>> +    if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1)
>> +        iaext = ~iaext;
>> +    if (WARN_ON(iaext))
>> +        return -EINVAL;
>> +
>> +    if (data->iop.fmt != ARM_64_LPAE_S1 &&
>> +        data->iop.fmt != ARM_32_LPAE_S1)
>> +        return -EINVAL;
>> +
>> +    return __arm_lpae_sync_dirty_log(data, iova, size, lvl, ptep,
>> +                     bitmap, base_iova, bitmap_pgshift);
>> +}
>> +
>>   static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
>>   {
>>       unsigned long granule, page_sizes;
>> @@ -957,6 +1046,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
>>           .iova_to_phys    = arm_lpae_iova_to_phys,
>>           .split_block    = arm_lpae_split_block,
>>           .merge_page    = arm_lpae_merge_page,
>> +        .sync_dirty_log    = arm_lpae_sync_dirty_log,
>>       };
>>         return data;
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index f1261da11ea8..69f268069282 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -2822,6 +2822,47 @@ size_t iommu_merge_page(struct iommu_domain *domain, unsigned long iova,
>>   }
>>   EXPORT_SYMBOL_GPL(iommu_merge_page);
>>   +int iommu_sync_dirty_log(struct iommu_domain *domain, unsigned long iova,
>> +             size_t size, unsigned long *bitmap,
>> +             unsigned long base_iova, unsigned long bitmap_pgshift)
>> +{
>> +    const struct iommu_ops *ops = domain->ops;
>> +    unsigned int min_pagesz;
>> +    size_t pgsize;
>> +    int ret;
>> +
>> +    min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
>> +
>> +    if (!IS_ALIGNED(iova | size, min_pagesz)) {
>> +        pr_err("unaligned: iova 0x%lx size 0x%zx min_pagesz 0x%x\n",
>> +               iova, size, min_pagesz);
>> +        return -EINVAL;
>> +    }
>> +
>> +    if (!ops || !ops->sync_dirty_log) {
>> +        pr_err("don't support sync dirty log\n");
>> +        return -ENODEV;
>> +    }
>> +
>> +    while (size) {
>> +        pgsize = iommu_pgsize(domain, iova, size);
>> +
>> +        ret = ops->sync_dirty_log(domain, iova, pgsize,
>> +                      bitmap, base_iova, bitmap_pgshift);
> 
> Once again, we have a worst-of-both-worlds iteration that doesn't make much sense. iommu_pgsize() essentially tells you the best supported size that an IOVA range *can* be mapped with, but we're iterating a range that's already mapped, so we don't know if it's relevant, and either way it may not bear any relation to the granularity of the bitmap, which is presumably what actually matters.
> 
> Logically, either we should iterate at the bitmap granularity here, and the driver just says whether the given iova chunk contains any dirty pages or not, or we just pass everything through to the driver and let it do the whole job itself. Doing a little bit of both is just an overcomplicated mess.
> 
> I'm skimming patch #7 and pretty much the same comments apply, so I can't be bothered to repeat them there...
> 
> Robin.
> 
>> +        if (ret)
>> +            break;
>> +
>> +        pr_debug("dirty_log_sync: iova 0x%lx pagesz 0x%zx\n", iova,
>> +             pgsize);
>> +
>> +        iova += pgsize;
>> +        size -= pgsize;
>> +    }
>> +
>> +    return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_sync_dirty_log);
>> +
>>   void iommu_get_resv_regions(struct device *dev, struct list_head *list)
>>   {
>>       const struct iommu_ops *ops = dev->bus->iommu_ops;
>> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
>> index 754b62a1bbaf..f44551e4a454 100644
>> --- a/include/linux/io-pgtable.h
>> +++ b/include/linux/io-pgtable.h
>> @@ -166,6 +166,10 @@ struct io_pgtable_ops {
>>                     size_t size);
>>       size_t (*merge_page)(struct io_pgtable_ops *ops, unsigned long iova,
>>                    phys_addr_t phys, size_t size, int prot);
>> +    int (*sync_dirty_log)(struct io_pgtable_ops *ops,
>> +                  unsigned long iova, size_t size,
>> +                  unsigned long *bitmap, unsigned long base_iova,
>> +                  unsigned long bitmap_pgshift);
>>   };
>>     /**
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index ac2b0b1bce0f..8069c8375e63 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -262,6 +262,10 @@ struct iommu_ops {
>>                     size_t size);
>>       size_t (*merge_page)(struct iommu_domain *domain, unsigned long iova,
>>                    phys_addr_t phys, size_t size, int prot);
>> +    int (*sync_dirty_log)(struct iommu_domain *domain,
>> +                  unsigned long iova, size_t size,
>> +                  unsigned long *bitmap, unsigned long base_iova,
>> +                  unsigned long bitmap_pgshift);
>>         /* Request/Free a list of reserved regions for a device */
>>       void (*get_resv_regions)(struct device *dev, struct list_head *list);
>> @@ -517,6 +521,10 @@ extern size_t iommu_split_block(struct iommu_domain *domain, unsigned long iova,
>>                   size_t size);
>>   extern size_t iommu_merge_page(struct iommu_domain *domain, unsigned long iova,
>>                      size_t size, int prot);
>> +extern int iommu_sync_dirty_log(struct iommu_domain *domain, unsigned long iova,
>> +                size_t size, unsigned long *bitmap,
>> +                unsigned long base_iova,
>> +                unsigned long bitmap_pgshift);
>>     /* Window handling function prototypes */
>>   extern int iommu_domain_window_enable(struct iommu_domain *domain, u32 wnd_nr,
>> @@ -923,6 +931,15 @@ static inline size_t iommu_merge_page(struct iommu_domain *domain,
>>       return -EINVAL;
>>   }
>>   +static inline int iommu_sync_dirty_log(struct iommu_domain *domain,
>> +                       unsigned long iova, size_t size,
>> +                       unsigned long *bitmap,
>> +                       unsigned long base_iova,
>> +                       unsigned long pgshift)
>> +{
>> +    return -EINVAL;
>> +}
>> +
>>   static inline int  iommu_device_register(struct iommu_device *iommu)
>>   {
>>       return -ENODEV;
>>
> .
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 06/11] iommu/arm-smmu-v3: Scan leaf TTD to sync hardware dirty log
  2021-02-04 19:52   ` Robin Murphy
  2021-02-07 12:41     ` Keqian Zhu
@ 2021-02-08  1:17     ` Keqian Zhu
  1 sibling, 0 replies; 36+ messages in thread
From: Keqian Zhu @ 2021-02-08  1:17 UTC (permalink / raw)
  To: Robin Murphy, linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu
  Cc: Mark Rutland, jiangkunkun, Suzuki K Poulose, Marc Zyngier,
	Cornelia Huck, Kirti Wankhede, lushenming, Alex Williamson,
	James Morse, Catalin Marinas, wanghaibin.wang, Will Deacon



On 2021/2/5 3:52, Robin Murphy wrote:
> On 2021-01-28 15:17, Keqian Zhu wrote:
>> From: jiangkunkun <jiangkunkun@huawei.com>
>>
>> During dirty log tracking, user will try to retrieve dirty log from
>> iommu if it supports hardware dirty log. This adds a new interface
[...]

>>   static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
>>   {
>>       unsigned long granule, page_sizes;
>> @@ -957,6 +1046,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
>>           .iova_to_phys    = arm_lpae_iova_to_phys,
>>           .split_block    = arm_lpae_split_block,
>>           .merge_page    = arm_lpae_merge_page,
>> +        .sync_dirty_log    = arm_lpae_sync_dirty_log,
>>       };
>>         return data;
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index f1261da11ea8..69f268069282 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -2822,6 +2822,47 @@ size_t iommu_merge_page(struct iommu_domain *domain, unsigned long iova,
>>   }
>>   EXPORT_SYMBOL_GPL(iommu_merge_page);
>>   +int iommu_sync_dirty_log(struct iommu_domain *domain, unsigned long iova,
>> +             size_t size, unsigned long *bitmap,
>> +             unsigned long base_iova, unsigned long bitmap_pgshift)
>> +{
>> +    const struct iommu_ops *ops = domain->ops;
>> +    unsigned int min_pagesz;
>> +    size_t pgsize;
>> +    int ret;
>> +
>> +    min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
>> +
>> +    if (!IS_ALIGNED(iova | size, min_pagesz)) {
>> +        pr_err("unaligned: iova 0x%lx size 0x%zx min_pagesz 0x%x\n",
>> +               iova, size, min_pagesz);
>> +        return -EINVAL;
>> +    }
>> +
>> +    if (!ops || !ops->sync_dirty_log) {
>> +        pr_err("don't support sync dirty log\n");
>> +        return -ENODEV;
>> +    }
>> +
>> +    while (size) {
>> +        pgsize = iommu_pgsize(domain, iova, size);
>> +
>> +        ret = ops->sync_dirty_log(domain, iova, pgsize,
>> +                      bitmap, base_iova, bitmap_pgshift);
> 
> Once again, we have a worst-of-both-worlds iteration that doesn't make much sense. iommu_pgsize() essentially tells you the best supported size that an IOVA range *can* be mapped with, but we're iterating a range that's already mapped, so we don't know if it's relevant, and either way it may not bear any relation to the granularity of the bitmap, which is presumably what actually matters.
> 
> Logically, either we should iterate at the bitmap granularity here, and the driver just says whether the given iova chunk contains any dirty pages or not, or we just pass everything through to the driver and let it do the whole job itself. Doing a little bit of both is just an overcomplicated mess.
> 
> I'm skimming patch #7 and pretty much the same comments apply, so I can't be bothered to repeat them there...
> 
> Robin.
Sorry that I missed these comments...

As I clarified in #4, due to unsuitable variable name, the @pgsize actually is the max size that meets alignment acquirement and fits into the pgsize_bitmap.

All iommu interfaces acquire the @size fits into pgsize_bitmap to simplify their implementation. And the logic is very similar to "unmap" here.

Thanks,
Keqian

> 
>> +        if (ret)
>> +            break;
>> +
>> +        pr_debug("dirty_log_sync: iova 0x%lx pagesz 0x%zx\n", iova,
>> +             pgsize);
>> +
>> +        iova += pgsize;
>> +        size -= pgsize;
>> +    }
>> +
>> +    return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_sync_dirty_log);
>> +
>>   void iommu_get_resv_regions(struct device *dev, struct list_head *list)
>>   {
>>       const struct iommu_ops *ops = dev->bus->iommu_ops;
>> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
>> index 754b62a1bbaf..f44551e4a454 100644
>> --- a/include/linux/io-pgtable.h
>> +++ b/include/linux/io-pgtable.h
>> @@ -166,6 +166,10 @@ struct io_pgtable_ops {
>>                     size_t size);
>>       size_t (*merge_page)(struct io_pgtable_ops *ops, unsigned long iova,
>>                    phys_addr_t phys, size_t size, int prot);
>> +    int (*sync_dirty_log)(struct io_pgtable_ops *ops,
>> +                  unsigned long iova, size_t size,
>> +                  unsigned long *bitmap, unsigned long base_iova,
>> +                  unsigned long bitmap_pgshift);
>>   };
>>     /**
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index ac2b0b1bce0f..8069c8375e63 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -262,6 +262,10 @@ struct iommu_ops {
>>                     size_t size);
>>       size_t (*merge_page)(struct iommu_domain *domain, unsigned long iova,
>>                    phys_addr_t phys, size_t size, int prot);
>> +    int (*sync_dirty_log)(struct iommu_domain *domain,
>> +                  unsigned long iova, size_t size,
>> +                  unsigned long *bitmap, unsigned long base_iova,
>> +                  unsigned long bitmap_pgshift);
>>         /* Request/Free a list of reserved regions for a device */
>>       void (*get_resv_regions)(struct device *dev, struct list_head *list);
>> @@ -517,6 +521,10 @@ extern size_t iommu_split_block(struct iommu_domain *domain, unsigned long iova,
>>                   size_t size);
>>   extern size_t iommu_merge_page(struct iommu_domain *domain, unsigned long iova,
>>                      size_t size, int prot);
>> +extern int iommu_sync_dirty_log(struct iommu_domain *domain, unsigned long iova,
>> +                size_t size, unsigned long *bitmap,
>> +                unsigned long base_iova,
>> +                unsigned long bitmap_pgshift);
>>     /* Window handling function prototypes */
>>   extern int iommu_domain_window_enable(struct iommu_domain *domain, u32 wnd_nr,
>> @@ -923,6 +931,15 @@ static inline size_t iommu_merge_page(struct iommu_domain *domain,
>>       return -EINVAL;
>>   }
>>   +static inline int iommu_sync_dirty_log(struct iommu_domain *domain,
>> +                       unsigned long iova, size_t size,
>> +                       unsigned long *bitmap,
>> +                       unsigned long base_iova,
>> +                       unsigned long pgshift)
>> +{
>> +    return -EINVAL;
>> +}
>> +
>>   static inline int  iommu_device_register(struct iommu_device *iommu)
>>   {
>>       return -ENODEV;
>>
> .
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 10/11] vfio/iommu_type1: Optimize dirty bitmap population based on iommu HWDBM
  2021-02-07  9:56   ` Yi Sun
  2021-02-07 10:40     ` Keqian Zhu
@ 2021-02-09 11:16     ` Robin Murphy
  2021-02-09 12:02       ` Yi Sun
  1 sibling, 1 reply; 36+ messages in thread
From: Robin Murphy @ 2021-02-09 11:16 UTC (permalink / raw)
  To: Yi Sun, Keqian Zhu
  Cc: Mark Rutland, kevin.tian, Cornelia Huck, yan.y.zhao, kvm,
	Suzuki K Poulose, Catalin Marinas, jiangkunkun, Alex Williamson,
	iommu, linux-kernel, lushenming, Kirti Wankhede, James Morse,
	Marc Zyngier, wanghaibin.wang, Will Deacon, kvmarm,
	linux-arm-kernel

On 2021-02-07 09:56, Yi Sun wrote:
> Hi,
> 
> On 21-01-28 23:17:41, Keqian Zhu wrote:
> 
> [...]
> 
>> +static void vfio_dma_dirty_log_start(struct vfio_iommu *iommu,
>> +				     struct vfio_dma *dma)
>> +{
>> +	struct vfio_domain *d;
>> +
>> +	list_for_each_entry(d, &iommu->domain_list, next) {
>> +		/* Go through all domain anyway even if we fail */
>> +		iommu_split_block(d->domain, dma->iova, dma->size);
>> +	}
>> +}
> 
> This should be a switch to prepare for dirty log start. Per Intel
> Vtd spec, there is SLADE defined in Scalable-Mode PASID Table Entry.
> It enables Accessed/Dirty Flags in second-level paging entries.
> So, a generic iommu interface here is better. For Intel iommu, it
> enables SLADE. For ARM, it splits block.

 From a quick look, VT-D's SLADE and SMMU's HTTU appear to be the exact 
same thing. This step isn't about enabling or disabling that feature 
itself (the proposal for SMMU is to simply leave HTTU enabled all the 
time), it's about controlling the granularity at which the dirty status 
can be detected/reported at all, since that's tied to the pagetable 
structure.

However, if an IOMMU were to come along with some other way of reporting 
dirty status that didn't depend on the granularity of individual 
mappings, then indeed it wouldn't need this operation.

Robin.

>> +
>> +static void vfio_dma_dirty_log_stop(struct vfio_iommu *iommu,
>> +				    struct vfio_dma *dma)
>> +{
>> +	struct vfio_domain *d;
>> +
>> +	list_for_each_entry(d, &iommu->domain_list, next) {
>> +		/* Go through all domain anyway even if we fail */
>> +		iommu_merge_page(d->domain, dma->iova, dma->size,
>> +				 d->prot | dma->prot);
>> +	}
>> +}
> 
> Same as above comment, a generic interface is required here.
> 
>> +
>> +static void vfio_iommu_dirty_log_switch(struct vfio_iommu *iommu, bool start)
>> +{
>> +	struct rb_node *n;
>> +
>> +	/* Split and merge even if all iommu don't support HWDBM now */
>> +	for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) {
>> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
>> +
>> +		if (!dma->iommu_mapped)
>> +			continue;
>> +
>> +		/* Go through all dma range anyway even if we fail */
>> +		if (start)
>> +			vfio_dma_dirty_log_start(iommu, dma);
>> +		else
>> +			vfio_dma_dirty_log_stop(iommu, dma);
>> +	}
>> +}
>> +
>>   static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
>>   					unsigned long arg)
>>   {
>> @@ -2812,8 +2900,10 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
>>   		pgsize = 1 << __ffs(iommu->pgsize_bitmap);
>>   		if (!iommu->dirty_page_tracking) {
>>   			ret = vfio_dma_bitmap_alloc_all(iommu, pgsize);
>> -			if (!ret)
>> +			if (!ret) {
>>   				iommu->dirty_page_tracking = true;
>> +				vfio_iommu_dirty_log_switch(iommu, true);
>> +			}
>>   		}
>>   		mutex_unlock(&iommu->lock);
>>   		return ret;
>> @@ -2822,6 +2912,7 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
>>   		if (iommu->dirty_page_tracking) {
>>   			iommu->dirty_page_tracking = false;
>>   			vfio_dma_bitmap_free_all(iommu);
>> +			vfio_iommu_dirty_log_switch(iommu, false);
>>   		}
>>   		mutex_unlock(&iommu->lock);
>>   		return 0;
>> -- 
>> 2.19.1
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 10/11] vfio/iommu_type1: Optimize dirty bitmap population based on iommu HWDBM
  2021-02-07 10:40     ` Keqian Zhu
@ 2021-02-09 11:57       ` Yi Sun
  2021-02-09 12:08         ` Robin Murphy
                           ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Yi Sun @ 2021-02-09 11:57 UTC (permalink / raw)
  To: Keqian Zhu
  Cc: Mark Rutland, kvm, Catalin Marinas, Kirti Wankhede, Will Deacon,
	kvmarm, Marc Zyngier, jiangkunkun, wanghaibin.wang, kevin.tian,
	yan.y.zhao, Suzuki K Poulose, Alex Williamson, linux-arm-kernel,
	Cornelia Huck, linux-kernel, lushenming, iommu, James Morse,
	Robin Murphy

On 21-02-07 18:40:36, Keqian Zhu wrote:
> Hi Yi,
> 
> On 2021/2/7 17:56, Yi Sun wrote:
> > Hi,
> > 
> > On 21-01-28 23:17:41, Keqian Zhu wrote:
> > 
> > [...]
> > 
> >> +static void vfio_dma_dirty_log_start(struct vfio_iommu *iommu,
> >> +				     struct vfio_dma *dma)
> >> +{
> >> +	struct vfio_domain *d;
> >> +
> >> +	list_for_each_entry(d, &iommu->domain_list, next) {
> >> +		/* Go through all domain anyway even if we fail */
> >> +		iommu_split_block(d->domain, dma->iova, dma->size);
> >> +	}
> >> +}
> > 
> > This should be a switch to prepare for dirty log start. Per Intel
> > Vtd spec, there is SLADE defined in Scalable-Mode PASID Table Entry.
> > It enables Accessed/Dirty Flags in second-level paging entries.
> > So, a generic iommu interface here is better. For Intel iommu, it
> > enables SLADE. For ARM, it splits block.
> Indeed, a generic interface name is better.
> 
> The vendor iommu driver plays vendor's specific actions to start dirty log, and Intel iommu and ARM smmu may differ. Besides, we may add more actions in ARM smmu driver in future.
> 
> One question: Though I am not familiar with Intel iommu, I think it also should split block mapping besides enable SLADE. Right?
> 
I am not familiar with ARM smmu. :) So I want to clarify if the block
in smmu is big page, e.g. 2M page? Intel Vtd manages the memory per
page, 4KB/2MB/1GB. There are two ways to manage dirty pages.
1. Keep default granularity. Just set SLADE to enable the dirty track.
2. Split big page to 4KB to get finer granularity.

But question about the second solution is if it can benefit the user
space, e.g. live migration. If my understanding about smmu block (i.e.
the big page) is correct, have you collected some performance data to
prove that the split can improve performance? Thanks!

> Thanks,
> Keqian
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 10/11] vfio/iommu_type1: Optimize dirty bitmap population based on iommu HWDBM
  2021-02-09 11:16     ` Robin Murphy
@ 2021-02-09 12:02       ` Yi Sun
  0 siblings, 0 replies; 36+ messages in thread
From: Yi Sun @ 2021-02-09 12:02 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Mark Rutland, kevin.tian, Cornelia Huck, yan.y.zhao, kvm,
	Will Deacon, Suzuki K Poulose, Catalin Marinas, jiangkunkun,
	Alex Williamson, iommu, linux-kernel, lushenming, Kirti Wankhede,
	James Morse, Marc Zyngier, wanghaibin.wang, kvmarm,
	linux-arm-kernel

On 21-02-09 11:16:08, Robin Murphy wrote:
> On 2021-02-07 09:56, Yi Sun wrote:
> >Hi,
> >
> >On 21-01-28 23:17:41, Keqian Zhu wrote:
> >
> >[...]
> >
> >>+static void vfio_dma_dirty_log_start(struct vfio_iommu *iommu,
> >>+				     struct vfio_dma *dma)
> >>+{
> >>+	struct vfio_domain *d;
> >>+
> >>+	list_for_each_entry(d, &iommu->domain_list, next) {
> >>+		/* Go through all domain anyway even if we fail */
> >>+		iommu_split_block(d->domain, dma->iova, dma->size);
> >>+	}
> >>+}
> >
> >This should be a switch to prepare for dirty log start. Per Intel
> >Vtd spec, there is SLADE defined in Scalable-Mode PASID Table Entry.
> >It enables Accessed/Dirty Flags in second-level paging entries.
> >So, a generic iommu interface here is better. For Intel iommu, it
> >enables SLADE. For ARM, it splits block.
> 
> From a quick look, VT-D's SLADE and SMMU's HTTU appear to be the
> exact same thing. This step isn't about enabling or disabling that
> feature itself (the proposal for SMMU is to simply leave HTTU
> enabled all the time), it's about controlling the granularity at
> which the dirty status can be detected/reported at all, since that's
> tied to the pagetable structure.
> 
> However, if an IOMMU were to come along with some other way of
> reporting dirty status that didn't depend on the granularity of
> individual mappings, then indeed it wouldn't need this operation.
> 
Per my thought, we can use these two start/stop interfaces to make
user space decide when to start/stop the dirty tracking. For Intel
SLADE, I think we can enable this bit when this start interface is
called by user space. I don't think leave SLADE enabled all the time
is necessary for Intel Vt-d. So I suggest a generic interface here.
Thanks!

> Robin.
> 
> >>+
> >>+static void vfio_dma_dirty_log_stop(struct vfio_iommu *iommu,
> >>+				    struct vfio_dma *dma)
> >>+{
> >>+	struct vfio_domain *d;
> >>+
> >>+	list_for_each_entry(d, &iommu->domain_list, next) {
> >>+		/* Go through all domain anyway even if we fail */
> >>+		iommu_merge_page(d->domain, dma->iova, dma->size,
> >>+				 d->prot | dma->prot);
> >>+	}
> >>+}
> >
> >Same as above comment, a generic interface is required here.
> >
> >>+
> >>+static void vfio_iommu_dirty_log_switch(struct vfio_iommu *iommu, bool start)
> >>+{
> >>+	struct rb_node *n;
> >>+
> >>+	/* Split and merge even if all iommu don't support HWDBM now */
> >>+	for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) {
> >>+		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> >>+
> >>+		if (!dma->iommu_mapped)
> >>+			continue;
> >>+
> >>+		/* Go through all dma range anyway even if we fail */
> >>+		if (start)
> >>+			vfio_dma_dirty_log_start(iommu, dma);
> >>+		else
> >>+			vfio_dma_dirty_log_stop(iommu, dma);
> >>+	}
> >>+}
> >>+
> >>  static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
> >>  					unsigned long arg)
> >>  {
> >>@@ -2812,8 +2900,10 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
> >>  		pgsize = 1 << __ffs(iommu->pgsize_bitmap);
> >>  		if (!iommu->dirty_page_tracking) {
> >>  			ret = vfio_dma_bitmap_alloc_all(iommu, pgsize);
> >>-			if (!ret)
> >>+			if (!ret) {
> >>  				iommu->dirty_page_tracking = true;
> >>+				vfio_iommu_dirty_log_switch(iommu, true);
> >>+			}
> >>  		}
> >>  		mutex_unlock(&iommu->lock);
> >>  		return ret;
> >>@@ -2822,6 +2912,7 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
> >>  		if (iommu->dirty_page_tracking) {
> >>  			iommu->dirty_page_tracking = false;
> >>  			vfio_dma_bitmap_free_all(iommu);
> >>+			vfio_iommu_dirty_log_switch(iommu, false);
> >>  		}
> >>  		mutex_unlock(&iommu->lock);
> >>  		return 0;
> >>-- 
> >>2.19.1
> >_______________________________________________
> >iommu mailing list
> >iommu@lists.linux-foundation.org
> >https://lists.linuxfoundation.org/mailman/listinfo/iommu
> >
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 10/11] vfio/iommu_type1: Optimize dirty bitmap population based on iommu HWDBM
  2021-02-09 11:57       ` Yi Sun
@ 2021-02-09 12:08         ` Robin Murphy
  2021-02-10  6:15         ` zhukeqian
  2021-02-18  1:17         ` Keqian Zhu
  2 siblings, 0 replies; 36+ messages in thread
From: Robin Murphy @ 2021-02-09 12:08 UTC (permalink / raw)
  To: Yi Sun, Keqian Zhu
  Cc: Mark Rutland, kevin.tian, Cornelia Huck, yan.y.zhao, kvm,
	Suzuki K Poulose, Catalin Marinas, jiangkunkun, Alex Williamson,
	iommu, linux-kernel, lushenming, Kirti Wankhede, James Morse,
	Marc Zyngier, wanghaibin.wang, Will Deacon, kvmarm,
	linux-arm-kernel

On 2021-02-09 11:57, Yi Sun wrote:
> On 21-02-07 18:40:36, Keqian Zhu wrote:
>> Hi Yi,
>>
>> On 2021/2/7 17:56, Yi Sun wrote:
>>> Hi,
>>>
>>> On 21-01-28 23:17:41, Keqian Zhu wrote:
>>>
>>> [...]
>>>
>>>> +static void vfio_dma_dirty_log_start(struct vfio_iommu *iommu,
>>>> +				     struct vfio_dma *dma)
>>>> +{
>>>> +	struct vfio_domain *d;
>>>> +
>>>> +	list_for_each_entry(d, &iommu->domain_list, next) {
>>>> +		/* Go through all domain anyway even if we fail */
>>>> +		iommu_split_block(d->domain, dma->iova, dma->size);
>>>> +	}
>>>> +}
>>>
>>> This should be a switch to prepare for dirty log start. Per Intel
>>> Vtd spec, there is SLADE defined in Scalable-Mode PASID Table Entry.
>>> It enables Accessed/Dirty Flags in second-level paging entries.
>>> So, a generic iommu interface here is better. For Intel iommu, it
>>> enables SLADE. For ARM, it splits block.
>> Indeed, a generic interface name is better.
>>
>> The vendor iommu driver plays vendor's specific actions to start dirty log, and Intel iommu and ARM smmu may differ. Besides, we may add more actions in ARM smmu driver in future.
>>
>> One question: Though I am not familiar with Intel iommu, I think it also should split block mapping besides enable SLADE. Right?
>>
> I am not familiar with ARM smmu. :) So I want to clarify if the block
> in smmu is big page, e.g. 2M page? Intel Vtd manages the memory per
> page, 4KB/2MB/1GB.

Indeed, what you call large pages, we call blocks :)

Robin.

> There are two ways to manage dirty pages.
> 1. Keep default granularity. Just set SLADE to enable the dirty track.
> 2. Split big page to 4KB to get finer granularity.
> 
> But question about the second solution is if it can benefit the user
> space, e.g. live migration. If my understanding about smmu block (i.e.
> the big page) is correct, have you collected some performance data to
> prove that the split can improve performance? Thanks!
> 
>> Thanks,
>> Keqian
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [RFC PATCH 10/11] vfio/iommu_type1: Optimize dirty bitmap population based on iommu HWDBM
  2021-02-09 11:57       ` Yi Sun
  2021-02-09 12:08         ` Robin Murphy
@ 2021-02-10  6:15         ` zhukeqian
  2021-02-18  1:17         ` Keqian Zhu
  2 siblings, 0 replies; 36+ messages in thread
From: zhukeqian @ 2021-02-10  6:15 UTC (permalink / raw)
  To: Yi Sun
  Cc: Mark Rutland, kvm, Catalin Marinas, Kirti Wankhede, Will Deacon,
	kvmarm, Marc Zyngier, jiangkunkun, Wanghaibin (D),
	Tian, Kevin, yan.y.zhao, Suzuki K Poulose, Alex Williamson,
	linux-arm-kernel, Cornelia Huck, linux-kernel, lushenming, iommu,
	James Morse, Robin Murphy


[-- Attachment #1.1: Type: text/plain, Size: 2002 bytes --]

Hi,

Cheers for these discussion. :)

I am on vacation, and will come back at 2.18.

Thanks,
Keqian

On 21-02-07 18:40:36, Keqian Zhu wrote:
> Hi Yi,
>
> On 2021/2/7 17:56, Yi Sun wrote:
> > Hi,
> >
> > On 21-01-28 23:17:41, Keqian Zhu wrote:
> >
> > [...]
> >
> >> +static void vfio_dma_dirty_log_start(struct vfio_iommu *iommu,
> >> +                               struct vfio_dma *dma)
> >> +{
> >> + struct vfio_domain *d;
> >> +
> >> + list_for_each_entry(d, &iommu->domain_list, next) {
> >> +          /* Go through all domain anyway even if we fail */
> >> +          iommu_split_block(d->domain, dma->iova, dma->size);
> >> + }
> >> +}
> >
> > This should be a switch to prepare for dirty log start. Per Intel
> > Vtd spec, there is SLADE defined in Scalable-Mode PASID Table Entry.
> > It enables Accessed/Dirty Flags in second-level paging entries.
> > So, a generic iommu interface here is better. For Intel iommu, it
> > enables SLADE. For ARM, it splits block.
> Indeed, a generic interface name is better.
>
> The vendor iommu driver plays vendor's specific actions to start dirty log, and Intel iommu and ARM smmu may differ. Besides, we may add more actions in ARM smmu driver in future.
>
> One question: Though I am not familiar with Intel iommu, I think it also should split block mapping besides enable SLADE. Right?
>
I am not familiar with ARM smmu. :) So I want to clarify if the block
in smmu is big page, e.g. 2M page? Intel Vtd manages the memory per
page, 4KB/2MB/1GB. There are two ways to manage dirty pages.
1. Keep default granularity. Just set SLADE to enable the dirty track.
2. Split big page to 4KB to get finer granularity.

But question about the second solution is if it can benefit the user
space, e.g. live migration. If my understanding about smmu block (i.e.
the big page) is correct, have you collected some performance data to
prove that the split can improve performance? Thanks!

> Thanks,
> Keqian


[-- Attachment #1.2: Type: text/html, Size: 3281 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 10/11] vfio/iommu_type1: Optimize dirty bitmap population based on iommu HWDBM
  2021-02-09 11:57       ` Yi Sun
  2021-02-09 12:08         ` Robin Murphy
  2021-02-10  6:15         ` zhukeqian
@ 2021-02-18  1:17         ` Keqian Zhu
  2 siblings, 0 replies; 36+ messages in thread
From: Keqian Zhu @ 2021-02-18  1:17 UTC (permalink / raw)
  To: Yi Sun
  Cc: Mark Rutland, kvm, Catalin Marinas, Kirti Wankhede, Will Deacon,
	kvmarm, Marc Zyngier, jiangkunkun, wanghaibin.wang, kevin.tian,
	yan.y.zhao, Suzuki K Poulose, Alex Williamson, linux-arm-kernel,
	Cornelia Huck, linux-kernel, lushenming, iommu, James Morse,
	Robin Murphy

Hi Yi,

On 2021/2/9 19:57, Yi Sun wrote:
> On 21-02-07 18:40:36, Keqian Zhu wrote:
>> Hi Yi,
>>
>> On 2021/2/7 17:56, Yi Sun wrote:
>>> Hi,
>>>
>>> On 21-01-28 23:17:41, Keqian Zhu wrote:
>>>
>>> [...]
>>>
>>>> +static void vfio_dma_dirty_log_start(struct vfio_iommu *iommu,
>>>> +				     struct vfio_dma *dma)
>>>> +{
>>>> +	struct vfio_domain *d;
>>>> +
>>>> +	list_for_each_entry(d, &iommu->domain_list, next) {
>>>> +		/* Go through all domain anyway even if we fail */
>>>> +		iommu_split_block(d->domain, dma->iova, dma->size);
>>>> +	}
>>>> +}
>>>
>>> This should be a switch to prepare for dirty log start. Per Intel
>>> Vtd spec, there is SLADE defined in Scalable-Mode PASID Table Entry.
>>> It enables Accessed/Dirty Flags in second-level paging entries.
>>> So, a generic iommu interface here is better. For Intel iommu, it
>>> enables SLADE. For ARM, it splits block.
>> Indeed, a generic interface name is better.
>>
>> The vendor iommu driver plays vendor's specific actions to start dirty log, and Intel iommu and ARM smmu may differ. Besides, we may add more actions in ARM smmu driver in future.
>>
>> One question: Though I am not familiar with Intel iommu, I think it also should split block mapping besides enable SLADE. Right?
>>
> I am not familiar with ARM smmu. :) So I want to clarify if the block
> in smmu is big page, e.g. 2M page? Intel Vtd manages the memory per
Yes, for ARM, the "block" is big page :).

> page, 4KB/2MB/1GB. There are two ways to manage dirty pages.
> 1. Keep default granularity. Just set SLADE to enable the dirty track.
> 2. Split big page to 4KB to get finer granularity.
According to your statement, I see that VT-D's SLADE behaves like smmu HTTU. They are both based on page-table.

Right, we should give more freedom to iommu vendor driver, so a generic interface is better.
1) As you said, set SLADE when enable dirty log.
2) IOMMUs of other architecture may has completely different dirty tracking mechanism.

> 
> But question about the second solution is if it can benefit the user
> space, e.g. live migration. If my understanding about smmu block (i.e.
> the big page) is correct, have you collected some performance data to
> prove that the split can improve performance? Thanks!
The purpose of splitting block mapping is to reduce the amount of dirty bytes, which depends on actual DMA transaction.
Take an extreme example, if DMA writes one byte, under 1G mapping, the dirty amount reported to userspace is 1G, but under 4K mapping, the dirty amount is just 4K.

I will detail the commit message in v2.

Thanks,
Keqian
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU
  2021-02-04 19:50   ` Robin Murphy
  2021-02-05  9:13     ` Keqian Zhu
@ 2021-03-02  7:42     ` Keqian Zhu
  1 sibling, 0 replies; 36+ messages in thread
From: Keqian Zhu @ 2021-03-02  7:42 UTC (permalink / raw)
  To: Robin Murphy, linux-kernel, linux-arm-kernel, kvm, kvmarm, iommu,
	Will Deacon, Alex Williamson, Marc Zyngier, Catalin Marinas
  Cc: Mark Rutland, jiangkunkun, Suzuki K Poulose, Cornelia Huck,
	lushenming, Kirti Wankhede, James Morse, wanghaibin.wang

Hi Robin,

I am going to send v2 at next week, to addresses these issues reported by you. Many thanks!
And do you have any further comments on patch #4 #5 and #6?

Thanks,
Keqian

On 2021/2/5 3:50, Robin Murphy wrote:
> On 2021-01-28 15:17, Keqian Zhu wrote:
>> From: jiangkunkun <jiangkunkun@huawei.com>
>>
>> The SMMU which supports HTTU (Hardware Translation Table Update) can
>> update the access flag and the dirty state of TTD by hardware. It is
>> essential to track dirty pages of DMA.
>>
>> This adds feature detection, none functional change.
>>
>> Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
>> Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
>> ---
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 ++++++++++++++++
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 ++++++++
>>   include/linux/io-pgtable.h                  |  1 +
>>   3 files changed, 25 insertions(+)
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index 8ca7415d785d..0f0fe71cc10d 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -1987,6 +1987,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
>>           .pgsize_bitmap    = smmu->pgsize_bitmap,
>>           .ias        = ias,
>>           .oas        = oas,
>> +        .httu_hd    = smmu->features & ARM_SMMU_FEAT_HTTU_HD,
>>           .coherent_walk    = smmu->features & ARM_SMMU_FEAT_COHERENCY,
>>           .tlb        = &arm_smmu_flush_ops,
>>           .iommu_dev    = smmu->dev,
>> @@ -3224,6 +3225,21 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
>>       if (reg & IDR0_HYP)
>>           smmu->features |= ARM_SMMU_FEAT_HYP;
>>   +    switch (FIELD_GET(IDR0_HTTU, reg)) {
> 
> We need to accommodate the firmware override as well if we need this to be meaningful. Jean-Philippe is already carrying a suitable patch in the SVA stack[1].
> 
>> +    case IDR0_HTTU_NONE:
>> +        break;
>> +    case IDR0_HTTU_HA:
>> +        smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
>> +        break;
>> +    case IDR0_HTTU_HAD:
>> +        smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
>> +        smmu->features |= ARM_SMMU_FEAT_HTTU_HD;
>> +        break;
>> +    default:
>> +        dev_err(smmu->dev, "unknown/unsupported HTTU!\n");
>> +        return -ENXIO;
>> +    }
>> +
>>       /*
>>        * The coherency feature as set by FW is used in preference to the ID
>>        * register, but warn on mismatch.
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> index 96c2e9565e00..e91bea44519e 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> @@ -33,6 +33,10 @@
>>   #define IDR0_ASID16            (1 << 12)
>>   #define IDR0_ATS            (1 << 10)
>>   #define IDR0_HYP            (1 << 9)
>> +#define IDR0_HTTU            GENMASK(7, 6)
>> +#define IDR0_HTTU_NONE            0
>> +#define IDR0_HTTU_HA            1
>> +#define IDR0_HTTU_HAD            2
>>   #define IDR0_COHACC            (1 << 4)
>>   #define IDR0_TTF            GENMASK(3, 2)
>>   #define IDR0_TTF_AARCH64        2
>> @@ -286,6 +290,8 @@
>>   #define CTXDESC_CD_0_TCR_TBI0        (1ULL << 38)
>>     #define CTXDESC_CD_0_AA64        (1UL << 41)
>> +#define CTXDESC_CD_0_HD            (1UL << 42)
>> +#define CTXDESC_CD_0_HA            (1UL << 43)
>>   #define CTXDESC_CD_0_S            (1UL << 44)
>>   #define CTXDESC_CD_0_R            (1UL << 45)
>>   #define CTXDESC_CD_0_A            (1UL << 46)
>> @@ -604,6 +610,8 @@ struct arm_smmu_device {
>>   #define ARM_SMMU_FEAT_RANGE_INV        (1 << 15)
>>   #define ARM_SMMU_FEAT_BTM        (1 << 16)
>>   #define ARM_SMMU_FEAT_SVA        (1 << 17)
>> +#define ARM_SMMU_FEAT_HTTU_HA        (1 << 18)
>> +#define ARM_SMMU_FEAT_HTTU_HD        (1 << 19)
>>       u32                features;
>>     #define ARM_SMMU_OPT_SKIP_PREFETCH    (1 << 0)
>> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
>> index ea727eb1a1a9..1a00ea8562c7 100644
>> --- a/include/linux/io-pgtable.h
>> +++ b/include/linux/io-pgtable.h
>> @@ -97,6 +97,7 @@ struct io_pgtable_cfg {
>>       unsigned long            pgsize_bitmap;
>>       unsigned int            ias;
>>       unsigned int            oas;
>> +    bool                httu_hd;
> 
> This is very specific to the AArch64 stage 1 format, not a generic capability - I think it should be a quirk flag rather than a common field.
> 
> Robin.
> 
> [1] https://jpbrucker.net/git/linux/commit/?h=sva/current&id=1ef7d512fb9082450dfe0d22ca4f7e35625a097b
> 
>>       bool                coherent_walk;
>>       const struct iommu_flush_ops    *tlb;
>>       struct device            *iommu_dev;
>>
> .
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2021-03-02  7:42 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-28 15:17 [RFC PATCH 00/11] vfio/iommu_type1: Implement dirty log tracking based on smmuv3 HTTU Keqian Zhu
2021-01-28 15:17 ` [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU Keqian Zhu
2021-02-04 19:50   ` Robin Murphy
2021-02-05  9:13     ` Keqian Zhu
2021-02-05  9:51       ` Jean-Philippe Brucker
2021-02-07  1:42         ` Keqian Zhu
2021-02-05 11:48       ` Robin Murphy
2021-02-05 16:11         ` Robin Murphy
2021-02-07  1:56           ` Keqian Zhu
2021-02-07  2:19         ` Keqian Zhu
2021-03-02  7:42     ` Keqian Zhu
2021-01-28 15:17 ` [RFC PATCH 02/11] iommu/arm-smmu-v3: Enable HTTU for SMMU stage1 mapping Keqian Zhu
2021-01-28 15:17 ` [RFC PATCH 03/11] iommu/arm-smmu-v3: Add feature detection for BBML Keqian Zhu
2021-01-28 15:17 ` [RFC PATCH 04/11] iommu/arm-smmu-v3: Split block descriptor to a span of page Keqian Zhu
2021-02-04 19:51   ` Robin Murphy
2021-02-07  8:18     ` Keqian Zhu
2021-01-28 15:17 ` [RFC PATCH 05/11] iommu/arm-smmu-v3: Merge a span of page to block descriptor Keqian Zhu
2021-02-04 19:52   ` Robin Murphy
2021-02-07 12:13     ` Keqian Zhu
2021-01-28 15:17 ` [RFC PATCH 06/11] iommu/arm-smmu-v3: Scan leaf TTD to sync hardware dirty log Keqian Zhu
2021-02-04 19:52   ` Robin Murphy
2021-02-07 12:41     ` Keqian Zhu
2021-02-08  1:17     ` Keqian Zhu
2021-01-28 15:17 ` [RFC PATCH 07/11] iommu/arm-smmu-v3: Clear dirty log according to bitmap Keqian Zhu
2021-01-28 15:17 ` [RFC PATCH 08/11] iommu/arm-smmu-v3: Add HWDBM device feature reporting Keqian Zhu
2021-01-28 15:17 ` [RFC PATCH 09/11] vfio/iommu_type1: Add HWDBM status maintanance Keqian Zhu
2021-01-28 15:17 ` [RFC PATCH 10/11] vfio/iommu_type1: Optimize dirty bitmap population based on iommu HWDBM Keqian Zhu
2021-02-07  9:56   ` Yi Sun
2021-02-07 10:40     ` Keqian Zhu
2021-02-09 11:57       ` Yi Sun
2021-02-09 12:08         ` Robin Murphy
2021-02-10  6:15         ` zhukeqian
2021-02-18  1:17         ` Keqian Zhu
2021-02-09 11:16     ` Robin Murphy
2021-02-09 12:02       ` Yi Sun
2021-01-28 15:17 ` [RFC PATCH 11/11] vfio/iommu_type1: Add support for manual dirty log clear Keqian Zhu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).