iommu.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
@ 2020-06-22 17:28 John Garry
  2020-06-22 17:28 ` [PATCH 1/4] iommu/arm-smmu-v3: Fix trivial typo John Garry
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: John Garry @ 2020-06-22 17:28 UTC (permalink / raw)
  To: will, robin.murphy
  Cc: trivial, maz, linuxarm, linux-kernel, iommu, linux-arm-kernel

As mentioned in [0], the CPU may consume many cycles processing
arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
get space on the queue takes approx 25% of the cycles for this function.

This series removes that cmpxchg().

For my NVMe test with 3x NVMe SSDs, I'm getting a ~24% throughput
increase:
Before: 1310 IOPs
After: 1630 IOPs

I also have a test harness to check the rate of DMA map+unmaps we can
achieve:

CPU count	32	64	128
Before:		63187	19418	10169
After:		93287	44789	15862

(unit is map+unmaps per CPU per second)

[0] https://lore.kernel.org/linux-iommu/B926444035E5E2439431908E3842AFD24B86DB@DGGEMI525-MBS.china.huawei.com/T/#ma02e301c38c3e94b7725e685757c27e39c7cbde3

John Garry (4):
  iommu/arm-smmu-v3: Fix trivial typo
  iommu/arm-smmu-v3: Calculate bits for prod and owner
  iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch
  iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()

 drivers/iommu/arm-smmu-v3.c | 233 +++++++++++++++++++++++-------------
 1 file changed, 151 insertions(+), 82 deletions(-)

-- 
2.26.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/4] iommu/arm-smmu-v3: Fix trivial typo
  2020-06-22 17:28 [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency John Garry
@ 2020-06-22 17:28 ` John Garry
  2020-06-22 17:28 ` [PATCH 2/4] iommu/arm-smmu-v3: Calculate bits for prod and owner John Garry
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: John Garry @ 2020-06-22 17:28 UTC (permalink / raw)
  To: will, robin.murphy
  Cc: trivial, maz, linuxarm, linux-kernel, iommu, linux-arm-kernel

Set "cmq" -> "cmdq".

Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/iommu/arm-smmu-v3.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f578677a5c41..a8e814c652fe 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1479,7 +1479,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 		}
 
 		/*
-		 * Try to unlock the cmq lock. This will fail if we're the last
+		 * Try to unlock the cmdq lock. This will fail if we're the last
 		 * reader, in which case we can safely update cmdq->q.llq.cons
 		 */
 		if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) {
-- 
2.26.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 2/4] iommu/arm-smmu-v3: Calculate bits for prod and owner
  2020-06-22 17:28 [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency John Garry
  2020-06-22 17:28 ` [PATCH 1/4] iommu/arm-smmu-v3: Fix trivial typo John Garry
@ 2020-06-22 17:28 ` John Garry
  2020-06-22 17:28 ` [PATCH 3/4] iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch John Garry
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: John Garry @ 2020-06-22 17:28 UTC (permalink / raw)
  To: will, robin.murphy
  Cc: trivial, maz, linuxarm, linux-kernel, iommu, linux-arm-kernel

Since the arm_smmu_ll_queue.prod will be for counting the "owner" value
and also HW prod pointer, calculate how many bits are available for and
used by each.

This is based on the number of possible CPUs in the system. And we require
that each CPU can issue a minimum of 2 commands per batch - 1 x CMD_SYNC
and at least 1 x other.

Ignoring limits of HW max_n_shift and HW cmdq memory allocation, approx 32K
is the max supported CPUs. For this, max_n_shift would be 15.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/iommu/arm-smmu-v3.c | 64 ++++++++++++++++++++++++++++++++++---
 1 file changed, 60 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index a8e814c652fe..4e9677b066f1 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -532,6 +532,8 @@ struct arm_smmu_ll_queue {
 		u8			__pad[SMP_CACHE_BYTES];
 	} ____cacheline_aligned_in_smp;
 	u32				max_n_shift;
+	u32				max_cmd_per_batch;
+	u32				owner_count_shift;
 };
 
 struct arm_smmu_queue {
@@ -1515,7 +1517,10 @@ static void arm_smmu_cmdq_batch_add(struct arm_smmu_device *smmu,
 				    struct arm_smmu_cmdq_batch *cmds,
 				    struct arm_smmu_cmdq_ent *cmd)
 {
-	if (cmds->num == CMDQ_BATCH_ENTRIES) {
+	struct arm_smmu_cmdq *q = &smmu->cmdq;
+	struct arm_smmu_ll_queue *llq = &q->q.llq;
+
+	if (cmds->num == llq->max_cmd_per_batch) {
 		arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num, false);
 		cmds->num = 0;
 	}
@@ -3177,6 +3182,58 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
 	return 0;
 }
 
+static int arm_smmu_init_cmd_queue(struct arm_smmu_device *smmu,
+				   struct arm_smmu_queue *q,
+				   unsigned long prod_off,
+				   unsigned long cons_off,
+				   size_t dwords)
+{
+	u32 cpus = num_possible_cpus(), bits_for_cmdq_owner,
+		bits_available_for_prod, entries_for_prod;
+	int ret;
+
+	/*
+	 * We can get the number of bits required for owner counting by
+	 * log2(nr possible cpus) + 1
+	 */
+	bits_for_cmdq_owner = ilog2(cpus) + 1;
+
+	/*
+	 * Add an extra bit to ensure prod(+wrap) do not overflow into
+	 * owner count.
+	 */
+	bits_available_for_prod = 32 - 1 - bits_for_cmdq_owner;
+
+	if (bits_available_for_prod < 1) /* How many CPUs??? */
+		return -ENOMEM;
+
+	q->llq.max_n_shift = min(q->llq.max_n_shift, bits_available_for_prod);
+
+	ret = arm_smmu_init_one_queue(smmu, q, prod_off, cons_off, dwords,
+				      "cmdq");
+	if (ret)
+		return ret;
+
+	entries_for_prod = 1 << q->llq.max_n_shift;
+
+	/*
+	 * We need at least 2 commands in a batch (1 x CMD_SYNC and 1 x
+	 * whatever else).
+	 */
+	if (entries_for_prod < 2 * cpus)
+		return -ENOMEM;
+
+	/*
+	 * When finding max_cmd_per_batch, deduct 1 entry per batch to take
+	 * account of CMD_SYNC
+	 */
+	q->llq.max_cmd_per_batch = min((entries_for_prod / cpus) - 1,
+				       (u32)CMDQ_BATCH_ENTRIES);
+	q->llq.owner_count_shift = q->llq.max_n_shift + 2;
+
+	return 0;
+}
+
 static void arm_smmu_cmdq_free_bitmap(void *data)
 {
 	unsigned long *bitmap = data;
@@ -3210,9 +3267,8 @@ static int arm_smmu_init_queues(struct arm_smmu_device *smmu)
 	int ret;
 
 	/* cmdq */
-	ret = arm_smmu_init_one_queue(smmu, &smmu->cmdq.q, ARM_SMMU_CMDQ_PROD,
-				      ARM_SMMU_CMDQ_CONS, CMDQ_ENT_DWORDS,
-				      "cmdq");
+	ret = arm_smmu_init_cmd_queue(smmu, &smmu->cmdq.q, ARM_SMMU_CMDQ_PROD,
+				      ARM_SMMU_CMDQ_CONS, CMDQ_ENT_DWORDS);
 	if (ret)
 		return ret;
 
-- 
2.26.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 3/4] iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch
  2020-06-22 17:28 [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency John Garry
  2020-06-22 17:28 ` [PATCH 1/4] iommu/arm-smmu-v3: Fix trivial typo John Garry
  2020-06-22 17:28 ` [PATCH 2/4] iommu/arm-smmu-v3: Calculate bits for prod and owner John Garry
@ 2020-06-22 17:28 ` John Garry
  2020-06-22 17:28 ` [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() John Garry
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: John Garry @ 2020-06-22 17:28 UTC (permalink / raw)
  To: will, robin.murphy
  Cc: trivial, maz, linuxarm, linux-kernel, iommu, linux-arm-kernel

To ensure that a CPU does not send more than a permitted amount of commands
to the cmdq, ensure that each batch includes a CMD_SYNC. When issuing a
CMD_SYNC, we always wait for the consumption of its batch of commands - as
such, we guarantee that any CPU will not issue more than its permitted
amount.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/iommu/arm-smmu-v3.c | 86 +++++++++++++++++--------------------
 1 file changed, 39 insertions(+), 47 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 4e9677b066f1..45a39ccaf455 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1373,11 +1373,15 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds,
  * - Command insertion is totally ordered, so if two CPUs each race to
  *   insert their own list of commands then all of the commands from one
  *   CPU will appear before any of the commands from the other CPU.
+ *
+ * - A CMD_SYNC is always inserted, ensuring that any CPU does not issue
+ *   more than the permitted amount commands at once.
  */
 static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
-				       u64 *cmds, int n, bool sync)
+				       u64 *cmds, int n)
 {
 	u64 cmd_sync[CMDQ_ENT_DWORDS];
+	const int sync = 1;
 	u32 prod;
 	unsigned long flags;
 	bool owner;
@@ -1419,19 +1423,17 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 	 * Dependency ordering from the cmpxchg() loop above.
 	 */
 	arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n);
-	if (sync) {
-		prod = queue_inc_prod_n(&llq, n);
-		arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod);
-		queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS);
 
-		/*
-		 * In order to determine completion of our CMD_SYNC, we must
-		 * ensure that the queue can't wrap twice without us noticing.
-		 * We achieve that by taking the cmdq lock as shared before
-		 * marking our slot as valid.
-		 */
-		arm_smmu_cmdq_shared_lock(cmdq);
-	}
+	prod = queue_inc_prod_n(&llq, n);
+	arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod);
+	queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS);
+
+	/*
+	 * In order to determine completion of our CMD_SYNC, we must ensure
+	 * that the queue can't wrap twice without us noticing. We achieve that
+	 * by taking the cmdq lock as shared before marking our slot as valid.
+	 */
+	arm_smmu_cmdq_shared_lock(cmdq);
 
 	/* 3. Mark our slots as valid, ensuring commands are visible first */
 	dma_wmb();
@@ -1468,26 +1470,21 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 		atomic_set_release(&cmdq->owner_prod, prod);
 	}
 
-	/* 5. If we are inserting a CMD_SYNC, we must wait for it to complete */
-	if (sync) {
-		llq.prod = queue_inc_prod_n(&llq, n);
-		ret = arm_smmu_cmdq_poll_until_sync(smmu, &llq);
-		if (ret) {
-			dev_err_ratelimited(smmu->dev,
-					    "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n",
-					    llq.prod,
-					    readl_relaxed(cmdq->q.prod_reg),
-					    readl_relaxed(cmdq->q.cons_reg));
-		}
+	/* 5. Since we always insert a CMD_SYNC, we must wait for it to complete */
+	llq.prod = queue_inc_prod_n(&llq, n);
+	ret = arm_smmu_cmdq_poll_until_sync(smmu, &llq);
+	if (ret)
+		dev_err_ratelimited(smmu->dev, "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n",
+				    llq.prod, readl_relaxed(cmdq->q.prod_reg),
+				    readl_relaxed(cmdq->q.cons_reg));
 
-		/*
-		 * Try to unlock the cmdq lock. This will fail if we're the last
-		 * reader, in which case we can safely update cmdq->q.llq.cons
-		 */
-		if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) {
-			WRITE_ONCE(cmdq->q.llq.cons, llq.cons);
-			arm_smmu_cmdq_shared_unlock(cmdq);
-		}
+	/*
+	 * Try to unlock the cmdq lock. This will fail if we're the last reader,
+	 * in which case we can safely update cmdq->q.llq.cons
+	 */
+	if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) {
+		WRITE_ONCE(cmdq->q.llq.cons, llq.cons);
+		arm_smmu_cmdq_shared_unlock(cmdq);
 	}
 
 	local_irq_restore(flags);
@@ -1505,12 +1502,7 @@ static int arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
 		return -EINVAL;
 	}
 
-	return arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1, false);
-}
-
-static int arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
-{
-	return arm_smmu_cmdq_issue_cmdlist(smmu, NULL, 0, true);
+	return arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1);
 }
 
 static void arm_smmu_cmdq_batch_add(struct arm_smmu_device *smmu,
@@ -1521,7 +1513,7 @@ static void arm_smmu_cmdq_batch_add(struct arm_smmu_device *smmu,
 	struct arm_smmu_ll_queue *llq = &q->q.llq;
 
 	if (cmds->num == llq->max_cmd_per_batch) {
-		arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num, false);
+		arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num);
 		cmds->num = 0;
 	}
 	arm_smmu_cmdq_build_cmd(&cmds->cmds[cmds->num * CMDQ_ENT_DWORDS], cmd);
@@ -1531,7 +1523,7 @@ static void arm_smmu_cmdq_batch_add(struct arm_smmu_device *smmu,
 static int arm_smmu_cmdq_batch_submit(struct arm_smmu_device *smmu,
 				      struct arm_smmu_cmdq_batch *cmds)
 {
-	return arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num, true);
+	return arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num);
 }
 
 /* Context descriptor manipulation functions */
@@ -1803,7 +1795,6 @@ static void arm_smmu_sync_ste_for_sid(struct arm_smmu_device *smmu, u32 sid)
 	};
 
 	arm_smmu_cmdq_issue_cmd(smmu, &cmd);
-	arm_smmu_cmdq_issue_sync(smmu);
 }
 
 static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
@@ -2197,17 +2188,21 @@ arm_smmu_atc_inv_to_cmd(int ssid, unsigned long iova, size_t size,
 
 static int arm_smmu_atc_inv_master(struct arm_smmu_master *master)
 {
-	int i;
+	int i, ret = 0;
 	struct arm_smmu_cmdq_ent cmd;
 
 	arm_smmu_atc_inv_to_cmd(0, 0, 0, &cmd);
 
 	for (i = 0; i < master->num_sids; i++) {
+		int rc;
+
 		cmd.atc.sid = master->sids[i];
-		arm_smmu_cmdq_issue_cmd(master->smmu, &cmd);
+		rc = arm_smmu_cmdq_issue_cmd(master->smmu, &cmd);
+		if (rc)
+			ret = rc;
 	}
 
-	return arm_smmu_cmdq_issue_sync(master->smmu);
+	return ret;
 }
 
 static int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
@@ -2280,7 +2275,6 @@ static void arm_smmu_tlb_inv_context(void *cookie)
 	 * careful, 007.
 	 */
 	arm_smmu_cmdq_issue_cmd(smmu, &cmd);
-	arm_smmu_cmdq_issue_sync(smmu);
 	arm_smmu_atc_inv_domain(smmu_domain, 0, 0, 0);
 }
 
@@ -3667,7 +3661,6 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
 	/* Invalidate any cached configuration */
 	cmd.opcode = CMDQ_OP_CFGI_ALL;
 	arm_smmu_cmdq_issue_cmd(smmu, &cmd);
-	arm_smmu_cmdq_issue_sync(smmu);
 
 	/* Invalidate any stale TLB entries */
 	if (smmu->features & ARM_SMMU_FEAT_HYP) {
@@ -3677,7 +3670,6 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
 
 	cmd.opcode = CMDQ_OP_TLBI_NSNH_ALL;
 	arm_smmu_cmdq_issue_cmd(smmu, &cmd);
-	arm_smmu_cmdq_issue_sync(smmu);
 
 	/* Event queue */
 	writeq_relaxed(smmu->evtq.q.q_base, smmu->base + ARM_SMMU_EVTQ_BASE);
-- 
2.26.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
  2020-06-22 17:28 [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency John Garry
                   ` (2 preceding siblings ...)
  2020-06-22 17:28 ` [PATCH 3/4] iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch John Garry
@ 2020-06-22 17:28 ` John Garry
  2020-06-23  1:07   ` kernel test robot
  2020-07-16 10:20   ` Will Deacon
  2020-07-08 13:00 ` [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency John Garry
  2020-07-16 10:19 ` Will Deacon
  5 siblings, 2 replies; 25+ messages in thread
From: John Garry @ 2020-06-22 17:28 UTC (permalink / raw)
  To: will, robin.murphy
  Cc: trivial, maz, linuxarm, linux-kernel, iommu, linux-arm-kernel

It has been shown that the cmpxchg() for finding space in the cmdq can
be a bottleneck:
- for more CPUs contending the cmdq, the cmpxchg() will fail more often
- since the software-maintained cons pointer is updated on the same 64b
  memory region, the chance of cmpxchg() failure increases again

The cmpxchg() is removed as part of 2 related changes:

- Update prod and cmdq owner in a single atomic add operation. For this, we
  count the prod and owner in separate regions in prod memory.

  As with simple binary counting, once the prod+wrap fields overflow, they
  will zero. They should never overflow into "owner" region, and we zero
  the non-owner, prod region for each owner. This maintains the prod
  pointer.

  As for the "owner", we now count this value, instead of setting a flag.
  Similar to before, once the owner has finished gathering, it will clear
  a mask. As such, a CPU declares itself as the "owner" when it reads zero
  for this region. This zeroing will also clear possible overflow in
  wrap+prod region, above.

  The owner is now responsible for all cmdq locking to avoid possible
  deadlock. The owner will lock the cmdq for all non-owers it has gathered
  when they have space in the queue and have written their entries.

- Check for space in the cmdq after the prod pointer has been assigned.

  We don't bother checking for space in the cmdq before assigning the prod
  pointer, as this would be racy.

  So since the prod pointer is updated unconditionally, it would be common
  for no space to be available in the cmdq when prod is assigned - that
  is, according the software-maintained prod and cons pointer. So now
  it must be ensured that the entries are not yet written and not until
  there is space.

  How the prod pointer is maintained also leads to a strange condition
  where the prod pointer can wrap past the cons pointer. We can detect this
  condition, and report no space here. However, a prod pointer progressed
  twice past the cons pointer cannot be detected. But it can be ensured that
  this that this scenario does not occur, as we limit the amount of
  commands any CPU can issue at any given time, such that we cannot
  progress prod pointer further.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/iommu/arm-smmu-v3.c | 101 ++++++++++++++++++++++--------------
 1 file changed, 61 insertions(+), 40 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 45a39ccaf455..54d056f10bea 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -772,10 +772,19 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n)
 	prod = Q_IDX(q, q->prod);
 	cons = Q_IDX(q, q->cons);
 
-	if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons))
+	if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons)) {
+		/* check if we have wrapped, meaning definitely no space */
+		if (cons > prod)
+			return false;
+
 		space = (1 << q->max_n_shift) - (prod - cons);
-	else
+	} else {
+		/* similar check to above */
+		if (prod > cons)
+			return false;
+
 		space = cons - prod;
+	}
 
 	return space >= n;
 }
@@ -1073,7 +1082,7 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu)
  *   fails if the caller appears to be the last lock holder (yes, this is
  *   racy). All successful UNLOCK routines have RELEASE semantics.
  */
-static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq)
+static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq, int count)
 {
 	int val;
 
@@ -1083,12 +1092,12 @@ static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq)
 	 * to INT_MIN so these increments won't hurt as the value will remain
 	 * negative.
 	 */
-	if (atomic_fetch_inc_relaxed(&cmdq->lock) >= 0)
+	if (atomic_fetch_add_relaxed(count, &cmdq->lock) >= 0)
 		return;
 
 	do {
 		val = atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0);
-	} while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1) != val);
+	} while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + count) != val);
 }
 
 static void arm_smmu_cmdq_shared_unlock(struct arm_smmu_cmdq *cmdq)
@@ -1374,8 +1383,10 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds,
  *   insert their own list of commands then all of the commands from one
  *   CPU will appear before any of the commands from the other CPU.
  *
- * - A CMD_SYNC is always inserted, ensuring that any CPU does not issue
- *   more than the permitted amount commands at once.
+ * - A CMD_SYNC is always inserted, which ensures we limit the prod pointer
+ *   for when the cmdq is full, such that we don't wrap more than twice.
+ *   It also makes it easy for the owner to know by how many to increment the
+ *   cmdq lock.
  */
 static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 				       u64 *cmds, int n)
@@ -1388,39 +1399,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 	struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
 	struct arm_smmu_ll_queue llq = {
 		.max_n_shift = cmdq->q.llq.max_n_shift,
-	}, head = llq;
+	}, head = llq, space = llq;
+	u32 owner_val = 1 << cmdq->q.llq.owner_count_shift;
+	u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
+	u32 owner_mask = GENMASK(30, cmdq->q.llq.owner_count_shift);
 	int ret = 0;
 
 	/* 1. Allocate some space in the queue */
 	local_irq_save(flags);
-	llq.val = READ_ONCE(cmdq->q.llq.val);
-	do {
-		u64 old;
 
-		while (!queue_has_space(&llq, n + sync)) {
-			local_irq_restore(flags);
-			if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
-				dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
-			local_irq_save(flags);
-		}
+	prod = atomic_fetch_add(n + sync + owner_val,
+				&cmdq->q.llq.atomic.prod);
 
-		head.cons = llq.cons;
-		head.prod = queue_inc_prod_n(&llq, n + sync) |
-					     CMDQ_PROD_OWNED_FLAG;
+	owner = !(prod & owner_mask);
+	llq.prod = prod_mask & prod;
+	head.prod = queue_inc_prod_n(&llq, n + sync);
 
-		old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
-		if (old == llq.val)
-			break;
+	/*
+	 * Ensure it's safe to write the entries. For this, we need to ensure
+	 * that there is space in the queue from our prod pointer.
+	 */
+	space.cons = READ_ONCE(cmdq->q.llq.cons);
+	space.prod = llq.prod;
+	while (!queue_has_space(&space, n + sync)) {
+		if (arm_smmu_cmdq_poll_until_not_full(smmu, &space))
+			dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
 
-		llq.val = old;
-	} while (1);
-	owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
-	head.prod &= ~CMDQ_PROD_OWNED_FLAG;
-	llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
+		space.prod = llq.prod;
+	}
 
 	/*
 	 * 2. Write our commands into the queue
-	 * Dependency ordering from the cmpxchg() loop above.
+	 * Dependency ordering from the space-checking while loop, above.
 	 */
 	arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n);
 
@@ -1428,26 +1438,24 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 	arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod);
 	queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS);
 
-	/*
-	 * In order to determine completion of our CMD_SYNC, we must ensure
-	 * that the queue can't wrap twice without us noticing. We achieve that
-	 * by taking the cmdq lock as shared before marking our slot as valid.
-	 */
-	arm_smmu_cmdq_shared_lock(cmdq);
-
 	/* 3. Mark our slots as valid, ensuring commands are visible first */
 	dma_wmb();
 	arm_smmu_cmdq_set_valid_map(cmdq, llq.prod, head.prod);
 
 	/* 4. If we are the owner, take control of the SMMU hardware */
 	if (owner) {
+		int owner_count;
+		u32 prod_tmp;
+
 		/* a. Wait for previous owner to finish */
 		atomic_cond_read_relaxed(&cmdq->owner_prod, VAL == llq.prod);
 
-		/* b. Stop gathering work by clearing the owned flag */
-		prod = atomic_fetch_andnot_relaxed(CMDQ_PROD_OWNED_FLAG,
-						   &cmdq->q.llq.atomic.prod);
-		prod &= ~CMDQ_PROD_OWNED_FLAG;
+		/* b. Stop gathering work by clearing the owned mask */
+		prod_tmp = atomic_fetch_andnot_relaxed(~prod_mask,
+						       &cmdq->q.llq.atomic.prod);
+		prod = prod_tmp & prod_mask;
+		owner_count = prod_tmp & owner_mask;
+		owner_count >>= cmdq->q.llq.owner_count_shift;
 
 		/*
 		 * c. Wait for any gathered work to be written to the queue.
@@ -1456,6 +1464,19 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 		 */
 		arm_smmu_cmdq_poll_valid_map(cmdq, llq.prod, prod);
 
+		/*
+		 * In order to determine completion of the CMD_SYNCs, we must
+		 * ensure that the queue can't wrap twice without us noticing.
+		 * We achieve that by taking the cmdq lock as shared before
+		 * progressing the prod pointer.
+		 * The owner does this for all the non-owners it has gathered.
+		 * Otherwise, some non-owner(s) may lock the cmdq, blocking
+		 * cons being updating. This could be when the cmdq has just
+		 * become full. In this case, other sibling non-owners could be
+		 * blocked from progressing, leading to deadlock.
+		 */
+		arm_smmu_cmdq_shared_lock(cmdq, owner_count);
+
 		/*
 		 * d. Advance the hardware prod pointer
 		 * Control dependency ordering from the entries becoming valid.
-- 
2.26.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
  2020-06-22 17:28 ` [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() John Garry
@ 2020-06-23  1:07   ` kernel test robot
  2020-06-23  9:21     ` John Garry
  2020-07-16 10:20   ` Will Deacon
  1 sibling, 1 reply; 25+ messages in thread
From: kernel test robot @ 2020-06-23  1:07 UTC (permalink / raw)
  To: John Garry, will, robin.murphy
  Cc: trivial, maz, linux-kernel, linuxarm, iommu, kbuild-all,
	linux-arm-kernel

[-- Attachment #1: Type: text/plain, Size: 8647 bytes --]

Hi John,

I love your patch! Perhaps something to improve:

[auto build test WARNING on iommu/next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use  as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438
base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
config: arm64-randconfig-c024-20200622 (attached as .config)
compiler: aarch64-linux-gcc (GCC) 9.3.0

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>, old ones prefixed by <<):

In file included from include/linux/bits.h:23,
from include/linux/ioport.h:15,
from include/linux/acpi.h:12,
from drivers/iommu/arm-smmu-v3.c:12:
drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist':
include/linux/bits.h:26:28: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
26 |   __builtin_constant_p((l) > (h)), (l) > (h), 0)))
|                            ^
include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO'
16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
|                                                              ^
include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK'
39 |  (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
|   ^~~~~~~~~~~~~~~~~~~
>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK'
1404 |  u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
|                  ^~~~~~~
include/linux/bits.h:26:40: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
26 |   __builtin_constant_p((l) > (h)), (l) > (h), 0)))
|                                        ^
include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO'
16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
|                                                              ^
include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK'
39 |  (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
|   ^~~~~~~~~~~~~~~~~~~
>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK'
1404 |  u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
|                  ^~~~~~~

vim +/GENMASK +1404 drivers/iommu/arm-smmu-v3.c

  1369	
  1370	/*
  1371	 * This is the actual insertion function, and provides the following
  1372	 * ordering guarantees to callers:
  1373	 *
  1374	 * - There is a dma_wmb() before publishing any commands to the queue.
  1375	 *   This can be relied upon to order prior writes to data structures
  1376	 *   in memory (such as a CD or an STE) before the command.
  1377	 *
  1378	 * - On completion of a CMD_SYNC, there is a control dependency.
  1379	 *   This can be relied upon to order subsequent writes to memory (e.g.
  1380	 *   freeing an IOVA) after completion of the CMD_SYNC.
  1381	 *
  1382	 * - Command insertion is totally ordered, so if two CPUs each race to
  1383	 *   insert their own list of commands then all of the commands from one
  1384	 *   CPU will appear before any of the commands from the other CPU.
  1385	 *
  1386	 * - A CMD_SYNC is always inserted, which ensures we limit the prod pointer
  1387	 *   for when the cmdq is full, such that we don't wrap more than twice.
  1388	 *   It also makes it easy for the owner to know by how many to increment the
  1389	 *   cmdq lock.
  1390	 */
  1391	static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
  1392					       u64 *cmds, int n)
  1393	{
  1394		u64 cmd_sync[CMDQ_ENT_DWORDS];
  1395		const int sync = 1;
  1396		u32 prod;
  1397		unsigned long flags;
  1398		bool owner;
  1399		struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
  1400		struct arm_smmu_ll_queue llq = {
  1401			.max_n_shift = cmdq->q.llq.max_n_shift,
  1402		}, head = llq, space = llq;
  1403		u32 owner_val = 1 << cmdq->q.llq.owner_count_shift;
> 1404		u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
  1405		u32 owner_mask = GENMASK(30, cmdq->q.llq.owner_count_shift);
  1406		int ret = 0;
  1407	
  1408		/* 1. Allocate some space in the queue */
  1409		local_irq_save(flags);
  1410	
  1411		prod = atomic_fetch_add(n + sync + owner_val,
  1412					&cmdq->q.llq.atomic.prod);
  1413	
  1414		owner = !(prod & owner_mask);
  1415		llq.prod = prod_mask & prod;
  1416		head.prod = queue_inc_prod_n(&llq, n + sync);
  1417	
  1418		/*
  1419		 * Ensure it's safe to write the entries. For this, we need to ensure
  1420		 * that there is space in the queue from our prod pointer.
  1421		 */
  1422		space.cons = READ_ONCE(cmdq->q.llq.cons);
  1423		space.prod = llq.prod;
  1424		while (!queue_has_space(&space, n + sync)) {
  1425			if (arm_smmu_cmdq_poll_until_not_full(smmu, &space))
  1426				dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
  1427	
  1428			space.prod = llq.prod;
  1429		}
  1430	
  1431		/*
  1432		 * 2. Write our commands into the queue
  1433		 * Dependency ordering from the space-checking while loop, above.
  1434		 */
  1435		arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n);
  1436	
  1437		prod = queue_inc_prod_n(&llq, n);
  1438		arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod);
  1439		queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS);
  1440	
  1441		/* 3. Mark our slots as valid, ensuring commands are visible first */
  1442		dma_wmb();
  1443		arm_smmu_cmdq_set_valid_map(cmdq, llq.prod, head.prod);
  1444	
  1445		/* 4. If we are the owner, take control of the SMMU hardware */
  1446		if (owner) {
  1447			int owner_count;
  1448			u32 prod_tmp;
  1449	
  1450			/* a. Wait for previous owner to finish */
  1451			atomic_cond_read_relaxed(&cmdq->owner_prod, VAL == llq.prod);
  1452	
  1453			/* b. Stop gathering work by clearing the owned mask */
  1454			prod_tmp = atomic_fetch_andnot_relaxed(~prod_mask,
  1455							       &cmdq->q.llq.atomic.prod);
  1456			prod = prod_tmp & prod_mask;
  1457			owner_count = prod_tmp & owner_mask;
  1458			owner_count >>= cmdq->q.llq.owner_count_shift;
  1459	
  1460			/*
  1461			 * c. Wait for any gathered work to be written to the queue.
  1462			 * Note that we read our own entries so that we have the control
  1463			 * dependency required by (d).
  1464			 */
  1465			arm_smmu_cmdq_poll_valid_map(cmdq, llq.prod, prod);
  1466	
  1467			/*
  1468			 * In order to determine completion of the CMD_SYNCs, we must
  1469			 * ensure that the queue can't wrap twice without us noticing.
  1470			 * We achieve that by taking the cmdq lock as shared before
  1471			 * progressing the prod pointer.
  1472			 * The owner does this for all the non-owners it has gathered.
  1473			 * Otherwise, some non-owner(s) may lock the cmdq, blocking
  1474			 * cons being updating. This could be when the cmdq has just
  1475			 * become full. In this case, other sibling non-owners could be
  1476			 * blocked from progressing, leading to deadlock.
  1477			 */
  1478			arm_smmu_cmdq_shared_lock(cmdq, owner_count);
  1479	
  1480			/*
  1481			 * d. Advance the hardware prod pointer
  1482			 * Control dependency ordering from the entries becoming valid.
  1483			 */
  1484			writel_relaxed(prod, cmdq->q.prod_reg);
  1485	
  1486			/*
  1487			 * e. Tell the next owner we're done
  1488			 * Make sure we've updated the hardware first, so that we don't
  1489			 * race to update prod and potentially move it backwards.
  1490			 */
  1491			atomic_set_release(&cmdq->owner_prod, prod);
  1492		}
  1493	
  1494		/* 5. Since we always insert a CMD_SYNC, we must wait for it to complete */
  1495		llq.prod = queue_inc_prod_n(&llq, n);
  1496		ret = arm_smmu_cmdq_poll_until_sync(smmu, &llq);
  1497		if (ret)
  1498			dev_err_ratelimited(smmu->dev, "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n",
  1499					    llq.prod, readl_relaxed(cmdq->q.prod_reg),
  1500					    readl_relaxed(cmdq->q.cons_reg));
  1501	
  1502		/*
  1503		 * Try to unlock the cmdq lock. This will fail if we're the last reader,
  1504		 * in which case we can safely update cmdq->q.llq.cons
  1505		 */
  1506		if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) {
  1507			WRITE_ONCE(cmdq->q.llq.cons, llq.cons);
  1508			arm_smmu_cmdq_shared_unlock(cmdq);
  1509		}
  1510	
  1511		local_irq_restore(flags);
  1512		return ret;
  1513	}
  1514	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 30901 bytes --]

[-- Attachment #3: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
  2020-06-23  1:07   ` kernel test robot
@ 2020-06-23  9:21     ` John Garry
  2020-06-23  9:35       ` Rikard Falkeborn
  2020-06-23 16:22       ` Robin Murphy
  0 siblings, 2 replies; 25+ messages in thread
From: John Garry @ 2020-06-23  9:21 UTC (permalink / raw)
  To: kernel test robot, will, robin.murphy, rikard.falkeborn
  Cc: trivial, maz, linux-kernel, Linuxarm, iommu, kbuild-all,
	linux-arm-kernel

On 23/06/2020 02:07, kernel test robot wrote:

+ Rikard, as the GENMASK compile-time sanity checks were added recently

> Hi John,
> 
> I love your patch! Perhaps something to improve:
> 
> [auto build test WARNING on iommu/next]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use  as documented in
> https://git-scm.com/docs/git-format-patch]
> 
> url:    https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
> config: arm64-randconfig-c024-20200622 (attached as .config)
> compiler: aarch64-linux-gcc (GCC) 9.3.0
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
> 
> All warnings (new ones prefixed by >>, old ones prefixed by <<):
> 
> In file included from include/linux/bits.h:23,
> from include/linux/ioport.h:15,
> from include/linux/acpi.h:12,
> from drivers/iommu/arm-smmu-v3.c:12:
> drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist':
> include/linux/bits.h:26:28: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
> 26 |   __builtin_constant_p((l) > (h)), (l) > (h), 0)))

I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and 
h=unsigned value, so I doubt this warn.

Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks 
like GENMASK_INPUT_CHECK() could be improved.

> |                            ^
> include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO'
> 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
> |                                                              ^
> include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK'
> 39 |  (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
> |   ^~~~~~~~~~~~~~~~~~~
>>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK'
> 1404 |  u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
> |                  ^~~~~~~
> include/linux/bits.h:26:40: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
> 26 |   __builtin_constant_p((l) > (h)), (l) > (h), 0)))
> |                                        ^
> include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO'
> 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
> |                                                              ^
> include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK'
> 39 |  (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
> |   ^~~~~~~~~~~~~~~~~~~
>>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK'
> 1404 |  u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
> |                  ^~~~~~~
> 
> vim +/GENMASK +1404 drivers/iommu/arm-smmu-v3.c


_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
  2020-06-23  9:21     ` John Garry
@ 2020-06-23  9:35       ` Rikard Falkeborn
  2020-06-23 10:19         ` John Garry
  2020-06-23 16:22       ` Robin Murphy
  1 sibling, 1 reply; 25+ messages in thread
From: Rikard Falkeborn @ 2020-06-23  9:35 UTC (permalink / raw)
  To: John Garry
  Cc: trivial, kernel test robot, will, linux-kernel, Linuxarm, iommu,
	maz, kbuild-all, robin.murphy, linux-arm-kernel


[-- Attachment #1.1: Type: text/plain, Size: 3214 bytes --]

Hi

Den tis 23 juni 2020 11:22John Garry <john.garry@huawei.com> skrev:

> On 23/06/2020 02:07, kernel test robot wrote:
>
> + Rikard, as the GENMASK compile-time sanity checks were added recently
>
> > Hi John,
> >
> > I love your patch! Perhaps something to improve:
> >
> > [auto build test WARNING on iommu/next]
> > [If your patch is applied to the wrong git tree, kindly drop us a note.
> > And when submitting patch, we suggest to use  as documented in
> > https://git-scm.com/docs/git-format-patch]
> >
> > url:
> https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438
> > base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git
> next
> > config: arm64-randconfig-c024-20200622 (attached as .config)
> > compiler: aarch64-linux-gcc (GCC) 9.3.0
> >
> > If you fix the issue, kindly add following tag as appropriate
> > Reported-by: kernel test robot <lkp@intel.com>
> >
> > All warnings (new ones prefixed by >>, old ones prefixed by <<):
> >
> > In file included from include/linux/bits.h:23,
> > from include/linux/ioport.h:15,
> > from include/linux/acpi.h:12,
> > from drivers/iommu/arm-smmu-v3.c:12:
> > drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist':
> > include/linux/bits.h:26:28: warning: comparison of unsigned expression <
> 0 is always false [-Wtype-limits]
> > 26 |   __builtin_constant_p((l) > (h)), (l) > (h), 0)))
>
> I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and
> h=unsigned value, so I doubt this warn.
>
> Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks
> like GENMASK_INPUT_CHECK() could be improved.
>

Indeed it could, it is fixed in -next.

Rikard

> |                            ^
> > include/linux/build_bug.h:16:62: note: in definition of macro
> 'BUILD_BUG_ON_ZERO'
> > 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e));
> })))
> > |                                                              ^
> > include/linux/bits.h:39:3: note: in expansion of macro
> 'GENMASK_INPUT_CHECK'
> > 39 |  (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
> > |   ^~~~~~~~~~~~~~~~~~~
> >>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro
> 'GENMASK'
> > 1404 |  u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
> > |                  ^~~~~~~
> > include/linux/bits.h:26:40: warning: comparison of unsigned expression <
> 0 is always false [-Wtype-limits]
> > 26 |   __builtin_constant_p((l) > (h)), (l) > (h), 0)))
> > |                                        ^
> > include/linux/build_bug.h:16:62: note: in definition of macro
> 'BUILD_BUG_ON_ZERO'
> > 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e));
> })))
> > |                                                              ^
> > include/linux/bits.h:39:3: note: in expansion of macro
> 'GENMASK_INPUT_CHECK'
> > 39 |  (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
> > |   ^~~~~~~~~~~~~~~~~~~
> >>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro
> 'GENMASK'
> > 1404 |  u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
> > |                  ^~~~~~~
> >
> > vim +/GENMASK +1404 drivers/iommu/arm-smmu-v3.c
>
>
>

[-- Attachment #1.2: Type: text/html, Size: 4817 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
  2020-06-23  9:35       ` Rikard Falkeborn
@ 2020-06-23 10:19         ` John Garry
  2020-06-23 13:55           ` Rikard Falkeborn
  0 siblings, 1 reply; 25+ messages in thread
From: John Garry @ 2020-06-23 10:19 UTC (permalink / raw)
  To: Rikard Falkeborn, joro
  Cc: trivial, kernel test robot, will, linux-kernel, Linuxarm, iommu,
	maz, kbuild-all, robin.murphy, linux-arm-kernel

On 23/06/2020 10:35, Rikard Falkeborn wrote:
> 
>     I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and
>     h=unsigned value, so I doubt this warn.
> 
>     Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it
>     looks
>     like GENMASK_INPUT_CHECK() could be improved.
> 
> 
> Indeed it could, it is fixed in -next.

ok, thanks for the pointer, but I still see this on today's -next with 
this patch:

make W=1 drivers/iommu/arm-smmu-v3.o

In file included from ./include/linux/bits.h:23:0,
                 from ./include/linux/ioport.h:15,
                 from ./include/linux/acpi.h:12,
                 from drivers/iommu/arm-smmu-v3.c:12:
drivers/iommu/arm-smmu-v3.c: In function ‘arm_smmu_cmdq_issue_cmdlist’:
./include/linux/bits.h:27:7: warning: comparison of unsigned expression 
< 0 is always false [-Wtype-limits]
   (l) > (h), 0)))
       ^
./include/linux/build_bug.h:16:62: note: in definition of macro 
‘BUILD_BUG_ON_ZERO’
#define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
                                                              ^
./include/linux/bits.h:40:3: note: in expansion of macro 
‘GENMASK_INPUT_CHECK’
  (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
   ^~~~~~~~~~~~~~~~~~~
drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro ‘GENMASK’
  u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);

That's gcc 7.5.0 .

Cheers,
John

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
  2020-06-23 10:19         ` John Garry
@ 2020-06-23 13:55           ` Rikard Falkeborn
  2020-06-26 10:05             ` John Garry
  0 siblings, 1 reply; 25+ messages in thread
From: Rikard Falkeborn @ 2020-06-23 13:55 UTC (permalink / raw)
  To: John Garry
  Cc: trivial, kernel test robot, will, linux-kernel, Linuxarm, iommu,
	maz, kbuild-all, robin.murphy, linux-arm-kernel


[-- Attachment #1.1: Type: text/plain, Size: 1974 bytes --]

Den tis 23 juni 2020 12:21John Garry <john.garry@huawei.com> skrev:

> On 23/06/2020 10:35, Rikard Falkeborn wrote:
> >
> >     I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and
> >     h=unsigned value, so I doubt this warn.
> >
> >     Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it
> >     looks
> >     like GENMASK_INPUT_CHECK() could be improved.
> >
> >
> > Indeed it could, it is fixed in -next.
>
> ok, thanks for the pointer, but I still see this on today's -next with
> this patch:
>
> make W=1 drivers/iommu/arm-smmu-v3.o
>
>
Oh, ok thanks for reporting. I guess different gcc versions have different
behaviour. I guess we'll have to change the comparison to (!((h) == (l) ||
(h) > (l))) instead (not sure I got all parenthesis and logic correct but
you get the idea).

I'm travelling and wont have time to look at this until next week though.

Rikard



In file included from ./include/linux/bits.h:23:0,
>                  from ./include/linux/ioport.h:15,
>                  from ./include/linux/acpi.h:12,
>                  from drivers/iommu/arm-smmu-v3.c:12:
> drivers/iommu/arm-smmu-v3.c: In function ‘arm_smmu_cmdq_issue_cmdlist’:
> ./include/linux/bits.h:27:7: warning: comparison of unsigned expression
> < 0 is always false [-Wtype-limits]
>    (l) > (h), 0)))
>        ^
> ./include/linux/build_bug.h:16:62: note: in definition of macro
> ‘BUILD_BUG_ON_ZERO’
> #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
>                                                               ^
> ./include/linux/bits.h:40:3: note: in expansion of macro
> ‘GENMASK_INPUT_CHECK’
>   (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
>    ^~~~~~~~~~~~~~~~~~~
> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro ‘GENMASK’
>   u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
>
> That's gcc 7.5.0 .
>
> Cheers,
> John
>
>

[-- Attachment #1.2: Type: text/html, Size: 2883 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
  2020-06-23  9:21     ` John Garry
  2020-06-23  9:35       ` Rikard Falkeborn
@ 2020-06-23 16:22       ` Robin Murphy
  2020-06-24  8:15         ` John Garry
  1 sibling, 1 reply; 25+ messages in thread
From: Robin Murphy @ 2020-06-23 16:22 UTC (permalink / raw)
  To: John Garry, kernel test robot, will, rikard.falkeborn
  Cc: trivial, maz, linux-kernel, Linuxarm, iommu, kbuild-all,
	linux-arm-kernel

On 2020-06-23 10:21, John Garry wrote:
> On 23/06/2020 02:07, kernel test robot wrote:
> 
> + Rikard, as the GENMASK compile-time sanity checks were added recently
> 
>> Hi John,
>>
>> I love your patch! Perhaps something to improve:
>>
>> [auto build test WARNING on iommu/next]
>> [If your patch is applied to the wrong git tree, kindly drop us a note.
>> And when submitting patch, we suggest to use  as documented in
>> https://git-scm.com/docs/git-format-patch]
>>
>> url:    
>> https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438 
>>
>> base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git 
>> next
>> config: arm64-randconfig-c024-20200622 (attached as .config)
>> compiler: aarch64-linux-gcc (GCC) 9.3.0
>>
>> If you fix the issue, kindly add following tag as appropriate
>> Reported-by: kernel test robot <lkp@intel.com>
>>
>> All warnings (new ones prefixed by >>, old ones prefixed by <<):
>>
>> In file included from include/linux/bits.h:23,
>> from include/linux/ioport.h:15,
>> from include/linux/acpi.h:12,
>> from drivers/iommu/arm-smmu-v3.c:12:
>> drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist':
>> include/linux/bits.h:26:28: warning: comparison of unsigned expression 
>> < 0 is always false [-Wtype-limits]
>> 26 |   __builtin_constant_p((l) > (h)), (l) > (h), 0)))
> 
> I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and 
> h=unsigned value, so I doubt this warn.
> 
> Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks 
> like GENMASK_INPUT_CHECK() could be improved.

That said, I think this particular case might be even better off dodging
GENMASK() entirely, by doing something like this first. Untested...

Robin.

----->8-----
Subject: [PATCH] iommu/arm-smmu-v3: Streamline queue calculations

Beyond the initial queue setup based on the log2 values from ID
registers, the log2 queue size is only ever used in the form of
(1 << max_n_shift) to repeatedly recalculate the number of queue
elements. Simply storing it in that form leads to slightly more
efficient code, particularly in the low-level queue accessors
where it counts most:

add/remove: 0/0 grow/shrink: 1/7 up/down: 4/-120 (-116)
Function                                     old     new   delta
arm_smmu_init_one_queue                      360     364      +4
arm_smmu_priq_thread                         512     508      -4
arm_smmu_evtq_thread                         300     292      -8
__arm_smmu_cmdq_poll_set_valid_map.isra      296     288      -8
queue_remove_raw                             180     164     -16
arm_smmu_gerror_handler                      732     716     -16
arm_smmu_device_probe                       4312    4284     -28
arm_smmu_cmdq_issue_cmdlist                 1892    1852     -40
Total: Before=20135, After=20019, chg -0.58%

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/arm-smmu-v3.c | 46 ++++++++++++++++++-------------------
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f578677a5c41..407cb9451a7a 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -185,8 +185,8 @@
 #define ARM_SMMU_MEMATTR_DEVICE_nGnRE	0x1
 #define ARM_SMMU_MEMATTR_OIWB		0xf
 
-#define Q_IDX(llq, p)			((p) & ((1 << (llq)->max_n_shift) - 1))
-#define Q_WRP(llq, p)			((p) & (1 << (llq)->max_n_shift))
+#define Q_IDX(llq, p)			((p) & ((llq)->max_n - 1))
+#define Q_WRP(llq, p)			((p) & (llq)->max_n)
 #define Q_OVERFLOW_FLAG			(1U << 31)
 #define Q_OVF(p)			((p) & Q_OVERFLOW_FLAG)
 #define Q_ENT(q, p)			((q)->base +			\
@@ -531,7 +531,7 @@ struct arm_smmu_ll_queue {
 		} atomic;
 		u8			__pad[SMP_CACHE_BYTES];
 	} ____cacheline_aligned_in_smp;
-	u32				max_n_shift;
+	u32				max_n;
 };
 
 struct arm_smmu_queue {
@@ -771,7 +771,7 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n)
 	cons = Q_IDX(q, q->cons);
 
 	if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons))
-		space = (1 << q->max_n_shift) - (prod - cons);
+		space = q->max_n - (prod - cons);
 	else
 		space = cons - prod;
 
@@ -1164,8 +1164,8 @@ static void __arm_smmu_cmdq_poll_set_valid_map(struct arm_smmu_cmdq *cmdq,
 {
 	u32 swidx, sbidx, ewidx, ebidx;
 	struct arm_smmu_ll_queue llq = {
-		.max_n_shift	= cmdq->q.llq.max_n_shift,
-		.prod		= sprod,
+		.max_n	= cmdq->q.llq.max_n,
+		.prod	= sprod,
 	};
 
 	ewidx = BIT_WORD(Q_IDX(&llq, eprod));
@@ -1344,8 +1344,8 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds,
 {
 	int i;
 	struct arm_smmu_ll_queue llq = {
-		.max_n_shift	= cmdq->q.llq.max_n_shift,
-		.prod		= prod,
+		.max_n	= cmdq->q.llq.max_n,
+		.prod	= prod,
 	};
 
 	for (i = 0; i < n; ++i) {
@@ -1381,7 +1381,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 	bool owner;
 	struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
 	struct arm_smmu_ll_queue llq = {
-		.max_n_shift = cmdq->q.llq.max_n_shift,
+		.max_n = cmdq->q.llq.max_n,
 	}, head = llq;
 	int ret = 0;
 
@@ -3144,13 +3144,13 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
 	size_t qsz;
 
 	do {
-		qsz = ((1 << q->llq.max_n_shift) * dwords) << 3;
+		qsz = (q->llq.max_n * dwords) << 3;
 		q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma,
 					      GFP_KERNEL);
 		if (q->base || qsz < PAGE_SIZE)
 			break;
 
-		q->llq.max_n_shift--;
+		q->llq.max_n >>= 1;
 	} while (1);
 
 	if (!q->base) {
@@ -3162,7 +3162,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
 
 	if (!WARN_ON(q->base_dma & (qsz - 1))) {
 		dev_info(smmu->dev, "allocated %u entries for %s\n",
-			 1 << q->llq.max_n_shift, name);
+			 q->llq.max_n, name);
 	}
 
 	q->prod_reg	= arm_smmu_page1_fixup(prod_off, smmu);
@@ -3171,7 +3171,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
 
 	q->q_base  = Q_BASE_RWA;
 	q->q_base |= q->base_dma & Q_BASE_ADDR_MASK;
-	q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, q->llq.max_n_shift);
+	q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, ilog2(q->llq.max_n));
 
 	q->llq.prod = q->llq.cons = 0;
 	return 0;
@@ -3187,13 +3187,12 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
 {
 	int ret = 0;
 	struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
-	unsigned int nents = 1 << cmdq->q.llq.max_n_shift;
 	atomic_long_t *bitmap;
 
 	atomic_set(&cmdq->owner_prod, 0);
 	atomic_set(&cmdq->lock, 0);
 
-	bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
+	bitmap = (atomic_long_t *)bitmap_zalloc(cmdq->q.llq.max_n, GFP_KERNEL);
 	if (!bitmap) {
 		dev_err(smmu->dev, "failed to allocate cmdq bitmap\n");
 		ret = -ENOMEM;
@@ -3695,7 +3694,7 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
 
 static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 {
-	u32 reg;
+	u32 reg, max_n_shift;
 	bool coherent = smmu->features & ARM_SMMU_FEAT_COHERENCY;
 
 	/* IDR0 */
@@ -3798,9 +3797,9 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 	}
 
 	/* Queue sizes, capped to ensure natural alignment */
-	smmu->cmdq.q.llq.max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT,
-					     FIELD_GET(IDR1_CMDQS, reg));
-	if (smmu->cmdq.q.llq.max_n_shift <= ilog2(CMDQ_BATCH_ENTRIES)) {
+	max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_CMDQS, reg));
+	smmu->cmdq.q.llq.max_n = 1 << max_n_shift;
+	if (smmu->cmdq.q.llq.max_n <= CMDQ_BATCH_ENTRIES) {
 		/*
 		 * We don't support splitting up batches, so one batch of
 		 * commands plus an extra sync needs to fit inside the command
@@ -3812,10 +3811,11 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 		return -ENXIO;
 	}
 
-	smmu->evtq.q.llq.max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT,
-					     FIELD_GET(IDR1_EVTQS, reg));
-	smmu->priq.q.llq.max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT,
-					     FIELD_GET(IDR1_PRIQS, reg));
+	max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_EVTQS, reg));
+	smmu->evtq.q.llq.max_n = 1 << max_n_shift;
+
+	max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_PRIQS, reg));
+	smmu->priq.q.llq.max_n = 1 << max_n_shift;
 
 	/* SID/SSID sizes */
 	smmu->ssid_bits = FIELD_GET(IDR1_SSIDSIZE, reg);
-- 
2.23.0.dirty

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
  2020-06-23 16:22       ` Robin Murphy
@ 2020-06-24  8:15         ` John Garry
  0 siblings, 0 replies; 25+ messages in thread
From: John Garry @ 2020-06-24  8:15 UTC (permalink / raw)
  To: Robin Murphy, kernel test robot, will, rikard.falkeborn
  Cc: trivial, maz, linux-kernel, Linuxarm, iommu, kbuild-all,
	linux-arm-kernel

>>
>> I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and
>> h=unsigned value, so I doubt this warn.
>>
>> Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks
>> like GENMASK_INPUT_CHECK() could be improved.
> 
> That said, I think this particular case might be even better off dodging
> GENMASK() entirely, by doing something like this first. Untested...
> 
> Robin.
> 
> ----->8-----
> Subject: [PATCH] iommu/arm-smmu-v3: Streamline queue calculations
> 
> Beyond the initial queue setup based on the log2 values from ID
> registers, the log2 queue size is only ever used in the form of
> (1 << max_n_shift) to repeatedly recalculate the number of queue
> elements. Simply storing it in that form leads to slightly more
> efficient code, particularly in the low-level queue accessors
> where it counts most:
> 
> add/remove: 0/0 grow/shrink: 1/7 up/down: 4/-120 (-116)
> Function                                     old     new   delta
> arm_smmu_init_one_queue                      360     364      +4
> arm_smmu_priq_thread                         512     508      -4
> arm_smmu_evtq_thread                         300     292      -8
> __arm_smmu_cmdq_poll_set_valid_map.isra      296     288      -8
> queue_remove_raw                             180     164     -16
> arm_smmu_gerror_handler                      732     716     -16
> arm_smmu_device_probe                       4312    4284     -28
> arm_smmu_cmdq_issue_cmdlist                 1892    1852     -40
> Total: Before=20135, After=20019, chg -0.58%
> 
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> ---

[...]


>   	}
>   
> -	smmu->evtq.q.llq.max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT,
> -					     FIELD_GET(IDR1_EVTQS, reg));
> -	smmu->priq.q.llq.max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT,
> -					     FIELD_GET(IDR1_PRIQS, reg));
> +	max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_EVTQS, reg));
> +	smmu->evtq.q.llq.max_n = 1 << max_n_shift;

So I require the bitmask of this for the prod, which would be (max_n << 
1) - 1.

I don't feel too strongly either way, and the other big changes in this 
series need to be considered first...

Thanks,
John

> +
> +	max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_PRIQS, reg));
> +	smmu->priq.q.llq.max_n = 1 << max_n_shift;
>   
>   	/* SID/SSID sizes */
>   	smmu->ssid_bits = FIELD_GET(IDR1_SSIDSIZE, reg);
> 

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
  2020-06-23 13:55           ` Rikard Falkeborn
@ 2020-06-26 10:05             ` John Garry
  0 siblings, 0 replies; 25+ messages in thread
From: John Garry @ 2020-06-26 10:05 UTC (permalink / raw)
  To: Rikard Falkeborn
  Cc: trivial, kernel test robot, will, linux-kernel, Linuxarm, iommu,
	maz, kbuild-all, robin.murphy, linux-arm-kernel

On 23/06/2020 14:55, Rikard Falkeborn wrote:
> Den tis 23 juni 2020 12:21John Garry <john.garry@huawei.com 
> <mailto:john.garry@huawei.com>> skrev:
> 
>     On 23/06/2020 10:35, Rikard Falkeborn wrote:
>      >
>      >     I'd say that GENMASK_INPUT_CHECK() should be able to handle a
>     l=0 and
>      >     h=unsigned value, so I doubt this warn.
>      >
>      >     Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it,
>     but it
>      >     looks
>      >     like GENMASK_INPUT_CHECK() could be improved.
>      >
>      >
>      > Indeed it could, it is fixed in -next.
> 
>     ok, thanks for the pointer, but I still see this on today's -next with
>     this patch:
> 
>     make W=1 drivers/iommu/arm-smmu-v3.o
> 
> 
> Oh, ok thanks for reporting. I guess different gcc versions have 
> different behaviour. I guess we'll have to change the comparison to 
> (!((h) == (l) || (h) > (l))) instead (not sure I got all parenthesis and 
> logic correct but you get the idea).
> 

Yeah, so this looks to fix it:

--- a/include/linux/bits.h
+++ b/include/linux/bits.h
@@ -23,7 +23,8 @@
#include <linux/build_bug.h>
#define GENMASK_INPUT_CHECK(h, l) \
        (BUILD_BUG_ON_ZERO(__builtin_choose_expr( \
-               __builtin_constant_p((l) > (h)), (l) > (h), 0)))
+               __builtin_constant_p(!((h) == (l) ||(h) > (l))), !((h) 
== (l) ||(h) > (l)), 0)))
+

We may be able to just use (h) == (l) as the const expr to make it more 
concise, but that may be confusing.

I only tested with my toolchain based on 7.5.0

Thanks,
John

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
  2020-06-22 17:28 [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency John Garry
                   ` (3 preceding siblings ...)
  2020-06-22 17:28 ` [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() John Garry
@ 2020-07-08 13:00 ` John Garry
  2020-07-16 10:19 ` Will Deacon
  5 siblings, 0 replies; 25+ messages in thread
From: John Garry @ 2020-07-08 13:00 UTC (permalink / raw)
  To: will, robin.murphy
  Cc: trivial, maz, linuxarm, linux-kernel, iommu, linux-arm-kernel

On 22/06/2020 18:28, John Garry wrote:

Hi, Can you guys let me know if this is on the radar at all?

I have been talking about this performance issue since Jan, and not 
getting anything really.

thanks

> As mentioned in [0], the CPU may consume many cycles processing
> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
> get space on the queue takes approx 25% of the cycles for this function.
> 
> This series removes that cmpxchg().
> 
> For my NVMe test with 3x NVMe SSDs, I'm getting a ~24% throughput
> increase:
> Before: 1310 IOPs
> After: 1630 IOPs
> 
> I also have a test harness to check the rate of DMA map+unmaps we can
> achieve:
> 
> CPU count	32	64	128
> Before:		63187	19418	10169
> After:		93287	44789	15862
> 
> (unit is map+unmaps per CPU per second)
> 
> [0] https://lore.kernel.org/linux-iommu/B926444035E5E2439431908E3842AFD24B86DB@DGGEMI525-MBS.china.huawei.com/T/#ma02e301c38c3e94b7725e685757c27e39c7cbde3
> 
> John Garry (4):
>    iommu/arm-smmu-v3: Fix trivial typo
>    iommu/arm-smmu-v3: Calculate bits for prod and owner
>    iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch
>    iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
> 
>   drivers/iommu/arm-smmu-v3.c | 233 +++++++++++++++++++++++-------------
>   1 file changed, 151 insertions(+), 82 deletions(-)
> 

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
  2020-06-22 17:28 [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency John Garry
                   ` (4 preceding siblings ...)
  2020-07-08 13:00 ` [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency John Garry
@ 2020-07-16 10:19 ` Will Deacon
  2020-07-16 10:22   ` Will Deacon
  5 siblings, 1 reply; 25+ messages in thread
From: Will Deacon @ 2020-07-16 10:19 UTC (permalink / raw)
  To: John Garry
  Cc: trivial, maz, linux-kernel, linuxarm, iommu, robin.murphy,
	linux-arm-kernel

On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
> As mentioned in [0], the CPU may consume many cycles processing
> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
> get space on the queue takes approx 25% of the cycles for this function.
> 
> This series removes that cmpxchg().

How about something much simpler like the diff below?

Will

--->8

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f578677a5c41..705d99ec8219 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -560,6 +560,7 @@ struct arm_smmu_cmdq {
 	atomic_long_t			*valid_map;
 	atomic_t			owner_prod;
 	atomic_t			lock;
+	spinlock_t			slock;
 };
 
 struct arm_smmu_cmdq_batch {
@@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 	u64 cmd_sync[CMDQ_ENT_DWORDS];
 	u32 prod;
 	unsigned long flags;
-	bool owner;
+	bool owner, cas_failed = false;
 	struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
 	struct arm_smmu_ll_queue llq = {
 		.max_n_shift = cmdq->q.llq.max_n_shift,
@@ -1387,27 +1388,35 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 
 	/* 1. Allocate some space in the queue */
 	local_irq_save(flags);
-	llq.val = READ_ONCE(cmdq->q.llq.val);
 	do {
 		u64 old;
+		llq.val = READ_ONCE(cmdq->q.llq.val);
 
-		while (!queue_has_space(&llq, n + sync)) {
+		if (queue_has_space(&llq, n + sync))
+			goto try_cas;
+
+		if (cas_failed)
+			spin_unlock(&cmdq->slock);
+
+		do {
 			local_irq_restore(flags);
 			if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
 				dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
 			local_irq_save(flags);
-		}
+		} while (!queue_has_space(&llq, n + sync));
 
+try_cas:
 		head.cons = llq.cons;
 		head.prod = queue_inc_prod_n(&llq, n + sync) |
 					     CMDQ_PROD_OWNED_FLAG;
 
 		old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
-		if (old == llq.val)
-			break;
+		cas_failed = old != llq.val;
+
+		if (cas_failed)
+			spin_lock(&cmdq->slock);
+	} while (cas_failed);
 
-		llq.val = old;
-	} while (1);
 	owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
 	head.prod &= ~CMDQ_PROD_OWNED_FLAG;
 	llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
@@ -3192,6 +3201,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
 
 	atomic_set(&cmdq->owner_prod, 0);
 	atomic_set(&cmdq->lock, 0);
+	spin_lock_init(&cmdq->slock);
 
 	bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
 	if (!bitmap) {
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
  2020-06-22 17:28 ` [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() John Garry
  2020-06-23  1:07   ` kernel test robot
@ 2020-07-16 10:20   ` Will Deacon
  2020-07-16 10:26     ` John Garry
  1 sibling, 1 reply; 25+ messages in thread
From: Will Deacon @ 2020-07-16 10:20 UTC (permalink / raw)
  To: John Garry
  Cc: trivial, maz, linux-kernel, linuxarm, iommu, robin.murphy,
	linux-arm-kernel

On Tue, Jun 23, 2020 at 01:28:40AM +0800, John Garry wrote:
> It has been shown that the cmpxchg() for finding space in the cmdq can
> be a bottleneck:
> - for more CPUs contending the cmdq, the cmpxchg() will fail more often
> - since the software-maintained cons pointer is updated on the same 64b
>   memory region, the chance of cmpxchg() failure increases again
> 
> The cmpxchg() is removed as part of 2 related changes:
> 
> - Update prod and cmdq owner in a single atomic add operation. For this, we
>   count the prod and owner in separate regions in prod memory.
> 
>   As with simple binary counting, once the prod+wrap fields overflow, they
>   will zero. They should never overflow into "owner" region, and we zero
>   the non-owner, prod region for each owner. This maintains the prod
>   pointer.
> 
>   As for the "owner", we now count this value, instead of setting a flag.
>   Similar to before, once the owner has finished gathering, it will clear
>   a mask. As such, a CPU declares itself as the "owner" when it reads zero
>   for this region. This zeroing will also clear possible overflow in
>   wrap+prod region, above.
> 
>   The owner is now responsible for all cmdq locking to avoid possible
>   deadlock. The owner will lock the cmdq for all non-owers it has gathered
>   when they have space in the queue and have written their entries.
> 
> - Check for space in the cmdq after the prod pointer has been assigned.
> 
>   We don't bother checking for space in the cmdq before assigning the prod
>   pointer, as this would be racy.
> 
>   So since the prod pointer is updated unconditionally, it would be common
>   for no space to be available in the cmdq when prod is assigned - that
>   is, according the software-maintained prod and cons pointer. So now
>   it must be ensured that the entries are not yet written and not until
>   there is space.
> 
>   How the prod pointer is maintained also leads to a strange condition
>   where the prod pointer can wrap past the cons pointer. We can detect this
>   condition, and report no space here. However, a prod pointer progressed
>   twice past the cons pointer cannot be detected. But it can be ensured that
>   this that this scenario does not occur, as we limit the amount of
>   commands any CPU can issue at any given time, such that we cannot
>   progress prod pointer further.
> 
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>  drivers/iommu/arm-smmu-v3.c | 101 ++++++++++++++++++++++--------------
>  1 file changed, 61 insertions(+), 40 deletions(-)

I must admit, you made me smile putting trivial@kernel.org on cc for this ;)

Will
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
  2020-07-16 10:19 ` Will Deacon
@ 2020-07-16 10:22   ` Will Deacon
  2020-07-16 10:28     ` Will Deacon
  0 siblings, 1 reply; 25+ messages in thread
From: Will Deacon @ 2020-07-16 10:22 UTC (permalink / raw)
  To: John Garry
  Cc: trivial, maz, linux-kernel, linuxarm, iommu, robin.murphy,
	linux-arm-kernel

On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
> > As mentioned in [0], the CPU may consume many cycles processing
> > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
> > get space on the queue takes approx 25% of the cycles for this function.
> > 
> > This series removes that cmpxchg().
> 
> How about something much simpler like the diff below?

Ah, scratch that, I don't drop the lock if we fail the cas with it held.
Let me hack it some more (I have no hardware so I can only build-test this).

Will
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
  2020-07-16 10:20   ` Will Deacon
@ 2020-07-16 10:26     ` John Garry
  0 siblings, 0 replies; 25+ messages in thread
From: John Garry @ 2020-07-16 10:26 UTC (permalink / raw)
  To: Will Deacon
  Cc: trivial, maz, linux-kernel, linuxarm, iommu, robin.murphy,
	linux-arm-kernel

On 16/07/2020 11:20, Will Deacon wrote:
> On Tue, Jun 23, 2020 at 01:28:40AM +0800, John Garry wrote:
>> It has been shown that the cmpxchg() for finding space in the cmdq can
>> be a bottleneck:
>> - for more CPUs contending the cmdq, the cmpxchg() will fail more often
>> - since the software-maintained cons pointer is updated on the same 64b
>>    memory region, the chance of cmpxchg() failure increases again
>>
>> The cmpxchg() is removed as part of 2 related changes:
>>
>> - Update prod and cmdq owner in a single atomic add operation. For this, we
>>    count the prod and owner in separate regions in prod memory.
>>
>>    As with simple binary counting, once the prod+wrap fields overflow, they
>>    will zero. They should never overflow into "owner" region, and we zero
>>    the non-owner, prod region for each owner. This maintains the prod
>>    pointer.
>>
>>    As for the "owner", we now count this value, instead of setting a flag.
>>    Similar to before, once the owner has finished gathering, it will clear
>>    a mask. As such, a CPU declares itself as the "owner" when it reads zero
>>    for this region. This zeroing will also clear possible overflow in
>>    wrap+prod region, above.
>>
>>    The owner is now responsible for all cmdq locking to avoid possible
>>    deadlock. The owner will lock the cmdq for all non-owers it has gathered
>>    when they have space in the queue and have written their entries.
>>
>> - Check for space in the cmdq after the prod pointer has been assigned.
>>
>>    We don't bother checking for space in the cmdq before assigning the prod
>>    pointer, as this would be racy.
>>
>>    So since the prod pointer is updated unconditionally, it would be common
>>    for no space to be available in the cmdq when prod is assigned - that
>>    is, according the software-maintained prod and cons pointer. So now
>>    it must be ensured that the entries are not yet written and not until
>>    there is space.
>>
>>    How the prod pointer is maintained also leads to a strange condition
>>    where the prod pointer can wrap past the cons pointer. We can detect this
>>    condition, and report no space here. However, a prod pointer progressed
>>    twice past the cons pointer cannot be detected. But it can be ensured that
>>    this that this scenario does not occur, as we limit the amount of
>>    commands any CPU can issue at any given time, such that we cannot
>>    progress prod pointer further.
>>
>> Signed-off-by: John Garry <john.garry@huawei.com>
>> ---
>>   drivers/iommu/arm-smmu-v3.c | 101 ++++++++++++++++++++++--------------
>>   1 file changed, 61 insertions(+), 40 deletions(-)
> 
> I must admit, you made me smile putting trivial@kernel.org on cc for this ;)
> 

Yes, quite ironic :)
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
  2020-07-16 10:22   ` Will Deacon
@ 2020-07-16 10:28     ` Will Deacon
  2020-07-16 10:56       ` John Garry
  2020-07-16 13:31       ` John Garry
  0 siblings, 2 replies; 25+ messages in thread
From: Will Deacon @ 2020-07-16 10:28 UTC (permalink / raw)
  To: John Garry
  Cc: trivial, maz, linux-kernel, linuxarm, iommu, robin.murphy,
	linux-arm-kernel

On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
> > On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
> > > As mentioned in [0], the CPU may consume many cycles processing
> > > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
> > > get space on the queue takes approx 25% of the cycles for this function.
> > > 
> > > This series removes that cmpxchg().
> > 
> > How about something much simpler like the diff below?
> 
> Ah, scratch that, I don't drop the lock if we fail the cas with it held.
> Let me hack it some more (I have no hardware so I can only build-test this).

Right, second attempt...

Will

--->8

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f578677a5c41..e6bcddd6ef69 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -560,6 +560,7 @@ struct arm_smmu_cmdq {
 	atomic_long_t			*valid_map;
 	atomic_t			owner_prod;
 	atomic_t			lock;
+	spinlock_t			slock;
 };
 
 struct arm_smmu_cmdq_batch {
@@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 	u64 cmd_sync[CMDQ_ENT_DWORDS];
 	u32 prod;
 	unsigned long flags;
-	bool owner;
+	bool owner, locked = false;
 	struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
 	struct arm_smmu_ll_queue llq = {
 		.max_n_shift = cmdq->q.llq.max_n_shift,
@@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 
 	/* 1. Allocate some space in the queue */
 	local_irq_save(flags);
-	llq.val = READ_ONCE(cmdq->q.llq.val);
 	do {
 		u64 old;
+		llq.val = READ_ONCE(cmdq->q.llq.val);
 
-		while (!queue_has_space(&llq, n + sync)) {
+		if (queue_has_space(&llq, n + sync))
+			goto try_cas;
+
+		if (locked)
+			spin_unlock(&cmdq->slock);
+
+		do {
 			local_irq_restore(flags);
 			if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
 				dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
 			local_irq_save(flags);
-		}
+		} while (!queue_has_space(&llq, n + sync));
 
+try_cas:
 		head.cons = llq.cons;
 		head.prod = queue_inc_prod_n(&llq, n + sync) |
 					     CMDQ_PROD_OWNED_FLAG;
 
 		old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
-		if (old == llq.val)
+		if (old != llq.val)
 			break;
 
-		llq.val = old;
+		if (!locked) {
+			spin_lock(&cmdq->slock);
+			locked = true;
+		}
 	} while (1);
+
 	owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
 	head.prod &= ~CMDQ_PROD_OWNED_FLAG;
 	llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
@@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
 
 	atomic_set(&cmdq->owner_prod, 0);
 	atomic_set(&cmdq->lock, 0);
+	spin_lock_init(&cmdq->slock);
 
 	bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
 	if (!bitmap) {
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
  2020-07-16 10:28     ` Will Deacon
@ 2020-07-16 10:56       ` John Garry
  2020-07-16 11:22         ` Robin Murphy
  2020-07-16 13:31       ` John Garry
  1 sibling, 1 reply; 25+ messages in thread
From: John Garry @ 2020-07-16 10:56 UTC (permalink / raw)
  To: Will Deacon
  Cc: trivial, maz, linux-kernel, linuxarm, iommu, robin.murphy,
	linux-arm-kernel

On 16/07/2020 11:28, Will Deacon wrote:
> On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
>> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
>>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
>>>> As mentioned in [0], the CPU may consume many cycles processing
>>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
>>>> get space on the queue takes approx 25% of the cycles for this function.
>>>>
>>>> This series removes that cmpxchg().
>>>
>>> How about something much simpler like the diff below? >>
>> Ah, scratch that, I don't drop the lock if we fail the cas with it held.
>> Let me hack it some more (I have no hardware so I can only build-test this).
> 
> Right, second attempt...

I can try it, but if performance if not as good, then please check mine 
further (patch 4/4 specifically) - performance is really good, IMHO.

Thanks,

> 
> Will
> 
> --->8
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index f578677a5c41..e6bcddd6ef69 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -560,6 +560,7 @@ struct arm_smmu_cmdq {
>   	atomic_long_t			*valid_map;
>   	atomic_t			owner_prod;
>   	atomic_t			lock;
> +	spinlock_t			slock;
>   };
>   
>   struct arm_smmu_cmdq_batch {
> @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
>   	u64 cmd_sync[CMDQ_ENT_DWORDS];
>   	u32 prod;
>   	unsigned long flags;
> -	bool owner;
> +	bool owner, locked = false;
>   	struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
>   	struct arm_smmu_ll_queue llq = {
>   		.max_n_shift = cmdq->q.llq.max_n_shift,
> @@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
>   
>   	/* 1. Allocate some space in the queue */
>   	local_irq_save(flags);
> -	llq.val = READ_ONCE(cmdq->q.llq.val);
>   	do {
>   		u64 old;
> +		llq.val = READ_ONCE(cmdq->q.llq.val);
>   
> -		while (!queue_has_space(&llq, n + sync)) {
> +		if (queue_has_space(&llq, n + sync))
> +			goto try_cas;
> +
> +		if (locked)
> +			spin_unlock(&cmdq->slock);
> +
> +		do {
>   			local_irq_restore(flags);
>   			if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
>   				dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
>   			local_irq_save(flags);
> -		}
> +		} while (!queue_has_space(&llq, n + sync));
>   
> +try_cas:
>   		head.cons = llq.cons;
>   		head.prod = queue_inc_prod_n(&llq, n + sync) |
>   					     CMDQ_PROD_OWNED_FLAG;
>   
>   		old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
> -		if (old == llq.val)
> +		if (old != llq.val)
>   			break;
>   
> -		llq.val = old;
> +		if (!locked) {
> +			spin_lock(&cmdq->slock);
> +			locked = true;
> +		}
>   	} while (1);
> +
>   	owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
>   	head.prod &= ~CMDQ_PROD_OWNED_FLAG;
>   	llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
> @@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
>   
>   	atomic_set(&cmdq->owner_prod, 0);
>   	atomic_set(&cmdq->lock, 0);
> +	spin_lock_init(&cmdq->slock);
>   
>   	bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
>   	if (!bitmap) {
> .
> 

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
  2020-07-16 10:56       ` John Garry
@ 2020-07-16 11:22         ` Robin Murphy
  2020-07-16 11:30           ` John Garry
  2020-07-16 11:32           ` Will Deacon
  0 siblings, 2 replies; 25+ messages in thread
From: Robin Murphy @ 2020-07-16 11:22 UTC (permalink / raw)
  To: John Garry, Will Deacon
  Cc: trivial, maz, linuxarm, linux-kernel, iommu, linux-arm-kernel

On 2020-07-16 11:56, John Garry wrote:
> On 16/07/2020 11:28, Will Deacon wrote:
>> On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
>>> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
>>>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
>>>>> As mentioned in [0], the CPU may consume many cycles processing
>>>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() 
>>>>> loop to
>>>>> get space on the queue takes approx 25% of the cycles for this 
>>>>> function.
>>>>>
>>>>> This series removes that cmpxchg().
>>>>
>>>> How about something much simpler like the diff below? >>
>>> Ah, scratch that, I don't drop the lock if we fail the cas with it held.
>>> Let me hack it some more (I have no hardware so I can only build-test 
>>> this).
>>
>> Right, second attempt...
> 
> I can try it, but if performance if not as good, then please check mine 
> further (patch 4/4 specifically) - performance is really good, IMHO.

Perhaps a silly question (I'm too engrossed in PMU world ATM to get 
properly back up to speed on this), but couldn't this be done without 
cmpxchg anyway? Instinctively it feels like instead of maintaining a 
literal software copy of the prod value, we could resolve the "claim my 
slot in the queue" part with atomic_fetch_add on a free-running 32-bit 
"pseudo-prod" index, then whoever updates the hardware deals with the 
truncation and wrap bit to convert it to an actual register value.

Robin.

> 
> Thanks,
> 
>>
>> Will
>>
>> --->8
>>
>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>> index f578677a5c41..e6bcddd6ef69 100644
>> --- a/drivers/iommu/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm-smmu-v3.c
>> @@ -560,6 +560,7 @@ struct arm_smmu_cmdq {
>>       atomic_long_t            *valid_map;
>>       atomic_t            owner_prod;
>>       atomic_t            lock;
>> +    spinlock_t            slock;
>>   };
>>   struct arm_smmu_cmdq_batch {
>> @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct 
>> arm_smmu_device *smmu,
>>       u64 cmd_sync[CMDQ_ENT_DWORDS];
>>       u32 prod;
>>       unsigned long flags;
>> -    bool owner;
>> +    bool owner, locked = false;
>>       struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
>>       struct arm_smmu_ll_queue llq = {
>>           .max_n_shift = cmdq->q.llq.max_n_shift,
>> @@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct 
>> arm_smmu_device *smmu,
>>       /* 1. Allocate some space in the queue */
>>       local_irq_save(flags);
>> -    llq.val = READ_ONCE(cmdq->q.llq.val);
>>       do {
>>           u64 old;
>> +        llq.val = READ_ONCE(cmdq->q.llq.val);
>> -        while (!queue_has_space(&llq, n + sync)) {
>> +        if (queue_has_space(&llq, n + sync))
>> +            goto try_cas;
>> +
>> +        if (locked)
>> +            spin_unlock(&cmdq->slock);
>> +
>> +        do {
>>               local_irq_restore(flags);
>>               if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
>>                   dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
>>               local_irq_save(flags);
>> -        }
>> +        } while (!queue_has_space(&llq, n + sync));
>> +try_cas:
>>           head.cons = llq.cons;
>>           head.prod = queue_inc_prod_n(&llq, n + sync) |
>>                            CMDQ_PROD_OWNED_FLAG;
>>           old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
>> -        if (old == llq.val)
>> +        if (old != llq.val)
>>               break;
>> -        llq.val = old;
>> +        if (!locked) {
>> +            spin_lock(&cmdq->slock);
>> +            locked = true;
>> +        }
>>       } while (1);
>> +
>>       owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
>>       head.prod &= ~CMDQ_PROD_OWNED_FLAG;
>>       llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
>> @@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct 
>> arm_smmu_device *smmu)
>>       atomic_set(&cmdq->owner_prod, 0);
>>       atomic_set(&cmdq->lock, 0);
>> +    spin_lock_init(&cmdq->slock);
>>       bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
>>       if (!bitmap) {
>> .
>>
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
  2020-07-16 11:22         ` Robin Murphy
@ 2020-07-16 11:30           ` John Garry
  2020-07-16 11:32           ` Will Deacon
  1 sibling, 0 replies; 25+ messages in thread
From: John Garry @ 2020-07-16 11:30 UTC (permalink / raw)
  To: Robin Murphy, Will Deacon
  Cc: trivial, maz, Linuxarm, linux-kernel, iommu, linux-arm-kernel

On 16/07/2020 12:22, Robin Murphy wrote:
> On 2020-07-16 11:56, John Garry wrote:
>> On 16/07/2020 11:28, Will Deacon wrote:
>>> On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
>>>> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
>>>>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
>>>>>> As mentioned in [0], the CPU may consume many cycles processing
>>>>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg()
>>>>>> loop to
>>>>>> get space on the queue takes approx 25% of the cycles for this
>>>>>> function.
>>>>>>
>>>>>> This series removes that cmpxchg().
>>>>>
>>>>> How about something much simpler like the diff below? >>
>>>> Ah, scratch that, I don't drop the lock if we fail the cas with it held.
>>>> Let me hack it some more (I have no hardware so I can only build-test
>>>> this).
>>>
>>> Right, second attempt...
>>
>> I can try it, but if performance if not as good, then please check mine
>> further (patch 4/4 specifically) - performance is really good, IMHO.
> 
> Perhaps a silly question (I'm too engrossed in PMU world ATM to get
> properly back up to speed on this), but couldn't this be done without
> cmpxchg anyway? Instinctively it feels like instead of maintaining a
> literal software copy of the prod value, we could resolve the "claim my
> slot in the queue" part with atomic_fetch_add on a free-running 32-bit
> "pseudo-prod" index, then whoever updates the hardware deals with the
> truncation and wrap bit to convert it to an actual register value.
> 

That's what mine does. But I also need to take care of cmdq locking and 
how we unconditionally provide space.

Cheers,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
  2020-07-16 11:22         ` Robin Murphy
  2020-07-16 11:30           ` John Garry
@ 2020-07-16 11:32           ` Will Deacon
  2020-07-16 16:50             ` John Garry
  1 sibling, 1 reply; 25+ messages in thread
From: Will Deacon @ 2020-07-16 11:32 UTC (permalink / raw)
  To: Robin Murphy
  Cc: trivial, maz, linux-kernel, linuxarm, iommu, linux-arm-kernel

On Thu, Jul 16, 2020 at 12:22:05PM +0100, Robin Murphy wrote:
> On 2020-07-16 11:56, John Garry wrote:
> > On 16/07/2020 11:28, Will Deacon wrote:
> > > On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
> > > > On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
> > > > > On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
> > > > > > As mentioned in [0], the CPU may consume many cycles processing
> > > > > > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the
> > > > > > cmpxchg() loop to
> > > > > > get space on the queue takes approx 25% of the cycles
> > > > > > for this function.
> > > > > > 
> > > > > > This series removes that cmpxchg().
> > > > > 
> > > > > How about something much simpler like the diff below? >>
> > > > Ah, scratch that, I don't drop the lock if we fail the cas with it held.
> > > > Let me hack it some more (I have no hardware so I can only
> > > > build-test this).
> > > 
> > > Right, second attempt...
> > 
> > I can try it, but if performance if not as good, then please check mine
> > further (patch 4/4 specifically) - performance is really good, IMHO.
> 
> Perhaps a silly question (I'm too engrossed in PMU world ATM to get properly
> back up to speed on this), but couldn't this be done without cmpxchg anyway?
> Instinctively it feels like instead of maintaining a literal software copy
> of the prod value, we could resolve the "claim my slot in the queue" part
> with atomic_fetch_add on a free-running 32-bit "pseudo-prod" index, then
> whoever updates the hardware deals with the truncation and wrap bit to
> convert it to an actual register value.

Maybe, but it's easier said than done. The hard part is figuring how that
you have space and there's no way I'm touching that logic without a way to
test this.

I'm also not keen to complicate the code because of systems that don't scale
well with contended CAS [1]. If you put down loads of cores, you need an
interconnect/coherence protocol that can handle it.

Will

[1] https://lore.kernel.org/lkml/20190607072652.GA5522@hc/T/#m0d00f107c29223045933292a8b5b90d2ca9b7e5c
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
  2020-07-16 10:28     ` Will Deacon
  2020-07-16 10:56       ` John Garry
@ 2020-07-16 13:31       ` John Garry
  1 sibling, 0 replies; 25+ messages in thread
From: John Garry @ 2020-07-16 13:31 UTC (permalink / raw)
  To: Will Deacon
  Cc: trivial, maz, linux-kernel, linuxarm, iommu, robin.murphy,
	linux-arm-kernel

On 16/07/2020 11:28, Will Deacon wrote:
> On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
>> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
>>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
>>>> As mentioned in [0], the CPU may consume many cycles processing
>>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
>>>> get space on the queue takes approx 25% of the cycles for this function.
>>>>
>>>> This series removes that cmpxchg().
>>>
>>> How about something much simpler like the diff below?
>>
>> Ah, scratch that, I don't drop the lock if we fail the cas with it held.
>> Let me hack it some more (I have no hardware so I can only build-test this).
> 
> Right, second attempt...
> 
> Will

Unfortunately that hangs my machine during boot:

[10.902893] 00:01: ttyS0 at MMIO 0x3f00003f8 (irq = 6, base_baud = 
115200) is a 16550A
[10.912048] SuperH (H)SCI(F) driver initialized
[10.916811] msm_serial: driver initialized
[10.921371] arm-smmu-v3 arm-smmu-v3.0.auto: option mask 0x0
[10.926946] arm-smmu-v3 arm-smmu-v3.0.auto: ias 48-bit, oas 48-bit 
(features 0x00000fef)
[10.935374] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 65536 entries for cmdq
[10.942522] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 32768 entries for evtq


> 
> --->8
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index f578677a5c41..e6bcddd6ef69 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -560,6 +560,7 @@ struct arm_smmu_cmdq {
>   	atomic_long_t			*valid_map;
>   	atomic_t			owner_prod;
>   	atomic_t			lock;
> +	spinlock_t			slock;
>   };
>   
>   struct arm_smmu_cmdq_batch {
> @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
>   	u64 cmd_sync[CMDQ_ENT_DWORDS];
>   	u32 prod;
>   	unsigned long flags;
> -	bool owner;
> +	bool owner, locked = false;
>   	struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
>   	struct arm_smmu_ll_queue llq = {
>   		.max_n_shift = cmdq->q.llq.max_n_shift,
> @@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
>   
>   	/* 1. Allocate some space in the queue */
>   	local_irq_save(flags);
> -	llq.val = READ_ONCE(cmdq->q.llq.val);
>   	do {
>   		u64 old;
> +		llq.val = READ_ONCE(cmdq->q.llq.val);
>   
> -		while (!queue_has_space(&llq, n + sync)) {
> +		if (queue_has_space(&llq, n + sync))
> +			goto try_cas;
> +
> +		if (locked)
> +			spin_unlock(&cmdq->slock);
> +
> +		do {
>   			local_irq_restore(flags);
>   			if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
>   				dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
>   			local_irq_save(flags);
> -		}
> +		} while (!queue_has_space(&llq, n + sync));
>   
> +try_cas:
>   		head.cons = llq.cons;
>   		head.prod = queue_inc_prod_n(&llq, n + sync) |
>   					     CMDQ_PROD_OWNED_FLAG;
>   
>   		old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
> -		if (old == llq.val)
> +		if (old != llq.val)

Not sure why you changed this. And if I change it back, it seems that we 
could drop out of the loop with cmdq->slock held, so need to drop the 
lock also.

I tried that and it stops my machine hanging. Let me know that was the 
intention, so I can test.

Thanks,
John

>   			break;
>   
> -		llq.val = old;
> +		if (!locked) {
> +			spin_lock(&cmdq->slock);
> +			locked = true;
> +		}
>   	} while (1);
> +
>   	owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
>   	head.prod &= ~CMDQ_PROD_OWNED_FLAG;
>   	llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
> @@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
>   
>   	atomic_set(&cmdq->owner_prod, 0);
>   	atomic_set(&cmdq->lock, 0);
> +	spin_lock_init(&cmdq->slock);
>   
>   	bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
>   	if (!bitmap) {
> .
> 

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
  2020-07-16 11:32           ` Will Deacon
@ 2020-07-16 16:50             ` John Garry
  0 siblings, 0 replies; 25+ messages in thread
From: John Garry @ 2020-07-16 16:50 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy
  Cc: trivial, maz, linuxarm, linux-kernel, iommu, linux-arm-kernel

>>
>> Perhaps a silly question (I'm too engrossed in PMU world ATM to get properly
>> back up to speed on this), but couldn't this be done without cmpxchg anyway?
>> Instinctively it feels like instead of maintaining a literal software copy
>> of the prod value, we could resolve the "claim my slot in the queue" part
>> with atomic_fetch_add on a free-running 32-bit "pseudo-prod" index, then
>> whoever updates the hardware deals with the truncation and wrap bit to
>> convert it to an actual register value.
> 
> Maybe, but it's easier said than done. The hard part is figuring how that
> you have space and there's no way I'm touching that logic without a way to
> test this.
> 
> I'm also not keen to complicate the code because of systems that don't scale
> well with contended CAS [1]. If you put down loads of cores, you need an
> interconnect/coherence protocol that can handle it.

JFYI, I added some debug to the driver to get the cmpxchg() attempt rate 
while running a testharness (below):

cpus	rate
2	1.1
4	1.1
8	1.3
16	3.6
32	8.1
64	12.6
96	14.7

Ideal rate is 1. So it's not crazy high for many CPUs, but still 
drifting away from 1.

John

> 
> Will
> 
> [1] https://lore.kernel.org/lkml/20190607072652.GA5522@hc/T/#m0d00f107c29223045933292a8b5b90d2ca9b7e5c
> .
> 

//copied from Barry, thanks :)

static int ways = 64;
module_param(ways, int, S_IRUGO);

static int seconds = 20;
module_param(seconds, int, S_IRUGO);

int mappings[NR_CPUS];
struct semaphore sem[NR_CPUS];


#define COMPLETIONS_SIZE 50

static noinline dma_addr_t test_mapsingle(struct device *dev, void *buf, 
int size)
{
     dma_addr_t dma_addr = dma_map_single(dev, buf, size, DMA_TO_DEVICE);
     return dma_addr;
}

static noinline void test_unmapsingle(struct device *dev, void *buf, int 
size, dma_addr_t dma_addr)
{
     dma_unmap_single(dev, dma_addr, size, DMA_TO_DEVICE);
}

static noinline void test_memcpy(void *out, void *in, int size)
{
     memcpy(out, in, size);
}

//just a hack to get a dev h behind a SMMU
extern struct device *hisi_dev;

static int testthread(void *data)
{
     unsigned long stop = jiffies +seconds*HZ;
     struct device *dev = hisi_dev;
     char *inputs[COMPLETIONS_SIZE];
     char *outputs[COMPLETIONS_SIZE];
     dma_addr_t dma_addr[COMPLETIONS_SIZE];
     int i, cpu = smp_processor_id();
     struct semaphore *sem = data;

     for (i = 0; i < COMPLETIONS_SIZE; i++) {
         inputs[i] = kzalloc(4096, GFP_KERNEL);
         if (!inputs[i])
             return -ENOMEM;
     }

     for (i = 0; i < COMPLETIONS_SIZE; i++) {
         outputs[i] = kzalloc(4096, GFP_KERNEL);
         if (!outputs[i])
             return -ENOMEM;
     }

     while (time_before(jiffies, stop)) {
         for (i = 0; i < COMPLETIONS_SIZE; i++) {
             dma_addr[i] = test_mapsingle(dev, inputs[i], 4096);
             test_memcpy(outputs[i], inputs[i], 4096);
         }
         for (i = 0; i < COMPLETIONS_SIZE; i++) {
             test_unmapsingle(dev, inputs[i], 4096, dma_addr[i]);
         }
         mappings[cpu] += COMPLETIONS_SIZE;
     }

     for (i = 0; i < COMPLETIONS_SIZE; i++) {
         kfree(outputs[i]);
         kfree(inputs[i]);
     }

     up(sem);

     return 0;
}

void smmu_test_core(int cpus)
{
     struct task_struct *tsk;
     int i;
     int total_mappings = 0;

     ways = cpus;

     for(i=0;i<ways;i++) {
         mappings[i] = 0;
         tsk = kthread_create_on_cpu(testthread, &sem[i], i,  "map_test");

         if (IS_ERR(tsk))
             printk(KERN_ERR "create test thread failed\n");
         wake_up_process(tsk);
     }

     for(i=0;i<ways;i++) {
         down(&sem[i]);
         total_mappings += mappings[i];
     }

     printk(KERN_ERR "finished total_mappings=%d (per way=%d) (rate=%d 
per second per cpu) ways=%d\n",
     total_mappings, total_mappings / ways, total_mappings / (seconds* 
ways), ways);

}
EXPORT_SYMBOL(smmu_test_core);


static int __init test_init(void)
{
     int i;

     for(i=0;i<NR_CPUS;i++)
         sema_init(&sem[i], 0);

     return 0;
}

static void __exit test_exit(void)
{
}

module_init(test_init);
module_exit(test_exit);
MODULE_LICENSE("GPL");

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2020-07-16 16:52 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-22 17:28 [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency John Garry
2020-06-22 17:28 ` [PATCH 1/4] iommu/arm-smmu-v3: Fix trivial typo John Garry
2020-06-22 17:28 ` [PATCH 2/4] iommu/arm-smmu-v3: Calculate bits for prod and owner John Garry
2020-06-22 17:28 ` [PATCH 3/4] iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch John Garry
2020-06-22 17:28 ` [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() John Garry
2020-06-23  1:07   ` kernel test robot
2020-06-23  9:21     ` John Garry
2020-06-23  9:35       ` Rikard Falkeborn
2020-06-23 10:19         ` John Garry
2020-06-23 13:55           ` Rikard Falkeborn
2020-06-26 10:05             ` John Garry
2020-06-23 16:22       ` Robin Murphy
2020-06-24  8:15         ` John Garry
2020-07-16 10:20   ` Will Deacon
2020-07-16 10:26     ` John Garry
2020-07-08 13:00 ` [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency John Garry
2020-07-16 10:19 ` Will Deacon
2020-07-16 10:22   ` Will Deacon
2020-07-16 10:28     ` Will Deacon
2020-07-16 10:56       ` John Garry
2020-07-16 11:22         ` Robin Murphy
2020-07-16 11:30           ` John Garry
2020-07-16 11:32           ` Will Deacon
2020-07-16 16:50             ` John Garry
2020-07-16 13:31       ` John Garry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).