[PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency

iommu.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency
@ 2020-08-21 13:54 John Garry
  2020-08-21 13:54 ` [PATCH v2 1/2] iommu/arm-smmu-v3: Calculate max commands per batch John Garry
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: John Garry @ 2020-08-21 13:54 UTC (permalink / raw)
  To: will, robin.murphy; +Cc: maz, linux-kernel, linuxarm, iommu, linux-arm-kernel

As mentioned in [0], the CPU may consume many cycles processing
arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
get space on the queue takes a lot of time once we start getting many
CPUs contending - from experiment, for 64 CPUs contending the cmdq,
success rate is ~ 1 in 12, which is poor, but not totally awful.

This series removes that cmpxchg() and replaces with an atomic_add,
same as how the actual cmdq deals with maintaining the prod pointer.

For my NVMe test with 3x NVMe SSDs, I'm getting a ~24% throughput
increase:
Before: 1250K IOPs
After: 1550K IOPs

I also have a test harness to check the rate of DMA map+unmaps we can
achieve:

CPU count	8	16	32	64
Before:		282K	115K	36K	11K
After:		302K	193K	80K	30K

(unit is map+unmaps per CPU per second)

[0] https://lore.kernel.org/linux-iommu/B926444035E5E2439431908E3842AFD24B86DB@DGGEMI525-MBS.china.huawei.com/T/#ma02e301c38c3e94b7725e685757c27e39c7cbde3

Differences to v1:
- Simplify by dropping patch to always issue a CMD_SYNC
- Use 64b atomic add, keeping prod in a separate 32b field

John Garry (2):
  iommu/arm-smmu-v3: Calculate max commands per batch
  iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 166 ++++++++++++++------
 1 file changed, 114 insertions(+), 52 deletions(-)

-- 
2.26.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/2] iommu/arm-smmu-v3: Calculate max commands per batch
  2020-08-21 13:54 [PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency John Garry
@ 2020-08-21 13:54 ` John Garry
  2020-08-21 13:54 ` [PATCH v2 2/2] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() John Garry
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: John Garry @ 2020-08-21 13:54 UTC (permalink / raw)
  To: will, robin.murphy; +Cc: maz, linux-kernel, linuxarm, iommu, linux-arm-kernel

Calculate the batch size limit such that all CPUs in the system cannot
issue so many commands as to fill the command queue.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 46 +++++++++++++++++++--
 1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 7196207be7ea..a705fa3e18ea 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -532,6 +532,7 @@ struct arm_smmu_ll_queue {
 		u8			__pad[SMP_CACHE_BYTES];
 	} ____cacheline_aligned_in_smp;
 	u32				max_n_shift;
+	u32				max_cmd_per_batch;
 };
 
 struct arm_smmu_queue {
@@ -1515,7 +1516,10 @@ static void arm_smmu_cmdq_batch_add(struct arm_smmu_device *smmu,
 				    struct arm_smmu_cmdq_batch *cmds,
 				    struct arm_smmu_cmdq_ent *cmd)
 {
-	if (cmds->num == CMDQ_BATCH_ENTRIES) {
+	struct arm_smmu_cmdq *q = &smmu->cmdq;
+	struct arm_smmu_ll_queue *llq = &q->q.llq;
+
+	if (cmds->num == llq->max_cmd_per_batch) {
 		arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num, false);
 		cmds->num = 0;
 	}
@@ -3177,6 +3181,41 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
 	return 0;
 }
 
+static int arm_smmu_init_cmd_queue(struct arm_smmu_device *smmu,
+				   struct arm_smmu_queue *q,
+				   unsigned long prod_off,
+				   unsigned long cons_off,
+				   size_t dwords)
+{
+	u32 cpus = num_possible_cpus(), entries_for_prod;
+	int ret;
+
+	ret = arm_smmu_init_one_queue(smmu, q, prod_off, cons_off, dwords,
+				      "cmdq");
+	if (ret)
+		return ret;
+
+	entries_for_prod = 1 << q->llq.max_n_shift;
+
+	/*
+	 * We need at least 2 commands in a batch (1 x CMD_SYNC and 1 x
+	 * whatever else).
+	 */
+	if (entries_for_prod < 2 * cpus) {
+		dev_err(smmu->dev, "command queue size too small, suggest reduce #CPUs\n");
+		return -ENXIO;
+	}
+
+	/*
+	 * When finding max_cmd_per_batch, deduct 1 entry per batch to take
+	 * account of a CMD_SYNC being issued also.
+	 */
+	q->llq.max_cmd_per_batch = min((entries_for_prod / cpus) - 1,
+				       (u32)CMDQ_BATCH_ENTRIES);
+
+	return 0;
+}
+
 static void arm_smmu_cmdq_free_bitmap(void *data)
 {
 	unsigned long *bitmap = data;
@@ -3210,9 +3249,8 @@ static int arm_smmu_init_queues(struct arm_smmu_device *smmu)
 	int ret;
 
 	/* cmdq */
-	ret = arm_smmu_init_one_queue(smmu, &smmu->cmdq.q, ARM_SMMU_CMDQ_PROD,
-				      ARM_SMMU_CMDQ_CONS, CMDQ_ENT_DWORDS,
-				      "cmdq");
+	ret = arm_smmu_init_cmd_queue(smmu, &smmu->cmdq.q, ARM_SMMU_CMDQ_PROD,
+				      ARM_SMMU_CMDQ_CONS, CMDQ_ENT_DWORDS);
 	if (ret)
 		return ret;
 
-- 
2.26.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/2] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
  2020-08-21 13:54 [PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency John Garry
  2020-08-21 13:54 ` [PATCH v2 1/2] iommu/arm-smmu-v3: Calculate max commands per batch John Garry
@ 2020-08-21 13:54 ` John Garry
  2020-09-01 11:17 ` [PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency Song Bao Hua (Barry Song)
  2020-09-21 13:43 ` Will Deacon
  3 siblings, 0 replies; 8+ messages in thread
From: John Garry @ 2020-08-21 13:54 UTC (permalink / raw)
  To: will, robin.murphy; +Cc: maz, linux-kernel, linuxarm, iommu, linux-arm-kernel

It has been shown that the cmpxchg() for finding space in the cmdq can
be a bottleneck:
- for more CPUs contending the cmdq, the cmpxchg() will fail more often
- since the software-maintained cons pointer is updated on the same 64b
  memory region, the chance of cmpxchg() failure increases again

The cmpxchg() is removed as part of 2 related changes:

- Update prod and cmdq owner in a single atomic64 add operation.

  The prod value is updated in one 32b region, while the "owner" is updated
  in another 32b region.

  For the "owner", we now count this value, instead of setting a flag.
  Similar to before, once the owner has finished gathering, it will clear
  a mask. As such, a CPU declares itself as the "owner" when it reads zero
  for this region.

  The owner is now responsible for all cmdq locking to avoid possible
  deadlock. The owner will lock the cmdq for all non-owers it has gathered
  when they have space in the queue and have written their entries.

- Check for space in the cmdq after the prod pointer has been assigned.

  We don't bother checking for space in the cmdq before assigning the prod
  pointer, as this would be super racy.

  So since the prod pointer is updated unconditionally, it would be common
  for no space to be available in the cmdq when prod is assigned - that
  is, according the software-maintained prod and cons pointer. So now
  it must be ensured that the entries are not yet written and not until
  there is space.

  How the prod pointer is maintained also leads to a strange condition
  where the prod pointer can wrap past the cons pointer. We can detect this
  condition, and report no space here. However, a prod pointer progressed
  twice past the cons pointer cannot be detected. But it can be ensured
  that this that this scenario does not occur, as we limit the amount of
  commands any CPU can issue at any given time, such that we cannot
  progress prod pointer further.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 120 ++++++++++++--------
 1 file changed, 72 insertions(+), 48 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index a705fa3e18ea..d3c1400c644c 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -521,16 +521,17 @@ struct arm_smmu_cmdq_ent {
 struct arm_smmu_ll_queue {
 	union {
 		u64			val;
+		atomic64_t		atomic;
 		struct {
+			struct {
+				u16	sync;
+				u16	count;
+			};
 			u32		prod;
-			u32		cons;
 		};
-		struct {
-			atomic_t	prod;
-			atomic_t	cons;
-		} atomic;
 		u8			__pad[SMP_CACHE_BYTES];
 	} ____cacheline_aligned_in_smp;
+	u32				cons;
 	u32				max_n_shift;
 	u32				max_cmd_per_batch;
 };
@@ -771,10 +772,19 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n)
 	prod = Q_IDX(q, q->prod);
 	cons = Q_IDX(q, q->cons);
 
-	if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons))
+	if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons)) {
+		/* check if we have wrapped, meaning definitely no space */
+		if (cons > prod)
+			return false;
+
 		space = (1 << q->max_n_shift) - (prod - cons);
-	else
+	} else {
+		/* similar check to above */
+		if (prod > cons)
+			return false;
+
 		space = cons - prod;
+	}
 
 	return space >= n;
 }
@@ -1072,7 +1082,7 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu)
  *   fails if the caller appears to be the last lock holder (yes, this is
  *   racy). All successful UNLOCK routines have RELEASE semantics.
  */
-static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq)
+static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq, int count)
 {
 	int val;
 
@@ -1082,12 +1092,12 @@ static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq)
 	 * to INT_MIN so these increments won't hurt as the value will remain
 	 * negative.
 	 */
-	if (atomic_fetch_inc_relaxed(&cmdq->lock) >= 0)
+	if (atomic_fetch_add_relaxed(count, &cmdq->lock) >= 0)
 		return;
 
 	do {
 		val = atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0);
-	} while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1) != val);
+	} while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + count) != val);
 }
 
 static void arm_smmu_cmdq_shared_unlock(struct arm_smmu_cmdq *cmdq)
@@ -1236,13 +1246,14 @@ static int arm_smmu_cmdq_poll_until_not_full(struct arm_smmu_device *smmu,
 	if (arm_smmu_cmdq_exclusive_trylock_irqsave(cmdq, flags)) {
 		WRITE_ONCE(cmdq->q.llq.cons, readl_relaxed(cmdq->q.cons_reg));
 		arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq, flags);
-		llq->val = READ_ONCE(cmdq->q.llq.val);
+		llq->cons = READ_ONCE(cmdq->q.llq.cons);
 		return 0;
 	}
 
 	queue_poll_init(smmu, &qp);
 	do {
-		llq->val = READ_ONCE(smmu->cmdq.q.llq.val);
+		llq->prod = READ_ONCE(smmu->cmdq.q.llq.prod);
+		llq->cons = READ_ONCE(smmu->cmdq.q.llq.cons);
 		if (!queue_full(llq))
 			break;
 
@@ -1289,7 +1300,7 @@ static int __arm_smmu_cmdq_poll_until_consumed(struct arm_smmu_device *smmu,
 	int ret = 0;
 
 	queue_poll_init(smmu, &qp);
-	llq->val = READ_ONCE(smmu->cmdq.q.llq.val);
+	llq->cons = READ_ONCE(smmu->cmdq.q.llq.cons);
 	do {
 		if (queue_consumed(llq, prod))
 			break;
@@ -1383,53 +1394,46 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 	struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
 	struct arm_smmu_ll_queue llq = {
 		.max_n_shift = cmdq->q.llq.max_n_shift,
-	}, head = llq;
+	}, head = llq, space = {
+		.max_n_shift = cmdq->q.llq.max_n_shift,
+		/* We would need 2^16 CPUs gathered for overflow ... */
+		.count = 1,
+		.sync = sync,
+		.prod = n + sync,
+	};
+	u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
 	int ret = 0;
 
 	/* 1. Allocate some space in the queue */
 	local_irq_save(flags);
-	llq.val = READ_ONCE(cmdq->q.llq.val);
-	do {
-		u64 old;
 
-		while (!queue_has_space(&llq, n + sync)) {
-			local_irq_restore(flags);
-			if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
-				dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
-			local_irq_save(flags);
-		}
+	llq.val = atomic64_fetch_add(space.val, &cmdq->q.llq.atomic);
+	llq.prod &= prod_mask;
+	owner = !llq.count;
+	head.prod = queue_inc_prod_n(&llq, n + sync);
 
-		head.cons = llq.cons;
-		head.prod = queue_inc_prod_n(&llq, n + sync) |
-					     CMDQ_PROD_OWNED_FLAG;
-
-		old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
-		if (old == llq.val)
-			break;
+	/*
+	 * Ensure it's safe to write the entries. For this, we need to ensure
+	 * that there is space in the queue from our prod pointer.
+	 */
+	space.cons = READ_ONCE(cmdq->q.llq.cons);
+	space.prod = llq.prod;
+	while (!queue_has_space(&space, n + sync)) {
+		if (arm_smmu_cmdq_poll_until_not_full(smmu, &space))
+			dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
 
-		llq.val = old;
-	} while (1);
-	owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
-	head.prod &= ~CMDQ_PROD_OWNED_FLAG;
-	llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
+		space.prod = llq.prod;
+	}
 
 	/*
 	 * 2. Write our commands into the queue
-	 * Dependency ordering from the cmpxchg() loop above.
+	 * Dependency ordering from the space-checking while loop, above.
 	 */
 	arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n);
 	if (sync) {
 		prod = queue_inc_prod_n(&llq, n);
 		arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod);
 		queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS);
-
-		/*
-		 * In order to determine completion of our CMD_SYNC, we must
-		 * ensure that the queue can't wrap twice without us noticing.
-		 * We achieve that by taking the cmdq lock as shared before
-		 * marking our slot as valid.
-		 */
-		arm_smmu_cmdq_shared_lock(cmdq);
 	}
 
 	/* 3. Mark our slots as valid, ensuring commands are visible first */
@@ -1438,13 +1442,20 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 
 	/* 4. If we are the owner, take control of the SMMU hardware */
 	if (owner) {
+		struct arm_smmu_ll_queue tmp = {
+			.sync = ~0,
+			.count = ~0,
+			.prod = ~prod_mask,
+		};
+		int sync_count;
 		/* a. Wait for previous owner to finish */
 		atomic_cond_read_relaxed(&cmdq->owner_prod, VAL == llq.prod);
 
-		/* b. Stop gathering work by clearing the owned flag */
-		prod = atomic_fetch_andnot_relaxed(CMDQ_PROD_OWNED_FLAG,
-						   &cmdq->q.llq.atomic.prod);
-		prod &= ~CMDQ_PROD_OWNED_FLAG;
+		/* b. Stop gathering work by clearing the owned mask */
+		tmp.val = atomic64_fetch_andnot_relaxed(tmp.val,
+						       &cmdq->q.llq.atomic);
+		prod = Q_WRP(&llq, tmp.prod) | Q_IDX(&llq, tmp.prod);
+		sync_count = tmp.sync;
 
 		/*
 		 * c. Wait for any gathered work to be written to the queue.
@@ -1453,6 +1464,19 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 		 */
 		arm_smmu_cmdq_poll_valid_map(cmdq, llq.prod, prod);
 
+		/*
+		 * In order to determine completion of the CMD_SYNCs, we must
+		 * ensure that the queue can't wrap twice without us noticing.
+		 * We achieve that by taking the cmdq lock as shared before
+		 * progressing the prod pointer.
+		 * The owner does this for all the non-owners it has gathered.
+		 * Otherwise, some non-owner(s) may lock the cmdq, blocking the
+		 * cons being updating. This could be when the cmdq has just
+		 * become full. In this case, other sibling non-owners could be
+		 * blocked from progressing, leading to deadlock.
+		 */
+		arm_smmu_cmdq_shared_lock(cmdq, sync_count);
+
 		/*
 		 * d. Advance the hardware prod pointer
 		 * Control dependency ordering from the entries becoming valid.
-- 
2.26.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* RE: [PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency
  2020-08-21 13:54 [PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency John Garry
  2020-08-21 13:54 ` [PATCH v2 1/2] iommu/arm-smmu-v3: Calculate max commands per batch John Garry
  2020-08-21 13:54 ` [PATCH v2 2/2] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() John Garry
@ 2020-09-01 11:17 ` Song Bao Hua (Barry Song)
  2020-09-21 13:43 ` Will Deacon
  3 siblings, 0 replies; 8+ messages in thread
From: Song Bao Hua (Barry Song) @ 2020-09-01 11:17 UTC (permalink / raw)
  To: John Garry, will, robin.murphy
  Cc: maz, linux-kernel, Linuxarm, iommu, linux-arm-kernel



> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org
> [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of John Garry
> Sent: Saturday, August 22, 2020 1:54 AM
> To: will@kernel.org; robin.murphy@arm.com
> Cc: joro@8bytes.org; linux-arm-kernel@lists.infradead.org;
> iommu@lists.linux-foundation.org; maz@kernel.org; Linuxarm
> <linuxarm@huawei.com>; linux-kernel@vger.kernel.org; John Garry
> <john.garry@huawei.com>
> Subject: [PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency
> 
> As mentioned in [0], the CPU may consume many cycles processing
> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
> get space on the queue takes a lot of time once we start getting many CPUs
> contending - from experiment, for 64 CPUs contending the cmdq, success rate
> is ~ 1 in 12, which is poor, but not totally awful.
> 
> This series removes that cmpxchg() and replaces with an atomic_add, same as
> how the actual cmdq deals with maintaining the prod pointer.
> 
> For my NVMe test with 3x NVMe SSDs, I'm getting a ~24% throughput
> increase:
> Before: 1250K IOPs
> After: 1550K IOPs
> 
> I also have a test harness to check the rate of DMA map+unmaps we can
> achieve:
> 
> CPU count	8	16	32	64
> Before:		282K	115K	36K	11K
> After:		302K	193K	80K	30K
> 
> (unit is map+unmaps per CPU per second)

I have seen performance improvement on hns3 network by sending UDP with 1-32 threads:

Threads number        1        4           8         16       32
Before patch(TX Mbps)  7636.05  16444.36  21694.48  25746.40   25295.93
After  patch(TX Mbps)  7711.60  16478.98  26561.06  32628.75   33764.56

As you can see, for 8,16,32 threads, network TX throughput improve much. For 1 and 4 threads,
Tx throughput is almost seem before and after patch. This should be sensible as this patch
is mainly for decreasing the lock contention.

> 
> [0]
> https://lore.kernel.org/linux-iommu/B926444035E5E2439431908E3842AFD2
> 4B86DB@DGGEMI525-MBS.china.huawei.com/T/#ma02e301c38c3e94b7725e
> 685757c27e39c7cbde3
> 
> Differences to v1:
> - Simplify by dropping patch to always issue a CMD_SYNC
> - Use 64b atomic add, keeping prod in a separate 32b field
> 
> John Garry (2):
>   iommu/arm-smmu-v3: Calculate max commands per batch
>   iommu/arm-smmu-v3: Remove cmpxchg() in
> arm_smmu_cmdq_issue_cmdlist()
> 
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 166
> ++++++++++++++------
>  1 file changed, 114 insertions(+), 52 deletions(-)
> 
> --
> 2.26.2

Thanks
Barry

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency
  2020-08-21 13:54 [PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency John Garry
                   ` (2 preceding siblings ...)
  2020-09-01 11:17 ` [PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency Song Bao Hua (Barry Song)
@ 2020-09-21 13:43 ` Will Deacon
  2020-09-21 13:58   ` John Garry
  3 siblings, 1 reply; 8+ messages in thread
From: Will Deacon @ 2020-09-21 13:43 UTC (permalink / raw)
  To: John Garry
  Cc: maz, linuxarm, linux-kernel, iommu, robin.murphy, linux-arm-kernel

On Fri, Aug 21, 2020 at 09:54:20PM +0800, John Garry wrote:
> As mentioned in [0], the CPU may consume many cycles processing
> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
> get space on the queue takes a lot of time once we start getting many
> CPUs contending - from experiment, for 64 CPUs contending the cmdq,
> success rate is ~ 1 in 12, which is poor, but not totally awful.
> 
> This series removes that cmpxchg() and replaces with an atomic_add,
> same as how the actual cmdq deals with maintaining the prod pointer.

I'm still not a fan of this. Could you try to adapt the hacks I sent before,
please? I know they weren't quite right (I have no hardware to test on), but
the basic idea is to fall back to a spinlock if the cmpxchg() fails. The
queueing in the spinlock implementation should avoid the contention.

Thanks,

Will
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency
  2020-09-21 13:43 ` Will Deacon
@ 2020-09-21 13:58   ` John Garry
  2020-09-23 14:47     ` John Garry
  2020-11-13 10:43     ` John Garry
  0 siblings, 2 replies; 8+ messages in thread
From: John Garry @ 2020-09-21 13:58 UTC (permalink / raw)
  To: Will Deacon
  Cc: maz, linuxarm, linux-kernel, iommu, robin.murphy, linux-arm-kernel

On 21/09/2020 14:43, Will Deacon wrote:
> On Fri, Aug 21, 2020 at 09:54:20PM +0800, John Garry wrote:
>> As mentioned in [0], the CPU may consume many cycles processing
>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
>> get space on the queue takes a lot of time once we start getting many
>> CPUs contending - from experiment, for 64 CPUs contending the cmdq,
>> success rate is ~ 1 in 12, which is poor, but not totally awful.
>>
>> This series removes that cmpxchg() and replaces with an atomic_add,
>> same as how the actual cmdq deals with maintaining the prod pointer.
>  > I'm still not a fan of this.

:(

> Could you try to adapt the hacks I sent before,
> please? I know they weren't quite right (I have no hardware to test on), but
> the basic idea is to fall back to a spinlock if the cmpxchg() fails. The
> queueing in the spinlock implementation should avoid the contention.

OK, so if you're asking me to try this again, then I can do that, and 
see what it gives us.

John

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency
  2020-09-21 13:58   ` John Garry
@ 2020-09-23 14:47     ` John Garry
  2020-11-13 10:43     ` John Garry
  1 sibling, 0 replies; 8+ messages in thread
From: John Garry @ 2020-09-23 14:47 UTC (permalink / raw)
  To: Will Deacon
  Cc: maz, linuxarm, linux-kernel, iommu, robin.murphy, linux-arm-kernel

On 21/09/2020 14:58, John Garry wrote:
> 
>> Could you try to adapt the hacks I sent before,
>> please? I know they weren't quite right (I have no hardware to test 
>> on

Could the ARM Rev C FVP be used to at least functionally test? Can't 
seem to access myself, even though it's gratis...

), but
>> the basic idea is to fall back to a spinlock if the cmpxchg() fails. The
>> queueing in the spinlock implementation should avoid the contention.
> 

So I modified that suggested change to get it functioning, and it looks 
like this:

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 7196207be7ea..f907b7c233a2 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -560,6 +560,7 @@ struct arm_smmu_cmdq {
  	atomic_long_t			*valid_map;
  	atomic_t			owner_prod;
  	atomic_t			lock;
+	spinlock_t			slock;
  };

  struct arm_smmu_cmdq_batch {
@@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct 
arm_smmu_device *smmu,
  	u64 cmd_sync[CMDQ_ENT_DWORDS];
  	u32 prod;
  	unsigned long flags;
-	bool owner;
+	bool owner, locked = false;
  	struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
  	struct arm_smmu_ll_queue llq = {
  		.max_n_shift = cmdq->q.llq.max_n_shift,
@@ -1387,26 +1388,42 @@ static int arm_smmu_cmdq_issue_cmdlist(struct 
arm_smmu_device *smmu,

  	/* 1. Allocate some space in the queue */
  	local_irq_save(flags);
-	llq.val = READ_ONCE(cmdq->q.llq.val);
  	do {
  		u64 old;

-		while (!queue_has_space(&llq, n + sync)) {
+		llq.val = READ_ONCE(cmdq->q.llq.val);
+
+		if (queue_has_space(&llq, n + sync))
+			goto try_cas;
+
+		if (locked) {
+			spin_unlock(&cmdq->slock);
+			locked = 0; // added
+		}
+
+		do {
  			local_irq_restore(flags);
  			if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
  				dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
  			local_irq_save(flags);
-		}
+		} while (!queue_has_space(&llq, n + sync));

+try_cas:
  		head.cons = llq.cons;
  		head.prod = queue_inc_prod_n(&llq, n + sync) |
  					     CMDQ_PROD_OWNED_FLAG;

  		old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
-		if (old == llq.val)
+		if (old == llq.val) { // was if (old != llq.val)
+			if (locked)   //           break;
+				spin_unlock(&cmdq->slock);//
  			break;//
+		}//

-		llq.val = old;
+		if (!locked) {
+			spin_lock(&cmdq->slock);
+			locked = true;
+		}
  	} while (1);
  	owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
  	head.prod &= ~CMDQ_PROD_OWNED_FLAG;
@@ -3192,6 +3209,7 @@ static int arm_smmu_cmdq_init(struct 
arm_smmu_device *smmu)

  	atomic_set(&cmdq->owner_prod, 0);
  	atomic_set(&cmdq->lock, 0);
+	spin_lock_init(&cmdq->slock);

  	bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
  	if (!bitmap) {
-- 
2.26.2

I annotated my mods with comments. Maybe those mods would not be as you 
intend.

So I'm not sure that we solve the problem of a new CPU coming along and 
trying the cmpxchg immediately, while another CPU has the slock and will 
try the cmpxchg also.

Anyway, the results are a bit mixed depending on the CPU count, but 
generally positive compared to mainline:

CPUs		2	4	8	16	32	64	96
v5.9-rc1	453K	409K	295K	157K	33.6K	9.5K	5.2K
Will's change	459K	414K	281K	131K	44K	15.5K	8.6K
$subject change	481K	406K	305K	190K	81K	30K	18.7K

(Unit is DMA map+unmap per CPU per second, using test harness. Higher is 
better.)

Please let me know of any way to progress.

Thanks,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency
  2020-09-21 13:58   ` John Garry
  2020-09-23 14:47     ` John Garry
@ 2020-11-13 10:43     ` John Garry
  1 sibling, 0 replies; 8+ messages in thread
From: John Garry @ 2020-11-13 10:43 UTC (permalink / raw)
  To: Will Deacon
  Cc: maz, linuxarm, linux-kernel, iommu, robin.murphy, linux-arm-kernel

On 21/09/2020 14:58, John Garry wrote:
> On 21/09/2020 14:43, Will Deacon wrote:
>> On Fri, Aug 21, 2020 at 09:54:20PM +0800, John Garry wrote:
>>> As mentioned in [0], the CPU may consume many cycles processing
>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() 
>>> loop to
>>> get space on the queue takes a lot of time once we start getting many
>>> CPUs contending - from experiment, for 64 CPUs contending the cmdq,
>>> success rate is ~ 1 in 12, which is poor, but not totally awful.
>>>
>>> This series removes that cmpxchg() and replaces with an atomic_add,
>>> same as how the actual cmdq deals with maintaining the prod pointer.
>>  > I'm still not a fan of this.
> 
> :(
> 
>> Could you try to adapt the hacks I sent before,
>> please? I know they weren't quite right (I have no hardware to test 
>> on), but
>> the basic idea is to fall back to a spinlock if the cmpxchg() fails. The
>> queueing in the spinlock implementation should avoid the contention.
> 
> OK, so if you're asking me to try this again, then I can do that, and 
> see what it gives us.
> 

JFYI, to prove that this is not a problem which affects only our HW, I 
managed to test an arm64 platform from another vendor. Generally I see 
the same issue, and this patchset actually helps that platform even more.

		CPUs	Before	After	% Increase
Huawei D06	8	282K	302K	7%
Other			379K	420K	11%

Huawei D06	16	115K	193K	68K
Other			102K	291K	185K

Huawei D06	32	36K	80K	122%
Other			41K	156K	280%

Huawei D06	64	11K	30K	172%
Other			6K	47K	683%

I tested with something like [1], so unit is map+unmaps per cpu per 
second - higher is better.

My D06 is memory poor, so would expect higher results otherwise (with 
more memory). Indeed, my D05 has memory on all nodes and performs better.

Anyway, I see that the implementation here is not perfect, and I could 
not get suggested approach to improve performance significantly. So back 
to the drawing board...

Thanks,
John

[1] 
https://lore.kernel.org/linux-iommu/20201102080646.2180-1-song.bao.hua@hisilicon.com/

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-11-13 10:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-21 13:54 [PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency John Garry
2020-08-21 13:54 ` [PATCH v2 1/2] iommu/arm-smmu-v3: Calculate max commands per batch John Garry
2020-08-21 13:54 ` [PATCH v2 2/2] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() John Garry
2020-09-01 11:17 ` [PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency Song Bao Hua (Barry Song)
2020-09-21 13:43 ` Will Deacon
2020-09-21 13:58   ` John Garry
2020-09-23 14:47     ` John Garry
2020-11-13 10:43     ` John Garry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).