* [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
@ 2020-06-22 17:28 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-06-22 17:28 UTC (permalink / raw)
To: will, robin.murphy
Cc: joro, trivial, linux-arm-kernel, iommu, linux-kernel, linuxarm,
maz, John Garry
As mentioned in [0], the CPU may consume many cycles processing
arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
get space on the queue takes approx 25% of the cycles for this function.
This series removes that cmpxchg().
For my NVMe test with 3x NVMe SSDs, I'm getting a ~24% throughput
increase:
Before: 1310 IOPs
After: 1630 IOPs
I also have a test harness to check the rate of DMA map+unmaps we can
achieve:
CPU count 32 64 128
Before: 63187 19418 10169
After: 93287 44789 15862
(unit is map+unmaps per CPU per second)
[0] https://lore.kernel.org/linux-iommu/B926444035E5E2439431908E3842AFD24B86DB@DGGEMI525-MBS.china.huawei.com/T/#ma02e301c38c3e94b7725e685757c27e39c7cbde3
John Garry (4):
iommu/arm-smmu-v3: Fix trivial typo
iommu/arm-smmu-v3: Calculate bits for prod and owner
iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch
iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
drivers/iommu/arm-smmu-v3.c | 233 +++++++++++++++++++++++-------------
1 file changed, 151 insertions(+), 82 deletions(-)
--
2.26.2
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
@ 2020-06-22 17:28 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-06-22 17:28 UTC (permalink / raw)
To: will, robin.murphy
Cc: trivial, maz, linuxarm, linux-kernel, iommu, linux-arm-kernel
As mentioned in [0], the CPU may consume many cycles processing
arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
get space on the queue takes approx 25% of the cycles for this function.
This series removes that cmpxchg().
For my NVMe test with 3x NVMe SSDs, I'm getting a ~24% throughput
increase:
Before: 1310 IOPs
After: 1630 IOPs
I also have a test harness to check the rate of DMA map+unmaps we can
achieve:
CPU count 32 64 128
Before: 63187 19418 10169
After: 93287 44789 15862
(unit is map+unmaps per CPU per second)
[0] https://lore.kernel.org/linux-iommu/B926444035E5E2439431908E3842AFD24B86DB@DGGEMI525-MBS.china.huawei.com/T/#ma02e301c38c3e94b7725e685757c27e39c7cbde3
John Garry (4):
iommu/arm-smmu-v3: Fix trivial typo
iommu/arm-smmu-v3: Calculate bits for prod and owner
iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch
iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
drivers/iommu/arm-smmu-v3.c | 233 +++++++++++++++++++++++-------------
1 file changed, 151 insertions(+), 82 deletions(-)
--
2.26.2
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH 1/4] iommu/arm-smmu-v3: Fix trivial typo
2020-06-22 17:28 ` John Garry
@ 2020-06-22 17:28 ` John Garry
-1 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-06-22 17:28 UTC (permalink / raw)
To: will, robin.murphy
Cc: joro, trivial, linux-arm-kernel, iommu, linux-kernel, linuxarm,
maz, John Garry
Set "cmq" -> "cmdq".
Signed-off-by: John Garry <john.garry@huawei.com>
---
drivers/iommu/arm-smmu-v3.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f578677a5c41..a8e814c652fe 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1479,7 +1479,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
}
/*
- * Try to unlock the cmq lock. This will fail if we're the last
+ * Try to unlock the cmdq lock. This will fail if we're the last
* reader, in which case we can safely update cmdq->q.llq.cons
*/
if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) {
--
2.26.2
^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 1/4] iommu/arm-smmu-v3: Fix trivial typo
@ 2020-06-22 17:28 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-06-22 17:28 UTC (permalink / raw)
To: will, robin.murphy
Cc: trivial, maz, linuxarm, linux-kernel, iommu, linux-arm-kernel
Set "cmq" -> "cmdq".
Signed-off-by: John Garry <john.garry@huawei.com>
---
drivers/iommu/arm-smmu-v3.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f578677a5c41..a8e814c652fe 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1479,7 +1479,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
}
/*
- * Try to unlock the cmq lock. This will fail if we're the last
+ * Try to unlock the cmdq lock. This will fail if we're the last
* reader, in which case we can safely update cmdq->q.llq.cons
*/
if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) {
--
2.26.2
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 2/4] iommu/arm-smmu-v3: Calculate bits for prod and owner
2020-06-22 17:28 ` John Garry
@ 2020-06-22 17:28 ` John Garry
-1 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-06-22 17:28 UTC (permalink / raw)
To: will, robin.murphy
Cc: joro, trivial, linux-arm-kernel, iommu, linux-kernel, linuxarm,
maz, John Garry
Since the arm_smmu_ll_queue.prod will be for counting the "owner" value
and also HW prod pointer, calculate how many bits are available for and
used by each.
This is based on the number of possible CPUs in the system. And we require
that each CPU can issue a minimum of 2 commands per batch - 1 x CMD_SYNC
and at least 1 x other.
Ignoring limits of HW max_n_shift and HW cmdq memory allocation, approx 32K
is the max supported CPUs. For this, max_n_shift would be 15.
Signed-off-by: John Garry <john.garry@huawei.com>
---
drivers/iommu/arm-smmu-v3.c | 64 ++++++++++++++++++++++++++++++++++---
1 file changed, 60 insertions(+), 4 deletions(-)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index a8e814c652fe..4e9677b066f1 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -532,6 +532,8 @@ struct arm_smmu_ll_queue {
u8 __pad[SMP_CACHE_BYTES];
} ____cacheline_aligned_in_smp;
u32 max_n_shift;
+ u32 max_cmd_per_batch;
+ u32 owner_count_shift;
};
struct arm_smmu_queue {
@@ -1515,7 +1517,10 @@ static void arm_smmu_cmdq_batch_add(struct arm_smmu_device *smmu,
struct arm_smmu_cmdq_batch *cmds,
struct arm_smmu_cmdq_ent *cmd)
{
- if (cmds->num == CMDQ_BATCH_ENTRIES) {
+ struct arm_smmu_cmdq *q = &smmu->cmdq;
+ struct arm_smmu_ll_queue *llq = &q->q.llq;
+
+ if (cmds->num == llq->max_cmd_per_batch) {
arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num, false);
cmds->num = 0;
}
@@ -3177,6 +3182,58 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
return 0;
}
+static int arm_smmu_init_cmd_queue(struct arm_smmu_device *smmu,
+ struct arm_smmu_queue *q,
+ unsigned long prod_off,
+ unsigned long cons_off,
+ size_t dwords)
+{
+ u32 cpus = num_possible_cpus(), bits_for_cmdq_owner,
+ bits_available_for_prod, entries_for_prod;
+ int ret;
+
+ /*
+ * We can get the number of bits required for owner counting by
+ * log2(nr possible cpus) + 1
+ */
+ bits_for_cmdq_owner = ilog2(cpus) + 1;
+
+ /*
+ * Add an extra bit to ensure prod(+wrap) do not overflow into
+ * owner count.
+ */
+ bits_available_for_prod = 32 - 1 - bits_for_cmdq_owner;
+
+ if (bits_available_for_prod < 1) /* How many CPUs??? */
+ return -ENOMEM;
+
+ q->llq.max_n_shift = min(q->llq.max_n_shift, bits_available_for_prod);
+
+ ret = arm_smmu_init_one_queue(smmu, q, prod_off, cons_off, dwords,
+ "cmdq");
+ if (ret)
+ return ret;
+
+ entries_for_prod = 1 << q->llq.max_n_shift;
+
+ /*
+ * We need at least 2 commands in a batch (1 x CMD_SYNC and 1 x
+ * whatever else).
+ */
+ if (entries_for_prod < 2 * cpus)
+ return -ENOMEM;
+
+ /*
+ * When finding max_cmd_per_batch, deduct 1 entry per batch to take
+ * account of CMD_SYNC
+ */
+ q->llq.max_cmd_per_batch = min((entries_for_prod / cpus) - 1,
+ (u32)CMDQ_BATCH_ENTRIES);
+ q->llq.owner_count_shift = q->llq.max_n_shift + 2;
+
+ return 0;
+}
+
static void arm_smmu_cmdq_free_bitmap(void *data)
{
unsigned long *bitmap = data;
@@ -3210,9 +3267,8 @@ static int arm_smmu_init_queues(struct arm_smmu_device *smmu)
int ret;
/* cmdq */
- ret = arm_smmu_init_one_queue(smmu, &smmu->cmdq.q, ARM_SMMU_CMDQ_PROD,
- ARM_SMMU_CMDQ_CONS, CMDQ_ENT_DWORDS,
- "cmdq");
+ ret = arm_smmu_init_cmd_queue(smmu, &smmu->cmdq.q, ARM_SMMU_CMDQ_PROD,
+ ARM_SMMU_CMDQ_CONS, CMDQ_ENT_DWORDS);
if (ret)
return ret;
--
2.26.2
^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 2/4] iommu/arm-smmu-v3: Calculate bits for prod and owner
@ 2020-06-22 17:28 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-06-22 17:28 UTC (permalink / raw)
To: will, robin.murphy
Cc: trivial, maz, linuxarm, linux-kernel, iommu, linux-arm-kernel
Since the arm_smmu_ll_queue.prod will be for counting the "owner" value
and also HW prod pointer, calculate how many bits are available for and
used by each.
This is based on the number of possible CPUs in the system. And we require
that each CPU can issue a minimum of 2 commands per batch - 1 x CMD_SYNC
and at least 1 x other.
Ignoring limits of HW max_n_shift and HW cmdq memory allocation, approx 32K
is the max supported CPUs. For this, max_n_shift would be 15.
Signed-off-by: John Garry <john.garry@huawei.com>
---
drivers/iommu/arm-smmu-v3.c | 64 ++++++++++++++++++++++++++++++++++---
1 file changed, 60 insertions(+), 4 deletions(-)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index a8e814c652fe..4e9677b066f1 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -532,6 +532,8 @@ struct arm_smmu_ll_queue {
u8 __pad[SMP_CACHE_BYTES];
} ____cacheline_aligned_in_smp;
u32 max_n_shift;
+ u32 max_cmd_per_batch;
+ u32 owner_count_shift;
};
struct arm_smmu_queue {
@@ -1515,7 +1517,10 @@ static void arm_smmu_cmdq_batch_add(struct arm_smmu_device *smmu,
struct arm_smmu_cmdq_batch *cmds,
struct arm_smmu_cmdq_ent *cmd)
{
- if (cmds->num == CMDQ_BATCH_ENTRIES) {
+ struct arm_smmu_cmdq *q = &smmu->cmdq;
+ struct arm_smmu_ll_queue *llq = &q->q.llq;
+
+ if (cmds->num == llq->max_cmd_per_batch) {
arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num, false);
cmds->num = 0;
}
@@ -3177,6 +3182,58 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
return 0;
}
+static int arm_smmu_init_cmd_queue(struct arm_smmu_device *smmu,
+ struct arm_smmu_queue *q,
+ unsigned long prod_off,
+ unsigned long cons_off,
+ size_t dwords)
+{
+ u32 cpus = num_possible_cpus(), bits_for_cmdq_owner,
+ bits_available_for_prod, entries_for_prod;
+ int ret;
+
+ /*
+ * We can get the number of bits required for owner counting by
+ * log2(nr possible cpus) + 1
+ */
+ bits_for_cmdq_owner = ilog2(cpus) + 1;
+
+ /*
+ * Add an extra bit to ensure prod(+wrap) do not overflow into
+ * owner count.
+ */
+ bits_available_for_prod = 32 - 1 - bits_for_cmdq_owner;
+
+ if (bits_available_for_prod < 1) /* How many CPUs??? */
+ return -ENOMEM;
+
+ q->llq.max_n_shift = min(q->llq.max_n_shift, bits_available_for_prod);
+
+ ret = arm_smmu_init_one_queue(smmu, q, prod_off, cons_off, dwords,
+ "cmdq");
+ if (ret)
+ return ret;
+
+ entries_for_prod = 1 << q->llq.max_n_shift;
+
+ /*
+ * We need at least 2 commands in a batch (1 x CMD_SYNC and 1 x
+ * whatever else).
+ */
+ if (entries_for_prod < 2 * cpus)
+ return -ENOMEM;
+
+ /*
+ * When finding max_cmd_per_batch, deduct 1 entry per batch to take
+ * account of CMD_SYNC
+ */
+ q->llq.max_cmd_per_batch = min((entries_for_prod / cpus) - 1,
+ (u32)CMDQ_BATCH_ENTRIES);
+ q->llq.owner_count_shift = q->llq.max_n_shift + 2;
+
+ return 0;
+}
+
static void arm_smmu_cmdq_free_bitmap(void *data)
{
unsigned long *bitmap = data;
@@ -3210,9 +3267,8 @@ static int arm_smmu_init_queues(struct arm_smmu_device *smmu)
int ret;
/* cmdq */
- ret = arm_smmu_init_one_queue(smmu, &smmu->cmdq.q, ARM_SMMU_CMDQ_PROD,
- ARM_SMMU_CMDQ_CONS, CMDQ_ENT_DWORDS,
- "cmdq");
+ ret = arm_smmu_init_cmd_queue(smmu, &smmu->cmdq.q, ARM_SMMU_CMDQ_PROD,
+ ARM_SMMU_CMDQ_CONS, CMDQ_ENT_DWORDS);
if (ret)
return ret;
--
2.26.2
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 3/4] iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch
2020-06-22 17:28 ` John Garry
@ 2020-06-22 17:28 ` John Garry
-1 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-06-22 17:28 UTC (permalink / raw)
To: will, robin.murphy
Cc: joro, trivial, linux-arm-kernel, iommu, linux-kernel, linuxarm,
maz, John Garry
To ensure that a CPU does not send more than a permitted amount of commands
to the cmdq, ensure that each batch includes a CMD_SYNC. When issuing a
CMD_SYNC, we always wait for the consumption of its batch of commands - as
such, we guarantee that any CPU will not issue more than its permitted
amount.
Signed-off-by: John Garry <john.garry@huawei.com>
---
drivers/iommu/arm-smmu-v3.c | 86 +++++++++++++++++--------------------
1 file changed, 39 insertions(+), 47 deletions(-)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 4e9677b066f1..45a39ccaf455 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1373,11 +1373,15 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds,
* - Command insertion is totally ordered, so if two CPUs each race to
* insert their own list of commands then all of the commands from one
* CPU will appear before any of the commands from the other CPU.
+ *
+ * - A CMD_SYNC is always inserted, ensuring that any CPU does not issue
+ * more than the permitted amount commands at once.
*/
static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
- u64 *cmds, int n, bool sync)
+ u64 *cmds, int n)
{
u64 cmd_sync[CMDQ_ENT_DWORDS];
+ const int sync = 1;
u32 prod;
unsigned long flags;
bool owner;
@@ -1419,19 +1423,17 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
* Dependency ordering from the cmpxchg() loop above.
*/
arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n);
- if (sync) {
- prod = queue_inc_prod_n(&llq, n);
- arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod);
- queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS);
- /*
- * In order to determine completion of our CMD_SYNC, we must
- * ensure that the queue can't wrap twice without us noticing.
- * We achieve that by taking the cmdq lock as shared before
- * marking our slot as valid.
- */
- arm_smmu_cmdq_shared_lock(cmdq);
- }
+ prod = queue_inc_prod_n(&llq, n);
+ arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod);
+ queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS);
+
+ /*
+ * In order to determine completion of our CMD_SYNC, we must ensure
+ * that the queue can't wrap twice without us noticing. We achieve that
+ * by taking the cmdq lock as shared before marking our slot as valid.
+ */
+ arm_smmu_cmdq_shared_lock(cmdq);
/* 3. Mark our slots as valid, ensuring commands are visible first */
dma_wmb();
@@ -1468,26 +1470,21 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
atomic_set_release(&cmdq->owner_prod, prod);
}
- /* 5. If we are inserting a CMD_SYNC, we must wait for it to complete */
- if (sync) {
- llq.prod = queue_inc_prod_n(&llq, n);
- ret = arm_smmu_cmdq_poll_until_sync(smmu, &llq);
- if (ret) {
- dev_err_ratelimited(smmu->dev,
- "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n",
- llq.prod,
- readl_relaxed(cmdq->q.prod_reg),
- readl_relaxed(cmdq->q.cons_reg));
- }
+ /* 5. Since we always insert a CMD_SYNC, we must wait for it to complete */
+ llq.prod = queue_inc_prod_n(&llq, n);
+ ret = arm_smmu_cmdq_poll_until_sync(smmu, &llq);
+ if (ret)
+ dev_err_ratelimited(smmu->dev, "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n",
+ llq.prod, readl_relaxed(cmdq->q.prod_reg),
+ readl_relaxed(cmdq->q.cons_reg));
- /*
- * Try to unlock the cmdq lock. This will fail if we're the last
- * reader, in which case we can safely update cmdq->q.llq.cons
- */
- if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) {
- WRITE_ONCE(cmdq->q.llq.cons, llq.cons);
- arm_smmu_cmdq_shared_unlock(cmdq);
- }
+ /*
+ * Try to unlock the cmdq lock. This will fail if we're the last reader,
+ * in which case we can safely update cmdq->q.llq.cons
+ */
+ if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) {
+ WRITE_ONCE(cmdq->q.llq.cons, llq.cons);
+ arm_smmu_cmdq_shared_unlock(cmdq);
}
local_irq_restore(flags);
@@ -1505,12 +1502,7 @@ static int arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
return -EINVAL;
}
- return arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1, false);
-}
-
-static int arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
-{
- return arm_smmu_cmdq_issue_cmdlist(smmu, NULL, 0, true);
+ return arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1);
}
static void arm_smmu_cmdq_batch_add(struct arm_smmu_device *smmu,
@@ -1521,7 +1513,7 @@ static void arm_smmu_cmdq_batch_add(struct arm_smmu_device *smmu,
struct arm_smmu_ll_queue *llq = &q->q.llq;
if (cmds->num == llq->max_cmd_per_batch) {
- arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num, false);
+ arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num);
cmds->num = 0;
}
arm_smmu_cmdq_build_cmd(&cmds->cmds[cmds->num * CMDQ_ENT_DWORDS], cmd);
@@ -1531,7 +1523,7 @@ static void arm_smmu_cmdq_batch_add(struct arm_smmu_device *smmu,
static int arm_smmu_cmdq_batch_submit(struct arm_smmu_device *smmu,
struct arm_smmu_cmdq_batch *cmds)
{
- return arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num, true);
+ return arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num);
}
/* Context descriptor manipulation functions */
@@ -1803,7 +1795,6 @@ static void arm_smmu_sync_ste_for_sid(struct arm_smmu_device *smmu, u32 sid)
};
arm_smmu_cmdq_issue_cmd(smmu, &cmd);
- arm_smmu_cmdq_issue_sync(smmu);
}
static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
@@ -2197,17 +2188,21 @@ arm_smmu_atc_inv_to_cmd(int ssid, unsigned long iova, size_t size,
static int arm_smmu_atc_inv_master(struct arm_smmu_master *master)
{
- int i;
+ int i, ret = 0;
struct arm_smmu_cmdq_ent cmd;
arm_smmu_atc_inv_to_cmd(0, 0, 0, &cmd);
for (i = 0; i < master->num_sids; i++) {
+ int rc;
+
cmd.atc.sid = master->sids[i];
- arm_smmu_cmdq_issue_cmd(master->smmu, &cmd);
+ rc = arm_smmu_cmdq_issue_cmd(master->smmu, &cmd);
+ if (rc)
+ ret = rc;
}
- return arm_smmu_cmdq_issue_sync(master->smmu);
+ return ret;
}
static int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
@@ -2280,7 +2275,6 @@ static void arm_smmu_tlb_inv_context(void *cookie)
* careful, 007.
*/
arm_smmu_cmdq_issue_cmd(smmu, &cmd);
- arm_smmu_cmdq_issue_sync(smmu);
arm_smmu_atc_inv_domain(smmu_domain, 0, 0, 0);
}
@@ -3667,7 +3661,6 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
/* Invalidate any cached configuration */
cmd.opcode = CMDQ_OP_CFGI_ALL;
arm_smmu_cmdq_issue_cmd(smmu, &cmd);
- arm_smmu_cmdq_issue_sync(smmu);
/* Invalidate any stale TLB entries */
if (smmu->features & ARM_SMMU_FEAT_HYP) {
@@ -3677,7 +3670,6 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
cmd.opcode = CMDQ_OP_TLBI_NSNH_ALL;
arm_smmu_cmdq_issue_cmd(smmu, &cmd);
- arm_smmu_cmdq_issue_sync(smmu);
/* Event queue */
writeq_relaxed(smmu->evtq.q.q_base, smmu->base + ARM_SMMU_EVTQ_BASE);
--
2.26.2
^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 3/4] iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch
@ 2020-06-22 17:28 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-06-22 17:28 UTC (permalink / raw)
To: will, robin.murphy
Cc: trivial, maz, linuxarm, linux-kernel, iommu, linux-arm-kernel
To ensure that a CPU does not send more than a permitted amount of commands
to the cmdq, ensure that each batch includes a CMD_SYNC. When issuing a
CMD_SYNC, we always wait for the consumption of its batch of commands - as
such, we guarantee that any CPU will not issue more than its permitted
amount.
Signed-off-by: John Garry <john.garry@huawei.com>
---
drivers/iommu/arm-smmu-v3.c | 86 +++++++++++++++++--------------------
1 file changed, 39 insertions(+), 47 deletions(-)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 4e9677b066f1..45a39ccaf455 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1373,11 +1373,15 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds,
* - Command insertion is totally ordered, so if two CPUs each race to
* insert their own list of commands then all of the commands from one
* CPU will appear before any of the commands from the other CPU.
+ *
+ * - A CMD_SYNC is always inserted, ensuring that any CPU does not issue
+ * more than the permitted amount commands at once.
*/
static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
- u64 *cmds, int n, bool sync)
+ u64 *cmds, int n)
{
u64 cmd_sync[CMDQ_ENT_DWORDS];
+ const int sync = 1;
u32 prod;
unsigned long flags;
bool owner;
@@ -1419,19 +1423,17 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
* Dependency ordering from the cmpxchg() loop above.
*/
arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n);
- if (sync) {
- prod = queue_inc_prod_n(&llq, n);
- arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod);
- queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS);
- /*
- * In order to determine completion of our CMD_SYNC, we must
- * ensure that the queue can't wrap twice without us noticing.
- * We achieve that by taking the cmdq lock as shared before
- * marking our slot as valid.
- */
- arm_smmu_cmdq_shared_lock(cmdq);
- }
+ prod = queue_inc_prod_n(&llq, n);
+ arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod);
+ queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS);
+
+ /*
+ * In order to determine completion of our CMD_SYNC, we must ensure
+ * that the queue can't wrap twice without us noticing. We achieve that
+ * by taking the cmdq lock as shared before marking our slot as valid.
+ */
+ arm_smmu_cmdq_shared_lock(cmdq);
/* 3. Mark our slots as valid, ensuring commands are visible first */
dma_wmb();
@@ -1468,26 +1470,21 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
atomic_set_release(&cmdq->owner_prod, prod);
}
- /* 5. If we are inserting a CMD_SYNC, we must wait for it to complete */
- if (sync) {
- llq.prod = queue_inc_prod_n(&llq, n);
- ret = arm_smmu_cmdq_poll_until_sync(smmu, &llq);
- if (ret) {
- dev_err_ratelimited(smmu->dev,
- "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n",
- llq.prod,
- readl_relaxed(cmdq->q.prod_reg),
- readl_relaxed(cmdq->q.cons_reg));
- }
+ /* 5. Since we always insert a CMD_SYNC, we must wait for it to complete */
+ llq.prod = queue_inc_prod_n(&llq, n);
+ ret = arm_smmu_cmdq_poll_until_sync(smmu, &llq);
+ if (ret)
+ dev_err_ratelimited(smmu->dev, "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n",
+ llq.prod, readl_relaxed(cmdq->q.prod_reg),
+ readl_relaxed(cmdq->q.cons_reg));
- /*
- * Try to unlock the cmdq lock. This will fail if we're the last
- * reader, in which case we can safely update cmdq->q.llq.cons
- */
- if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) {
- WRITE_ONCE(cmdq->q.llq.cons, llq.cons);
- arm_smmu_cmdq_shared_unlock(cmdq);
- }
+ /*
+ * Try to unlock the cmdq lock. This will fail if we're the last reader,
+ * in which case we can safely update cmdq->q.llq.cons
+ */
+ if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) {
+ WRITE_ONCE(cmdq->q.llq.cons, llq.cons);
+ arm_smmu_cmdq_shared_unlock(cmdq);
}
local_irq_restore(flags);
@@ -1505,12 +1502,7 @@ static int arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
return -EINVAL;
}
- return arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1, false);
-}
-
-static int arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
-{
- return arm_smmu_cmdq_issue_cmdlist(smmu, NULL, 0, true);
+ return arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1);
}
static void arm_smmu_cmdq_batch_add(struct arm_smmu_device *smmu,
@@ -1521,7 +1513,7 @@ static void arm_smmu_cmdq_batch_add(struct arm_smmu_device *smmu,
struct arm_smmu_ll_queue *llq = &q->q.llq;
if (cmds->num == llq->max_cmd_per_batch) {
- arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num, false);
+ arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num);
cmds->num = 0;
}
arm_smmu_cmdq_build_cmd(&cmds->cmds[cmds->num * CMDQ_ENT_DWORDS], cmd);
@@ -1531,7 +1523,7 @@ static void arm_smmu_cmdq_batch_add(struct arm_smmu_device *smmu,
static int arm_smmu_cmdq_batch_submit(struct arm_smmu_device *smmu,
struct arm_smmu_cmdq_batch *cmds)
{
- return arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num, true);
+ return arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num);
}
/* Context descriptor manipulation functions */
@@ -1803,7 +1795,6 @@ static void arm_smmu_sync_ste_for_sid(struct arm_smmu_device *smmu, u32 sid)
};
arm_smmu_cmdq_issue_cmd(smmu, &cmd);
- arm_smmu_cmdq_issue_sync(smmu);
}
static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
@@ -2197,17 +2188,21 @@ arm_smmu_atc_inv_to_cmd(int ssid, unsigned long iova, size_t size,
static int arm_smmu_atc_inv_master(struct arm_smmu_master *master)
{
- int i;
+ int i, ret = 0;
struct arm_smmu_cmdq_ent cmd;
arm_smmu_atc_inv_to_cmd(0, 0, 0, &cmd);
for (i = 0; i < master->num_sids; i++) {
+ int rc;
+
cmd.atc.sid = master->sids[i];
- arm_smmu_cmdq_issue_cmd(master->smmu, &cmd);
+ rc = arm_smmu_cmdq_issue_cmd(master->smmu, &cmd);
+ if (rc)
+ ret = rc;
}
- return arm_smmu_cmdq_issue_sync(master->smmu);
+ return ret;
}
static int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
@@ -2280,7 +2275,6 @@ static void arm_smmu_tlb_inv_context(void *cookie)
* careful, 007.
*/
arm_smmu_cmdq_issue_cmd(smmu, &cmd);
- arm_smmu_cmdq_issue_sync(smmu);
arm_smmu_atc_inv_domain(smmu_domain, 0, 0, 0);
}
@@ -3667,7 +3661,6 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
/* Invalidate any cached configuration */
cmd.opcode = CMDQ_OP_CFGI_ALL;
arm_smmu_cmdq_issue_cmd(smmu, &cmd);
- arm_smmu_cmdq_issue_sync(smmu);
/* Invalidate any stale TLB entries */
if (smmu->features & ARM_SMMU_FEAT_HYP) {
@@ -3677,7 +3670,6 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
cmd.opcode = CMDQ_OP_TLBI_NSNH_ALL;
arm_smmu_cmdq_issue_cmd(smmu, &cmd);
- arm_smmu_cmdq_issue_sync(smmu);
/* Event queue */
writeq_relaxed(smmu->evtq.q.q_base, smmu->base + ARM_SMMU_EVTQ_BASE);
--
2.26.2
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
2020-06-22 17:28 ` John Garry
@ 2020-06-22 17:28 ` John Garry
-1 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-06-22 17:28 UTC (permalink / raw)
To: will, robin.murphy
Cc: joro, trivial, linux-arm-kernel, iommu, linux-kernel, linuxarm,
maz, John Garry
It has been shown that the cmpxchg() for finding space in the cmdq can
be a bottleneck:
- for more CPUs contending the cmdq, the cmpxchg() will fail more often
- since the software-maintained cons pointer is updated on the same 64b
memory region, the chance of cmpxchg() failure increases again
The cmpxchg() is removed as part of 2 related changes:
- Update prod and cmdq owner in a single atomic add operation. For this, we
count the prod and owner in separate regions in prod memory.
As with simple binary counting, once the prod+wrap fields overflow, they
will zero. They should never overflow into "owner" region, and we zero
the non-owner, prod region for each owner. This maintains the prod
pointer.
As for the "owner", we now count this value, instead of setting a flag.
Similar to before, once the owner has finished gathering, it will clear
a mask. As such, a CPU declares itself as the "owner" when it reads zero
for this region. This zeroing will also clear possible overflow in
wrap+prod region, above.
The owner is now responsible for all cmdq locking to avoid possible
deadlock. The owner will lock the cmdq for all non-owers it has gathered
when they have space in the queue and have written their entries.
- Check for space in the cmdq after the prod pointer has been assigned.
We don't bother checking for space in the cmdq before assigning the prod
pointer, as this would be racy.
So since the prod pointer is updated unconditionally, it would be common
for no space to be available in the cmdq when prod is assigned - that
is, according the software-maintained prod and cons pointer. So now
it must be ensured that the entries are not yet written and not until
there is space.
How the prod pointer is maintained also leads to a strange condition
where the prod pointer can wrap past the cons pointer. We can detect this
condition, and report no space here. However, a prod pointer progressed
twice past the cons pointer cannot be detected. But it can be ensured that
this that this scenario does not occur, as we limit the amount of
commands any CPU can issue at any given time, such that we cannot
progress prod pointer further.
Signed-off-by: John Garry <john.garry@huawei.com>
---
drivers/iommu/arm-smmu-v3.c | 101 ++++++++++++++++++++++--------------
1 file changed, 61 insertions(+), 40 deletions(-)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 45a39ccaf455..54d056f10bea 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -772,10 +772,19 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n)
prod = Q_IDX(q, q->prod);
cons = Q_IDX(q, q->cons);
- if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons))
+ if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons)) {
+ /* check if we have wrapped, meaning definitely no space */
+ if (cons > prod)
+ return false;
+
space = (1 << q->max_n_shift) - (prod - cons);
- else
+ } else {
+ /* similar check to above */
+ if (prod > cons)
+ return false;
+
space = cons - prod;
+ }
return space >= n;
}
@@ -1073,7 +1082,7 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu)
* fails if the caller appears to be the last lock holder (yes, this is
* racy). All successful UNLOCK routines have RELEASE semantics.
*/
-static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq)
+static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq, int count)
{
int val;
@@ -1083,12 +1092,12 @@ static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq)
* to INT_MIN so these increments won't hurt as the value will remain
* negative.
*/
- if (atomic_fetch_inc_relaxed(&cmdq->lock) >= 0)
+ if (atomic_fetch_add_relaxed(count, &cmdq->lock) >= 0)
return;
do {
val = atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0);
- } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1) != val);
+ } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + count) != val);
}
static void arm_smmu_cmdq_shared_unlock(struct arm_smmu_cmdq *cmdq)
@@ -1374,8 +1383,10 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds,
* insert their own list of commands then all of the commands from one
* CPU will appear before any of the commands from the other CPU.
*
- * - A CMD_SYNC is always inserted, ensuring that any CPU does not issue
- * more than the permitted amount commands at once.
+ * - A CMD_SYNC is always inserted, which ensures we limit the prod pointer
+ * for when the cmdq is full, such that we don't wrap more than twice.
+ * It also makes it easy for the owner to know by how many to increment the
+ * cmdq lock.
*/
static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
u64 *cmds, int n)
@@ -1388,39 +1399,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
struct arm_smmu_ll_queue llq = {
.max_n_shift = cmdq->q.llq.max_n_shift,
- }, head = llq;
+ }, head = llq, space = llq;
+ u32 owner_val = 1 << cmdq->q.llq.owner_count_shift;
+ u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
+ u32 owner_mask = GENMASK(30, cmdq->q.llq.owner_count_shift);
int ret = 0;
/* 1. Allocate some space in the queue */
local_irq_save(flags);
- llq.val = READ_ONCE(cmdq->q.llq.val);
- do {
- u64 old;
- while (!queue_has_space(&llq, n + sync)) {
- local_irq_restore(flags);
- if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
- dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
- local_irq_save(flags);
- }
+ prod = atomic_fetch_add(n + sync + owner_val,
+ &cmdq->q.llq.atomic.prod);
- head.cons = llq.cons;
- head.prod = queue_inc_prod_n(&llq, n + sync) |
- CMDQ_PROD_OWNED_FLAG;
+ owner = !(prod & owner_mask);
+ llq.prod = prod_mask & prod;
+ head.prod = queue_inc_prod_n(&llq, n + sync);
- old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
- if (old == llq.val)
- break;
+ /*
+ * Ensure it's safe to write the entries. For this, we need to ensure
+ * that there is space in the queue from our prod pointer.
+ */
+ space.cons = READ_ONCE(cmdq->q.llq.cons);
+ space.prod = llq.prod;
+ while (!queue_has_space(&space, n + sync)) {
+ if (arm_smmu_cmdq_poll_until_not_full(smmu, &space))
+ dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
- llq.val = old;
- } while (1);
- owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
- head.prod &= ~CMDQ_PROD_OWNED_FLAG;
- llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
+ space.prod = llq.prod;
+ }
/*
* 2. Write our commands into the queue
- * Dependency ordering from the cmpxchg() loop above.
+ * Dependency ordering from the space-checking while loop, above.
*/
arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n);
@@ -1428,26 +1438,24 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod);
queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS);
- /*
- * In order to determine completion of our CMD_SYNC, we must ensure
- * that the queue can't wrap twice without us noticing. We achieve that
- * by taking the cmdq lock as shared before marking our slot as valid.
- */
- arm_smmu_cmdq_shared_lock(cmdq);
-
/* 3. Mark our slots as valid, ensuring commands are visible first */
dma_wmb();
arm_smmu_cmdq_set_valid_map(cmdq, llq.prod, head.prod);
/* 4. If we are the owner, take control of the SMMU hardware */
if (owner) {
+ int owner_count;
+ u32 prod_tmp;
+
/* a. Wait for previous owner to finish */
atomic_cond_read_relaxed(&cmdq->owner_prod, VAL == llq.prod);
- /* b. Stop gathering work by clearing the owned flag */
- prod = atomic_fetch_andnot_relaxed(CMDQ_PROD_OWNED_FLAG,
- &cmdq->q.llq.atomic.prod);
- prod &= ~CMDQ_PROD_OWNED_FLAG;
+ /* b. Stop gathering work by clearing the owned mask */
+ prod_tmp = atomic_fetch_andnot_relaxed(~prod_mask,
+ &cmdq->q.llq.atomic.prod);
+ prod = prod_tmp & prod_mask;
+ owner_count = prod_tmp & owner_mask;
+ owner_count >>= cmdq->q.llq.owner_count_shift;
/*
* c. Wait for any gathered work to be written to the queue.
@@ -1456,6 +1464,19 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
*/
arm_smmu_cmdq_poll_valid_map(cmdq, llq.prod, prod);
+ /*
+ * In order to determine completion of the CMD_SYNCs, we must
+ * ensure that the queue can't wrap twice without us noticing.
+ * We achieve that by taking the cmdq lock as shared before
+ * progressing the prod pointer.
+ * The owner does this for all the non-owners it has gathered.
+ * Otherwise, some non-owner(s) may lock the cmdq, blocking
+ * cons being updating. This could be when the cmdq has just
+ * become full. In this case, other sibling non-owners could be
+ * blocked from progressing, leading to deadlock.
+ */
+ arm_smmu_cmdq_shared_lock(cmdq, owner_count);
+
/*
* d. Advance the hardware prod pointer
* Control dependency ordering from the entries becoming valid.
--
2.26.2
^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
@ 2020-06-22 17:28 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-06-22 17:28 UTC (permalink / raw)
To: will, robin.murphy
Cc: trivial, maz, linuxarm, linux-kernel, iommu, linux-arm-kernel
It has been shown that the cmpxchg() for finding space in the cmdq can
be a bottleneck:
- for more CPUs contending the cmdq, the cmpxchg() will fail more often
- since the software-maintained cons pointer is updated on the same 64b
memory region, the chance of cmpxchg() failure increases again
The cmpxchg() is removed as part of 2 related changes:
- Update prod and cmdq owner in a single atomic add operation. For this, we
count the prod and owner in separate regions in prod memory.
As with simple binary counting, once the prod+wrap fields overflow, they
will zero. They should never overflow into "owner" region, and we zero
the non-owner, prod region for each owner. This maintains the prod
pointer.
As for the "owner", we now count this value, instead of setting a flag.
Similar to before, once the owner has finished gathering, it will clear
a mask. As such, a CPU declares itself as the "owner" when it reads zero
for this region. This zeroing will also clear possible overflow in
wrap+prod region, above.
The owner is now responsible for all cmdq locking to avoid possible
deadlock. The owner will lock the cmdq for all non-owers it has gathered
when they have space in the queue and have written their entries.
- Check for space in the cmdq after the prod pointer has been assigned.
We don't bother checking for space in the cmdq before assigning the prod
pointer, as this would be racy.
So since the prod pointer is updated unconditionally, it would be common
for no space to be available in the cmdq when prod is assigned - that
is, according the software-maintained prod and cons pointer. So now
it must be ensured that the entries are not yet written and not until
there is space.
How the prod pointer is maintained also leads to a strange condition
where the prod pointer can wrap past the cons pointer. We can detect this
condition, and report no space here. However, a prod pointer progressed
twice past the cons pointer cannot be detected. But it can be ensured that
this that this scenario does not occur, as we limit the amount of
commands any CPU can issue at any given time, such that we cannot
progress prod pointer further.
Signed-off-by: John Garry <john.garry@huawei.com>
---
drivers/iommu/arm-smmu-v3.c | 101 ++++++++++++++++++++++--------------
1 file changed, 61 insertions(+), 40 deletions(-)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 45a39ccaf455..54d056f10bea 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -772,10 +772,19 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n)
prod = Q_IDX(q, q->prod);
cons = Q_IDX(q, q->cons);
- if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons))
+ if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons)) {
+ /* check if we have wrapped, meaning definitely no space */
+ if (cons > prod)
+ return false;
+
space = (1 << q->max_n_shift) - (prod - cons);
- else
+ } else {
+ /* similar check to above */
+ if (prod > cons)
+ return false;
+
space = cons - prod;
+ }
return space >= n;
}
@@ -1073,7 +1082,7 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu)
* fails if the caller appears to be the last lock holder (yes, this is
* racy). All successful UNLOCK routines have RELEASE semantics.
*/
-static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq)
+static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq, int count)
{
int val;
@@ -1083,12 +1092,12 @@ static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq)
* to INT_MIN so these increments won't hurt as the value will remain
* negative.
*/
- if (atomic_fetch_inc_relaxed(&cmdq->lock) >= 0)
+ if (atomic_fetch_add_relaxed(count, &cmdq->lock) >= 0)
return;
do {
val = atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0);
- } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1) != val);
+ } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + count) != val);
}
static void arm_smmu_cmdq_shared_unlock(struct arm_smmu_cmdq *cmdq)
@@ -1374,8 +1383,10 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds,
* insert their own list of commands then all of the commands from one
* CPU will appear before any of the commands from the other CPU.
*
- * - A CMD_SYNC is always inserted, ensuring that any CPU does not issue
- * more than the permitted amount commands at once.
+ * - A CMD_SYNC is always inserted, which ensures we limit the prod pointer
+ * for when the cmdq is full, such that we don't wrap more than twice.
+ * It also makes it easy for the owner to know by how many to increment the
+ * cmdq lock.
*/
static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
u64 *cmds, int n)
@@ -1388,39 +1399,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
struct arm_smmu_ll_queue llq = {
.max_n_shift = cmdq->q.llq.max_n_shift,
- }, head = llq;
+ }, head = llq, space = llq;
+ u32 owner_val = 1 << cmdq->q.llq.owner_count_shift;
+ u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
+ u32 owner_mask = GENMASK(30, cmdq->q.llq.owner_count_shift);
int ret = 0;
/* 1. Allocate some space in the queue */
local_irq_save(flags);
- llq.val = READ_ONCE(cmdq->q.llq.val);
- do {
- u64 old;
- while (!queue_has_space(&llq, n + sync)) {
- local_irq_restore(flags);
- if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
- dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
- local_irq_save(flags);
- }
+ prod = atomic_fetch_add(n + sync + owner_val,
+ &cmdq->q.llq.atomic.prod);
- head.cons = llq.cons;
- head.prod = queue_inc_prod_n(&llq, n + sync) |
- CMDQ_PROD_OWNED_FLAG;
+ owner = !(prod & owner_mask);
+ llq.prod = prod_mask & prod;
+ head.prod = queue_inc_prod_n(&llq, n + sync);
- old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
- if (old == llq.val)
- break;
+ /*
+ * Ensure it's safe to write the entries. For this, we need to ensure
+ * that there is space in the queue from our prod pointer.
+ */
+ space.cons = READ_ONCE(cmdq->q.llq.cons);
+ space.prod = llq.prod;
+ while (!queue_has_space(&space, n + sync)) {
+ if (arm_smmu_cmdq_poll_until_not_full(smmu, &space))
+ dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
- llq.val = old;
- } while (1);
- owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
- head.prod &= ~CMDQ_PROD_OWNED_FLAG;
- llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
+ space.prod = llq.prod;
+ }
/*
* 2. Write our commands into the queue
- * Dependency ordering from the cmpxchg() loop above.
+ * Dependency ordering from the space-checking while loop, above.
*/
arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n);
@@ -1428,26 +1438,24 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod);
queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS);
- /*
- * In order to determine completion of our CMD_SYNC, we must ensure
- * that the queue can't wrap twice without us noticing. We achieve that
- * by taking the cmdq lock as shared before marking our slot as valid.
- */
- arm_smmu_cmdq_shared_lock(cmdq);
-
/* 3. Mark our slots as valid, ensuring commands are visible first */
dma_wmb();
arm_smmu_cmdq_set_valid_map(cmdq, llq.prod, head.prod);
/* 4. If we are the owner, take control of the SMMU hardware */
if (owner) {
+ int owner_count;
+ u32 prod_tmp;
+
/* a. Wait for previous owner to finish */
atomic_cond_read_relaxed(&cmdq->owner_prod, VAL == llq.prod);
- /* b. Stop gathering work by clearing the owned flag */
- prod = atomic_fetch_andnot_relaxed(CMDQ_PROD_OWNED_FLAG,
- &cmdq->q.llq.atomic.prod);
- prod &= ~CMDQ_PROD_OWNED_FLAG;
+ /* b. Stop gathering work by clearing the owned mask */
+ prod_tmp = atomic_fetch_andnot_relaxed(~prod_mask,
+ &cmdq->q.llq.atomic.prod);
+ prod = prod_tmp & prod_mask;
+ owner_count = prod_tmp & owner_mask;
+ owner_count >>= cmdq->q.llq.owner_count_shift;
/*
* c. Wait for any gathered work to be written to the queue.
@@ -1456,6 +1464,19 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
*/
arm_smmu_cmdq_poll_valid_map(cmdq, llq.prod, prod);
+ /*
+ * In order to determine completion of the CMD_SYNCs, we must
+ * ensure that the queue can't wrap twice without us noticing.
+ * We achieve that by taking the cmdq lock as shared before
+ * progressing the prod pointer.
+ * The owner does this for all the non-owners it has gathered.
+ * Otherwise, some non-owner(s) may lock the cmdq, blocking
+ * cons being updating. This could be when the cmdq has just
+ * become full. In this case, other sibling non-owners could be
+ * blocked from progressing, leading to deadlock.
+ */
+ arm_smmu_cmdq_shared_lock(cmdq, owner_count);
+
/*
* d. Advance the hardware prod pointer
* Control dependency ordering from the entries becoming valid.
--
2.26.2
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
2020-06-22 17:28 ` John Garry
(?)
@ 2020-06-23 1:07 ` kernel test robot
-1 siblings, 0 replies; 69+ messages in thread
From: kernel test robot @ 2020-06-23 1:07 UTC (permalink / raw)
To: John Garry, will, robin.murphy
Cc: kbuild-all, joro, trivial, linux-arm-kernel, iommu, linux-kernel,
linuxarm, maz, John Garry
[-- Attachment #1: Type: text/plain, Size: 8647 bytes --]
Hi John,
I love your patch! Perhaps something to improve:
[auto build test WARNING on iommu/next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438
base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
config: arm64-randconfig-c024-20200622 (attached as .config)
compiler: aarch64-linux-gcc (GCC) 9.3.0
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All warnings (new ones prefixed by >>, old ones prefixed by <<):
In file included from include/linux/bits.h:23,
from include/linux/ioport.h:15,
from include/linux/acpi.h:12,
from drivers/iommu/arm-smmu-v3.c:12:
drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist':
include/linux/bits.h:26:28: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
26 | __builtin_constant_p((l) > (h)), (l) > (h), 0)))
| ^
include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO'
16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
| ^
include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK'
39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
| ^~~~~~~~~~~~~~~~~~~
>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK'
1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
| ^~~~~~~
include/linux/bits.h:26:40: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
26 | __builtin_constant_p((l) > (h)), (l) > (h), 0)))
| ^
include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO'
16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
| ^
include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK'
39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
| ^~~~~~~~~~~~~~~~~~~
>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK'
1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
| ^~~~~~~
vim +/GENMASK +1404 drivers/iommu/arm-smmu-v3.c
1369
1370 /*
1371 * This is the actual insertion function, and provides the following
1372 * ordering guarantees to callers:
1373 *
1374 * - There is a dma_wmb() before publishing any commands to the queue.
1375 * This can be relied upon to order prior writes to data structures
1376 * in memory (such as a CD or an STE) before the command.
1377 *
1378 * - On completion of a CMD_SYNC, there is a control dependency.
1379 * This can be relied upon to order subsequent writes to memory (e.g.
1380 * freeing an IOVA) after completion of the CMD_SYNC.
1381 *
1382 * - Command insertion is totally ordered, so if two CPUs each race to
1383 * insert their own list of commands then all of the commands from one
1384 * CPU will appear before any of the commands from the other CPU.
1385 *
1386 * - A CMD_SYNC is always inserted, which ensures we limit the prod pointer
1387 * for when the cmdq is full, such that we don't wrap more than twice.
1388 * It also makes it easy for the owner to know by how many to increment the
1389 * cmdq lock.
1390 */
1391 static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
1392 u64 *cmds, int n)
1393 {
1394 u64 cmd_sync[CMDQ_ENT_DWORDS];
1395 const int sync = 1;
1396 u32 prod;
1397 unsigned long flags;
1398 bool owner;
1399 struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
1400 struct arm_smmu_ll_queue llq = {
1401 .max_n_shift = cmdq->q.llq.max_n_shift,
1402 }, head = llq, space = llq;
1403 u32 owner_val = 1 << cmdq->q.llq.owner_count_shift;
> 1404 u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
1405 u32 owner_mask = GENMASK(30, cmdq->q.llq.owner_count_shift);
1406 int ret = 0;
1407
1408 /* 1. Allocate some space in the queue */
1409 local_irq_save(flags);
1410
1411 prod = atomic_fetch_add(n + sync + owner_val,
1412 &cmdq->q.llq.atomic.prod);
1413
1414 owner = !(prod & owner_mask);
1415 llq.prod = prod_mask & prod;
1416 head.prod = queue_inc_prod_n(&llq, n + sync);
1417
1418 /*
1419 * Ensure it's safe to write the entries. For this, we need to ensure
1420 * that there is space in the queue from our prod pointer.
1421 */
1422 space.cons = READ_ONCE(cmdq->q.llq.cons);
1423 space.prod = llq.prod;
1424 while (!queue_has_space(&space, n + sync)) {
1425 if (arm_smmu_cmdq_poll_until_not_full(smmu, &space))
1426 dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
1427
1428 space.prod = llq.prod;
1429 }
1430
1431 /*
1432 * 2. Write our commands into the queue
1433 * Dependency ordering from the space-checking while loop, above.
1434 */
1435 arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n);
1436
1437 prod = queue_inc_prod_n(&llq, n);
1438 arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod);
1439 queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS);
1440
1441 /* 3. Mark our slots as valid, ensuring commands are visible first */
1442 dma_wmb();
1443 arm_smmu_cmdq_set_valid_map(cmdq, llq.prod, head.prod);
1444
1445 /* 4. If we are the owner, take control of the SMMU hardware */
1446 if (owner) {
1447 int owner_count;
1448 u32 prod_tmp;
1449
1450 /* a. Wait for previous owner to finish */
1451 atomic_cond_read_relaxed(&cmdq->owner_prod, VAL == llq.prod);
1452
1453 /* b. Stop gathering work by clearing the owned mask */
1454 prod_tmp = atomic_fetch_andnot_relaxed(~prod_mask,
1455 &cmdq->q.llq.atomic.prod);
1456 prod = prod_tmp & prod_mask;
1457 owner_count = prod_tmp & owner_mask;
1458 owner_count >>= cmdq->q.llq.owner_count_shift;
1459
1460 /*
1461 * c. Wait for any gathered work to be written to the queue.
1462 * Note that we read our own entries so that we have the control
1463 * dependency required by (d).
1464 */
1465 arm_smmu_cmdq_poll_valid_map(cmdq, llq.prod, prod);
1466
1467 /*
1468 * In order to determine completion of the CMD_SYNCs, we must
1469 * ensure that the queue can't wrap twice without us noticing.
1470 * We achieve that by taking the cmdq lock as shared before
1471 * progressing the prod pointer.
1472 * The owner does this for all the non-owners it has gathered.
1473 * Otherwise, some non-owner(s) may lock the cmdq, blocking
1474 * cons being updating. This could be when the cmdq has just
1475 * become full. In this case, other sibling non-owners could be
1476 * blocked from progressing, leading to deadlock.
1477 */
1478 arm_smmu_cmdq_shared_lock(cmdq, owner_count);
1479
1480 /*
1481 * d. Advance the hardware prod pointer
1482 * Control dependency ordering from the entries becoming valid.
1483 */
1484 writel_relaxed(prod, cmdq->q.prod_reg);
1485
1486 /*
1487 * e. Tell the next owner we're done
1488 * Make sure we've updated the hardware first, so that we don't
1489 * race to update prod and potentially move it backwards.
1490 */
1491 atomic_set_release(&cmdq->owner_prod, prod);
1492 }
1493
1494 /* 5. Since we always insert a CMD_SYNC, we must wait for it to complete */
1495 llq.prod = queue_inc_prod_n(&llq, n);
1496 ret = arm_smmu_cmdq_poll_until_sync(smmu, &llq);
1497 if (ret)
1498 dev_err_ratelimited(smmu->dev, "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n",
1499 llq.prod, readl_relaxed(cmdq->q.prod_reg),
1500 readl_relaxed(cmdq->q.cons_reg));
1501
1502 /*
1503 * Try to unlock the cmdq lock. This will fail if we're the last reader,
1504 * in which case we can safely update cmdq->q.llq.cons
1505 */
1506 if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) {
1507 WRITE_ONCE(cmdq->q.llq.cons, llq.cons);
1508 arm_smmu_cmdq_shared_unlock(cmdq);
1509 }
1510
1511 local_irq_restore(flags);
1512 return ret;
1513 }
1514
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 30901 bytes --]
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
@ 2020-06-23 1:07 ` kernel test robot
0 siblings, 0 replies; 69+ messages in thread
From: kernel test robot @ 2020-06-23 1:07 UTC (permalink / raw)
To: John Garry, will, robin.murphy
Cc: trivial, maz, linux-kernel, linuxarm, iommu, kbuild-all,
linux-arm-kernel
[-- Attachment #1: Type: text/plain, Size: 8647 bytes --]
Hi John,
I love your patch! Perhaps something to improve:
[auto build test WARNING on iommu/next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438
base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
config: arm64-randconfig-c024-20200622 (attached as .config)
compiler: aarch64-linux-gcc (GCC) 9.3.0
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All warnings (new ones prefixed by >>, old ones prefixed by <<):
In file included from include/linux/bits.h:23,
from include/linux/ioport.h:15,
from include/linux/acpi.h:12,
from drivers/iommu/arm-smmu-v3.c:12:
drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist':
include/linux/bits.h:26:28: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
26 | __builtin_constant_p((l) > (h)), (l) > (h), 0)))
| ^
include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO'
16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
| ^
include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK'
39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
| ^~~~~~~~~~~~~~~~~~~
>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK'
1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
| ^~~~~~~
include/linux/bits.h:26:40: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
26 | __builtin_constant_p((l) > (h)), (l) > (h), 0)))
| ^
include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO'
16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
| ^
include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK'
39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
| ^~~~~~~~~~~~~~~~~~~
>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK'
1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
| ^~~~~~~
vim +/GENMASK +1404 drivers/iommu/arm-smmu-v3.c
1369
1370 /*
1371 * This is the actual insertion function, and provides the following
1372 * ordering guarantees to callers:
1373 *
1374 * - There is a dma_wmb() before publishing any commands to the queue.
1375 * This can be relied upon to order prior writes to data structures
1376 * in memory (such as a CD or an STE) before the command.
1377 *
1378 * - On completion of a CMD_SYNC, there is a control dependency.
1379 * This can be relied upon to order subsequent writes to memory (e.g.
1380 * freeing an IOVA) after completion of the CMD_SYNC.
1381 *
1382 * - Command insertion is totally ordered, so if two CPUs each race to
1383 * insert their own list of commands then all of the commands from one
1384 * CPU will appear before any of the commands from the other CPU.
1385 *
1386 * - A CMD_SYNC is always inserted, which ensures we limit the prod pointer
1387 * for when the cmdq is full, such that we don't wrap more than twice.
1388 * It also makes it easy for the owner to know by how many to increment the
1389 * cmdq lock.
1390 */
1391 static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
1392 u64 *cmds, int n)
1393 {
1394 u64 cmd_sync[CMDQ_ENT_DWORDS];
1395 const int sync = 1;
1396 u32 prod;
1397 unsigned long flags;
1398 bool owner;
1399 struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
1400 struct arm_smmu_ll_queue llq = {
1401 .max_n_shift = cmdq->q.llq.max_n_shift,
1402 }, head = llq, space = llq;
1403 u32 owner_val = 1 << cmdq->q.llq.owner_count_shift;
> 1404 u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
1405 u32 owner_mask = GENMASK(30, cmdq->q.llq.owner_count_shift);
1406 int ret = 0;
1407
1408 /* 1. Allocate some space in the queue */
1409 local_irq_save(flags);
1410
1411 prod = atomic_fetch_add(n + sync + owner_val,
1412 &cmdq->q.llq.atomic.prod);
1413
1414 owner = !(prod & owner_mask);
1415 llq.prod = prod_mask & prod;
1416 head.prod = queue_inc_prod_n(&llq, n + sync);
1417
1418 /*
1419 * Ensure it's safe to write the entries. For this, we need to ensure
1420 * that there is space in the queue from our prod pointer.
1421 */
1422 space.cons = READ_ONCE(cmdq->q.llq.cons);
1423 space.prod = llq.prod;
1424 while (!queue_has_space(&space, n + sync)) {
1425 if (arm_smmu_cmdq_poll_until_not_full(smmu, &space))
1426 dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
1427
1428 space.prod = llq.prod;
1429 }
1430
1431 /*
1432 * 2. Write our commands into the queue
1433 * Dependency ordering from the space-checking while loop, above.
1434 */
1435 arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n);
1436
1437 prod = queue_inc_prod_n(&llq, n);
1438 arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod);
1439 queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS);
1440
1441 /* 3. Mark our slots as valid, ensuring commands are visible first */
1442 dma_wmb();
1443 arm_smmu_cmdq_set_valid_map(cmdq, llq.prod, head.prod);
1444
1445 /* 4. If we are the owner, take control of the SMMU hardware */
1446 if (owner) {
1447 int owner_count;
1448 u32 prod_tmp;
1449
1450 /* a. Wait for previous owner to finish */
1451 atomic_cond_read_relaxed(&cmdq->owner_prod, VAL == llq.prod);
1452
1453 /* b. Stop gathering work by clearing the owned mask */
1454 prod_tmp = atomic_fetch_andnot_relaxed(~prod_mask,
1455 &cmdq->q.llq.atomic.prod);
1456 prod = prod_tmp & prod_mask;
1457 owner_count = prod_tmp & owner_mask;
1458 owner_count >>= cmdq->q.llq.owner_count_shift;
1459
1460 /*
1461 * c. Wait for any gathered work to be written to the queue.
1462 * Note that we read our own entries so that we have the control
1463 * dependency required by (d).
1464 */
1465 arm_smmu_cmdq_poll_valid_map(cmdq, llq.prod, prod);
1466
1467 /*
1468 * In order to determine completion of the CMD_SYNCs, we must
1469 * ensure that the queue can't wrap twice without us noticing.
1470 * We achieve that by taking the cmdq lock as shared before
1471 * progressing the prod pointer.
1472 * The owner does this for all the non-owners it has gathered.
1473 * Otherwise, some non-owner(s) may lock the cmdq, blocking
1474 * cons being updating. This could be when the cmdq has just
1475 * become full. In this case, other sibling non-owners could be
1476 * blocked from progressing, leading to deadlock.
1477 */
1478 arm_smmu_cmdq_shared_lock(cmdq, owner_count);
1479
1480 /*
1481 * d. Advance the hardware prod pointer
1482 * Control dependency ordering from the entries becoming valid.
1483 */
1484 writel_relaxed(prod, cmdq->q.prod_reg);
1485
1486 /*
1487 * e. Tell the next owner we're done
1488 * Make sure we've updated the hardware first, so that we don't
1489 * race to update prod and potentially move it backwards.
1490 */
1491 atomic_set_release(&cmdq->owner_prod, prod);
1492 }
1493
1494 /* 5. Since we always insert a CMD_SYNC, we must wait for it to complete */
1495 llq.prod = queue_inc_prod_n(&llq, n);
1496 ret = arm_smmu_cmdq_poll_until_sync(smmu, &llq);
1497 if (ret)
1498 dev_err_ratelimited(smmu->dev, "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n",
1499 llq.prod, readl_relaxed(cmdq->q.prod_reg),
1500 readl_relaxed(cmdq->q.cons_reg));
1501
1502 /*
1503 * Try to unlock the cmdq lock. This will fail if we're the last reader,
1504 * in which case we can safely update cmdq->q.llq.cons
1505 */
1506 if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) {
1507 WRITE_ONCE(cmdq->q.llq.cons, llq.cons);
1508 arm_smmu_cmdq_shared_unlock(cmdq);
1509 }
1510
1511 local_irq_restore(flags);
1512 return ret;
1513 }
1514
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 30901 bytes --]
[-- Attachment #3: Type: text/plain, Size: 156 bytes --]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
@ 2020-06-23 1:07 ` kernel test robot
0 siblings, 0 replies; 69+ messages in thread
From: kernel test robot @ 2020-06-23 1:07 UTC (permalink / raw)
To: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 8850 bytes --]
Hi John,
I love your patch! Perhaps something to improve:
[auto build test WARNING on iommu/next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438
base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
config: arm64-randconfig-c024-20200622 (attached as .config)
compiler: aarch64-linux-gcc (GCC) 9.3.0
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All warnings (new ones prefixed by >>, old ones prefixed by <<):
In file included from include/linux/bits.h:23,
from include/linux/ioport.h:15,
from include/linux/acpi.h:12,
from drivers/iommu/arm-smmu-v3.c:12:
drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist':
include/linux/bits.h:26:28: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
26 | __builtin_constant_p((l) > (h)), (l) > (h), 0)))
| ^
include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO'
16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
| ^
include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK'
39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
| ^~~~~~~~~~~~~~~~~~~
>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK'
1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
| ^~~~~~~
include/linux/bits.h:26:40: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
26 | __builtin_constant_p((l) > (h)), (l) > (h), 0)))
| ^
include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO'
16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
| ^
include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK'
39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
| ^~~~~~~~~~~~~~~~~~~
>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK'
1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
| ^~~~~~~
vim +/GENMASK +1404 drivers/iommu/arm-smmu-v3.c
1369
1370 /*
1371 * This is the actual insertion function, and provides the following
1372 * ordering guarantees to callers:
1373 *
1374 * - There is a dma_wmb() before publishing any commands to the queue.
1375 * This can be relied upon to order prior writes to data structures
1376 * in memory (such as a CD or an STE) before the command.
1377 *
1378 * - On completion of a CMD_SYNC, there is a control dependency.
1379 * This can be relied upon to order subsequent writes to memory (e.g.
1380 * freeing an IOVA) after completion of the CMD_SYNC.
1381 *
1382 * - Command insertion is totally ordered, so if two CPUs each race to
1383 * insert their own list of commands then all of the commands from one
1384 * CPU will appear before any of the commands from the other CPU.
1385 *
1386 * - A CMD_SYNC is always inserted, which ensures we limit the prod pointer
1387 * for when the cmdq is full, such that we don't wrap more than twice.
1388 * It also makes it easy for the owner to know by how many to increment the
1389 * cmdq lock.
1390 */
1391 static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
1392 u64 *cmds, int n)
1393 {
1394 u64 cmd_sync[CMDQ_ENT_DWORDS];
1395 const int sync = 1;
1396 u32 prod;
1397 unsigned long flags;
1398 bool owner;
1399 struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
1400 struct arm_smmu_ll_queue llq = {
1401 .max_n_shift = cmdq->q.llq.max_n_shift,
1402 }, head = llq, space = llq;
1403 u32 owner_val = 1 << cmdq->q.llq.owner_count_shift;
> 1404 u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
1405 u32 owner_mask = GENMASK(30, cmdq->q.llq.owner_count_shift);
1406 int ret = 0;
1407
1408 /* 1. Allocate some space in the queue */
1409 local_irq_save(flags);
1410
1411 prod = atomic_fetch_add(n + sync + owner_val,
1412 &cmdq->q.llq.atomic.prod);
1413
1414 owner = !(prod & owner_mask);
1415 llq.prod = prod_mask & prod;
1416 head.prod = queue_inc_prod_n(&llq, n + sync);
1417
1418 /*
1419 * Ensure it's safe to write the entries. For this, we need to ensure
1420 * that there is space in the queue from our prod pointer.
1421 */
1422 space.cons = READ_ONCE(cmdq->q.llq.cons);
1423 space.prod = llq.prod;
1424 while (!queue_has_space(&space, n + sync)) {
1425 if (arm_smmu_cmdq_poll_until_not_full(smmu, &space))
1426 dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
1427
1428 space.prod = llq.prod;
1429 }
1430
1431 /*
1432 * 2. Write our commands into the queue
1433 * Dependency ordering from the space-checking while loop, above.
1434 */
1435 arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n);
1436
1437 prod = queue_inc_prod_n(&llq, n);
1438 arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod);
1439 queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS);
1440
1441 /* 3. Mark our slots as valid, ensuring commands are visible first */
1442 dma_wmb();
1443 arm_smmu_cmdq_set_valid_map(cmdq, llq.prod, head.prod);
1444
1445 /* 4. If we are the owner, take control of the SMMU hardware */
1446 if (owner) {
1447 int owner_count;
1448 u32 prod_tmp;
1449
1450 /* a. Wait for previous owner to finish */
1451 atomic_cond_read_relaxed(&cmdq->owner_prod, VAL == llq.prod);
1452
1453 /* b. Stop gathering work by clearing the owned mask */
1454 prod_tmp = atomic_fetch_andnot_relaxed(~prod_mask,
1455 &cmdq->q.llq.atomic.prod);
1456 prod = prod_tmp & prod_mask;
1457 owner_count = prod_tmp & owner_mask;
1458 owner_count >>= cmdq->q.llq.owner_count_shift;
1459
1460 /*
1461 * c. Wait for any gathered work to be written to the queue.
1462 * Note that we read our own entries so that we have the control
1463 * dependency required by (d).
1464 */
1465 arm_smmu_cmdq_poll_valid_map(cmdq, llq.prod, prod);
1466
1467 /*
1468 * In order to determine completion of the CMD_SYNCs, we must
1469 * ensure that the queue can't wrap twice without us noticing.
1470 * We achieve that by taking the cmdq lock as shared before
1471 * progressing the prod pointer.
1472 * The owner does this for all the non-owners it has gathered.
1473 * Otherwise, some non-owner(s) may lock the cmdq, blocking
1474 * cons being updating. This could be when the cmdq has just
1475 * become full. In this case, other sibling non-owners could be
1476 * blocked from progressing, leading to deadlock.
1477 */
1478 arm_smmu_cmdq_shared_lock(cmdq, owner_count);
1479
1480 /*
1481 * d. Advance the hardware prod pointer
1482 * Control dependency ordering from the entries becoming valid.
1483 */
1484 writel_relaxed(prod, cmdq->q.prod_reg);
1485
1486 /*
1487 * e. Tell the next owner we're done
1488 * Make sure we've updated the hardware first, so that we don't
1489 * race to update prod and potentially move it backwards.
1490 */
1491 atomic_set_release(&cmdq->owner_prod, prod);
1492 }
1493
1494 /* 5. Since we always insert a CMD_SYNC, we must wait for it to complete */
1495 llq.prod = queue_inc_prod_n(&llq, n);
1496 ret = arm_smmu_cmdq_poll_until_sync(smmu, &llq);
1497 if (ret)
1498 dev_err_ratelimited(smmu->dev, "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n",
1499 llq.prod, readl_relaxed(cmdq->q.prod_reg),
1500 readl_relaxed(cmdq->q.cons_reg));
1501
1502 /*
1503 * Try to unlock the cmdq lock. This will fail if we're the last reader,
1504 * in which case we can safely update cmdq->q.llq.cons
1505 */
1506 if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) {
1507 WRITE_ONCE(cmdq->q.llq.cons, llq.cons);
1508 arm_smmu_cmdq_shared_unlock(cmdq);
1509 }
1510
1511 local_irq_restore(flags);
1512 return ret;
1513 }
1514
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org
[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 30901 bytes --]
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
2020-06-23 1:07 ` kernel test robot
(?)
@ 2020-06-23 9:21 ` John Garry
-1 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-06-23 9:21 UTC (permalink / raw)
To: kernel test robot, will, robin.murphy, rikard.falkeborn
Cc: kbuild-all, joro, trivial, linux-arm-kernel, iommu, linux-kernel,
Linuxarm, maz
On 23/06/2020 02:07, kernel test robot wrote:
+ Rikard, as the GENMASK compile-time sanity checks were added recently
> Hi John,
>
> I love your patch! Perhaps something to improve:
>
> [auto build test WARNING on iommu/next]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use as documented in
> https://git-scm.com/docs/git-format-patch]
>
> url: https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438
> base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
> config: arm64-randconfig-c024-20200622 (attached as .config)
> compiler: aarch64-linux-gcc (GCC) 9.3.0
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
>
> All warnings (new ones prefixed by >>, old ones prefixed by <<):
>
> In file included from include/linux/bits.h:23,
> from include/linux/ioport.h:15,
> from include/linux/acpi.h:12,
> from drivers/iommu/arm-smmu-v3.c:12:
> drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist':
> include/linux/bits.h:26:28: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
> 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0)))
I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and
h=unsigned value, so I doubt this warn.
Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks
like GENMASK_INPUT_CHECK() could be improved.
> | ^
> include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO'
> 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
> | ^
> include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK'
> 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
> | ^~~~~~~~~~~~~~~~~~~
>>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK'
> 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
> | ^~~~~~~
> include/linux/bits.h:26:40: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
> 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0)))
> | ^
> include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO'
> 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
> | ^
> include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK'
> 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
> | ^~~~~~~~~~~~~~~~~~~
>>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK'
> 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
> | ^~~~~~~
>
> vim +/GENMASK +1404 drivers/iommu/arm-smmu-v3.c
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
@ 2020-06-23 9:21 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-06-23 9:21 UTC (permalink / raw)
To: kernel test robot, will, robin.murphy, rikard.falkeborn
Cc: trivial, maz, linux-kernel, Linuxarm, iommu, kbuild-all,
linux-arm-kernel
On 23/06/2020 02:07, kernel test robot wrote:
+ Rikard, as the GENMASK compile-time sanity checks were added recently
> Hi John,
>
> I love your patch! Perhaps something to improve:
>
> [auto build test WARNING on iommu/next]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use as documented in
> https://git-scm.com/docs/git-format-patch]
>
> url: https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438
> base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
> config: arm64-randconfig-c024-20200622 (attached as .config)
> compiler: aarch64-linux-gcc (GCC) 9.3.0
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
>
> All warnings (new ones prefixed by >>, old ones prefixed by <<):
>
> In file included from include/linux/bits.h:23,
> from include/linux/ioport.h:15,
> from include/linux/acpi.h:12,
> from drivers/iommu/arm-smmu-v3.c:12:
> drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist':
> include/linux/bits.h:26:28: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
> 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0)))
I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and
h=unsigned value, so I doubt this warn.
Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks
like GENMASK_INPUT_CHECK() could be improved.
> | ^
> include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO'
> 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
> | ^
> include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK'
> 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
> | ^~~~~~~~~~~~~~~~~~~
>>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK'
> 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
> | ^~~~~~~
> include/linux/bits.h:26:40: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
> 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0)))
> | ^
> include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO'
> 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
> | ^
> include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK'
> 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
> | ^~~~~~~~~~~~~~~~~~~
>>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK'
> 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
> | ^~~~~~~
>
> vim +/GENMASK +1404 drivers/iommu/arm-smmu-v3.c
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
@ 2020-06-23 9:21 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-06-23 9:21 UTC (permalink / raw)
To: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 3022 bytes --]
On 23/06/2020 02:07, kernel test robot wrote:
+ Rikard, as the GENMASK compile-time sanity checks were added recently
> Hi John,
>
> I love your patch! Perhaps something to improve:
>
> [auto build test WARNING on iommu/next]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use as documented in
> https://git-scm.com/docs/git-format-patch]
>
> url: https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438
> base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
> config: arm64-randconfig-c024-20200622 (attached as .config)
> compiler: aarch64-linux-gcc (GCC) 9.3.0
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
>
> All warnings (new ones prefixed by >>, old ones prefixed by <<):
>
> In file included from include/linux/bits.h:23,
> from include/linux/ioport.h:15,
> from include/linux/acpi.h:12,
> from drivers/iommu/arm-smmu-v3.c:12:
> drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist':
> include/linux/bits.h:26:28: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
> 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0)))
I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and
h=unsigned value, so I doubt this warn.
Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks
like GENMASK_INPUT_CHECK() could be improved.
> | ^
> include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO'
> 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
> | ^
> include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK'
> 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
> | ^~~~~~~~~~~~~~~~~~~
>>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK'
> 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
> | ^~~~~~~
> include/linux/bits.h:26:40: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
> 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0)))
> | ^
> include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO'
> 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
> | ^
> include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK'
> 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
> | ^~~~~~~~~~~~~~~~~~~
>>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK'
> 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
> | ^~~~~~~
>
> vim +/GENMASK +1404 drivers/iommu/arm-smmu-v3.c
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
2020-06-23 9:21 ` John Garry
(?)
(?)
@ 2020-06-23 9:35 ` Rikard Falkeborn
2020-06-23 10:19 ` John Garry
-1 siblings, 1 reply; 69+ messages in thread
From: Rikard Falkeborn @ 2020-06-23 9:35 UTC (permalink / raw)
To: John Garry
Cc: trivial, kernel test robot, will, linux-kernel, Linuxarm, iommu,
maz, kbuild-all, robin.murphy, linux-arm-kernel
[-- Attachment #1.1: Type: text/plain, Size: 3214 bytes --]
Hi
Den tis 23 juni 2020 11:22John Garry <john.garry@huawei.com> skrev:
> On 23/06/2020 02:07, kernel test robot wrote:
>
> + Rikard, as the GENMASK compile-time sanity checks were added recently
>
> > Hi John,
> >
> > I love your patch! Perhaps something to improve:
> >
> > [auto build test WARNING on iommu/next]
> > [If your patch is applied to the wrong git tree, kindly drop us a note.
> > And when submitting patch, we suggest to use as documented in
> > https://git-scm.com/docs/git-format-patch]
> >
> > url:
> https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438
> > base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git
> next
> > config: arm64-randconfig-c024-20200622 (attached as .config)
> > compiler: aarch64-linux-gcc (GCC) 9.3.0
> >
> > If you fix the issue, kindly add following tag as appropriate
> > Reported-by: kernel test robot <lkp@intel.com>
> >
> > All warnings (new ones prefixed by >>, old ones prefixed by <<):
> >
> > In file included from include/linux/bits.h:23,
> > from include/linux/ioport.h:15,
> > from include/linux/acpi.h:12,
> > from drivers/iommu/arm-smmu-v3.c:12:
> > drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist':
> > include/linux/bits.h:26:28: warning: comparison of unsigned expression <
> 0 is always false [-Wtype-limits]
> > 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0)))
>
> I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and
> h=unsigned value, so I doubt this warn.
>
> Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks
> like GENMASK_INPUT_CHECK() could be improved.
>
Indeed it could, it is fixed in -next.
Rikard
> | ^
> > include/linux/build_bug.h:16:62: note: in definition of macro
> 'BUILD_BUG_ON_ZERO'
> > 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e));
> })))
> > | ^
> > include/linux/bits.h:39:3: note: in expansion of macro
> 'GENMASK_INPUT_CHECK'
> > 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
> > | ^~~~~~~~~~~~~~~~~~~
> >>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro
> 'GENMASK'
> > 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
> > | ^~~~~~~
> > include/linux/bits.h:26:40: warning: comparison of unsigned expression <
> 0 is always false [-Wtype-limits]
> > 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0)))
> > | ^
> > include/linux/build_bug.h:16:62: note: in definition of macro
> 'BUILD_BUG_ON_ZERO'
> > 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e));
> })))
> > | ^
> > include/linux/bits.h:39:3: note: in expansion of macro
> 'GENMASK_INPUT_CHECK'
> > 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
> > | ^~~~~~~~~~~~~~~~~~~
> >>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro
> 'GENMASK'
> > 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
> > | ^~~~~~~
> >
> > vim +/GENMASK +1404 drivers/iommu/arm-smmu-v3.c
>
>
>
[-- Attachment #1.2: Type: text/html, Size: 4817 bytes --]
[-- Attachment #2: Type: text/plain, Size: 156 bytes --]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
2020-06-23 9:35 ` Rikard Falkeborn
2020-06-23 10:19 ` John Garry
@ 2020-06-23 10:19 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-06-23 10:19 UTC (permalink / raw)
To: Rikard Falkeborn, joro
Cc: kernel test robot, will, robin.murphy, kbuild-all, trivial,
linux-arm-kernel, iommu, linux-kernel, Linuxarm, maz
On 23/06/2020 10:35, Rikard Falkeborn wrote:
>
> I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and
> h=unsigned value, so I doubt this warn.
>
> Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it
> looks
> like GENMASK_INPUT_CHECK() could be improved.
>
>
> Indeed it could, it is fixed in -next.
ok, thanks for the pointer, but I still see this on today's -next with
this patch:
make W=1 drivers/iommu/arm-smmu-v3.o
In file included from ./include/linux/bits.h:23:0,
from ./include/linux/ioport.h:15,
from ./include/linux/acpi.h:12,
from drivers/iommu/arm-smmu-v3.c:12:
drivers/iommu/arm-smmu-v3.c: In function ‘arm_smmu_cmdq_issue_cmdlist’:
./include/linux/bits.h:27:7: warning: comparison of unsigned expression
< 0 is always false [-Wtype-limits]
(l) > (h), 0)))
^
./include/linux/build_bug.h:16:62: note: in definition of macro
‘BUILD_BUG_ON_ZERO’
#define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
^
./include/linux/bits.h:40:3: note: in expansion of macro
‘GENMASK_INPUT_CHECK’
(GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
^~~~~~~~~~~~~~~~~~~
drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro ‘GENMASK’
u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
That's gcc 7.5.0 .
Cheers,
John
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
@ 2020-06-23 10:19 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-06-23 10:19 UTC (permalink / raw)
To: Rikard Falkeborn, joro
Cc: trivial, kernel test robot, will, linux-kernel, Linuxarm, iommu,
maz, kbuild-all, robin.murphy, linux-arm-kernel
On 23/06/2020 10:35, Rikard Falkeborn wrote:
>
> I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and
> h=unsigned value, so I doubt this warn.
>
> Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it
> looks
> like GENMASK_INPUT_CHECK() could be improved.
>
>
> Indeed it could, it is fixed in -next.
ok, thanks for the pointer, but I still see this on today's -next with
this patch:
make W=1 drivers/iommu/arm-smmu-v3.o
In file included from ./include/linux/bits.h:23:0,
from ./include/linux/ioport.h:15,
from ./include/linux/acpi.h:12,
from drivers/iommu/arm-smmu-v3.c:12:
drivers/iommu/arm-smmu-v3.c: In function ‘arm_smmu_cmdq_issue_cmdlist’:
./include/linux/bits.h:27:7: warning: comparison of unsigned expression
< 0 is always false [-Wtype-limits]
(l) > (h), 0)))
^
./include/linux/build_bug.h:16:62: note: in definition of macro
‘BUILD_BUG_ON_ZERO’
#define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
^
./include/linux/bits.h:40:3: note: in expansion of macro
‘GENMASK_INPUT_CHECK’
(GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
^~~~~~~~~~~~~~~~~~~
drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro ‘GENMASK’
u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
That's gcc 7.5.0 .
Cheers,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
@ 2020-06-23 10:19 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-06-23 10:19 UTC (permalink / raw)
To: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 1487 bytes --]
On 23/06/2020 10:35, Rikard Falkeborn wrote:
>
> I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and
> h=unsigned value, so I doubt this warn.
>
> Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it
> looks
> like GENMASK_INPUT_CHECK() could be improved.
>
>
> Indeed it could, it is fixed in -next.
ok, thanks for the pointer, but I still see this on today's -next with
this patch:
make W=1 drivers/iommu/arm-smmu-v3.o
In file included from ./include/linux/bits.h:23:0,
from ./include/linux/ioport.h:15,
from ./include/linux/acpi.h:12,
from drivers/iommu/arm-smmu-v3.c:12:
drivers/iommu/arm-smmu-v3.c: In function ‘arm_smmu_cmdq_issue_cmdlist’:
./include/linux/bits.h:27:7: warning: comparison of unsigned expression
< 0 is always false [-Wtype-limits]
(l) > (h), 0)))
^
./include/linux/build_bug.h:16:62: note: in definition of macro
‘BUILD_BUG_ON_ZERO’
#define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
^
./include/linux/bits.h:40:3: note: in expansion of macro
‘GENMASK_INPUT_CHECK’
(GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
^~~~~~~~~~~~~~~~~~~
drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro ‘GENMASK’
u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
That's gcc 7.5.0 .
Cheers,
John
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
2020-06-23 10:19 ` John Garry
(?)
(?)
@ 2020-06-23 13:55 ` Rikard Falkeborn
2020-06-26 10:05 ` John Garry
-1 siblings, 1 reply; 69+ messages in thread
From: Rikard Falkeborn @ 2020-06-23 13:55 UTC (permalink / raw)
To: John Garry
Cc: trivial, kernel test robot, will, linux-kernel, Linuxarm, iommu,
maz, kbuild-all, robin.murphy, linux-arm-kernel
[-- Attachment #1.1: Type: text/plain, Size: 1974 bytes --]
Den tis 23 juni 2020 12:21John Garry <john.garry@huawei.com> skrev:
> On 23/06/2020 10:35, Rikard Falkeborn wrote:
> >
> > I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and
> > h=unsigned value, so I doubt this warn.
> >
> > Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it
> > looks
> > like GENMASK_INPUT_CHECK() could be improved.
> >
> >
> > Indeed it could, it is fixed in -next.
>
> ok, thanks for the pointer, but I still see this on today's -next with
> this patch:
>
> make W=1 drivers/iommu/arm-smmu-v3.o
>
>
Oh, ok thanks for reporting. I guess different gcc versions have different
behaviour. I guess we'll have to change the comparison to (!((h) == (l) ||
(h) > (l))) instead (not sure I got all parenthesis and logic correct but
you get the idea).
I'm travelling and wont have time to look at this until next week though.
Rikard
In file included from ./include/linux/bits.h:23:0,
> from ./include/linux/ioport.h:15,
> from ./include/linux/acpi.h:12,
> from drivers/iommu/arm-smmu-v3.c:12:
> drivers/iommu/arm-smmu-v3.c: In function ‘arm_smmu_cmdq_issue_cmdlist’:
> ./include/linux/bits.h:27:7: warning: comparison of unsigned expression
> < 0 is always false [-Wtype-limits]
> (l) > (h), 0)))
> ^
> ./include/linux/build_bug.h:16:62: note: in definition of macro
> ‘BUILD_BUG_ON_ZERO’
> #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
> ^
> ./include/linux/bits.h:40:3: note: in expansion of macro
> ‘GENMASK_INPUT_CHECK’
> (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
> ^~~~~~~~~~~~~~~~~~~
> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro ‘GENMASK’
> u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
>
> That's gcc 7.5.0 .
>
> Cheers,
> John
>
>
[-- Attachment #1.2: Type: text/html, Size: 2883 bytes --]
[-- Attachment #2: Type: text/plain, Size: 156 bytes --]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
2020-06-23 9:21 ` John Garry
(?)
(?)
@ 2020-06-23 16:22 ` Robin Murphy
-1 siblings, 0 replies; 69+ messages in thread
From: Robin Murphy @ 2020-06-23 16:22 UTC (permalink / raw)
To: John Garry, kernel test robot, will, rikard.falkeborn
Cc: trivial, maz, linux-kernel, Linuxarm, iommu, kbuild-all,
linux-arm-kernel
On 2020-06-23 10:21, John Garry wrote:
> On 23/06/2020 02:07, kernel test robot wrote:
>
> + Rikard, as the GENMASK compile-time sanity checks were added recently
>
>> Hi John,
>>
>> I love your patch! Perhaps something to improve:
>>
>> [auto build test WARNING on iommu/next]
>> [If your patch is applied to the wrong git tree, kindly drop us a note.
>> And when submitting patch, we suggest to use as documented in
>> https://git-scm.com/docs/git-format-patch]
>>
>> url:
>> https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438
>>
>> base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git
>> next
>> config: arm64-randconfig-c024-20200622 (attached as .config)
>> compiler: aarch64-linux-gcc (GCC) 9.3.0
>>
>> If you fix the issue, kindly add following tag as appropriate
>> Reported-by: kernel test robot <lkp@intel.com>
>>
>> All warnings (new ones prefixed by >>, old ones prefixed by <<):
>>
>> In file included from include/linux/bits.h:23,
>> from include/linux/ioport.h:15,
>> from include/linux/acpi.h:12,
>> from drivers/iommu/arm-smmu-v3.c:12:
>> drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist':
>> include/linux/bits.h:26:28: warning: comparison of unsigned expression
>> < 0 is always false [-Wtype-limits]
>> 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0)))
>
> I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and
> h=unsigned value, so I doubt this warn.
>
> Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks
> like GENMASK_INPUT_CHECK() could be improved.
That said, I think this particular case might be even better off dodging
GENMASK() entirely, by doing something like this first. Untested...
Robin.
----->8-----
Subject: [PATCH] iommu/arm-smmu-v3: Streamline queue calculations
Beyond the initial queue setup based on the log2 values from ID
registers, the log2 queue size is only ever used in the form of
(1 << max_n_shift) to repeatedly recalculate the number of queue
elements. Simply storing it in that form leads to slightly more
efficient code, particularly in the low-level queue accessors
where it counts most:
add/remove: 0/0 grow/shrink: 1/7 up/down: 4/-120 (-116)
Function old new delta
arm_smmu_init_one_queue 360 364 +4
arm_smmu_priq_thread 512 508 -4
arm_smmu_evtq_thread 300 292 -8
__arm_smmu_cmdq_poll_set_valid_map.isra 296 288 -8
queue_remove_raw 180 164 -16
arm_smmu_gerror_handler 732 716 -16
arm_smmu_device_probe 4312 4284 -28
arm_smmu_cmdq_issue_cmdlist 1892 1852 -40
Total: Before=20135, After=20019, chg -0.58%
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
drivers/iommu/arm-smmu-v3.c | 46 ++++++++++++++++++-------------------
1 file changed, 23 insertions(+), 23 deletions(-)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f578677a5c41..407cb9451a7a 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -185,8 +185,8 @@
#define ARM_SMMU_MEMATTR_DEVICE_nGnRE 0x1
#define ARM_SMMU_MEMATTR_OIWB 0xf
-#define Q_IDX(llq, p) ((p) & ((1 << (llq)->max_n_shift) - 1))
-#define Q_WRP(llq, p) ((p) & (1 << (llq)->max_n_shift))
+#define Q_IDX(llq, p) ((p) & ((llq)->max_n - 1))
+#define Q_WRP(llq, p) ((p) & (llq)->max_n)
#define Q_OVERFLOW_FLAG (1U << 31)
#define Q_OVF(p) ((p) & Q_OVERFLOW_FLAG)
#define Q_ENT(q, p) ((q)->base + \
@@ -531,7 +531,7 @@ struct arm_smmu_ll_queue {
} atomic;
u8 __pad[SMP_CACHE_BYTES];
} ____cacheline_aligned_in_smp;
- u32 max_n_shift;
+ u32 max_n;
};
struct arm_smmu_queue {
@@ -771,7 +771,7 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n)
cons = Q_IDX(q, q->cons);
if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons))
- space = (1 << q->max_n_shift) - (prod - cons);
+ space = q->max_n - (prod - cons);
else
space = cons - prod;
@@ -1164,8 +1164,8 @@ static void __arm_smmu_cmdq_poll_set_valid_map(struct arm_smmu_cmdq *cmdq,
{
u32 swidx, sbidx, ewidx, ebidx;
struct arm_smmu_ll_queue llq = {
- .max_n_shift = cmdq->q.llq.max_n_shift,
- .prod = sprod,
+ .max_n = cmdq->q.llq.max_n,
+ .prod = sprod,
};
ewidx = BIT_WORD(Q_IDX(&llq, eprod));
@@ -1344,8 +1344,8 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds,
{
int i;
struct arm_smmu_ll_queue llq = {
- .max_n_shift = cmdq->q.llq.max_n_shift,
- .prod = prod,
+ .max_n = cmdq->q.llq.max_n,
+ .prod = prod,
};
for (i = 0; i < n; ++i) {
@@ -1381,7 +1381,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
bool owner;
struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
struct arm_smmu_ll_queue llq = {
- .max_n_shift = cmdq->q.llq.max_n_shift,
+ .max_n = cmdq->q.llq.max_n,
}, head = llq;
int ret = 0;
@@ -3144,13 +3144,13 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
size_t qsz;
do {
- qsz = ((1 << q->llq.max_n_shift) * dwords) << 3;
+ qsz = (q->llq.max_n * dwords) << 3;
q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma,
GFP_KERNEL);
if (q->base || qsz < PAGE_SIZE)
break;
- q->llq.max_n_shift--;
+ q->llq.max_n >>= 1;
} while (1);
if (!q->base) {
@@ -3162,7 +3162,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
if (!WARN_ON(q->base_dma & (qsz - 1))) {
dev_info(smmu->dev, "allocated %u entries for %s\n",
- 1 << q->llq.max_n_shift, name);
+ q->llq.max_n, name);
}
q->prod_reg = arm_smmu_page1_fixup(prod_off, smmu);
@@ -3171,7 +3171,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
q->q_base = Q_BASE_RWA;
q->q_base |= q->base_dma & Q_BASE_ADDR_MASK;
- q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, q->llq.max_n_shift);
+ q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, ilog2(q->llq.max_n));
q->llq.prod = q->llq.cons = 0;
return 0;
@@ -3187,13 +3187,12 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
{
int ret = 0;
struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
- unsigned int nents = 1 << cmdq->q.llq.max_n_shift;
atomic_long_t *bitmap;
atomic_set(&cmdq->owner_prod, 0);
atomic_set(&cmdq->lock, 0);
- bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
+ bitmap = (atomic_long_t *)bitmap_zalloc(cmdq->q.llq.max_n, GFP_KERNEL);
if (!bitmap) {
dev_err(smmu->dev, "failed to allocate cmdq bitmap\n");
ret = -ENOMEM;
@@ -3695,7 +3694,7 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
{
- u32 reg;
+ u32 reg, max_n_shift;
bool coherent = smmu->features & ARM_SMMU_FEAT_COHERENCY;
/* IDR0 */
@@ -3798,9 +3797,9 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
}
/* Queue sizes, capped to ensure natural alignment */
- smmu->cmdq.q.llq.max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT,
- FIELD_GET(IDR1_CMDQS, reg));
- if (smmu->cmdq.q.llq.max_n_shift <= ilog2(CMDQ_BATCH_ENTRIES)) {
+ max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_CMDQS, reg));
+ smmu->cmdq.q.llq.max_n = 1 << max_n_shift;
+ if (smmu->cmdq.q.llq.max_n <= CMDQ_BATCH_ENTRIES) {
/*
* We don't support splitting up batches, so one batch of
* commands plus an extra sync needs to fit inside the command
@@ -3812,10 +3811,11 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
return -ENXIO;
}
- smmu->evtq.q.llq.max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT,
- FIELD_GET(IDR1_EVTQS, reg));
- smmu->priq.q.llq.max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT,
- FIELD_GET(IDR1_PRIQS, reg));
+ max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_EVTQS, reg));
+ smmu->evtq.q.llq.max_n = 1 << max_n_shift;
+
+ max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_PRIQS, reg));
+ smmu->priq.q.llq.max_n = 1 << max_n_shift;
/* SID/SSID sizes */
smmu->ssid_bits = FIELD_GET(IDR1_SSIDSIZE, reg);
--
2.23.0.dirty
^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
@ 2020-06-23 16:22 ` Robin Murphy
0 siblings, 0 replies; 69+ messages in thread
From: Robin Murphy @ 2020-06-23 16:22 UTC (permalink / raw)
To: John Garry, kernel test robot, will, rikard.falkeborn
Cc: trivial, maz, linux-kernel, Linuxarm, iommu, kbuild-all,
linux-arm-kernel
On 2020-06-23 10:21, John Garry wrote:
> On 23/06/2020 02:07, kernel test robot wrote:
>
> + Rikard, as the GENMASK compile-time sanity checks were added recently
>
>> Hi John,
>>
>> I love your patch! Perhaps something to improve:
>>
>> [auto build test WARNING on iommu/next]
>> [If your patch is applied to the wrong git tree, kindly drop us a note.
>> And when submitting patch, we suggest to use as documented in
>> https://git-scm.com/docs/git-format-patch]
>>
>> url:
>> https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438
>>
>> base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git
>> next
>> config: arm64-randconfig-c024-20200622 (attached as .config)
>> compiler: aarch64-linux-gcc (GCC) 9.3.0
>>
>> If you fix the issue, kindly add following tag as appropriate
>> Reported-by: kernel test robot <lkp@intel.com>
>>
>> All warnings (new ones prefixed by >>, old ones prefixed by <<):
>>
>> In file included from include/linux/bits.h:23,
>> from include/linux/ioport.h:15,
>> from include/linux/acpi.h:12,
>> from drivers/iommu/arm-smmu-v3.c:12:
>> drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist':
>> include/linux/bits.h:26:28: warning: comparison of unsigned expression
>> < 0 is always false [-Wtype-limits]
>> 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0)))
>
> I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and
> h=unsigned value, so I doubt this warn.
>
> Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks
> like GENMASK_INPUT_CHECK() could be improved.
That said, I think this particular case might be even better off dodging
GENMASK() entirely, by doing something like this first. Untested...
Robin.
----->8-----
Subject: [PATCH] iommu/arm-smmu-v3: Streamline queue calculations
Beyond the initial queue setup based on the log2 values from ID
registers, the log2 queue size is only ever used in the form of
(1 << max_n_shift) to repeatedly recalculate the number of queue
elements. Simply storing it in that form leads to slightly more
efficient code, particularly in the low-level queue accessors
where it counts most:
add/remove: 0/0 grow/shrink: 1/7 up/down: 4/-120 (-116)
Function old new delta
arm_smmu_init_one_queue 360 364 +4
arm_smmu_priq_thread 512 508 -4
arm_smmu_evtq_thread 300 292 -8
__arm_smmu_cmdq_poll_set_valid_map.isra 296 288 -8
queue_remove_raw 180 164 -16
arm_smmu_gerror_handler 732 716 -16
arm_smmu_device_probe 4312 4284 -28
arm_smmu_cmdq_issue_cmdlist 1892 1852 -40
Total: Before=20135, After=20019, chg -0.58%
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
drivers/iommu/arm-smmu-v3.c | 46 ++++++++++++++++++-------------------
1 file changed, 23 insertions(+), 23 deletions(-)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f578677a5c41..407cb9451a7a 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -185,8 +185,8 @@
#define ARM_SMMU_MEMATTR_DEVICE_nGnRE 0x1
#define ARM_SMMU_MEMATTR_OIWB 0xf
-#define Q_IDX(llq, p) ((p) & ((1 << (llq)->max_n_shift) - 1))
-#define Q_WRP(llq, p) ((p) & (1 << (llq)->max_n_shift))
+#define Q_IDX(llq, p) ((p) & ((llq)->max_n - 1))
+#define Q_WRP(llq, p) ((p) & (llq)->max_n)
#define Q_OVERFLOW_FLAG (1U << 31)
#define Q_OVF(p) ((p) & Q_OVERFLOW_FLAG)
#define Q_ENT(q, p) ((q)->base + \
@@ -531,7 +531,7 @@ struct arm_smmu_ll_queue {
} atomic;
u8 __pad[SMP_CACHE_BYTES];
} ____cacheline_aligned_in_smp;
- u32 max_n_shift;
+ u32 max_n;
};
struct arm_smmu_queue {
@@ -771,7 +771,7 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n)
cons = Q_IDX(q, q->cons);
if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons))
- space = (1 << q->max_n_shift) - (prod - cons);
+ space = q->max_n - (prod - cons);
else
space = cons - prod;
@@ -1164,8 +1164,8 @@ static void __arm_smmu_cmdq_poll_set_valid_map(struct arm_smmu_cmdq *cmdq,
{
u32 swidx, sbidx, ewidx, ebidx;
struct arm_smmu_ll_queue llq = {
- .max_n_shift = cmdq->q.llq.max_n_shift,
- .prod = sprod,
+ .max_n = cmdq->q.llq.max_n,
+ .prod = sprod,
};
ewidx = BIT_WORD(Q_IDX(&llq, eprod));
@@ -1344,8 +1344,8 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds,
{
int i;
struct arm_smmu_ll_queue llq = {
- .max_n_shift = cmdq->q.llq.max_n_shift,
- .prod = prod,
+ .max_n = cmdq->q.llq.max_n,
+ .prod = prod,
};
for (i = 0; i < n; ++i) {
@@ -1381,7 +1381,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
bool owner;
struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
struct arm_smmu_ll_queue llq = {
- .max_n_shift = cmdq->q.llq.max_n_shift,
+ .max_n = cmdq->q.llq.max_n,
}, head = llq;
int ret = 0;
@@ -3144,13 +3144,13 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
size_t qsz;
do {
- qsz = ((1 << q->llq.max_n_shift) * dwords) << 3;
+ qsz = (q->llq.max_n * dwords) << 3;
q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma,
GFP_KERNEL);
if (q->base || qsz < PAGE_SIZE)
break;
- q->llq.max_n_shift--;
+ q->llq.max_n >>= 1;
} while (1);
if (!q->base) {
@@ -3162,7 +3162,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
if (!WARN_ON(q->base_dma & (qsz - 1))) {
dev_info(smmu->dev, "allocated %u entries for %s\n",
- 1 << q->llq.max_n_shift, name);
+ q->llq.max_n, name);
}
q->prod_reg = arm_smmu_page1_fixup(prod_off, smmu);
@@ -3171,7 +3171,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
q->q_base = Q_BASE_RWA;
q->q_base |= q->base_dma & Q_BASE_ADDR_MASK;
- q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, q->llq.max_n_shift);
+ q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, ilog2(q->llq.max_n));
q->llq.prod = q->llq.cons = 0;
return 0;
@@ -3187,13 +3187,12 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
{
int ret = 0;
struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
- unsigned int nents = 1 << cmdq->q.llq.max_n_shift;
atomic_long_t *bitmap;
atomic_set(&cmdq->owner_prod, 0);
atomic_set(&cmdq->lock, 0);
- bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
+ bitmap = (atomic_long_t *)bitmap_zalloc(cmdq->q.llq.max_n, GFP_KERNEL);
if (!bitmap) {
dev_err(smmu->dev, "failed to allocate cmdq bitmap\n");
ret = -ENOMEM;
@@ -3695,7 +3694,7 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
{
- u32 reg;
+ u32 reg, max_n_shift;
bool coherent = smmu->features & ARM_SMMU_FEAT_COHERENCY;
/* IDR0 */
@@ -3798,9 +3797,9 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
}
/* Queue sizes, capped to ensure natural alignment */
- smmu->cmdq.q.llq.max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT,
- FIELD_GET(IDR1_CMDQS, reg));
- if (smmu->cmdq.q.llq.max_n_shift <= ilog2(CMDQ_BATCH_ENTRIES)) {
+ max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_CMDQS, reg));
+ smmu->cmdq.q.llq.max_n = 1 << max_n_shift;
+ if (smmu->cmdq.q.llq.max_n <= CMDQ_BATCH_ENTRIES) {
/*
* We don't support splitting up batches, so one batch of
* commands plus an extra sync needs to fit inside the command
@@ -3812,10 +3811,11 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
return -ENXIO;
}
- smmu->evtq.q.llq.max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT,
- FIELD_GET(IDR1_EVTQS, reg));
- smmu->priq.q.llq.max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT,
- FIELD_GET(IDR1_PRIQS, reg));
+ max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_EVTQS, reg));
+ smmu->evtq.q.llq.max_n = 1 << max_n_shift;
+
+ max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_PRIQS, reg));
+ smmu->priq.q.llq.max_n = 1 << max_n_shift;
/* SID/SSID sizes */
smmu->ssid_bits = FIELD_GET(IDR1_SSIDSIZE, reg);
--
2.23.0.dirty
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
@ 2020-06-23 16:22 ` Robin Murphy
0 siblings, 0 replies; 69+ messages in thread
From: Robin Murphy @ 2020-06-23 16:22 UTC (permalink / raw)
To: John Garry, kernel test robot, will, rikard.falkeborn
Cc: trivial, maz, linux-kernel, Linuxarm, iommu, kbuild-all,
linux-arm-kernel
On 2020-06-23 10:21, John Garry wrote:
> On 23/06/2020 02:07, kernel test robot wrote:
>
> + Rikard, as the GENMASK compile-time sanity checks were added recently
>
>> Hi John,
>>
>> I love your patch! Perhaps something to improve:
>>
>> [auto build test WARNING on iommu/next]
>> [If your patch is applied to the wrong git tree, kindly drop us a note.
>> And when submitting patch, we suggest to use as documented in
>> https://git-scm.com/docs/git-format-patch]
>>
>> url:
>> https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438
>>
>> base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git
>> next
>> config: arm64-randconfig-c024-20200622 (attached as .config)
>> compiler: aarch64-linux-gcc (GCC) 9.3.0
>>
>> If you fix the issue, kindly add following tag as appropriate
>> Reported-by: kernel test robot <lkp@intel.com>
>>
>> All warnings (new ones prefixed by >>, old ones prefixed by <<):
>>
>> In file included from include/linux/bits.h:23,
>> from include/linux/ioport.h:15,
>> from include/linux/acpi.h:12,
>> from drivers/iommu/arm-smmu-v3.c:12:
>> drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist':
>> include/linux/bits.h:26:28: warning: comparison of unsigned expression
>> < 0 is always false [-Wtype-limits]
>> 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0)))
>
> I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and
> h=unsigned value, so I doubt this warn.
>
> Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks
> like GENMASK_INPUT_CHECK() could be improved.
That said, I think this particular case might be even better off dodging
GENMASK() entirely, by doing something like this first. Untested...
Robin.
----->8-----
Subject: [PATCH] iommu/arm-smmu-v3: Streamline queue calculations
Beyond the initial queue setup based on the log2 values from ID
registers, the log2 queue size is only ever used in the form of
(1 << max_n_shift) to repeatedly recalculate the number of queue
elements. Simply storing it in that form leads to slightly more
efficient code, particularly in the low-level queue accessors
where it counts most:
add/remove: 0/0 grow/shrink: 1/7 up/down: 4/-120 (-116)
Function old new delta
arm_smmu_init_one_queue 360 364 +4
arm_smmu_priq_thread 512 508 -4
arm_smmu_evtq_thread 300 292 -8
__arm_smmu_cmdq_poll_set_valid_map.isra 296 288 -8
queue_remove_raw 180 164 -16
arm_smmu_gerror_handler 732 716 -16
arm_smmu_device_probe 4312 4284 -28
arm_smmu_cmdq_issue_cmdlist 1892 1852 -40
Total: Before=20135, After=20019, chg -0.58%
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
drivers/iommu/arm-smmu-v3.c | 46 ++++++++++++++++++-------------------
1 file changed, 23 insertions(+), 23 deletions(-)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f578677a5c41..407cb9451a7a 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -185,8 +185,8 @@
#define ARM_SMMU_MEMATTR_DEVICE_nGnRE 0x1
#define ARM_SMMU_MEMATTR_OIWB 0xf
-#define Q_IDX(llq, p) ((p) & ((1 << (llq)->max_n_shift) - 1))
-#define Q_WRP(llq, p) ((p) & (1 << (llq)->max_n_shift))
+#define Q_IDX(llq, p) ((p) & ((llq)->max_n - 1))
+#define Q_WRP(llq, p) ((p) & (llq)->max_n)
#define Q_OVERFLOW_FLAG (1U << 31)
#define Q_OVF(p) ((p) & Q_OVERFLOW_FLAG)
#define Q_ENT(q, p) ((q)->base + \
@@ -531,7 +531,7 @@ struct arm_smmu_ll_queue {
} atomic;
u8 __pad[SMP_CACHE_BYTES];
} ____cacheline_aligned_in_smp;
- u32 max_n_shift;
+ u32 max_n;
};
struct arm_smmu_queue {
@@ -771,7 +771,7 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n)
cons = Q_IDX(q, q->cons);
if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons))
- space = (1 << q->max_n_shift) - (prod - cons);
+ space = q->max_n - (prod - cons);
else
space = cons - prod;
@@ -1164,8 +1164,8 @@ static void __arm_smmu_cmdq_poll_set_valid_map(struct arm_smmu_cmdq *cmdq,
{
u32 swidx, sbidx, ewidx, ebidx;
struct arm_smmu_ll_queue llq = {
- .max_n_shift = cmdq->q.llq.max_n_shift,
- .prod = sprod,
+ .max_n = cmdq->q.llq.max_n,
+ .prod = sprod,
};
ewidx = BIT_WORD(Q_IDX(&llq, eprod));
@@ -1344,8 +1344,8 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds,
{
int i;
struct arm_smmu_ll_queue llq = {
- .max_n_shift = cmdq->q.llq.max_n_shift,
- .prod = prod,
+ .max_n = cmdq->q.llq.max_n,
+ .prod = prod,
};
for (i = 0; i < n; ++i) {
@@ -1381,7 +1381,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
bool owner;
struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
struct arm_smmu_ll_queue llq = {
- .max_n_shift = cmdq->q.llq.max_n_shift,
+ .max_n = cmdq->q.llq.max_n,
}, head = llq;
int ret = 0;
@@ -3144,13 +3144,13 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
size_t qsz;
do {
- qsz = ((1 << q->llq.max_n_shift) * dwords) << 3;
+ qsz = (q->llq.max_n * dwords) << 3;
q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma,
GFP_KERNEL);
if (q->base || qsz < PAGE_SIZE)
break;
- q->llq.max_n_shift--;
+ q->llq.max_n >>= 1;
} while (1);
if (!q->base) {
@@ -3162,7 +3162,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
if (!WARN_ON(q->base_dma & (qsz - 1))) {
dev_info(smmu->dev, "allocated %u entries for %s\n",
- 1 << q->llq.max_n_shift, name);
+ q->llq.max_n, name);
}
q->prod_reg = arm_smmu_page1_fixup(prod_off, smmu);
@@ -3171,7 +3171,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
q->q_base = Q_BASE_RWA;
q->q_base |= q->base_dma & Q_BASE_ADDR_MASK;
- q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, q->llq.max_n_shift);
+ q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, ilog2(q->llq.max_n));
q->llq.prod = q->llq.cons = 0;
return 0;
@@ -3187,13 +3187,12 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
{
int ret = 0;
struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
- unsigned int nents = 1 << cmdq->q.llq.max_n_shift;
atomic_long_t *bitmap;
atomic_set(&cmdq->owner_prod, 0);
atomic_set(&cmdq->lock, 0);
- bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
+ bitmap = (atomic_long_t *)bitmap_zalloc(cmdq->q.llq.max_n, GFP_KERNEL);
if (!bitmap) {
dev_err(smmu->dev, "failed to allocate cmdq bitmap\n");
ret = -ENOMEM;
@@ -3695,7 +3694,7 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
{
- u32 reg;
+ u32 reg, max_n_shift;
bool coherent = smmu->features & ARM_SMMU_FEAT_COHERENCY;
/* IDR0 */
@@ -3798,9 +3797,9 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
}
/* Queue sizes, capped to ensure natural alignment */
- smmu->cmdq.q.llq.max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT,
- FIELD_GET(IDR1_CMDQS, reg));
- if (smmu->cmdq.q.llq.max_n_shift <= ilog2(CMDQ_BATCH_ENTRIES)) {
+ max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_CMDQS, reg));
+ smmu->cmdq.q.llq.max_n = 1 << max_n_shift;
+ if (smmu->cmdq.q.llq.max_n <= CMDQ_BATCH_ENTRIES) {
/*
* We don't support splitting up batches, so one batch of
* commands plus an extra sync needs to fit inside the command
@@ -3812,10 +3811,11 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
return -ENXIO;
}
- smmu->evtq.q.llq.max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT,
- FIELD_GET(IDR1_EVTQS, reg));
- smmu->priq.q.llq.max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT,
- FIELD_GET(IDR1_PRIQS, reg));
+ max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_EVTQS, reg));
+ smmu->evtq.q.llq.max_n = 1 << max_n_shift;
+
+ max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_PRIQS, reg));
+ smmu->priq.q.llq.max_n = 1 << max_n_shift;
/* SID/SSID sizes */
smmu->ssid_bits = FIELD_GET(IDR1_SSIDSIZE, reg);
--
2.23.0.dirty
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
@ 2020-06-23 16:22 ` Robin Murphy
0 siblings, 0 replies; 69+ messages in thread
From: Robin Murphy @ 2020-06-23 16:22 UTC (permalink / raw)
To: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 8479 bytes --]
On 2020-06-23 10:21, John Garry wrote:
> On 23/06/2020 02:07, kernel test robot wrote:
>
> + Rikard, as the GENMASK compile-time sanity checks were added recently
>
>> Hi John,
>>
>> I love your patch! Perhaps something to improve:
>>
>> [auto build test WARNING on iommu/next]
>> [If your patch is applied to the wrong git tree, kindly drop us a note.
>> And when submitting patch, we suggest to use as documented in
>> https://git-scm.com/docs/git-format-patch]
>>
>> url:
>> https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438
>>
>> base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git
>> next
>> config: arm64-randconfig-c024-20200622 (attached as .config)
>> compiler: aarch64-linux-gcc (GCC) 9.3.0
>>
>> If you fix the issue, kindly add following tag as appropriate
>> Reported-by: kernel test robot <lkp@intel.com>
>>
>> All warnings (new ones prefixed by >>, old ones prefixed by <<):
>>
>> In file included from include/linux/bits.h:23,
>> from include/linux/ioport.h:15,
>> from include/linux/acpi.h:12,
>> from drivers/iommu/arm-smmu-v3.c:12:
>> drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist':
>> include/linux/bits.h:26:28: warning: comparison of unsigned expression
>> < 0 is always false [-Wtype-limits]
>> 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0)))
>
> I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and
> h=unsigned value, so I doubt this warn.
>
> Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks
> like GENMASK_INPUT_CHECK() could be improved.
That said, I think this particular case might be even better off dodging
GENMASK() entirely, by doing something like this first. Untested...
Robin.
----->8-----
Subject: [PATCH] iommu/arm-smmu-v3: Streamline queue calculations
Beyond the initial queue setup based on the log2 values from ID
registers, the log2 queue size is only ever used in the form of
(1 << max_n_shift) to repeatedly recalculate the number of queue
elements. Simply storing it in that form leads to slightly more
efficient code, particularly in the low-level queue accessors
where it counts most:
add/remove: 0/0 grow/shrink: 1/7 up/down: 4/-120 (-116)
Function old new delta
arm_smmu_init_one_queue 360 364 +4
arm_smmu_priq_thread 512 508 -4
arm_smmu_evtq_thread 300 292 -8
__arm_smmu_cmdq_poll_set_valid_map.isra 296 288 -8
queue_remove_raw 180 164 -16
arm_smmu_gerror_handler 732 716 -16
arm_smmu_device_probe 4312 4284 -28
arm_smmu_cmdq_issue_cmdlist 1892 1852 -40
Total: Before=20135, After=20019, chg -0.58%
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
drivers/iommu/arm-smmu-v3.c | 46 ++++++++++++++++++-------------------
1 file changed, 23 insertions(+), 23 deletions(-)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f578677a5c41..407cb9451a7a 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -185,8 +185,8 @@
#define ARM_SMMU_MEMATTR_DEVICE_nGnRE 0x1
#define ARM_SMMU_MEMATTR_OIWB 0xf
-#define Q_IDX(llq, p) ((p) & ((1 << (llq)->max_n_shift) - 1))
-#define Q_WRP(llq, p) ((p) & (1 << (llq)->max_n_shift))
+#define Q_IDX(llq, p) ((p) & ((llq)->max_n - 1))
+#define Q_WRP(llq, p) ((p) & (llq)->max_n)
#define Q_OVERFLOW_FLAG (1U << 31)
#define Q_OVF(p) ((p) & Q_OVERFLOW_FLAG)
#define Q_ENT(q, p) ((q)->base + \
@@ -531,7 +531,7 @@ struct arm_smmu_ll_queue {
} atomic;
u8 __pad[SMP_CACHE_BYTES];
} ____cacheline_aligned_in_smp;
- u32 max_n_shift;
+ u32 max_n;
};
struct arm_smmu_queue {
@@ -771,7 +771,7 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n)
cons = Q_IDX(q, q->cons);
if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons))
- space = (1 << q->max_n_shift) - (prod - cons);
+ space = q->max_n - (prod - cons);
else
space = cons - prod;
@@ -1164,8 +1164,8 @@ static void __arm_smmu_cmdq_poll_set_valid_map(struct arm_smmu_cmdq *cmdq,
{
u32 swidx, sbidx, ewidx, ebidx;
struct arm_smmu_ll_queue llq = {
- .max_n_shift = cmdq->q.llq.max_n_shift,
- .prod = sprod,
+ .max_n = cmdq->q.llq.max_n,
+ .prod = sprod,
};
ewidx = BIT_WORD(Q_IDX(&llq, eprod));
@@ -1344,8 +1344,8 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds,
{
int i;
struct arm_smmu_ll_queue llq = {
- .max_n_shift = cmdq->q.llq.max_n_shift,
- .prod = prod,
+ .max_n = cmdq->q.llq.max_n,
+ .prod = prod,
};
for (i = 0; i < n; ++i) {
@@ -1381,7 +1381,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
bool owner;
struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
struct arm_smmu_ll_queue llq = {
- .max_n_shift = cmdq->q.llq.max_n_shift,
+ .max_n = cmdq->q.llq.max_n,
}, head = llq;
int ret = 0;
@@ -3144,13 +3144,13 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
size_t qsz;
do {
- qsz = ((1 << q->llq.max_n_shift) * dwords) << 3;
+ qsz = (q->llq.max_n * dwords) << 3;
q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma,
GFP_KERNEL);
if (q->base || qsz < PAGE_SIZE)
break;
- q->llq.max_n_shift--;
+ q->llq.max_n >>= 1;
} while (1);
if (!q->base) {
@@ -3162,7 +3162,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
if (!WARN_ON(q->base_dma & (qsz - 1))) {
dev_info(smmu->dev, "allocated %u entries for %s\n",
- 1 << q->llq.max_n_shift, name);
+ q->llq.max_n, name);
}
q->prod_reg = arm_smmu_page1_fixup(prod_off, smmu);
@@ -3171,7 +3171,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
q->q_base = Q_BASE_RWA;
q->q_base |= q->base_dma & Q_BASE_ADDR_MASK;
- q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, q->llq.max_n_shift);
+ q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, ilog2(q->llq.max_n));
q->llq.prod = q->llq.cons = 0;
return 0;
@@ -3187,13 +3187,12 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
{
int ret = 0;
struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
- unsigned int nents = 1 << cmdq->q.llq.max_n_shift;
atomic_long_t *bitmap;
atomic_set(&cmdq->owner_prod, 0);
atomic_set(&cmdq->lock, 0);
- bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
+ bitmap = (atomic_long_t *)bitmap_zalloc(cmdq->q.llq.max_n, GFP_KERNEL);
if (!bitmap) {
dev_err(smmu->dev, "failed to allocate cmdq bitmap\n");
ret = -ENOMEM;
@@ -3695,7 +3694,7 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
{
- u32 reg;
+ u32 reg, max_n_shift;
bool coherent = smmu->features & ARM_SMMU_FEAT_COHERENCY;
/* IDR0 */
@@ -3798,9 +3797,9 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
}
/* Queue sizes, capped to ensure natural alignment */
- smmu->cmdq.q.llq.max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT,
- FIELD_GET(IDR1_CMDQS, reg));
- if (smmu->cmdq.q.llq.max_n_shift <= ilog2(CMDQ_BATCH_ENTRIES)) {
+ max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_CMDQS, reg));
+ smmu->cmdq.q.llq.max_n = 1 << max_n_shift;
+ if (smmu->cmdq.q.llq.max_n <= CMDQ_BATCH_ENTRIES) {
/*
* We don't support splitting up batches, so one batch of
* commands plus an extra sync needs to fit inside the command
@@ -3812,10 +3811,11 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
return -ENXIO;
}
- smmu->evtq.q.llq.max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT,
- FIELD_GET(IDR1_EVTQS, reg));
- smmu->priq.q.llq.max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT,
- FIELD_GET(IDR1_PRIQS, reg));
+ max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_EVTQS, reg));
+ smmu->evtq.q.llq.max_n = 1 << max_n_shift;
+
+ max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_PRIQS, reg));
+ smmu->priq.q.llq.max_n = 1 << max_n_shift;
/* SID/SSID sizes */
smmu->ssid_bits = FIELD_GET(IDR1_SSIDSIZE, reg);
--
2.23.0.dirty
^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
2020-06-23 16:22 ` Robin Murphy
(?)
(?)
@ 2020-06-24 8:15 ` John Garry
-1 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-06-24 8:15 UTC (permalink / raw)
To: Robin Murphy, kernel test robot, will, rikard.falkeborn
Cc: trivial, maz, linux-kernel, Linuxarm, iommu, kbuild-all,
linux-arm-kernel
>>
>> I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and
>> h=unsigned value, so I doubt this warn.
>>
>> Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks
>> like GENMASK_INPUT_CHECK() could be improved.
>
> That said, I think this particular case might be even better off dodging
> GENMASK() entirely, by doing something like this first. Untested...
>
> Robin.
>
> ----->8-----
> Subject: [PATCH] iommu/arm-smmu-v3: Streamline queue calculations
>
> Beyond the initial queue setup based on the log2 values from ID
> registers, the log2 queue size is only ever used in the form of
> (1 << max_n_shift) to repeatedly recalculate the number of queue
> elements. Simply storing it in that form leads to slightly more
> efficient code, particularly in the low-level queue accessors
> where it counts most:
>
> add/remove: 0/0 grow/shrink: 1/7 up/down: 4/-120 (-116)
> Function old new delta
> arm_smmu_init_one_queue 360 364 +4
> arm_smmu_priq_thread 512 508 -4
> arm_smmu_evtq_thread 300 292 -8
> __arm_smmu_cmdq_poll_set_valid_map.isra 296 288 -8
> queue_remove_raw 180 164 -16
> arm_smmu_gerror_handler 732 716 -16
> arm_smmu_device_probe 4312 4284 -28
> arm_smmu_cmdq_issue_cmdlist 1892 1852 -40
> Total: Before=20135, After=20019, chg -0.58%
>
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> ---
[...]
> }
>
> - smmu->evtq.q.llq.max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT,
> - FIELD_GET(IDR1_EVTQS, reg));
> - smmu->priq.q.llq.max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT,
> - FIELD_GET(IDR1_PRIQS, reg));
> + max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_EVTQS, reg));
> + smmu->evtq.q.llq.max_n = 1 << max_n_shift;
So I require the bitmask of this for the prod, which would be (max_n <<
1) - 1.
I don't feel too strongly either way, and the other big changes in this
series need to be considered first...
Thanks,
John
> +
> + max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_PRIQS, reg));
> + smmu->priq.q.llq.max_n = 1 << max_n_shift;
>
> /* SID/SSID sizes */
> smmu->ssid_bits = FIELD_GET(IDR1_SSIDSIZE, reg);
>
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
@ 2020-06-24 8:15 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-06-24 8:15 UTC (permalink / raw)
To: Robin Murphy, kernel test robot, will, rikard.falkeborn
Cc: trivial, maz, linux-kernel, Linuxarm, iommu, kbuild-all,
linux-arm-kernel
>>
>> I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and
>> h=unsigned value, so I doubt this warn.
>>
>> Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks
>> like GENMASK_INPUT_CHECK() could be improved.
>
> That said, I think this particular case might be even better off dodging
> GENMASK() entirely, by doing something like this first. Untested...
>
> Robin.
>
> ----->8-----
> Subject: [PATCH] iommu/arm-smmu-v3: Streamline queue calculations
>
> Beyond the initial queue setup based on the log2 values from ID
> registers, the log2 queue size is only ever used in the form of
> (1 << max_n_shift) to repeatedly recalculate the number of queue
> elements. Simply storing it in that form leads to slightly more
> efficient code, particularly in the low-level queue accessors
> where it counts most:
>
> add/remove: 0/0 grow/shrink: 1/7 up/down: 4/-120 (-116)
> Function old new delta
> arm_smmu_init_one_queue 360 364 +4
> arm_smmu_priq_thread 512 508 -4
> arm_smmu_evtq_thread 300 292 -8
> __arm_smmu_cmdq_poll_set_valid_map.isra 296 288 -8
> queue_remove_raw 180 164 -16
> arm_smmu_gerror_handler 732 716 -16
> arm_smmu_device_probe 4312 4284 -28
> arm_smmu_cmdq_issue_cmdlist 1892 1852 -40
> Total: Before=20135, After=20019, chg -0.58%
>
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> ---
[...]
> }
>
> - smmu->evtq.q.llq.max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT,
> - FIELD_GET(IDR1_EVTQS, reg));
> - smmu->priq.q.llq.max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT,
> - FIELD_GET(IDR1_PRIQS, reg));
> + max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_EVTQS, reg));
> + smmu->evtq.q.llq.max_n = 1 << max_n_shift;
So I require the bitmask of this for the prod, which would be (max_n <<
1) - 1.
I don't feel too strongly either way, and the other big changes in this
series need to be considered first...
Thanks,
John
> +
> + max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_PRIQS, reg));
> + smmu->priq.q.llq.max_n = 1 << max_n_shift;
>
> /* SID/SSID sizes */
> smmu->ssid_bits = FIELD_GET(IDR1_SSIDSIZE, reg);
>
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
@ 2020-06-24 8:15 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-06-24 8:15 UTC (permalink / raw)
To: Robin Murphy, kernel test robot, will, rikard.falkeborn
Cc: trivial, maz, linux-kernel, Linuxarm, iommu, kbuild-all,
linux-arm-kernel
>>
>> I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and
>> h=unsigned value, so I doubt this warn.
>>
>> Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks
>> like GENMASK_INPUT_CHECK() could be improved.
>
> That said, I think this particular case might be even better off dodging
> GENMASK() entirely, by doing something like this first. Untested...
>
> Robin.
>
> ----->8-----
> Subject: [PATCH] iommu/arm-smmu-v3: Streamline queue calculations
>
> Beyond the initial queue setup based on the log2 values from ID
> registers, the log2 queue size is only ever used in the form of
> (1 << max_n_shift) to repeatedly recalculate the number of queue
> elements. Simply storing it in that form leads to slightly more
> efficient code, particularly in the low-level queue accessors
> where it counts most:
>
> add/remove: 0/0 grow/shrink: 1/7 up/down: 4/-120 (-116)
> Function old new delta
> arm_smmu_init_one_queue 360 364 +4
> arm_smmu_priq_thread 512 508 -4
> arm_smmu_evtq_thread 300 292 -8
> __arm_smmu_cmdq_poll_set_valid_map.isra 296 288 -8
> queue_remove_raw 180 164 -16
> arm_smmu_gerror_handler 732 716 -16
> arm_smmu_device_probe 4312 4284 -28
> arm_smmu_cmdq_issue_cmdlist 1892 1852 -40
> Total: Before=20135, After=20019, chg -0.58%
>
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> ---
[...]
> }
>
> - smmu->evtq.q.llq.max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT,
> - FIELD_GET(IDR1_EVTQS, reg));
> - smmu->priq.q.llq.max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT,
> - FIELD_GET(IDR1_PRIQS, reg));
> + max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_EVTQS, reg));
> + smmu->evtq.q.llq.max_n = 1 << max_n_shift;
So I require the bitmask of this for the prod, which would be (max_n <<
1) - 1.
I don't feel too strongly either way, and the other big changes in this
series need to be considered first...
Thanks,
John
> +
> + max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_PRIQS, reg));
> + smmu->priq.q.llq.max_n = 1 << max_n_shift;
>
> /* SID/SSID sizes */
> smmu->ssid_bits = FIELD_GET(IDR1_SSIDSIZE, reg);
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
@ 2020-06-24 8:15 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-06-24 8:15 UTC (permalink / raw)
To: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 2471 bytes --]
>>
>> I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and
>> h=unsigned value, so I doubt this warn.
>>
>> Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks
>> like GENMASK_INPUT_CHECK() could be improved.
>
> That said, I think this particular case might be even better off dodging
> GENMASK() entirely, by doing something like this first. Untested...
>
> Robin.
>
> ----->8-----
> Subject: [PATCH] iommu/arm-smmu-v3: Streamline queue calculations
>
> Beyond the initial queue setup based on the log2 values from ID
> registers, the log2 queue size is only ever used in the form of
> (1 << max_n_shift) to repeatedly recalculate the number of queue
> elements. Simply storing it in that form leads to slightly more
> efficient code, particularly in the low-level queue accessors
> where it counts most:
>
> add/remove: 0/0 grow/shrink: 1/7 up/down: 4/-120 (-116)
> Function old new delta
> arm_smmu_init_one_queue 360 364 +4
> arm_smmu_priq_thread 512 508 -4
> arm_smmu_evtq_thread 300 292 -8
> __arm_smmu_cmdq_poll_set_valid_map.isra 296 288 -8
> queue_remove_raw 180 164 -16
> arm_smmu_gerror_handler 732 716 -16
> arm_smmu_device_probe 4312 4284 -28
> arm_smmu_cmdq_issue_cmdlist 1892 1852 -40
> Total: Before=20135, After=20019, chg -0.58%
>
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> ---
[...]
> }
>
> - smmu->evtq.q.llq.max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT,
> - FIELD_GET(IDR1_EVTQS, reg));
> - smmu->priq.q.llq.max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT,
> - FIELD_GET(IDR1_PRIQS, reg));
> + max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_EVTQS, reg));
> + smmu->evtq.q.llq.max_n = 1 << max_n_shift;
So I require the bitmask of this for the prod, which would be (max_n <<
1) - 1.
I don't feel too strongly either way, and the other big changes in this
series need to be considered first...
Thanks,
John
> +
> + max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_PRIQS, reg));
> + smmu->priq.q.llq.max_n = 1 << max_n_shift;
>
> /* SID/SSID sizes */
> smmu->ssid_bits = FIELD_GET(IDR1_SSIDSIZE, reg);
>
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
2020-06-23 13:55 ` Rikard Falkeborn
2020-06-26 10:05 ` John Garry
(?)
@ 2020-06-26 10:05 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-06-26 10:05 UTC (permalink / raw)
To: Rikard Falkeborn
Cc: Joerg Roedel, kernel test robot, will, robin.murphy, kbuild-all,
trivial, linux-arm-kernel, iommu, linux-kernel, Linuxarm, maz
On 23/06/2020 14:55, Rikard Falkeborn wrote:
> Den tis 23 juni 2020 12:21John Garry <john.garry@huawei.com
> <mailto:john.garry@huawei.com>> skrev:
>
> On 23/06/2020 10:35, Rikard Falkeborn wrote:
> >
> > I'd say that GENMASK_INPUT_CHECK() should be able to handle a
> l=0 and
> > h=unsigned value, so I doubt this warn.
> >
> > Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it,
> but it
> > looks
> > like GENMASK_INPUT_CHECK() could be improved.
> >
> >
> > Indeed it could, it is fixed in -next.
>
> ok, thanks for the pointer, but I still see this on today's -next with
> this patch:
>
> make W=1 drivers/iommu/arm-smmu-v3.o
>
>
> Oh, ok thanks for reporting. I guess different gcc versions have
> different behaviour. I guess we'll have to change the comparison to
> (!((h) == (l) || (h) > (l))) instead (not sure I got all parenthesis and
> logic correct but you get the idea).
>
Yeah, so this looks to fix it:
--- a/include/linux/bits.h
+++ b/include/linux/bits.h
@@ -23,7 +23,8 @@
#include <linux/build_bug.h>
#define GENMASK_INPUT_CHECK(h, l) \
(BUILD_BUG_ON_ZERO(__builtin_choose_expr( \
- __builtin_constant_p((l) > (h)), (l) > (h), 0)))
+ __builtin_constant_p(!((h) == (l) ||(h) > (l))), !((h)
== (l) ||(h) > (l)), 0)))
+
We may be able to just use (h) == (l) as the const expr to make it more
concise, but that may be confusing.
I only tested with my toolchain based on 7.5.0
Thanks,
John
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
@ 2020-06-26 10:05 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-06-26 10:05 UTC (permalink / raw)
To: Rikard Falkeborn
Cc: trivial, kernel test robot, will, linux-kernel, Linuxarm, iommu,
maz, kbuild-all, robin.murphy, linux-arm-kernel
On 23/06/2020 14:55, Rikard Falkeborn wrote:
> Den tis 23 juni 2020 12:21John Garry <john.garry@huawei.com
> <mailto:john.garry@huawei.com>> skrev:
>
> On 23/06/2020 10:35, Rikard Falkeborn wrote:
> >
> > I'd say that GENMASK_INPUT_CHECK() should be able to handle a
> l=0 and
> > h=unsigned value, so I doubt this warn.
> >
> > Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it,
> but it
> > looks
> > like GENMASK_INPUT_CHECK() could be improved.
> >
> >
> > Indeed it could, it is fixed in -next.
>
> ok, thanks for the pointer, but I still see this on today's -next with
> this patch:
>
> make W=1 drivers/iommu/arm-smmu-v3.o
>
>
> Oh, ok thanks for reporting. I guess different gcc versions have
> different behaviour. I guess we'll have to change the comparison to
> (!((h) == (l) || (h) > (l))) instead (not sure I got all parenthesis and
> logic correct but you get the idea).
>
Yeah, so this looks to fix it:
--- a/include/linux/bits.h
+++ b/include/linux/bits.h
@@ -23,7 +23,8 @@
#include <linux/build_bug.h>
#define GENMASK_INPUT_CHECK(h, l) \
(BUILD_BUG_ON_ZERO(__builtin_choose_expr( \
- __builtin_constant_p((l) > (h)), (l) > (h), 0)))
+ __builtin_constant_p(!((h) == (l) ||(h) > (l))), !((h)
== (l) ||(h) > (l)), 0)))
+
We may be able to just use (h) == (l) as the const expr to make it more
concise, but that may be confusing.
I only tested with my toolchain based on 7.5.0
Thanks,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
@ 2020-06-26 10:05 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-06-26 10:05 UTC (permalink / raw)
To: Rikard Falkeborn
Cc: trivial, kernel test robot, will, Joerg Roedel, linux-kernel,
Linuxarm, iommu, maz, kbuild-all, robin.murphy, linux-arm-kernel
On 23/06/2020 14:55, Rikard Falkeborn wrote:
> Den tis 23 juni 2020 12:21John Garry <john.garry@huawei.com
> <mailto:john.garry@huawei.com>> skrev:
>
> On 23/06/2020 10:35, Rikard Falkeborn wrote:
> >
> > I'd say that GENMASK_INPUT_CHECK() should be able to handle a
> l=0 and
> > h=unsigned value, so I doubt this warn.
> >
> > Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it,
> but it
> > looks
> > like GENMASK_INPUT_CHECK() could be improved.
> >
> >
> > Indeed it could, it is fixed in -next.
>
> ok, thanks for the pointer, but I still see this on today's -next with
> this patch:
>
> make W=1 drivers/iommu/arm-smmu-v3.o
>
>
> Oh, ok thanks for reporting. I guess different gcc versions have
> different behaviour. I guess we'll have to change the comparison to
> (!((h) == (l) || (h) > (l))) instead (not sure I got all parenthesis and
> logic correct but you get the idea).
>
Yeah, so this looks to fix it:
--- a/include/linux/bits.h
+++ b/include/linux/bits.h
@@ -23,7 +23,8 @@
#include <linux/build_bug.h>
#define GENMASK_INPUT_CHECK(h, l) \
(BUILD_BUG_ON_ZERO(__builtin_choose_expr( \
- __builtin_constant_p((l) > (h)), (l) > (h), 0)))
+ __builtin_constant_p(!((h) == (l) ||(h) > (l))), !((h)
== (l) ||(h) > (l)), 0)))
+
We may be able to just use (h) == (l) as the const expr to make it more
concise, but that may be confusing.
I only tested with my toolchain based on 7.5.0
Thanks,
John
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
@ 2020-06-26 10:05 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-06-26 10:05 UTC (permalink / raw)
To: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 1624 bytes --]
On 23/06/2020 14:55, Rikard Falkeborn wrote:
> Den tis 23 juni 2020 12:21John Garry <john.garry@huawei.com
> <mailto:john.garry@huawei.com>> skrev:
>
> On 23/06/2020 10:35, Rikard Falkeborn wrote:
> >
> > I'd say that GENMASK_INPUT_CHECK() should be able to handle a
> l=0 and
> > h=unsigned value, so I doubt this warn.
> >
> > Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it,
> but it
> > looks
> > like GENMASK_INPUT_CHECK() could be improved.
> >
> >
> > Indeed it could, it is fixed in -next.
>
> ok, thanks for the pointer, but I still see this on today's -next with
> this patch:
>
> make W=1 drivers/iommu/arm-smmu-v3.o
>
>
> Oh, ok thanks for reporting. I guess different gcc versions have
> different behaviour. I guess we'll have to change the comparison to
> (!((h) == (l) || (h) > (l))) instead (not sure I got all parenthesis and
> logic correct but you get the idea).
>
Yeah, so this looks to fix it:
--- a/include/linux/bits.h
+++ b/include/linux/bits.h
@@ -23,7 +23,8 @@
#include <linux/build_bug.h>
#define GENMASK_INPUT_CHECK(h, l) \
(BUILD_BUG_ON_ZERO(__builtin_choose_expr( \
- __builtin_constant_p((l) > (h)), (l) > (h), 0)))
+ __builtin_constant_p(!((h) == (l) ||(h) > (l))), !((h)
== (l) ||(h) > (l)), 0)))
+
We may be able to just use (h) == (l) as the const expr to make it more
concise, but that may be confusing.
I only tested with my toolchain based on 7.5.0
Thanks,
John
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
2020-06-22 17:28 ` John Garry
(?)
@ 2020-07-08 13:00 ` John Garry
-1 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-07-08 13:00 UTC (permalink / raw)
To: will, robin.murphy
Cc: joro, trivial, linux-arm-kernel, iommu, linux-kernel, linuxarm, maz
On 22/06/2020 18:28, John Garry wrote:
Hi, Can you guys let me know if this is on the radar at all?
I have been talking about this performance issue since Jan, and not
getting anything really.
thanks
> As mentioned in [0], the CPU may consume many cycles processing
> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
> get space on the queue takes approx 25% of the cycles for this function.
>
> This series removes that cmpxchg().
>
> For my NVMe test with 3x NVMe SSDs, I'm getting a ~24% throughput
> increase:
> Before: 1310 IOPs
> After: 1630 IOPs
>
> I also have a test harness to check the rate of DMA map+unmaps we can
> achieve:
>
> CPU count 32 64 128
> Before: 63187 19418 10169
> After: 93287 44789 15862
>
> (unit is map+unmaps per CPU per second)
>
> [0] https://lore.kernel.org/linux-iommu/B926444035E5E2439431908E3842AFD24B86DB@DGGEMI525-MBS.china.huawei.com/T/#ma02e301c38c3e94b7725e685757c27e39c7cbde3
>
> John Garry (4):
> iommu/arm-smmu-v3: Fix trivial typo
> iommu/arm-smmu-v3: Calculate bits for prod and owner
> iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch
> iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
>
> drivers/iommu/arm-smmu-v3.c | 233 +++++++++++++++++++++++-------------
> 1 file changed, 151 insertions(+), 82 deletions(-)
>
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
@ 2020-07-08 13:00 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-07-08 13:00 UTC (permalink / raw)
To: will, robin.murphy
Cc: trivial, maz, linuxarm, linux-kernel, iommu, linux-arm-kernel
On 22/06/2020 18:28, John Garry wrote:
Hi, Can you guys let me know if this is on the radar at all?
I have been talking about this performance issue since Jan, and not
getting anything really.
thanks
> As mentioned in [0], the CPU may consume many cycles processing
> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
> get space on the queue takes approx 25% of the cycles for this function.
>
> This series removes that cmpxchg().
>
> For my NVMe test with 3x NVMe SSDs, I'm getting a ~24% throughput
> increase:
> Before: 1310 IOPs
> After: 1630 IOPs
>
> I also have a test harness to check the rate of DMA map+unmaps we can
> achieve:
>
> CPU count 32 64 128
> Before: 63187 19418 10169
> After: 93287 44789 15862
>
> (unit is map+unmaps per CPU per second)
>
> [0] https://lore.kernel.org/linux-iommu/B926444035E5E2439431908E3842AFD24B86DB@DGGEMI525-MBS.china.huawei.com/T/#ma02e301c38c3e94b7725e685757c27e39c7cbde3
>
> John Garry (4):
> iommu/arm-smmu-v3: Fix trivial typo
> iommu/arm-smmu-v3: Calculate bits for prod and owner
> iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch
> iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
>
> drivers/iommu/arm-smmu-v3.c | 233 +++++++++++++++++++++++-------------
> 1 file changed, 151 insertions(+), 82 deletions(-)
>
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
@ 2020-07-08 13:00 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-07-08 13:00 UTC (permalink / raw)
To: will, robin.murphy
Cc: trivial, maz, joro, linuxarm, linux-kernel, iommu, linux-arm-kernel
On 22/06/2020 18:28, John Garry wrote:
Hi, Can you guys let me know if this is on the radar at all?
I have been talking about this performance issue since Jan, and not
getting anything really.
thanks
> As mentioned in [0], the CPU may consume many cycles processing
> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
> get space on the queue takes approx 25% of the cycles for this function.
>
> This series removes that cmpxchg().
>
> For my NVMe test with 3x NVMe SSDs, I'm getting a ~24% throughput
> increase:
> Before: 1310 IOPs
> After: 1630 IOPs
>
> I also have a test harness to check the rate of DMA map+unmaps we can
> achieve:
>
> CPU count 32 64 128
> Before: 63187 19418 10169
> After: 93287 44789 15862
>
> (unit is map+unmaps per CPU per second)
>
> [0] https://lore.kernel.org/linux-iommu/B926444035E5E2439431908E3842AFD24B86DB@DGGEMI525-MBS.china.huawei.com/T/#ma02e301c38c3e94b7725e685757c27e39c7cbde3
>
> John Garry (4):
> iommu/arm-smmu-v3: Fix trivial typo
> iommu/arm-smmu-v3: Calculate bits for prod and owner
> iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch
> iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
>
> drivers/iommu/arm-smmu-v3.c | 233 +++++++++++++++++++++++-------------
> 1 file changed, 151 insertions(+), 82 deletions(-)
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
2020-06-22 17:28 ` John Garry
(?)
@ 2020-07-16 10:19 ` Will Deacon
-1 siblings, 0 replies; 69+ messages in thread
From: Will Deacon @ 2020-07-16 10:19 UTC (permalink / raw)
To: John Garry
Cc: robin.murphy, joro, trivial, linux-arm-kernel, iommu,
linux-kernel, linuxarm, maz
On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
> As mentioned in [0], the CPU may consume many cycles processing
> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
> get space on the queue takes approx 25% of the cycles for this function.
>
> This series removes that cmpxchg().
How about something much simpler like the diff below?
Will
--->8
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f578677a5c41..705d99ec8219 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -560,6 +560,7 @@ struct arm_smmu_cmdq {
atomic_long_t *valid_map;
atomic_t owner_prod;
atomic_t lock;
+ spinlock_t slock;
};
struct arm_smmu_cmdq_batch {
@@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
u64 cmd_sync[CMDQ_ENT_DWORDS];
u32 prod;
unsigned long flags;
- bool owner;
+ bool owner, cas_failed = false;
struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
struct arm_smmu_ll_queue llq = {
.max_n_shift = cmdq->q.llq.max_n_shift,
@@ -1387,27 +1388,35 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
/* 1. Allocate some space in the queue */
local_irq_save(flags);
- llq.val = READ_ONCE(cmdq->q.llq.val);
do {
u64 old;
+ llq.val = READ_ONCE(cmdq->q.llq.val);
- while (!queue_has_space(&llq, n + sync)) {
+ if (queue_has_space(&llq, n + sync))
+ goto try_cas;
+
+ if (cas_failed)
+ spin_unlock(&cmdq->slock);
+
+ do {
local_irq_restore(flags);
if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
local_irq_save(flags);
- }
+ } while (!queue_has_space(&llq, n + sync));
+try_cas:
head.cons = llq.cons;
head.prod = queue_inc_prod_n(&llq, n + sync) |
CMDQ_PROD_OWNED_FLAG;
old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
- if (old == llq.val)
- break;
+ cas_failed = old != llq.val;
+
+ if (cas_failed)
+ spin_lock(&cmdq->slock);
+ } while (cas_failed);
- llq.val = old;
- } while (1);
owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
head.prod &= ~CMDQ_PROD_OWNED_FLAG;
llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
@@ -3192,6 +3201,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
atomic_set(&cmdq->owner_prod, 0);
atomic_set(&cmdq->lock, 0);
+ spin_lock_init(&cmdq->slock);
bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
if (!bitmap) {
^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
@ 2020-07-16 10:19 ` Will Deacon
0 siblings, 0 replies; 69+ messages in thread
From: Will Deacon @ 2020-07-16 10:19 UTC (permalink / raw)
To: John Garry
Cc: trivial, maz, linux-kernel, linuxarm, iommu, robin.murphy,
linux-arm-kernel
On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
> As mentioned in [0], the CPU may consume many cycles processing
> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
> get space on the queue takes approx 25% of the cycles for this function.
>
> This series removes that cmpxchg().
How about something much simpler like the diff below?
Will
--->8
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f578677a5c41..705d99ec8219 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -560,6 +560,7 @@ struct arm_smmu_cmdq {
atomic_long_t *valid_map;
atomic_t owner_prod;
atomic_t lock;
+ spinlock_t slock;
};
struct arm_smmu_cmdq_batch {
@@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
u64 cmd_sync[CMDQ_ENT_DWORDS];
u32 prod;
unsigned long flags;
- bool owner;
+ bool owner, cas_failed = false;
struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
struct arm_smmu_ll_queue llq = {
.max_n_shift = cmdq->q.llq.max_n_shift,
@@ -1387,27 +1388,35 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
/* 1. Allocate some space in the queue */
local_irq_save(flags);
- llq.val = READ_ONCE(cmdq->q.llq.val);
do {
u64 old;
+ llq.val = READ_ONCE(cmdq->q.llq.val);
- while (!queue_has_space(&llq, n + sync)) {
+ if (queue_has_space(&llq, n + sync))
+ goto try_cas;
+
+ if (cas_failed)
+ spin_unlock(&cmdq->slock);
+
+ do {
local_irq_restore(flags);
if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
local_irq_save(flags);
- }
+ } while (!queue_has_space(&llq, n + sync));
+try_cas:
head.cons = llq.cons;
head.prod = queue_inc_prod_n(&llq, n + sync) |
CMDQ_PROD_OWNED_FLAG;
old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
- if (old == llq.val)
- break;
+ cas_failed = old != llq.val;
+
+ if (cas_failed)
+ spin_lock(&cmdq->slock);
+ } while (cas_failed);
- llq.val = old;
- } while (1);
owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
head.prod &= ~CMDQ_PROD_OWNED_FLAG;
llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
@@ -3192,6 +3201,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
atomic_set(&cmdq->owner_prod, 0);
atomic_set(&cmdq->lock, 0);
+ spin_lock_init(&cmdq->slock);
bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
if (!bitmap) {
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
@ 2020-07-16 10:19 ` Will Deacon
0 siblings, 0 replies; 69+ messages in thread
From: Will Deacon @ 2020-07-16 10:19 UTC (permalink / raw)
To: John Garry
Cc: trivial, maz, joro, linux-kernel, linuxarm, iommu, robin.murphy,
linux-arm-kernel
On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
> As mentioned in [0], the CPU may consume many cycles processing
> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
> get space on the queue takes approx 25% of the cycles for this function.
>
> This series removes that cmpxchg().
How about something much simpler like the diff below?
Will
--->8
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f578677a5c41..705d99ec8219 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -560,6 +560,7 @@ struct arm_smmu_cmdq {
atomic_long_t *valid_map;
atomic_t owner_prod;
atomic_t lock;
+ spinlock_t slock;
};
struct arm_smmu_cmdq_batch {
@@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
u64 cmd_sync[CMDQ_ENT_DWORDS];
u32 prod;
unsigned long flags;
- bool owner;
+ bool owner, cas_failed = false;
struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
struct arm_smmu_ll_queue llq = {
.max_n_shift = cmdq->q.llq.max_n_shift,
@@ -1387,27 +1388,35 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
/* 1. Allocate some space in the queue */
local_irq_save(flags);
- llq.val = READ_ONCE(cmdq->q.llq.val);
do {
u64 old;
+ llq.val = READ_ONCE(cmdq->q.llq.val);
- while (!queue_has_space(&llq, n + sync)) {
+ if (queue_has_space(&llq, n + sync))
+ goto try_cas;
+
+ if (cas_failed)
+ spin_unlock(&cmdq->slock);
+
+ do {
local_irq_restore(flags);
if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
local_irq_save(flags);
- }
+ } while (!queue_has_space(&llq, n + sync));
+try_cas:
head.cons = llq.cons;
head.prod = queue_inc_prod_n(&llq, n + sync) |
CMDQ_PROD_OWNED_FLAG;
old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
- if (old == llq.val)
- break;
+ cas_failed = old != llq.val;
+
+ if (cas_failed)
+ spin_lock(&cmdq->slock);
+ } while (cas_failed);
- llq.val = old;
- } while (1);
owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
head.prod &= ~CMDQ_PROD_OWNED_FLAG;
llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
@@ -3192,6 +3201,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
atomic_set(&cmdq->owner_prod, 0);
atomic_set(&cmdq->lock, 0);
+ spin_lock_init(&cmdq->slock);
bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
if (!bitmap) {
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
2020-06-22 17:28 ` John Garry
(?)
@ 2020-07-16 10:20 ` Will Deacon
-1 siblings, 0 replies; 69+ messages in thread
From: Will Deacon @ 2020-07-16 10:20 UTC (permalink / raw)
To: John Garry
Cc: robin.murphy, joro, trivial, linux-arm-kernel, iommu,
linux-kernel, linuxarm, maz
On Tue, Jun 23, 2020 at 01:28:40AM +0800, John Garry wrote:
> It has been shown that the cmpxchg() for finding space in the cmdq can
> be a bottleneck:
> - for more CPUs contending the cmdq, the cmpxchg() will fail more often
> - since the software-maintained cons pointer is updated on the same 64b
> memory region, the chance of cmpxchg() failure increases again
>
> The cmpxchg() is removed as part of 2 related changes:
>
> - Update prod and cmdq owner in a single atomic add operation. For this, we
> count the prod and owner in separate regions in prod memory.
>
> As with simple binary counting, once the prod+wrap fields overflow, they
> will zero. They should never overflow into "owner" region, and we zero
> the non-owner, prod region for each owner. This maintains the prod
> pointer.
>
> As for the "owner", we now count this value, instead of setting a flag.
> Similar to before, once the owner has finished gathering, it will clear
> a mask. As such, a CPU declares itself as the "owner" when it reads zero
> for this region. This zeroing will also clear possible overflow in
> wrap+prod region, above.
>
> The owner is now responsible for all cmdq locking to avoid possible
> deadlock. The owner will lock the cmdq for all non-owers it has gathered
> when they have space in the queue and have written their entries.
>
> - Check for space in the cmdq after the prod pointer has been assigned.
>
> We don't bother checking for space in the cmdq before assigning the prod
> pointer, as this would be racy.
>
> So since the prod pointer is updated unconditionally, it would be common
> for no space to be available in the cmdq when prod is assigned - that
> is, according the software-maintained prod and cons pointer. So now
> it must be ensured that the entries are not yet written and not until
> there is space.
>
> How the prod pointer is maintained also leads to a strange condition
> where the prod pointer can wrap past the cons pointer. We can detect this
> condition, and report no space here. However, a prod pointer progressed
> twice past the cons pointer cannot be detected. But it can be ensured that
> this that this scenario does not occur, as we limit the amount of
> commands any CPU can issue at any given time, such that we cannot
> progress prod pointer further.
>
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
> drivers/iommu/arm-smmu-v3.c | 101 ++++++++++++++++++++++--------------
> 1 file changed, 61 insertions(+), 40 deletions(-)
I must admit, you made me smile putting trivial@kernel.org on cc for this ;)
Will
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
@ 2020-07-16 10:20 ` Will Deacon
0 siblings, 0 replies; 69+ messages in thread
From: Will Deacon @ 2020-07-16 10:20 UTC (permalink / raw)
To: John Garry
Cc: trivial, maz, linux-kernel, linuxarm, iommu, robin.murphy,
linux-arm-kernel
On Tue, Jun 23, 2020 at 01:28:40AM +0800, John Garry wrote:
> It has been shown that the cmpxchg() for finding space in the cmdq can
> be a bottleneck:
> - for more CPUs contending the cmdq, the cmpxchg() will fail more often
> - since the software-maintained cons pointer is updated on the same 64b
> memory region, the chance of cmpxchg() failure increases again
>
> The cmpxchg() is removed as part of 2 related changes:
>
> - Update prod and cmdq owner in a single atomic add operation. For this, we
> count the prod and owner in separate regions in prod memory.
>
> As with simple binary counting, once the prod+wrap fields overflow, they
> will zero. They should never overflow into "owner" region, and we zero
> the non-owner, prod region for each owner. This maintains the prod
> pointer.
>
> As for the "owner", we now count this value, instead of setting a flag.
> Similar to before, once the owner has finished gathering, it will clear
> a mask. As such, a CPU declares itself as the "owner" when it reads zero
> for this region. This zeroing will also clear possible overflow in
> wrap+prod region, above.
>
> The owner is now responsible for all cmdq locking to avoid possible
> deadlock. The owner will lock the cmdq for all non-owers it has gathered
> when they have space in the queue and have written their entries.
>
> - Check for space in the cmdq after the prod pointer has been assigned.
>
> We don't bother checking for space in the cmdq before assigning the prod
> pointer, as this would be racy.
>
> So since the prod pointer is updated unconditionally, it would be common
> for no space to be available in the cmdq when prod is assigned - that
> is, according the software-maintained prod and cons pointer. So now
> it must be ensured that the entries are not yet written and not until
> there is space.
>
> How the prod pointer is maintained also leads to a strange condition
> where the prod pointer can wrap past the cons pointer. We can detect this
> condition, and report no space here. However, a prod pointer progressed
> twice past the cons pointer cannot be detected. But it can be ensured that
> this that this scenario does not occur, as we limit the amount of
> commands any CPU can issue at any given time, such that we cannot
> progress prod pointer further.
>
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
> drivers/iommu/arm-smmu-v3.c | 101 ++++++++++++++++++++++--------------
> 1 file changed, 61 insertions(+), 40 deletions(-)
I must admit, you made me smile putting trivial@kernel.org on cc for this ;)
Will
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
@ 2020-07-16 10:20 ` Will Deacon
0 siblings, 0 replies; 69+ messages in thread
From: Will Deacon @ 2020-07-16 10:20 UTC (permalink / raw)
To: John Garry
Cc: trivial, maz, joro, linux-kernel, linuxarm, iommu, robin.murphy,
linux-arm-kernel
On Tue, Jun 23, 2020 at 01:28:40AM +0800, John Garry wrote:
> It has been shown that the cmpxchg() for finding space in the cmdq can
> be a bottleneck:
> - for more CPUs contending the cmdq, the cmpxchg() will fail more often
> - since the software-maintained cons pointer is updated on the same 64b
> memory region, the chance of cmpxchg() failure increases again
>
> The cmpxchg() is removed as part of 2 related changes:
>
> - Update prod and cmdq owner in a single atomic add operation. For this, we
> count the prod and owner in separate regions in prod memory.
>
> As with simple binary counting, once the prod+wrap fields overflow, they
> will zero. They should never overflow into "owner" region, and we zero
> the non-owner, prod region for each owner. This maintains the prod
> pointer.
>
> As for the "owner", we now count this value, instead of setting a flag.
> Similar to before, once the owner has finished gathering, it will clear
> a mask. As such, a CPU declares itself as the "owner" when it reads zero
> for this region. This zeroing will also clear possible overflow in
> wrap+prod region, above.
>
> The owner is now responsible for all cmdq locking to avoid possible
> deadlock. The owner will lock the cmdq for all non-owers it has gathered
> when they have space in the queue and have written their entries.
>
> - Check for space in the cmdq after the prod pointer has been assigned.
>
> We don't bother checking for space in the cmdq before assigning the prod
> pointer, as this would be racy.
>
> So since the prod pointer is updated unconditionally, it would be common
> for no space to be available in the cmdq when prod is assigned - that
> is, according the software-maintained prod and cons pointer. So now
> it must be ensured that the entries are not yet written and not until
> there is space.
>
> How the prod pointer is maintained also leads to a strange condition
> where the prod pointer can wrap past the cons pointer. We can detect this
> condition, and report no space here. However, a prod pointer progressed
> twice past the cons pointer cannot be detected. But it can be ensured that
> this that this scenario does not occur, as we limit the amount of
> commands any CPU can issue at any given time, such that we cannot
> progress prod pointer further.
>
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
> drivers/iommu/arm-smmu-v3.c | 101 ++++++++++++++++++++++--------------
> 1 file changed, 61 insertions(+), 40 deletions(-)
I must admit, you made me smile putting trivial@kernel.org on cc for this ;)
Will
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
2020-07-16 10:19 ` Will Deacon
(?)
@ 2020-07-16 10:22 ` Will Deacon
-1 siblings, 0 replies; 69+ messages in thread
From: Will Deacon @ 2020-07-16 10:22 UTC (permalink / raw)
To: John Garry
Cc: robin.murphy, joro, trivial, linux-arm-kernel, iommu,
linux-kernel, linuxarm, maz
On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
> > As mentioned in [0], the CPU may consume many cycles processing
> > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
> > get space on the queue takes approx 25% of the cycles for this function.
> >
> > This series removes that cmpxchg().
>
> How about something much simpler like the diff below?
Ah, scratch that, I don't drop the lock if we fail the cas with it held.
Let me hack it some more (I have no hardware so I can only build-test this).
Will
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
@ 2020-07-16 10:22 ` Will Deacon
0 siblings, 0 replies; 69+ messages in thread
From: Will Deacon @ 2020-07-16 10:22 UTC (permalink / raw)
To: John Garry
Cc: trivial, maz, linux-kernel, linuxarm, iommu, robin.murphy,
linux-arm-kernel
On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
> > As mentioned in [0], the CPU may consume many cycles processing
> > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
> > get space on the queue takes approx 25% of the cycles for this function.
> >
> > This series removes that cmpxchg().
>
> How about something much simpler like the diff below?
Ah, scratch that, I don't drop the lock if we fail the cas with it held.
Let me hack it some more (I have no hardware so I can only build-test this).
Will
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
@ 2020-07-16 10:22 ` Will Deacon
0 siblings, 0 replies; 69+ messages in thread
From: Will Deacon @ 2020-07-16 10:22 UTC (permalink / raw)
To: John Garry
Cc: trivial, maz, joro, linux-kernel, linuxarm, iommu, robin.murphy,
linux-arm-kernel
On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
> > As mentioned in [0], the CPU may consume many cycles processing
> > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
> > get space on the queue takes approx 25% of the cycles for this function.
> >
> > This series removes that cmpxchg().
>
> How about something much simpler like the diff below?
Ah, scratch that, I don't drop the lock if we fail the cas with it held.
Let me hack it some more (I have no hardware so I can only build-test this).
Will
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
2020-07-16 10:20 ` Will Deacon
(?)
@ 2020-07-16 10:26 ` John Garry
-1 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-07-16 10:26 UTC (permalink / raw)
To: Will Deacon
Cc: robin.murphy, joro, trivial, linux-arm-kernel, iommu,
linux-kernel, linuxarm, maz
On 16/07/2020 11:20, Will Deacon wrote:
> On Tue, Jun 23, 2020 at 01:28:40AM +0800, John Garry wrote:
>> It has been shown that the cmpxchg() for finding space in the cmdq can
>> be a bottleneck:
>> - for more CPUs contending the cmdq, the cmpxchg() will fail more often
>> - since the software-maintained cons pointer is updated on the same 64b
>> memory region, the chance of cmpxchg() failure increases again
>>
>> The cmpxchg() is removed as part of 2 related changes:
>>
>> - Update prod and cmdq owner in a single atomic add operation. For this, we
>> count the prod and owner in separate regions in prod memory.
>>
>> As with simple binary counting, once the prod+wrap fields overflow, they
>> will zero. They should never overflow into "owner" region, and we zero
>> the non-owner, prod region for each owner. This maintains the prod
>> pointer.
>>
>> As for the "owner", we now count this value, instead of setting a flag.
>> Similar to before, once the owner has finished gathering, it will clear
>> a mask. As such, a CPU declares itself as the "owner" when it reads zero
>> for this region. This zeroing will also clear possible overflow in
>> wrap+prod region, above.
>>
>> The owner is now responsible for all cmdq locking to avoid possible
>> deadlock. The owner will lock the cmdq for all non-owers it has gathered
>> when they have space in the queue and have written their entries.
>>
>> - Check for space in the cmdq after the prod pointer has been assigned.
>>
>> We don't bother checking for space in the cmdq before assigning the prod
>> pointer, as this would be racy.
>>
>> So since the prod pointer is updated unconditionally, it would be common
>> for no space to be available in the cmdq when prod is assigned - that
>> is, according the software-maintained prod and cons pointer. So now
>> it must be ensured that the entries are not yet written and not until
>> there is space.
>>
>> How the prod pointer is maintained also leads to a strange condition
>> where the prod pointer can wrap past the cons pointer. We can detect this
>> condition, and report no space here. However, a prod pointer progressed
>> twice past the cons pointer cannot be detected. But it can be ensured that
>> this that this scenario does not occur, as we limit the amount of
>> commands any CPU can issue at any given time, such that we cannot
>> progress prod pointer further.
>>
>> Signed-off-by: John Garry <john.garry@huawei.com>
>> ---
>> drivers/iommu/arm-smmu-v3.c | 101 ++++++++++++++++++++++--------------
>> 1 file changed, 61 insertions(+), 40 deletions(-)
>
> I must admit, you made me smile putting trivial@kernel.org on cc for this ;)
>
Yes, quite ironic :)
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
@ 2020-07-16 10:26 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-07-16 10:26 UTC (permalink / raw)
To: Will Deacon
Cc: trivial, maz, linux-kernel, linuxarm, iommu, robin.murphy,
linux-arm-kernel
On 16/07/2020 11:20, Will Deacon wrote:
> On Tue, Jun 23, 2020 at 01:28:40AM +0800, John Garry wrote:
>> It has been shown that the cmpxchg() for finding space in the cmdq can
>> be a bottleneck:
>> - for more CPUs contending the cmdq, the cmpxchg() will fail more often
>> - since the software-maintained cons pointer is updated on the same 64b
>> memory region, the chance of cmpxchg() failure increases again
>>
>> The cmpxchg() is removed as part of 2 related changes:
>>
>> - Update prod and cmdq owner in a single atomic add operation. For this, we
>> count the prod and owner in separate regions in prod memory.
>>
>> As with simple binary counting, once the prod+wrap fields overflow, they
>> will zero. They should never overflow into "owner" region, and we zero
>> the non-owner, prod region for each owner. This maintains the prod
>> pointer.
>>
>> As for the "owner", we now count this value, instead of setting a flag.
>> Similar to before, once the owner has finished gathering, it will clear
>> a mask. As such, a CPU declares itself as the "owner" when it reads zero
>> for this region. This zeroing will also clear possible overflow in
>> wrap+prod region, above.
>>
>> The owner is now responsible for all cmdq locking to avoid possible
>> deadlock. The owner will lock the cmdq for all non-owers it has gathered
>> when they have space in the queue and have written their entries.
>>
>> - Check for space in the cmdq after the prod pointer has been assigned.
>>
>> We don't bother checking for space in the cmdq before assigning the prod
>> pointer, as this would be racy.
>>
>> So since the prod pointer is updated unconditionally, it would be common
>> for no space to be available in the cmdq when prod is assigned - that
>> is, according the software-maintained prod and cons pointer. So now
>> it must be ensured that the entries are not yet written and not until
>> there is space.
>>
>> How the prod pointer is maintained also leads to a strange condition
>> where the prod pointer can wrap past the cons pointer. We can detect this
>> condition, and report no space here. However, a prod pointer progressed
>> twice past the cons pointer cannot be detected. But it can be ensured that
>> this that this scenario does not occur, as we limit the amount of
>> commands any CPU can issue at any given time, such that we cannot
>> progress prod pointer further.
>>
>> Signed-off-by: John Garry <john.garry@huawei.com>
>> ---
>> drivers/iommu/arm-smmu-v3.c | 101 ++++++++++++++++++++++--------------
>> 1 file changed, 61 insertions(+), 40 deletions(-)
>
> I must admit, you made me smile putting trivial@kernel.org on cc for this ;)
>
Yes, quite ironic :)
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()
@ 2020-07-16 10:26 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-07-16 10:26 UTC (permalink / raw)
To: Will Deacon
Cc: trivial, maz, joro, linux-kernel, linuxarm, iommu, robin.murphy,
linux-arm-kernel
On 16/07/2020 11:20, Will Deacon wrote:
> On Tue, Jun 23, 2020 at 01:28:40AM +0800, John Garry wrote:
>> It has been shown that the cmpxchg() for finding space in the cmdq can
>> be a bottleneck:
>> - for more CPUs contending the cmdq, the cmpxchg() will fail more often
>> - since the software-maintained cons pointer is updated on the same 64b
>> memory region, the chance of cmpxchg() failure increases again
>>
>> The cmpxchg() is removed as part of 2 related changes:
>>
>> - Update prod and cmdq owner in a single atomic add operation. For this, we
>> count the prod and owner in separate regions in prod memory.
>>
>> As with simple binary counting, once the prod+wrap fields overflow, they
>> will zero. They should never overflow into "owner" region, and we zero
>> the non-owner, prod region for each owner. This maintains the prod
>> pointer.
>>
>> As for the "owner", we now count this value, instead of setting a flag.
>> Similar to before, once the owner has finished gathering, it will clear
>> a mask. As such, a CPU declares itself as the "owner" when it reads zero
>> for this region. This zeroing will also clear possible overflow in
>> wrap+prod region, above.
>>
>> The owner is now responsible for all cmdq locking to avoid possible
>> deadlock. The owner will lock the cmdq for all non-owers it has gathered
>> when they have space in the queue and have written their entries.
>>
>> - Check for space in the cmdq after the prod pointer has been assigned.
>>
>> We don't bother checking for space in the cmdq before assigning the prod
>> pointer, as this would be racy.
>>
>> So since the prod pointer is updated unconditionally, it would be common
>> for no space to be available in the cmdq when prod is assigned - that
>> is, according the software-maintained prod and cons pointer. So now
>> it must be ensured that the entries are not yet written and not until
>> there is space.
>>
>> How the prod pointer is maintained also leads to a strange condition
>> where the prod pointer can wrap past the cons pointer. We can detect this
>> condition, and report no space here. However, a prod pointer progressed
>> twice past the cons pointer cannot be detected. But it can be ensured that
>> this that this scenario does not occur, as we limit the amount of
>> commands any CPU can issue at any given time, such that we cannot
>> progress prod pointer further.
>>
>> Signed-off-by: John Garry <john.garry@huawei.com>
>> ---
>> drivers/iommu/arm-smmu-v3.c | 101 ++++++++++++++++++++++--------------
>> 1 file changed, 61 insertions(+), 40 deletions(-)
>
> I must admit, you made me smile putting trivial@kernel.org on cc for this ;)
>
Yes, quite ironic :)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
2020-07-16 10:22 ` Will Deacon
(?)
@ 2020-07-16 10:28 ` Will Deacon
-1 siblings, 0 replies; 69+ messages in thread
From: Will Deacon @ 2020-07-16 10:28 UTC (permalink / raw)
To: John Garry
Cc: robin.murphy, joro, trivial, linux-arm-kernel, iommu,
linux-kernel, linuxarm, maz
On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
> > On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
> > > As mentioned in [0], the CPU may consume many cycles processing
> > > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
> > > get space on the queue takes approx 25% of the cycles for this function.
> > >
> > > This series removes that cmpxchg().
> >
> > How about something much simpler like the diff below?
>
> Ah, scratch that, I don't drop the lock if we fail the cas with it held.
> Let me hack it some more (I have no hardware so I can only build-test this).
Right, second attempt...
Will
--->8
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f578677a5c41..e6bcddd6ef69 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -560,6 +560,7 @@ struct arm_smmu_cmdq {
atomic_long_t *valid_map;
atomic_t owner_prod;
atomic_t lock;
+ spinlock_t slock;
};
struct arm_smmu_cmdq_batch {
@@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
u64 cmd_sync[CMDQ_ENT_DWORDS];
u32 prod;
unsigned long flags;
- bool owner;
+ bool owner, locked = false;
struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
struct arm_smmu_ll_queue llq = {
.max_n_shift = cmdq->q.llq.max_n_shift,
@@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
/* 1. Allocate some space in the queue */
local_irq_save(flags);
- llq.val = READ_ONCE(cmdq->q.llq.val);
do {
u64 old;
+ llq.val = READ_ONCE(cmdq->q.llq.val);
- while (!queue_has_space(&llq, n + sync)) {
+ if (queue_has_space(&llq, n + sync))
+ goto try_cas;
+
+ if (locked)
+ spin_unlock(&cmdq->slock);
+
+ do {
local_irq_restore(flags);
if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
local_irq_save(flags);
- }
+ } while (!queue_has_space(&llq, n + sync));
+try_cas:
head.cons = llq.cons;
head.prod = queue_inc_prod_n(&llq, n + sync) |
CMDQ_PROD_OWNED_FLAG;
old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
- if (old == llq.val)
+ if (old != llq.val)
break;
- llq.val = old;
+ if (!locked) {
+ spin_lock(&cmdq->slock);
+ locked = true;
+ }
} while (1);
+
owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
head.prod &= ~CMDQ_PROD_OWNED_FLAG;
llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
@@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
atomic_set(&cmdq->owner_prod, 0);
atomic_set(&cmdq->lock, 0);
+ spin_lock_init(&cmdq->slock);
bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
if (!bitmap) {
^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
@ 2020-07-16 10:28 ` Will Deacon
0 siblings, 0 replies; 69+ messages in thread
From: Will Deacon @ 2020-07-16 10:28 UTC (permalink / raw)
To: John Garry
Cc: trivial, maz, linux-kernel, linuxarm, iommu, robin.murphy,
linux-arm-kernel
On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
> > On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
> > > As mentioned in [0], the CPU may consume many cycles processing
> > > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
> > > get space on the queue takes approx 25% of the cycles for this function.
> > >
> > > This series removes that cmpxchg().
> >
> > How about something much simpler like the diff below?
>
> Ah, scratch that, I don't drop the lock if we fail the cas with it held.
> Let me hack it some more (I have no hardware so I can only build-test this).
Right, second attempt...
Will
--->8
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f578677a5c41..e6bcddd6ef69 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -560,6 +560,7 @@ struct arm_smmu_cmdq {
atomic_long_t *valid_map;
atomic_t owner_prod;
atomic_t lock;
+ spinlock_t slock;
};
struct arm_smmu_cmdq_batch {
@@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
u64 cmd_sync[CMDQ_ENT_DWORDS];
u32 prod;
unsigned long flags;
- bool owner;
+ bool owner, locked = false;
struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
struct arm_smmu_ll_queue llq = {
.max_n_shift = cmdq->q.llq.max_n_shift,
@@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
/* 1. Allocate some space in the queue */
local_irq_save(flags);
- llq.val = READ_ONCE(cmdq->q.llq.val);
do {
u64 old;
+ llq.val = READ_ONCE(cmdq->q.llq.val);
- while (!queue_has_space(&llq, n + sync)) {
+ if (queue_has_space(&llq, n + sync))
+ goto try_cas;
+
+ if (locked)
+ spin_unlock(&cmdq->slock);
+
+ do {
local_irq_restore(flags);
if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
local_irq_save(flags);
- }
+ } while (!queue_has_space(&llq, n + sync));
+try_cas:
head.cons = llq.cons;
head.prod = queue_inc_prod_n(&llq, n + sync) |
CMDQ_PROD_OWNED_FLAG;
old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
- if (old == llq.val)
+ if (old != llq.val)
break;
- llq.val = old;
+ if (!locked) {
+ spin_lock(&cmdq->slock);
+ locked = true;
+ }
} while (1);
+
owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
head.prod &= ~CMDQ_PROD_OWNED_FLAG;
llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
@@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
atomic_set(&cmdq->owner_prod, 0);
atomic_set(&cmdq->lock, 0);
+ spin_lock_init(&cmdq->slock);
bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
if (!bitmap) {
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
@ 2020-07-16 10:28 ` Will Deacon
0 siblings, 0 replies; 69+ messages in thread
From: Will Deacon @ 2020-07-16 10:28 UTC (permalink / raw)
To: John Garry
Cc: trivial, maz, joro, linux-kernel, linuxarm, iommu, robin.murphy,
linux-arm-kernel
On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
> > On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
> > > As mentioned in [0], the CPU may consume many cycles processing
> > > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
> > > get space on the queue takes approx 25% of the cycles for this function.
> > >
> > > This series removes that cmpxchg().
> >
> > How about something much simpler like the diff below?
>
> Ah, scratch that, I don't drop the lock if we fail the cas with it held.
> Let me hack it some more (I have no hardware so I can only build-test this).
Right, second attempt...
Will
--->8
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f578677a5c41..e6bcddd6ef69 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -560,6 +560,7 @@ struct arm_smmu_cmdq {
atomic_long_t *valid_map;
atomic_t owner_prod;
atomic_t lock;
+ spinlock_t slock;
};
struct arm_smmu_cmdq_batch {
@@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
u64 cmd_sync[CMDQ_ENT_DWORDS];
u32 prod;
unsigned long flags;
- bool owner;
+ bool owner, locked = false;
struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
struct arm_smmu_ll_queue llq = {
.max_n_shift = cmdq->q.llq.max_n_shift,
@@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
/* 1. Allocate some space in the queue */
local_irq_save(flags);
- llq.val = READ_ONCE(cmdq->q.llq.val);
do {
u64 old;
+ llq.val = READ_ONCE(cmdq->q.llq.val);
- while (!queue_has_space(&llq, n + sync)) {
+ if (queue_has_space(&llq, n + sync))
+ goto try_cas;
+
+ if (locked)
+ spin_unlock(&cmdq->slock);
+
+ do {
local_irq_restore(flags);
if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
local_irq_save(flags);
- }
+ } while (!queue_has_space(&llq, n + sync));
+try_cas:
head.cons = llq.cons;
head.prod = queue_inc_prod_n(&llq, n + sync) |
CMDQ_PROD_OWNED_FLAG;
old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
- if (old == llq.val)
+ if (old != llq.val)
break;
- llq.val = old;
+ if (!locked) {
+ spin_lock(&cmdq->slock);
+ locked = true;
+ }
} while (1);
+
owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
head.prod &= ~CMDQ_PROD_OWNED_FLAG;
llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
@@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
atomic_set(&cmdq->owner_prod, 0);
atomic_set(&cmdq->lock, 0);
+ spin_lock_init(&cmdq->slock);
bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
if (!bitmap) {
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
2020-07-16 10:28 ` Will Deacon
(?)
@ 2020-07-16 10:56 ` John Garry
-1 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-07-16 10:56 UTC (permalink / raw)
To: Will Deacon
Cc: robin.murphy, joro, trivial, linux-arm-kernel, iommu,
linux-kernel, linuxarm, maz
On 16/07/2020 11:28, Will Deacon wrote:
> On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
>> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
>>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
>>>> As mentioned in [0], the CPU may consume many cycles processing
>>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
>>>> get space on the queue takes approx 25% of the cycles for this function.
>>>>
>>>> This series removes that cmpxchg().
>>>
>>> How about something much simpler like the diff below? >>
>> Ah, scratch that, I don't drop the lock if we fail the cas with it held.
>> Let me hack it some more (I have no hardware so I can only build-test this).
>
> Right, second attempt...
I can try it, but if performance if not as good, then please check mine
further (patch 4/4 specifically) - performance is really good, IMHO.
Thanks,
>
> Will
>
> --->8
>
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index f578677a5c41..e6bcddd6ef69 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -560,6 +560,7 @@ struct arm_smmu_cmdq {
> atomic_long_t *valid_map;
> atomic_t owner_prod;
> atomic_t lock;
> + spinlock_t slock;
> };
>
> struct arm_smmu_cmdq_batch {
> @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
> u64 cmd_sync[CMDQ_ENT_DWORDS];
> u32 prod;
> unsigned long flags;
> - bool owner;
> + bool owner, locked = false;
> struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
> struct arm_smmu_ll_queue llq = {
> .max_n_shift = cmdq->q.llq.max_n_shift,
> @@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
>
> /* 1. Allocate some space in the queue */
> local_irq_save(flags);
> - llq.val = READ_ONCE(cmdq->q.llq.val);
> do {
> u64 old;
> + llq.val = READ_ONCE(cmdq->q.llq.val);
>
> - while (!queue_has_space(&llq, n + sync)) {
> + if (queue_has_space(&llq, n + sync))
> + goto try_cas;
> +
> + if (locked)
> + spin_unlock(&cmdq->slock);
> +
> + do {
> local_irq_restore(flags);
> if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
> dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
> local_irq_save(flags);
> - }
> + } while (!queue_has_space(&llq, n + sync));
>
> +try_cas:
> head.cons = llq.cons;
> head.prod = queue_inc_prod_n(&llq, n + sync) |
> CMDQ_PROD_OWNED_FLAG;
>
> old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
> - if (old == llq.val)
> + if (old != llq.val)
> break;
>
> - llq.val = old;
> + if (!locked) {
> + spin_lock(&cmdq->slock);
> + locked = true;
> + }
> } while (1);
> +
> owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
> head.prod &= ~CMDQ_PROD_OWNED_FLAG;
> llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
> @@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
>
> atomic_set(&cmdq->owner_prod, 0);
> atomic_set(&cmdq->lock, 0);
> + spin_lock_init(&cmdq->slock);
>
> bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
> if (!bitmap) {
> .
>
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
@ 2020-07-16 10:56 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-07-16 10:56 UTC (permalink / raw)
To: Will Deacon
Cc: trivial, maz, linux-kernel, linuxarm, iommu, robin.murphy,
linux-arm-kernel
On 16/07/2020 11:28, Will Deacon wrote:
> On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
>> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
>>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
>>>> As mentioned in [0], the CPU may consume many cycles processing
>>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
>>>> get space on the queue takes approx 25% of the cycles for this function.
>>>>
>>>> This series removes that cmpxchg().
>>>
>>> How about something much simpler like the diff below? >>
>> Ah, scratch that, I don't drop the lock if we fail the cas with it held.
>> Let me hack it some more (I have no hardware so I can only build-test this).
>
> Right, second attempt...
I can try it, but if performance if not as good, then please check mine
further (patch 4/4 specifically) - performance is really good, IMHO.
Thanks,
>
> Will
>
> --->8
>
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index f578677a5c41..e6bcddd6ef69 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -560,6 +560,7 @@ struct arm_smmu_cmdq {
> atomic_long_t *valid_map;
> atomic_t owner_prod;
> atomic_t lock;
> + spinlock_t slock;
> };
>
> struct arm_smmu_cmdq_batch {
> @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
> u64 cmd_sync[CMDQ_ENT_DWORDS];
> u32 prod;
> unsigned long flags;
> - bool owner;
> + bool owner, locked = false;
> struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
> struct arm_smmu_ll_queue llq = {
> .max_n_shift = cmdq->q.llq.max_n_shift,
> @@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
>
> /* 1. Allocate some space in the queue */
> local_irq_save(flags);
> - llq.val = READ_ONCE(cmdq->q.llq.val);
> do {
> u64 old;
> + llq.val = READ_ONCE(cmdq->q.llq.val);
>
> - while (!queue_has_space(&llq, n + sync)) {
> + if (queue_has_space(&llq, n + sync))
> + goto try_cas;
> +
> + if (locked)
> + spin_unlock(&cmdq->slock);
> +
> + do {
> local_irq_restore(flags);
> if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
> dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
> local_irq_save(flags);
> - }
> + } while (!queue_has_space(&llq, n + sync));
>
> +try_cas:
> head.cons = llq.cons;
> head.prod = queue_inc_prod_n(&llq, n + sync) |
> CMDQ_PROD_OWNED_FLAG;
>
> old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
> - if (old == llq.val)
> + if (old != llq.val)
> break;
>
> - llq.val = old;
> + if (!locked) {
> + spin_lock(&cmdq->slock);
> + locked = true;
> + }
> } while (1);
> +
> owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
> head.prod &= ~CMDQ_PROD_OWNED_FLAG;
> llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
> @@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
>
> atomic_set(&cmdq->owner_prod, 0);
> atomic_set(&cmdq->lock, 0);
> + spin_lock_init(&cmdq->slock);
>
> bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
> if (!bitmap) {
> .
>
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
@ 2020-07-16 10:56 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-07-16 10:56 UTC (permalink / raw)
To: Will Deacon
Cc: trivial, maz, joro, linux-kernel, linuxarm, iommu, robin.murphy,
linux-arm-kernel
On 16/07/2020 11:28, Will Deacon wrote:
> On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
>> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
>>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
>>>> As mentioned in [0], the CPU may consume many cycles processing
>>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
>>>> get space on the queue takes approx 25% of the cycles for this function.
>>>>
>>>> This series removes that cmpxchg().
>>>
>>> How about something much simpler like the diff below? >>
>> Ah, scratch that, I don't drop the lock if we fail the cas with it held.
>> Let me hack it some more (I have no hardware so I can only build-test this).
>
> Right, second attempt...
I can try it, but if performance if not as good, then please check mine
further (patch 4/4 specifically) - performance is really good, IMHO.
Thanks,
>
> Will
>
> --->8
>
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index f578677a5c41..e6bcddd6ef69 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -560,6 +560,7 @@ struct arm_smmu_cmdq {
> atomic_long_t *valid_map;
> atomic_t owner_prod;
> atomic_t lock;
> + spinlock_t slock;
> };
>
> struct arm_smmu_cmdq_batch {
> @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
> u64 cmd_sync[CMDQ_ENT_DWORDS];
> u32 prod;
> unsigned long flags;
> - bool owner;
> + bool owner, locked = false;
> struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
> struct arm_smmu_ll_queue llq = {
> .max_n_shift = cmdq->q.llq.max_n_shift,
> @@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
>
> /* 1. Allocate some space in the queue */
> local_irq_save(flags);
> - llq.val = READ_ONCE(cmdq->q.llq.val);
> do {
> u64 old;
> + llq.val = READ_ONCE(cmdq->q.llq.val);
>
> - while (!queue_has_space(&llq, n + sync)) {
> + if (queue_has_space(&llq, n + sync))
> + goto try_cas;
> +
> + if (locked)
> + spin_unlock(&cmdq->slock);
> +
> + do {
> local_irq_restore(flags);
> if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
> dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
> local_irq_save(flags);
> - }
> + } while (!queue_has_space(&llq, n + sync));
>
> +try_cas:
> head.cons = llq.cons;
> head.prod = queue_inc_prod_n(&llq, n + sync) |
> CMDQ_PROD_OWNED_FLAG;
>
> old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
> - if (old == llq.val)
> + if (old != llq.val)
> break;
>
> - llq.val = old;
> + if (!locked) {
> + spin_lock(&cmdq->slock);
> + locked = true;
> + }
> } while (1);
> +
> owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
> head.prod &= ~CMDQ_PROD_OWNED_FLAG;
> llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
> @@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
>
> atomic_set(&cmdq->owner_prod, 0);
> atomic_set(&cmdq->lock, 0);
> + spin_lock_init(&cmdq->slock);
>
> bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
> if (!bitmap) {
> .
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
2020-07-16 10:56 ` John Garry
(?)
@ 2020-07-16 11:22 ` Robin Murphy
-1 siblings, 0 replies; 69+ messages in thread
From: Robin Murphy @ 2020-07-16 11:22 UTC (permalink / raw)
To: John Garry, Will Deacon
Cc: joro, trivial, linux-arm-kernel, iommu, linux-kernel, linuxarm, maz
On 2020-07-16 11:56, John Garry wrote:
> On 16/07/2020 11:28, Will Deacon wrote:
>> On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
>>> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
>>>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
>>>>> As mentioned in [0], the CPU may consume many cycles processing
>>>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg()
>>>>> loop to
>>>>> get space on the queue takes approx 25% of the cycles for this
>>>>> function.
>>>>>
>>>>> This series removes that cmpxchg().
>>>>
>>>> How about something much simpler like the diff below? >>
>>> Ah, scratch that, I don't drop the lock if we fail the cas with it held.
>>> Let me hack it some more (I have no hardware so I can only build-test
>>> this).
>>
>> Right, second attempt...
>
> I can try it, but if performance if not as good, then please check mine
> further (patch 4/4 specifically) - performance is really good, IMHO.
Perhaps a silly question (I'm too engrossed in PMU world ATM to get
properly back up to speed on this), but couldn't this be done without
cmpxchg anyway? Instinctively it feels like instead of maintaining a
literal software copy of the prod value, we could resolve the "claim my
slot in the queue" part with atomic_fetch_add on a free-running 32-bit
"pseudo-prod" index, then whoever updates the hardware deals with the
truncation and wrap bit to convert it to an actual register value.
Robin.
>
> Thanks,
>
>>
>> Will
>>
>> --->8
>>
>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>> index f578677a5c41..e6bcddd6ef69 100644
>> --- a/drivers/iommu/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm-smmu-v3.c
>> @@ -560,6 +560,7 @@ struct arm_smmu_cmdq {
>> atomic_long_t *valid_map;
>> atomic_t owner_prod;
>> atomic_t lock;
>> + spinlock_t slock;
>> };
>> struct arm_smmu_cmdq_batch {
>> @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct
>> arm_smmu_device *smmu,
>> u64 cmd_sync[CMDQ_ENT_DWORDS];
>> u32 prod;
>> unsigned long flags;
>> - bool owner;
>> + bool owner, locked = false;
>> struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
>> struct arm_smmu_ll_queue llq = {
>> .max_n_shift = cmdq->q.llq.max_n_shift,
>> @@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct
>> arm_smmu_device *smmu,
>> /* 1. Allocate some space in the queue */
>> local_irq_save(flags);
>> - llq.val = READ_ONCE(cmdq->q.llq.val);
>> do {
>> u64 old;
>> + llq.val = READ_ONCE(cmdq->q.llq.val);
>> - while (!queue_has_space(&llq, n + sync)) {
>> + if (queue_has_space(&llq, n + sync))
>> + goto try_cas;
>> +
>> + if (locked)
>> + spin_unlock(&cmdq->slock);
>> +
>> + do {
>> local_irq_restore(flags);
>> if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
>> dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
>> local_irq_save(flags);
>> - }
>> + } while (!queue_has_space(&llq, n + sync));
>> +try_cas:
>> head.cons = llq.cons;
>> head.prod = queue_inc_prod_n(&llq, n + sync) |
>> CMDQ_PROD_OWNED_FLAG;
>> old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
>> - if (old == llq.val)
>> + if (old != llq.val)
>> break;
>> - llq.val = old;
>> + if (!locked) {
>> + spin_lock(&cmdq->slock);
>> + locked = true;
>> + }
>> } while (1);
>> +
>> owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
>> head.prod &= ~CMDQ_PROD_OWNED_FLAG;
>> llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
>> @@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct
>> arm_smmu_device *smmu)
>> atomic_set(&cmdq->owner_prod, 0);
>> atomic_set(&cmdq->lock, 0);
>> + spin_lock_init(&cmdq->slock);
>> bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
>> if (!bitmap) {
>> .
>>
>
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
@ 2020-07-16 11:22 ` Robin Murphy
0 siblings, 0 replies; 69+ messages in thread
From: Robin Murphy @ 2020-07-16 11:22 UTC (permalink / raw)
To: John Garry, Will Deacon
Cc: trivial, maz, linuxarm, linux-kernel, iommu, linux-arm-kernel
On 2020-07-16 11:56, John Garry wrote:
> On 16/07/2020 11:28, Will Deacon wrote:
>> On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
>>> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
>>>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
>>>>> As mentioned in [0], the CPU may consume many cycles processing
>>>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg()
>>>>> loop to
>>>>> get space on the queue takes approx 25% of the cycles for this
>>>>> function.
>>>>>
>>>>> This series removes that cmpxchg().
>>>>
>>>> How about something much simpler like the diff below? >>
>>> Ah, scratch that, I don't drop the lock if we fail the cas with it held.
>>> Let me hack it some more (I have no hardware so I can only build-test
>>> this).
>>
>> Right, second attempt...
>
> I can try it, but if performance if not as good, then please check mine
> further (patch 4/4 specifically) - performance is really good, IMHO.
Perhaps a silly question (I'm too engrossed in PMU world ATM to get
properly back up to speed on this), but couldn't this be done without
cmpxchg anyway? Instinctively it feels like instead of maintaining a
literal software copy of the prod value, we could resolve the "claim my
slot in the queue" part with atomic_fetch_add on a free-running 32-bit
"pseudo-prod" index, then whoever updates the hardware deals with the
truncation and wrap bit to convert it to an actual register value.
Robin.
>
> Thanks,
>
>>
>> Will
>>
>> --->8
>>
>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>> index f578677a5c41..e6bcddd6ef69 100644
>> --- a/drivers/iommu/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm-smmu-v3.c
>> @@ -560,6 +560,7 @@ struct arm_smmu_cmdq {
>> atomic_long_t *valid_map;
>> atomic_t owner_prod;
>> atomic_t lock;
>> + spinlock_t slock;
>> };
>> struct arm_smmu_cmdq_batch {
>> @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct
>> arm_smmu_device *smmu,
>> u64 cmd_sync[CMDQ_ENT_DWORDS];
>> u32 prod;
>> unsigned long flags;
>> - bool owner;
>> + bool owner, locked = false;
>> struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
>> struct arm_smmu_ll_queue llq = {
>> .max_n_shift = cmdq->q.llq.max_n_shift,
>> @@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct
>> arm_smmu_device *smmu,
>> /* 1. Allocate some space in the queue */
>> local_irq_save(flags);
>> - llq.val = READ_ONCE(cmdq->q.llq.val);
>> do {
>> u64 old;
>> + llq.val = READ_ONCE(cmdq->q.llq.val);
>> - while (!queue_has_space(&llq, n + sync)) {
>> + if (queue_has_space(&llq, n + sync))
>> + goto try_cas;
>> +
>> + if (locked)
>> + spin_unlock(&cmdq->slock);
>> +
>> + do {
>> local_irq_restore(flags);
>> if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
>> dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
>> local_irq_save(flags);
>> - }
>> + } while (!queue_has_space(&llq, n + sync));
>> +try_cas:
>> head.cons = llq.cons;
>> head.prod = queue_inc_prod_n(&llq, n + sync) |
>> CMDQ_PROD_OWNED_FLAG;
>> old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
>> - if (old == llq.val)
>> + if (old != llq.val)
>> break;
>> - llq.val = old;
>> + if (!locked) {
>> + spin_lock(&cmdq->slock);
>> + locked = true;
>> + }
>> } while (1);
>> +
>> owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
>> head.prod &= ~CMDQ_PROD_OWNED_FLAG;
>> llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
>> @@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct
>> arm_smmu_device *smmu)
>> atomic_set(&cmdq->owner_prod, 0);
>> atomic_set(&cmdq->lock, 0);
>> + spin_lock_init(&cmdq->slock);
>> bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
>> if (!bitmap) {
>> .
>>
>
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
@ 2020-07-16 11:22 ` Robin Murphy
0 siblings, 0 replies; 69+ messages in thread
From: Robin Murphy @ 2020-07-16 11:22 UTC (permalink / raw)
To: John Garry, Will Deacon
Cc: trivial, maz, joro, linuxarm, linux-kernel, iommu, linux-arm-kernel
On 2020-07-16 11:56, John Garry wrote:
> On 16/07/2020 11:28, Will Deacon wrote:
>> On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
>>> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
>>>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
>>>>> As mentioned in [0], the CPU may consume many cycles processing
>>>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg()
>>>>> loop to
>>>>> get space on the queue takes approx 25% of the cycles for this
>>>>> function.
>>>>>
>>>>> This series removes that cmpxchg().
>>>>
>>>> How about something much simpler like the diff below? >>
>>> Ah, scratch that, I don't drop the lock if we fail the cas with it held.
>>> Let me hack it some more (I have no hardware so I can only build-test
>>> this).
>>
>> Right, second attempt...
>
> I can try it, but if performance if not as good, then please check mine
> further (patch 4/4 specifically) - performance is really good, IMHO.
Perhaps a silly question (I'm too engrossed in PMU world ATM to get
properly back up to speed on this), but couldn't this be done without
cmpxchg anyway? Instinctively it feels like instead of maintaining a
literal software copy of the prod value, we could resolve the "claim my
slot in the queue" part with atomic_fetch_add on a free-running 32-bit
"pseudo-prod" index, then whoever updates the hardware deals with the
truncation and wrap bit to convert it to an actual register value.
Robin.
>
> Thanks,
>
>>
>> Will
>>
>> --->8
>>
>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>> index f578677a5c41..e6bcddd6ef69 100644
>> --- a/drivers/iommu/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm-smmu-v3.c
>> @@ -560,6 +560,7 @@ struct arm_smmu_cmdq {
>> atomic_long_t *valid_map;
>> atomic_t owner_prod;
>> atomic_t lock;
>> + spinlock_t slock;
>> };
>> struct arm_smmu_cmdq_batch {
>> @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct
>> arm_smmu_device *smmu,
>> u64 cmd_sync[CMDQ_ENT_DWORDS];
>> u32 prod;
>> unsigned long flags;
>> - bool owner;
>> + bool owner, locked = false;
>> struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
>> struct arm_smmu_ll_queue llq = {
>> .max_n_shift = cmdq->q.llq.max_n_shift,
>> @@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct
>> arm_smmu_device *smmu,
>> /* 1. Allocate some space in the queue */
>> local_irq_save(flags);
>> - llq.val = READ_ONCE(cmdq->q.llq.val);
>> do {
>> u64 old;
>> + llq.val = READ_ONCE(cmdq->q.llq.val);
>> - while (!queue_has_space(&llq, n + sync)) {
>> + if (queue_has_space(&llq, n + sync))
>> + goto try_cas;
>> +
>> + if (locked)
>> + spin_unlock(&cmdq->slock);
>> +
>> + do {
>> local_irq_restore(flags);
>> if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
>> dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
>> local_irq_save(flags);
>> - }
>> + } while (!queue_has_space(&llq, n + sync));
>> +try_cas:
>> head.cons = llq.cons;
>> head.prod = queue_inc_prod_n(&llq, n + sync) |
>> CMDQ_PROD_OWNED_FLAG;
>> old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
>> - if (old == llq.val)
>> + if (old != llq.val)
>> break;
>> - llq.val = old;
>> + if (!locked) {
>> + spin_lock(&cmdq->slock);
>> + locked = true;
>> + }
>> } while (1);
>> +
>> owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
>> head.prod &= ~CMDQ_PROD_OWNED_FLAG;
>> llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
>> @@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct
>> arm_smmu_device *smmu)
>> atomic_set(&cmdq->owner_prod, 0);
>> atomic_set(&cmdq->lock, 0);
>> + spin_lock_init(&cmdq->slock);
>> bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
>> if (!bitmap) {
>> .
>>
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
2020-07-16 11:22 ` Robin Murphy
(?)
@ 2020-07-16 11:30 ` John Garry
-1 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-07-16 11:30 UTC (permalink / raw)
To: Robin Murphy, Will Deacon
Cc: joro, trivial, linux-arm-kernel, iommu, linux-kernel, Linuxarm, maz
On 16/07/2020 12:22, Robin Murphy wrote:
> On 2020-07-16 11:56, John Garry wrote:
>> On 16/07/2020 11:28, Will Deacon wrote:
>>> On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
>>>> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
>>>>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
>>>>>> As mentioned in [0], the CPU may consume many cycles processing
>>>>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg()
>>>>>> loop to
>>>>>> get space on the queue takes approx 25% of the cycles for this
>>>>>> function.
>>>>>>
>>>>>> This series removes that cmpxchg().
>>>>>
>>>>> How about something much simpler like the diff below? >>
>>>> Ah, scratch that, I don't drop the lock if we fail the cas with it held.
>>>> Let me hack it some more (I have no hardware so I can only build-test
>>>> this).
>>>
>>> Right, second attempt...
>>
>> I can try it, but if performance if not as good, then please check mine
>> further (patch 4/4 specifically) - performance is really good, IMHO.
>
> Perhaps a silly question (I'm too engrossed in PMU world ATM to get
> properly back up to speed on this), but couldn't this be done without
> cmpxchg anyway? Instinctively it feels like instead of maintaining a
> literal software copy of the prod value, we could resolve the "claim my
> slot in the queue" part with atomic_fetch_add on a free-running 32-bit
> "pseudo-prod" index, then whoever updates the hardware deals with the
> truncation and wrap bit to convert it to an actual register value.
>
That's what mine does. But I also need to take care of cmdq locking and
how we unconditionally provide space.
Cheers,
John
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
@ 2020-07-16 11:30 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-07-16 11:30 UTC (permalink / raw)
To: Robin Murphy, Will Deacon
Cc: trivial, maz, Linuxarm, linux-kernel, iommu, linux-arm-kernel
On 16/07/2020 12:22, Robin Murphy wrote:
> On 2020-07-16 11:56, John Garry wrote:
>> On 16/07/2020 11:28, Will Deacon wrote:
>>> On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
>>>> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
>>>>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
>>>>>> As mentioned in [0], the CPU may consume many cycles processing
>>>>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg()
>>>>>> loop to
>>>>>> get space on the queue takes approx 25% of the cycles for this
>>>>>> function.
>>>>>>
>>>>>> This series removes that cmpxchg().
>>>>>
>>>>> How about something much simpler like the diff below? >>
>>>> Ah, scratch that, I don't drop the lock if we fail the cas with it held.
>>>> Let me hack it some more (I have no hardware so I can only build-test
>>>> this).
>>>
>>> Right, second attempt...
>>
>> I can try it, but if performance if not as good, then please check mine
>> further (patch 4/4 specifically) - performance is really good, IMHO.
>
> Perhaps a silly question (I'm too engrossed in PMU world ATM to get
> properly back up to speed on this), but couldn't this be done without
> cmpxchg anyway? Instinctively it feels like instead of maintaining a
> literal software copy of the prod value, we could resolve the "claim my
> slot in the queue" part with atomic_fetch_add on a free-running 32-bit
> "pseudo-prod" index, then whoever updates the hardware deals with the
> truncation and wrap bit to convert it to an actual register value.
>
That's what mine does. But I also need to take care of cmdq locking and
how we unconditionally provide space.
Cheers,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
@ 2020-07-16 11:30 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-07-16 11:30 UTC (permalink / raw)
To: Robin Murphy, Will Deacon
Cc: trivial, maz, joro, Linuxarm, linux-kernel, iommu, linux-arm-kernel
On 16/07/2020 12:22, Robin Murphy wrote:
> On 2020-07-16 11:56, John Garry wrote:
>> On 16/07/2020 11:28, Will Deacon wrote:
>>> On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
>>>> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
>>>>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
>>>>>> As mentioned in [0], the CPU may consume many cycles processing
>>>>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg()
>>>>>> loop to
>>>>>> get space on the queue takes approx 25% of the cycles for this
>>>>>> function.
>>>>>>
>>>>>> This series removes that cmpxchg().
>>>>>
>>>>> How about something much simpler like the diff below? >>
>>>> Ah, scratch that, I don't drop the lock if we fail the cas with it held.
>>>> Let me hack it some more (I have no hardware so I can only build-test
>>>> this).
>>>
>>> Right, second attempt...
>>
>> I can try it, but if performance if not as good, then please check mine
>> further (patch 4/4 specifically) - performance is really good, IMHO.
>
> Perhaps a silly question (I'm too engrossed in PMU world ATM to get
> properly back up to speed on this), but couldn't this be done without
> cmpxchg anyway? Instinctively it feels like instead of maintaining a
> literal software copy of the prod value, we could resolve the "claim my
> slot in the queue" part with atomic_fetch_add on a free-running 32-bit
> "pseudo-prod" index, then whoever updates the hardware deals with the
> truncation and wrap bit to convert it to an actual register value.
>
That's what mine does. But I also need to take care of cmdq locking and
how we unconditionally provide space.
Cheers,
John
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
2020-07-16 11:22 ` Robin Murphy
(?)
@ 2020-07-16 11:32 ` Will Deacon
-1 siblings, 0 replies; 69+ messages in thread
From: Will Deacon @ 2020-07-16 11:32 UTC (permalink / raw)
To: Robin Murphy
Cc: John Garry, joro, trivial, linux-arm-kernel, iommu, linux-kernel,
linuxarm, maz
On Thu, Jul 16, 2020 at 12:22:05PM +0100, Robin Murphy wrote:
> On 2020-07-16 11:56, John Garry wrote:
> > On 16/07/2020 11:28, Will Deacon wrote:
> > > On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
> > > > On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
> > > > > On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
> > > > > > As mentioned in [0], the CPU may consume many cycles processing
> > > > > > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the
> > > > > > cmpxchg() loop to
> > > > > > get space on the queue takes approx 25% of the cycles
> > > > > > for this function.
> > > > > >
> > > > > > This series removes that cmpxchg().
> > > > >
> > > > > How about something much simpler like the diff below? >>
> > > > Ah, scratch that, I don't drop the lock if we fail the cas with it held.
> > > > Let me hack it some more (I have no hardware so I can only
> > > > build-test this).
> > >
> > > Right, second attempt...
> >
> > I can try it, but if performance if not as good, then please check mine
> > further (patch 4/4 specifically) - performance is really good, IMHO.
>
> Perhaps a silly question (I'm too engrossed in PMU world ATM to get properly
> back up to speed on this), but couldn't this be done without cmpxchg anyway?
> Instinctively it feels like instead of maintaining a literal software copy
> of the prod value, we could resolve the "claim my slot in the queue" part
> with atomic_fetch_add on a free-running 32-bit "pseudo-prod" index, then
> whoever updates the hardware deals with the truncation and wrap bit to
> convert it to an actual register value.
Maybe, but it's easier said than done. The hard part is figuring how that
you have space and there's no way I'm touching that logic without a way to
test this.
I'm also not keen to complicate the code because of systems that don't scale
well with contended CAS [1]. If you put down loads of cores, you need an
interconnect/coherence protocol that can handle it.
Will
[1] https://lore.kernel.org/lkml/20190607072652.GA5522@hc/T/#m0d00f107c29223045933292a8b5b90d2ca9b7e5c
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
@ 2020-07-16 11:32 ` Will Deacon
0 siblings, 0 replies; 69+ messages in thread
From: Will Deacon @ 2020-07-16 11:32 UTC (permalink / raw)
To: Robin Murphy
Cc: trivial, maz, linux-kernel, linuxarm, iommu, linux-arm-kernel
On Thu, Jul 16, 2020 at 12:22:05PM +0100, Robin Murphy wrote:
> On 2020-07-16 11:56, John Garry wrote:
> > On 16/07/2020 11:28, Will Deacon wrote:
> > > On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
> > > > On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
> > > > > On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
> > > > > > As mentioned in [0], the CPU may consume many cycles processing
> > > > > > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the
> > > > > > cmpxchg() loop to
> > > > > > get space on the queue takes approx 25% of the cycles
> > > > > > for this function.
> > > > > >
> > > > > > This series removes that cmpxchg().
> > > > >
> > > > > How about something much simpler like the diff below? >>
> > > > Ah, scratch that, I don't drop the lock if we fail the cas with it held.
> > > > Let me hack it some more (I have no hardware so I can only
> > > > build-test this).
> > >
> > > Right, second attempt...
> >
> > I can try it, but if performance if not as good, then please check mine
> > further (patch 4/4 specifically) - performance is really good, IMHO.
>
> Perhaps a silly question (I'm too engrossed in PMU world ATM to get properly
> back up to speed on this), but couldn't this be done without cmpxchg anyway?
> Instinctively it feels like instead of maintaining a literal software copy
> of the prod value, we could resolve the "claim my slot in the queue" part
> with atomic_fetch_add on a free-running 32-bit "pseudo-prod" index, then
> whoever updates the hardware deals with the truncation and wrap bit to
> convert it to an actual register value.
Maybe, but it's easier said than done. The hard part is figuring how that
you have space and there's no way I'm touching that logic without a way to
test this.
I'm also not keen to complicate the code because of systems that don't scale
well with contended CAS [1]. If you put down loads of cores, you need an
interconnect/coherence protocol that can handle it.
Will
[1] https://lore.kernel.org/lkml/20190607072652.GA5522@hc/T/#m0d00f107c29223045933292a8b5b90d2ca9b7e5c
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
@ 2020-07-16 11:32 ` Will Deacon
0 siblings, 0 replies; 69+ messages in thread
From: Will Deacon @ 2020-07-16 11:32 UTC (permalink / raw)
To: Robin Murphy
Cc: trivial, maz, joro, John Garry, linux-kernel, linuxarm, iommu,
linux-arm-kernel
On Thu, Jul 16, 2020 at 12:22:05PM +0100, Robin Murphy wrote:
> On 2020-07-16 11:56, John Garry wrote:
> > On 16/07/2020 11:28, Will Deacon wrote:
> > > On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
> > > > On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
> > > > > On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
> > > > > > As mentioned in [0], the CPU may consume many cycles processing
> > > > > > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the
> > > > > > cmpxchg() loop to
> > > > > > get space on the queue takes approx 25% of the cycles
> > > > > > for this function.
> > > > > >
> > > > > > This series removes that cmpxchg().
> > > > >
> > > > > How about something much simpler like the diff below? >>
> > > > Ah, scratch that, I don't drop the lock if we fail the cas with it held.
> > > > Let me hack it some more (I have no hardware so I can only
> > > > build-test this).
> > >
> > > Right, second attempt...
> >
> > I can try it, but if performance if not as good, then please check mine
> > further (patch 4/4 specifically) - performance is really good, IMHO.
>
> Perhaps a silly question (I'm too engrossed in PMU world ATM to get properly
> back up to speed on this), but couldn't this be done without cmpxchg anyway?
> Instinctively it feels like instead of maintaining a literal software copy
> of the prod value, we could resolve the "claim my slot in the queue" part
> with atomic_fetch_add on a free-running 32-bit "pseudo-prod" index, then
> whoever updates the hardware deals with the truncation and wrap bit to
> convert it to an actual register value.
Maybe, but it's easier said than done. The hard part is figuring how that
you have space and there's no way I'm touching that logic without a way to
test this.
I'm also not keen to complicate the code because of systems that don't scale
well with contended CAS [1]. If you put down loads of cores, you need an
interconnect/coherence protocol that can handle it.
Will
[1] https://lore.kernel.org/lkml/20190607072652.GA5522@hc/T/#m0d00f107c29223045933292a8b5b90d2ca9b7e5c
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
2020-07-16 10:28 ` Will Deacon
(?)
@ 2020-07-16 13:31 ` John Garry
-1 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-07-16 13:31 UTC (permalink / raw)
To: Will Deacon
Cc: robin.murphy, joro, trivial, linux-arm-kernel, iommu,
linux-kernel, linuxarm, maz
On 16/07/2020 11:28, Will Deacon wrote:
> On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
>> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
>>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
>>>> As mentioned in [0], the CPU may consume many cycles processing
>>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
>>>> get space on the queue takes approx 25% of the cycles for this function.
>>>>
>>>> This series removes that cmpxchg().
>>>
>>> How about something much simpler like the diff below?
>>
>> Ah, scratch that, I don't drop the lock if we fail the cas with it held.
>> Let me hack it some more (I have no hardware so I can only build-test this).
>
> Right, second attempt...
>
> Will
Unfortunately that hangs my machine during boot:
[10.902893] 00:01: ttyS0 at MMIO 0x3f00003f8 (irq = 6, base_baud =
115200) is a 16550A
[10.912048] SuperH (H)SCI(F) driver initialized
[10.916811] msm_serial: driver initialized
[10.921371] arm-smmu-v3 arm-smmu-v3.0.auto: option mask 0x0
[10.926946] arm-smmu-v3 arm-smmu-v3.0.auto: ias 48-bit, oas 48-bit
(features 0x00000fef)
[10.935374] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 65536 entries for cmdq
[10.942522] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 32768 entries for evtq
>
> --->8
>
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index f578677a5c41..e6bcddd6ef69 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -560,6 +560,7 @@ struct arm_smmu_cmdq {
> atomic_long_t *valid_map;
> atomic_t owner_prod;
> atomic_t lock;
> + spinlock_t slock;
> };
>
> struct arm_smmu_cmdq_batch {
> @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
> u64 cmd_sync[CMDQ_ENT_DWORDS];
> u32 prod;
> unsigned long flags;
> - bool owner;
> + bool owner, locked = false;
> struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
> struct arm_smmu_ll_queue llq = {
> .max_n_shift = cmdq->q.llq.max_n_shift,
> @@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
>
> /* 1. Allocate some space in the queue */
> local_irq_save(flags);
> - llq.val = READ_ONCE(cmdq->q.llq.val);
> do {
> u64 old;
> + llq.val = READ_ONCE(cmdq->q.llq.val);
>
> - while (!queue_has_space(&llq, n + sync)) {
> + if (queue_has_space(&llq, n + sync))
> + goto try_cas;
> +
> + if (locked)
> + spin_unlock(&cmdq->slock);
> +
> + do {
> local_irq_restore(flags);
> if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
> dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
> local_irq_save(flags);
> - }
> + } while (!queue_has_space(&llq, n + sync));
>
> +try_cas:
> head.cons = llq.cons;
> head.prod = queue_inc_prod_n(&llq, n + sync) |
> CMDQ_PROD_OWNED_FLAG;
>
> old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
> - if (old == llq.val)
> + if (old != llq.val)
Not sure why you changed this. And if I change it back, it seems that we
could drop out of the loop with cmdq->slock held, so need to drop the
lock also.
I tried that and it stops my machine hanging. Let me know that was the
intention, so I can test.
Thanks,
John
> break;
>
> - llq.val = old;
> + if (!locked) {
> + spin_lock(&cmdq->slock);
> + locked = true;
> + }
> } while (1);
> +
> owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
> head.prod &= ~CMDQ_PROD_OWNED_FLAG;
> llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
> @@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
>
> atomic_set(&cmdq->owner_prod, 0);
> atomic_set(&cmdq->lock, 0);
> + spin_lock_init(&cmdq->slock);
>
> bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
> if (!bitmap) {
> .
>
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
@ 2020-07-16 13:31 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-07-16 13:31 UTC (permalink / raw)
To: Will Deacon
Cc: trivial, maz, linux-kernel, linuxarm, iommu, robin.murphy,
linux-arm-kernel
On 16/07/2020 11:28, Will Deacon wrote:
> On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
>> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
>>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
>>>> As mentioned in [0], the CPU may consume many cycles processing
>>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
>>>> get space on the queue takes approx 25% of the cycles for this function.
>>>>
>>>> This series removes that cmpxchg().
>>>
>>> How about something much simpler like the diff below?
>>
>> Ah, scratch that, I don't drop the lock if we fail the cas with it held.
>> Let me hack it some more (I have no hardware so I can only build-test this).
>
> Right, second attempt...
>
> Will
Unfortunately that hangs my machine during boot:
[10.902893] 00:01: ttyS0 at MMIO 0x3f00003f8 (irq = 6, base_baud =
115200) is a 16550A
[10.912048] SuperH (H)SCI(F) driver initialized
[10.916811] msm_serial: driver initialized
[10.921371] arm-smmu-v3 arm-smmu-v3.0.auto: option mask 0x0
[10.926946] arm-smmu-v3 arm-smmu-v3.0.auto: ias 48-bit, oas 48-bit
(features 0x00000fef)
[10.935374] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 65536 entries for cmdq
[10.942522] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 32768 entries for evtq
>
> --->8
>
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index f578677a5c41..e6bcddd6ef69 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -560,6 +560,7 @@ struct arm_smmu_cmdq {
> atomic_long_t *valid_map;
> atomic_t owner_prod;
> atomic_t lock;
> + spinlock_t slock;
> };
>
> struct arm_smmu_cmdq_batch {
> @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
> u64 cmd_sync[CMDQ_ENT_DWORDS];
> u32 prod;
> unsigned long flags;
> - bool owner;
> + bool owner, locked = false;
> struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
> struct arm_smmu_ll_queue llq = {
> .max_n_shift = cmdq->q.llq.max_n_shift,
> @@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
>
> /* 1. Allocate some space in the queue */
> local_irq_save(flags);
> - llq.val = READ_ONCE(cmdq->q.llq.val);
> do {
> u64 old;
> + llq.val = READ_ONCE(cmdq->q.llq.val);
>
> - while (!queue_has_space(&llq, n + sync)) {
> + if (queue_has_space(&llq, n + sync))
> + goto try_cas;
> +
> + if (locked)
> + spin_unlock(&cmdq->slock);
> +
> + do {
> local_irq_restore(flags);
> if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
> dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
> local_irq_save(flags);
> - }
> + } while (!queue_has_space(&llq, n + sync));
>
> +try_cas:
> head.cons = llq.cons;
> head.prod = queue_inc_prod_n(&llq, n + sync) |
> CMDQ_PROD_OWNED_FLAG;
>
> old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
> - if (old == llq.val)
> + if (old != llq.val)
Not sure why you changed this. And if I change it back, it seems that we
could drop out of the loop with cmdq->slock held, so need to drop the
lock also.
I tried that and it stops my machine hanging. Let me know that was the
intention, so I can test.
Thanks,
John
> break;
>
> - llq.val = old;
> + if (!locked) {
> + spin_lock(&cmdq->slock);
> + locked = true;
> + }
> } while (1);
> +
> owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
> head.prod &= ~CMDQ_PROD_OWNED_FLAG;
> llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
> @@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
>
> atomic_set(&cmdq->owner_prod, 0);
> atomic_set(&cmdq->lock, 0);
> + spin_lock_init(&cmdq->slock);
>
> bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
> if (!bitmap) {
> .
>
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
@ 2020-07-16 13:31 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-07-16 13:31 UTC (permalink / raw)
To: Will Deacon
Cc: trivial, maz, joro, linux-kernel, linuxarm, iommu, robin.murphy,
linux-arm-kernel
On 16/07/2020 11:28, Will Deacon wrote:
> On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
>> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
>>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
>>>> As mentioned in [0], the CPU may consume many cycles processing
>>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
>>>> get space on the queue takes approx 25% of the cycles for this function.
>>>>
>>>> This series removes that cmpxchg().
>>>
>>> How about something much simpler like the diff below?
>>
>> Ah, scratch that, I don't drop the lock if we fail the cas with it held.
>> Let me hack it some more (I have no hardware so I can only build-test this).
>
> Right, second attempt...
>
> Will
Unfortunately that hangs my machine during boot:
[10.902893] 00:01: ttyS0 at MMIO 0x3f00003f8 (irq = 6, base_baud =
115200) is a 16550A
[10.912048] SuperH (H)SCI(F) driver initialized
[10.916811] msm_serial: driver initialized
[10.921371] arm-smmu-v3 arm-smmu-v3.0.auto: option mask 0x0
[10.926946] arm-smmu-v3 arm-smmu-v3.0.auto: ias 48-bit, oas 48-bit
(features 0x00000fef)
[10.935374] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 65536 entries for cmdq
[10.942522] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 32768 entries for evtq
>
> --->8
>
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index f578677a5c41..e6bcddd6ef69 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -560,6 +560,7 @@ struct arm_smmu_cmdq {
> atomic_long_t *valid_map;
> atomic_t owner_prod;
> atomic_t lock;
> + spinlock_t slock;
> };
>
> struct arm_smmu_cmdq_batch {
> @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
> u64 cmd_sync[CMDQ_ENT_DWORDS];
> u32 prod;
> unsigned long flags;
> - bool owner;
> + bool owner, locked = false;
> struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
> struct arm_smmu_ll_queue llq = {
> .max_n_shift = cmdq->q.llq.max_n_shift,
> @@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
>
> /* 1. Allocate some space in the queue */
> local_irq_save(flags);
> - llq.val = READ_ONCE(cmdq->q.llq.val);
> do {
> u64 old;
> + llq.val = READ_ONCE(cmdq->q.llq.val);
>
> - while (!queue_has_space(&llq, n + sync)) {
> + if (queue_has_space(&llq, n + sync))
> + goto try_cas;
> +
> + if (locked)
> + spin_unlock(&cmdq->slock);
> +
> + do {
> local_irq_restore(flags);
> if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
> dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
> local_irq_save(flags);
> - }
> + } while (!queue_has_space(&llq, n + sync));
>
> +try_cas:
> head.cons = llq.cons;
> head.prod = queue_inc_prod_n(&llq, n + sync) |
> CMDQ_PROD_OWNED_FLAG;
>
> old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
> - if (old == llq.val)
> + if (old != llq.val)
Not sure why you changed this. And if I change it back, it seems that we
could drop out of the loop with cmdq->slock held, so need to drop the
lock also.
I tried that and it stops my machine hanging. Let me know that was the
intention, so I can test.
Thanks,
John
> break;
>
> - llq.val = old;
> + if (!locked) {
> + spin_lock(&cmdq->slock);
> + locked = true;
> + }
> } while (1);
> +
> owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
> head.prod &= ~CMDQ_PROD_OWNED_FLAG;
> llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
> @@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
>
> atomic_set(&cmdq->owner_prod, 0);
> atomic_set(&cmdq->lock, 0);
> + spin_lock_init(&cmdq->slock);
>
> bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
> if (!bitmap) {
> .
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
2020-07-16 11:32 ` Will Deacon
(?)
@ 2020-07-16 16:50 ` John Garry
-1 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-07-16 16:50 UTC (permalink / raw)
To: Will Deacon, Robin Murphy
Cc: joro, trivial, linux-arm-kernel, iommu, linux-kernel, linuxarm, maz
>>
>> Perhaps a silly question (I'm too engrossed in PMU world ATM to get properly
>> back up to speed on this), but couldn't this be done without cmpxchg anyway?
>> Instinctively it feels like instead of maintaining a literal software copy
>> of the prod value, we could resolve the "claim my slot in the queue" part
>> with atomic_fetch_add on a free-running 32-bit "pseudo-prod" index, then
>> whoever updates the hardware deals with the truncation and wrap bit to
>> convert it to an actual register value.
>
> Maybe, but it's easier said than done. The hard part is figuring how that
> you have space and there's no way I'm touching that logic without a way to
> test this.
>
> I'm also not keen to complicate the code because of systems that don't scale
> well with contended CAS [1]. If you put down loads of cores, you need an
> interconnect/coherence protocol that can handle it.
JFYI, I added some debug to the driver to get the cmpxchg() attempt rate
while running a testharness (below):
cpus rate
2 1.1
4 1.1
8 1.3
16 3.6
32 8.1
64 12.6
96 14.7
Ideal rate is 1. So it's not crazy high for many CPUs, but still
drifting away from 1.
John
>
> Will
>
> [1] https://lore.kernel.org/lkml/20190607072652.GA5522@hc/T/#m0d00f107c29223045933292a8b5b90d2ca9b7e5c
> .
>
//copied from Barry, thanks :)
static int ways = 64;
module_param(ways, int, S_IRUGO);
static int seconds = 20;
module_param(seconds, int, S_IRUGO);
int mappings[NR_CPUS];
struct semaphore sem[NR_CPUS];
#define COMPLETIONS_SIZE 50
static noinline dma_addr_t test_mapsingle(struct device *dev, void *buf,
int size)
{
dma_addr_t dma_addr = dma_map_single(dev, buf, size, DMA_TO_DEVICE);
return dma_addr;
}
static noinline void test_unmapsingle(struct device *dev, void *buf, int
size, dma_addr_t dma_addr)
{
dma_unmap_single(dev, dma_addr, size, DMA_TO_DEVICE);
}
static noinline void test_memcpy(void *out, void *in, int size)
{
memcpy(out, in, size);
}
//just a hack to get a dev h behind a SMMU
extern struct device *hisi_dev;
static int testthread(void *data)
{
unsigned long stop = jiffies +seconds*HZ;
struct device *dev = hisi_dev;
char *inputs[COMPLETIONS_SIZE];
char *outputs[COMPLETIONS_SIZE];
dma_addr_t dma_addr[COMPLETIONS_SIZE];
int i, cpu = smp_processor_id();
struct semaphore *sem = data;
for (i = 0; i < COMPLETIONS_SIZE; i++) {
inputs[i] = kzalloc(4096, GFP_KERNEL);
if (!inputs[i])
return -ENOMEM;
}
for (i = 0; i < COMPLETIONS_SIZE; i++) {
outputs[i] = kzalloc(4096, GFP_KERNEL);
if (!outputs[i])
return -ENOMEM;
}
while (time_before(jiffies, stop)) {
for (i = 0; i < COMPLETIONS_SIZE; i++) {
dma_addr[i] = test_mapsingle(dev, inputs[i], 4096);
test_memcpy(outputs[i], inputs[i], 4096);
}
for (i = 0; i < COMPLETIONS_SIZE; i++) {
test_unmapsingle(dev, inputs[i], 4096, dma_addr[i]);
}
mappings[cpu] += COMPLETIONS_SIZE;
}
for (i = 0; i < COMPLETIONS_SIZE; i++) {
kfree(outputs[i]);
kfree(inputs[i]);
}
up(sem);
return 0;
}
void smmu_test_core(int cpus)
{
struct task_struct *tsk;
int i;
int total_mappings = 0;
ways = cpus;
for(i=0;i<ways;i++) {
mappings[i] = 0;
tsk = kthread_create_on_cpu(testthread, &sem[i], i, "map_test");
if (IS_ERR(tsk))
printk(KERN_ERR "create test thread failed\n");
wake_up_process(tsk);
}
for(i=0;i<ways;i++) {
down(&sem[i]);
total_mappings += mappings[i];
}
printk(KERN_ERR "finished total_mappings=%d (per way=%d) (rate=%d
per second per cpu) ways=%d\n",
total_mappings, total_mappings / ways, total_mappings / (seconds*
ways), ways);
}
EXPORT_SYMBOL(smmu_test_core);
static int __init test_init(void)
{
int i;
for(i=0;i<NR_CPUS;i++)
sema_init(&sem[i], 0);
return 0;
}
static void __exit test_exit(void)
{
}
module_init(test_init);
module_exit(test_exit);
MODULE_LICENSE("GPL");
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
@ 2020-07-16 16:50 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-07-16 16:50 UTC (permalink / raw)
To: Will Deacon, Robin Murphy
Cc: trivial, maz, linuxarm, linux-kernel, iommu, linux-arm-kernel
>>
>> Perhaps a silly question (I'm too engrossed in PMU world ATM to get properly
>> back up to speed on this), but couldn't this be done without cmpxchg anyway?
>> Instinctively it feels like instead of maintaining a literal software copy
>> of the prod value, we could resolve the "claim my slot in the queue" part
>> with atomic_fetch_add on a free-running 32-bit "pseudo-prod" index, then
>> whoever updates the hardware deals with the truncation and wrap bit to
>> convert it to an actual register value.
>
> Maybe, but it's easier said than done. The hard part is figuring how that
> you have space and there's no way I'm touching that logic without a way to
> test this.
>
> I'm also not keen to complicate the code because of systems that don't scale
> well with contended CAS [1]. If you put down loads of cores, you need an
> interconnect/coherence protocol that can handle it.
JFYI, I added some debug to the driver to get the cmpxchg() attempt rate
while running a testharness (below):
cpus rate
2 1.1
4 1.1
8 1.3
16 3.6
32 8.1
64 12.6
96 14.7
Ideal rate is 1. So it's not crazy high for many CPUs, but still
drifting away from 1.
John
>
> Will
>
> [1] https://lore.kernel.org/lkml/20190607072652.GA5522@hc/T/#m0d00f107c29223045933292a8b5b90d2ca9b7e5c
> .
>
//copied from Barry, thanks :)
static int ways = 64;
module_param(ways, int, S_IRUGO);
static int seconds = 20;
module_param(seconds, int, S_IRUGO);
int mappings[NR_CPUS];
struct semaphore sem[NR_CPUS];
#define COMPLETIONS_SIZE 50
static noinline dma_addr_t test_mapsingle(struct device *dev, void *buf,
int size)
{
dma_addr_t dma_addr = dma_map_single(dev, buf, size, DMA_TO_DEVICE);
return dma_addr;
}
static noinline void test_unmapsingle(struct device *dev, void *buf, int
size, dma_addr_t dma_addr)
{
dma_unmap_single(dev, dma_addr, size, DMA_TO_DEVICE);
}
static noinline void test_memcpy(void *out, void *in, int size)
{
memcpy(out, in, size);
}
//just a hack to get a dev h behind a SMMU
extern struct device *hisi_dev;
static int testthread(void *data)
{
unsigned long stop = jiffies +seconds*HZ;
struct device *dev = hisi_dev;
char *inputs[COMPLETIONS_SIZE];
char *outputs[COMPLETIONS_SIZE];
dma_addr_t dma_addr[COMPLETIONS_SIZE];
int i, cpu = smp_processor_id();
struct semaphore *sem = data;
for (i = 0; i < COMPLETIONS_SIZE; i++) {
inputs[i] = kzalloc(4096, GFP_KERNEL);
if (!inputs[i])
return -ENOMEM;
}
for (i = 0; i < COMPLETIONS_SIZE; i++) {
outputs[i] = kzalloc(4096, GFP_KERNEL);
if (!outputs[i])
return -ENOMEM;
}
while (time_before(jiffies, stop)) {
for (i = 0; i < COMPLETIONS_SIZE; i++) {
dma_addr[i] = test_mapsingle(dev, inputs[i], 4096);
test_memcpy(outputs[i], inputs[i], 4096);
}
for (i = 0; i < COMPLETIONS_SIZE; i++) {
test_unmapsingle(dev, inputs[i], 4096, dma_addr[i]);
}
mappings[cpu] += COMPLETIONS_SIZE;
}
for (i = 0; i < COMPLETIONS_SIZE; i++) {
kfree(outputs[i]);
kfree(inputs[i]);
}
up(sem);
return 0;
}
void smmu_test_core(int cpus)
{
struct task_struct *tsk;
int i;
int total_mappings = 0;
ways = cpus;
for(i=0;i<ways;i++) {
mappings[i] = 0;
tsk = kthread_create_on_cpu(testthread, &sem[i], i, "map_test");
if (IS_ERR(tsk))
printk(KERN_ERR "create test thread failed\n");
wake_up_process(tsk);
}
for(i=0;i<ways;i++) {
down(&sem[i]);
total_mappings += mappings[i];
}
printk(KERN_ERR "finished total_mappings=%d (per way=%d) (rate=%d
per second per cpu) ways=%d\n",
total_mappings, total_mappings / ways, total_mappings / (seconds*
ways), ways);
}
EXPORT_SYMBOL(smmu_test_core);
static int __init test_init(void)
{
int i;
for(i=0;i<NR_CPUS;i++)
sema_init(&sem[i], 0);
return 0;
}
static void __exit test_exit(void)
{
}
module_init(test_init);
module_exit(test_exit);
MODULE_LICENSE("GPL");
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency
@ 2020-07-16 16:50 ` John Garry
0 siblings, 0 replies; 69+ messages in thread
From: John Garry @ 2020-07-16 16:50 UTC (permalink / raw)
To: Will Deacon, Robin Murphy
Cc: trivial, maz, joro, linuxarm, linux-kernel, iommu, linux-arm-kernel
>>
>> Perhaps a silly question (I'm too engrossed in PMU world ATM to get properly
>> back up to speed on this), but couldn't this be done without cmpxchg anyway?
>> Instinctively it feels like instead of maintaining a literal software copy
>> of the prod value, we could resolve the "claim my slot in the queue" part
>> with atomic_fetch_add on a free-running 32-bit "pseudo-prod" index, then
>> whoever updates the hardware deals with the truncation and wrap bit to
>> convert it to an actual register value.
>
> Maybe, but it's easier said than done. The hard part is figuring how that
> you have space and there's no way I'm touching that logic without a way to
> test this.
>
> I'm also not keen to complicate the code because of systems that don't scale
> well with contended CAS [1]. If you put down loads of cores, you need an
> interconnect/coherence protocol that can handle it.
JFYI, I added some debug to the driver to get the cmpxchg() attempt rate
while running a testharness (below):
cpus rate
2 1.1
4 1.1
8 1.3
16 3.6
32 8.1
64 12.6
96 14.7
Ideal rate is 1. So it's not crazy high for many CPUs, but still
drifting away from 1.
John
>
> Will
>
> [1] https://lore.kernel.org/lkml/20190607072652.GA5522@hc/T/#m0d00f107c29223045933292a8b5b90d2ca9b7e5c
> .
>
//copied from Barry, thanks :)
static int ways = 64;
module_param(ways, int, S_IRUGO);
static int seconds = 20;
module_param(seconds, int, S_IRUGO);
int mappings[NR_CPUS];
struct semaphore sem[NR_CPUS];
#define COMPLETIONS_SIZE 50
static noinline dma_addr_t test_mapsingle(struct device *dev, void *buf,
int size)
{
dma_addr_t dma_addr = dma_map_single(dev, buf, size, DMA_TO_DEVICE);
return dma_addr;
}
static noinline void test_unmapsingle(struct device *dev, void *buf, int
size, dma_addr_t dma_addr)
{
dma_unmap_single(dev, dma_addr, size, DMA_TO_DEVICE);
}
static noinline void test_memcpy(void *out, void *in, int size)
{
memcpy(out, in, size);
}
//just a hack to get a dev h behind a SMMU
extern struct device *hisi_dev;
static int testthread(void *data)
{
unsigned long stop = jiffies +seconds*HZ;
struct device *dev = hisi_dev;
char *inputs[COMPLETIONS_SIZE];
char *outputs[COMPLETIONS_SIZE];
dma_addr_t dma_addr[COMPLETIONS_SIZE];
int i, cpu = smp_processor_id();
struct semaphore *sem = data;
for (i = 0; i < COMPLETIONS_SIZE; i++) {
inputs[i] = kzalloc(4096, GFP_KERNEL);
if (!inputs[i])
return -ENOMEM;
}
for (i = 0; i < COMPLETIONS_SIZE; i++) {
outputs[i] = kzalloc(4096, GFP_KERNEL);
if (!outputs[i])
return -ENOMEM;
}
while (time_before(jiffies, stop)) {
for (i = 0; i < COMPLETIONS_SIZE; i++) {
dma_addr[i] = test_mapsingle(dev, inputs[i], 4096);
test_memcpy(outputs[i], inputs[i], 4096);
}
for (i = 0; i < COMPLETIONS_SIZE; i++) {
test_unmapsingle(dev, inputs[i], 4096, dma_addr[i]);
}
mappings[cpu] += COMPLETIONS_SIZE;
}
for (i = 0; i < COMPLETIONS_SIZE; i++) {
kfree(outputs[i]);
kfree(inputs[i]);
}
up(sem);
return 0;
}
void smmu_test_core(int cpus)
{
struct task_struct *tsk;
int i;
int total_mappings = 0;
ways = cpus;
for(i=0;i<ways;i++) {
mappings[i] = 0;
tsk = kthread_create_on_cpu(testthread, &sem[i], i, "map_test");
if (IS_ERR(tsk))
printk(KERN_ERR "create test thread failed\n");
wake_up_process(tsk);
}
for(i=0;i<ways;i++) {
down(&sem[i]);
total_mappings += mappings[i];
}
printk(KERN_ERR "finished total_mappings=%d (per way=%d) (rate=%d
per second per cpu) ways=%d\n",
total_mappings, total_mappings / ways, total_mappings / (seconds*
ways), ways);
}
EXPORT_SYMBOL(smmu_test_core);
static int __init test_init(void)
{
int i;
for(i=0;i<NR_CPUS;i++)
sema_init(&sem[i], 0);
return 0;
}
static void __exit test_exit(void)
{
}
module_init(test_init);
module_exit(test_exit);
MODULE_LICENSE("GPL");
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 69+ messages in thread
end of thread, other threads:[~2020-07-16 16:57 UTC | newest]
Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-22 17:28 [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency John Garry
2020-06-22 17:28 ` John Garry
2020-06-22 17:28 ` [PATCH 1/4] iommu/arm-smmu-v3: Fix trivial typo John Garry
2020-06-22 17:28 ` John Garry
2020-06-22 17:28 ` [PATCH 2/4] iommu/arm-smmu-v3: Calculate bits for prod and owner John Garry
2020-06-22 17:28 ` John Garry
2020-06-22 17:28 ` [PATCH 3/4] iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch John Garry
2020-06-22 17:28 ` John Garry
2020-06-22 17:28 ` [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() John Garry
2020-06-22 17:28 ` John Garry
2020-06-23 1:07 ` kernel test robot
2020-06-23 1:07 ` kernel test robot
2020-06-23 1:07 ` kernel test robot
2020-06-23 9:21 ` John Garry
2020-06-23 9:21 ` John Garry
2020-06-23 9:21 ` John Garry
2020-06-23 9:35 ` Rikard Falkeborn
2020-06-23 10:19 ` John Garry
2020-06-23 10:19 ` John Garry
2020-06-23 10:19 ` John Garry
2020-06-23 13:55 ` Rikard Falkeborn
2020-06-26 10:05 ` John Garry
2020-06-26 10:05 ` John Garry
2020-06-26 10:05 ` John Garry
2020-06-26 10:05 ` John Garry
2020-06-23 16:22 ` Robin Murphy
2020-06-23 16:22 ` Robin Murphy
2020-06-23 16:22 ` Robin Murphy
2020-06-23 16:22 ` Robin Murphy
2020-06-24 8:15 ` John Garry
2020-06-24 8:15 ` John Garry
2020-06-24 8:15 ` John Garry
2020-06-24 8:15 ` John Garry
2020-07-16 10:20 ` Will Deacon
2020-07-16 10:20 ` Will Deacon
2020-07-16 10:20 ` Will Deacon
2020-07-16 10:26 ` John Garry
2020-07-16 10:26 ` John Garry
2020-07-16 10:26 ` John Garry
2020-07-08 13:00 ` [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency John Garry
2020-07-08 13:00 ` John Garry
2020-07-08 13:00 ` John Garry
2020-07-16 10:19 ` Will Deacon
2020-07-16 10:19 ` Will Deacon
2020-07-16 10:19 ` Will Deacon
2020-07-16 10:22 ` Will Deacon
2020-07-16 10:22 ` Will Deacon
2020-07-16 10:22 ` Will Deacon
2020-07-16 10:28 ` Will Deacon
2020-07-16 10:28 ` Will Deacon
2020-07-16 10:28 ` Will Deacon
2020-07-16 10:56 ` John Garry
2020-07-16 10:56 ` John Garry
2020-07-16 10:56 ` John Garry
2020-07-16 11:22 ` Robin Murphy
2020-07-16 11:22 ` Robin Murphy
2020-07-16 11:22 ` Robin Murphy
2020-07-16 11:30 ` John Garry
2020-07-16 11:30 ` John Garry
2020-07-16 11:30 ` John Garry
2020-07-16 11:32 ` Will Deacon
2020-07-16 11:32 ` Will Deacon
2020-07-16 11:32 ` Will Deacon
2020-07-16 16:50 ` John Garry
2020-07-16 16:50 ` John Garry
2020-07-16 16:50 ` John Garry
2020-07-16 13:31 ` John Garry
2020-07-16 13:31 ` John Garry
2020-07-16 13:31 ` John Garry
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.