* [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency @ 2020-06-22 17:28 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-06-22 17:28 UTC (permalink / raw) To: will, robin.murphy Cc: joro, trivial, linux-arm-kernel, iommu, linux-kernel, linuxarm, maz, John Garry As mentioned in [0], the CPU may consume many cycles processing arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to get space on the queue takes approx 25% of the cycles for this function. This series removes that cmpxchg(). For my NVMe test with 3x NVMe SSDs, I'm getting a ~24% throughput increase: Before: 1310 IOPs After: 1630 IOPs I also have a test harness to check the rate of DMA map+unmaps we can achieve: CPU count 32 64 128 Before: 63187 19418 10169 After: 93287 44789 15862 (unit is map+unmaps per CPU per second) [0] https://lore.kernel.org/linux-iommu/B926444035E5E2439431908E3842AFD24B86DB@DGGEMI525-MBS.china.huawei.com/T/#ma02e301c38c3e94b7725e685757c27e39c7cbde3 John Garry (4): iommu/arm-smmu-v3: Fix trivial typo iommu/arm-smmu-v3: Calculate bits for prod and owner iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() drivers/iommu/arm-smmu-v3.c | 233 +++++++++++++++++++++++------------- 1 file changed, 151 insertions(+), 82 deletions(-) -- 2.26.2 ^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency @ 2020-06-22 17:28 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-06-22 17:28 UTC (permalink / raw) To: will, robin.murphy Cc: trivial, maz, linuxarm, linux-kernel, iommu, linux-arm-kernel As mentioned in [0], the CPU may consume many cycles processing arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to get space on the queue takes approx 25% of the cycles for this function. This series removes that cmpxchg(). For my NVMe test with 3x NVMe SSDs, I'm getting a ~24% throughput increase: Before: 1310 IOPs After: 1630 IOPs I also have a test harness to check the rate of DMA map+unmaps we can achieve: CPU count 32 64 128 Before: 63187 19418 10169 After: 93287 44789 15862 (unit is map+unmaps per CPU per second) [0] https://lore.kernel.org/linux-iommu/B926444035E5E2439431908E3842AFD24B86DB@DGGEMI525-MBS.china.huawei.com/T/#ma02e301c38c3e94b7725e685757c27e39c7cbde3 John Garry (4): iommu/arm-smmu-v3: Fix trivial typo iommu/arm-smmu-v3: Calculate bits for prod and owner iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() drivers/iommu/arm-smmu-v3.c | 233 +++++++++++++++++++++++------------- 1 file changed, 151 insertions(+), 82 deletions(-) -- 2.26.2 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH 1/4] iommu/arm-smmu-v3: Fix trivial typo 2020-06-22 17:28 ` John Garry @ 2020-06-22 17:28 ` John Garry -1 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-06-22 17:28 UTC (permalink / raw) To: will, robin.murphy Cc: joro, trivial, linux-arm-kernel, iommu, linux-kernel, linuxarm, maz, John Garry Set "cmq" -> "cmdq". Signed-off-by: John Garry <john.garry@huawei.com> --- drivers/iommu/arm-smmu-v3.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index f578677a5c41..a8e814c652fe 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -1479,7 +1479,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, } /* - * Try to unlock the cmq lock. This will fail if we're the last + * Try to unlock the cmdq lock. This will fail if we're the last * reader, in which case we can safely update cmdq->q.llq.cons */ if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) { -- 2.26.2 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 1/4] iommu/arm-smmu-v3: Fix trivial typo @ 2020-06-22 17:28 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-06-22 17:28 UTC (permalink / raw) To: will, robin.murphy Cc: trivial, maz, linuxarm, linux-kernel, iommu, linux-arm-kernel Set "cmq" -> "cmdq". Signed-off-by: John Garry <john.garry@huawei.com> --- drivers/iommu/arm-smmu-v3.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index f578677a5c41..a8e814c652fe 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -1479,7 +1479,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, } /* - * Try to unlock the cmq lock. This will fail if we're the last + * Try to unlock the cmdq lock. This will fail if we're the last * reader, in which case we can safely update cmdq->q.llq.cons */ if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) { -- 2.26.2 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 2/4] iommu/arm-smmu-v3: Calculate bits for prod and owner 2020-06-22 17:28 ` John Garry @ 2020-06-22 17:28 ` John Garry -1 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-06-22 17:28 UTC (permalink / raw) To: will, robin.murphy Cc: joro, trivial, linux-arm-kernel, iommu, linux-kernel, linuxarm, maz, John Garry Since the arm_smmu_ll_queue.prod will be for counting the "owner" value and also HW prod pointer, calculate how many bits are available for and used by each. This is based on the number of possible CPUs in the system. And we require that each CPU can issue a minimum of 2 commands per batch - 1 x CMD_SYNC and at least 1 x other. Ignoring limits of HW max_n_shift and HW cmdq memory allocation, approx 32K is the max supported CPUs. For this, max_n_shift would be 15. Signed-off-by: John Garry <john.garry@huawei.com> --- drivers/iommu/arm-smmu-v3.c | 64 ++++++++++++++++++++++++++++++++++--- 1 file changed, 60 insertions(+), 4 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index a8e814c652fe..4e9677b066f1 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -532,6 +532,8 @@ struct arm_smmu_ll_queue { u8 __pad[SMP_CACHE_BYTES]; } ____cacheline_aligned_in_smp; u32 max_n_shift; + u32 max_cmd_per_batch; + u32 owner_count_shift; }; struct arm_smmu_queue { @@ -1515,7 +1517,10 @@ static void arm_smmu_cmdq_batch_add(struct arm_smmu_device *smmu, struct arm_smmu_cmdq_batch *cmds, struct arm_smmu_cmdq_ent *cmd) { - if (cmds->num == CMDQ_BATCH_ENTRIES) { + struct arm_smmu_cmdq *q = &smmu->cmdq; + struct arm_smmu_ll_queue *llq = &q->q.llq; + + if (cmds->num == llq->max_cmd_per_batch) { arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num, false); cmds->num = 0; } @@ -3177,6 +3182,58 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu, return 0; } +static int arm_smmu_init_cmd_queue(struct arm_smmu_device *smmu, + struct arm_smmu_queue *q, + unsigned long prod_off, + unsigned long cons_off, + size_t dwords) +{ + u32 cpus = num_possible_cpus(), bits_for_cmdq_owner, + bits_available_for_prod, entries_for_prod; + int ret; + + /* + * We can get the number of bits required for owner counting by + * log2(nr possible cpus) + 1 + */ + bits_for_cmdq_owner = ilog2(cpus) + 1; + + /* + * Add an extra bit to ensure prod(+wrap) do not overflow into + * owner count. + */ + bits_available_for_prod = 32 - 1 - bits_for_cmdq_owner; + + if (bits_available_for_prod < 1) /* How many CPUs??? */ + return -ENOMEM; + + q->llq.max_n_shift = min(q->llq.max_n_shift, bits_available_for_prod); + + ret = arm_smmu_init_one_queue(smmu, q, prod_off, cons_off, dwords, + "cmdq"); + if (ret) + return ret; + + entries_for_prod = 1 << q->llq.max_n_shift; + + /* + * We need at least 2 commands in a batch (1 x CMD_SYNC and 1 x + * whatever else). + */ + if (entries_for_prod < 2 * cpus) + return -ENOMEM; + + /* + * When finding max_cmd_per_batch, deduct 1 entry per batch to take + * account of CMD_SYNC + */ + q->llq.max_cmd_per_batch = min((entries_for_prod / cpus) - 1, + (u32)CMDQ_BATCH_ENTRIES); + q->llq.owner_count_shift = q->llq.max_n_shift + 2; + + return 0; +} + static void arm_smmu_cmdq_free_bitmap(void *data) { unsigned long *bitmap = data; @@ -3210,9 +3267,8 @@ static int arm_smmu_init_queues(struct arm_smmu_device *smmu) int ret; /* cmdq */ - ret = arm_smmu_init_one_queue(smmu, &smmu->cmdq.q, ARM_SMMU_CMDQ_PROD, - ARM_SMMU_CMDQ_CONS, CMDQ_ENT_DWORDS, - "cmdq"); + ret = arm_smmu_init_cmd_queue(smmu, &smmu->cmdq.q, ARM_SMMU_CMDQ_PROD, + ARM_SMMU_CMDQ_CONS, CMDQ_ENT_DWORDS); if (ret) return ret; -- 2.26.2 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 2/4] iommu/arm-smmu-v3: Calculate bits for prod and owner @ 2020-06-22 17:28 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-06-22 17:28 UTC (permalink / raw) To: will, robin.murphy Cc: trivial, maz, linuxarm, linux-kernel, iommu, linux-arm-kernel Since the arm_smmu_ll_queue.prod will be for counting the "owner" value and also HW prod pointer, calculate how many bits are available for and used by each. This is based on the number of possible CPUs in the system. And we require that each CPU can issue a minimum of 2 commands per batch - 1 x CMD_SYNC and at least 1 x other. Ignoring limits of HW max_n_shift and HW cmdq memory allocation, approx 32K is the max supported CPUs. For this, max_n_shift would be 15. Signed-off-by: John Garry <john.garry@huawei.com> --- drivers/iommu/arm-smmu-v3.c | 64 ++++++++++++++++++++++++++++++++++--- 1 file changed, 60 insertions(+), 4 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index a8e814c652fe..4e9677b066f1 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -532,6 +532,8 @@ struct arm_smmu_ll_queue { u8 __pad[SMP_CACHE_BYTES]; } ____cacheline_aligned_in_smp; u32 max_n_shift; + u32 max_cmd_per_batch; + u32 owner_count_shift; }; struct arm_smmu_queue { @@ -1515,7 +1517,10 @@ static void arm_smmu_cmdq_batch_add(struct arm_smmu_device *smmu, struct arm_smmu_cmdq_batch *cmds, struct arm_smmu_cmdq_ent *cmd) { - if (cmds->num == CMDQ_BATCH_ENTRIES) { + struct arm_smmu_cmdq *q = &smmu->cmdq; + struct arm_smmu_ll_queue *llq = &q->q.llq; + + if (cmds->num == llq->max_cmd_per_batch) { arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num, false); cmds->num = 0; } @@ -3177,6 +3182,58 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu, return 0; } +static int arm_smmu_init_cmd_queue(struct arm_smmu_device *smmu, + struct arm_smmu_queue *q, + unsigned long prod_off, + unsigned long cons_off, + size_t dwords) +{ + u32 cpus = num_possible_cpus(), bits_for_cmdq_owner, + bits_available_for_prod, entries_for_prod; + int ret; + + /* + * We can get the number of bits required for owner counting by + * log2(nr possible cpus) + 1 + */ + bits_for_cmdq_owner = ilog2(cpus) + 1; + + /* + * Add an extra bit to ensure prod(+wrap) do not overflow into + * owner count. + */ + bits_available_for_prod = 32 - 1 - bits_for_cmdq_owner; + + if (bits_available_for_prod < 1) /* How many CPUs??? */ + return -ENOMEM; + + q->llq.max_n_shift = min(q->llq.max_n_shift, bits_available_for_prod); + + ret = arm_smmu_init_one_queue(smmu, q, prod_off, cons_off, dwords, + "cmdq"); + if (ret) + return ret; + + entries_for_prod = 1 << q->llq.max_n_shift; + + /* + * We need at least 2 commands in a batch (1 x CMD_SYNC and 1 x + * whatever else). + */ + if (entries_for_prod < 2 * cpus) + return -ENOMEM; + + /* + * When finding max_cmd_per_batch, deduct 1 entry per batch to take + * account of CMD_SYNC + */ + q->llq.max_cmd_per_batch = min((entries_for_prod / cpus) - 1, + (u32)CMDQ_BATCH_ENTRIES); + q->llq.owner_count_shift = q->llq.max_n_shift + 2; + + return 0; +} + static void arm_smmu_cmdq_free_bitmap(void *data) { unsigned long *bitmap = data; @@ -3210,9 +3267,8 @@ static int arm_smmu_init_queues(struct arm_smmu_device *smmu) int ret; /* cmdq */ - ret = arm_smmu_init_one_queue(smmu, &smmu->cmdq.q, ARM_SMMU_CMDQ_PROD, - ARM_SMMU_CMDQ_CONS, CMDQ_ENT_DWORDS, - "cmdq"); + ret = arm_smmu_init_cmd_queue(smmu, &smmu->cmdq.q, ARM_SMMU_CMDQ_PROD, + ARM_SMMU_CMDQ_CONS, CMDQ_ENT_DWORDS); if (ret) return ret; -- 2.26.2 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 3/4] iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch 2020-06-22 17:28 ` John Garry @ 2020-06-22 17:28 ` John Garry -1 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-06-22 17:28 UTC (permalink / raw) To: will, robin.murphy Cc: joro, trivial, linux-arm-kernel, iommu, linux-kernel, linuxarm, maz, John Garry To ensure that a CPU does not send more than a permitted amount of commands to the cmdq, ensure that each batch includes a CMD_SYNC. When issuing a CMD_SYNC, we always wait for the consumption of its batch of commands - as such, we guarantee that any CPU will not issue more than its permitted amount. Signed-off-by: John Garry <john.garry@huawei.com> --- drivers/iommu/arm-smmu-v3.c | 86 +++++++++++++++++-------------------- 1 file changed, 39 insertions(+), 47 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 4e9677b066f1..45a39ccaf455 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -1373,11 +1373,15 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds, * - Command insertion is totally ordered, so if two CPUs each race to * insert their own list of commands then all of the commands from one * CPU will appear before any of the commands from the other CPU. + * + * - A CMD_SYNC is always inserted, ensuring that any CPU does not issue + * more than the permitted amount commands at once. */ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, - u64 *cmds, int n, bool sync) + u64 *cmds, int n) { u64 cmd_sync[CMDQ_ENT_DWORDS]; + const int sync = 1; u32 prod; unsigned long flags; bool owner; @@ -1419,19 +1423,17 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, * Dependency ordering from the cmpxchg() loop above. */ arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n); - if (sync) { - prod = queue_inc_prod_n(&llq, n); - arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod); - queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS); - /* - * In order to determine completion of our CMD_SYNC, we must - * ensure that the queue can't wrap twice without us noticing. - * We achieve that by taking the cmdq lock as shared before - * marking our slot as valid. - */ - arm_smmu_cmdq_shared_lock(cmdq); - } + prod = queue_inc_prod_n(&llq, n); + arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod); + queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS); + + /* + * In order to determine completion of our CMD_SYNC, we must ensure + * that the queue can't wrap twice without us noticing. We achieve that + * by taking the cmdq lock as shared before marking our slot as valid. + */ + arm_smmu_cmdq_shared_lock(cmdq); /* 3. Mark our slots as valid, ensuring commands are visible first */ dma_wmb(); @@ -1468,26 +1470,21 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, atomic_set_release(&cmdq->owner_prod, prod); } - /* 5. If we are inserting a CMD_SYNC, we must wait for it to complete */ - if (sync) { - llq.prod = queue_inc_prod_n(&llq, n); - ret = arm_smmu_cmdq_poll_until_sync(smmu, &llq); - if (ret) { - dev_err_ratelimited(smmu->dev, - "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n", - llq.prod, - readl_relaxed(cmdq->q.prod_reg), - readl_relaxed(cmdq->q.cons_reg)); - } + /* 5. Since we always insert a CMD_SYNC, we must wait for it to complete */ + llq.prod = queue_inc_prod_n(&llq, n); + ret = arm_smmu_cmdq_poll_until_sync(smmu, &llq); + if (ret) + dev_err_ratelimited(smmu->dev, "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n", + llq.prod, readl_relaxed(cmdq->q.prod_reg), + readl_relaxed(cmdq->q.cons_reg)); - /* - * Try to unlock the cmdq lock. This will fail if we're the last - * reader, in which case we can safely update cmdq->q.llq.cons - */ - if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) { - WRITE_ONCE(cmdq->q.llq.cons, llq.cons); - arm_smmu_cmdq_shared_unlock(cmdq); - } + /* + * Try to unlock the cmdq lock. This will fail if we're the last reader, + * in which case we can safely update cmdq->q.llq.cons + */ + if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) { + WRITE_ONCE(cmdq->q.llq.cons, llq.cons); + arm_smmu_cmdq_shared_unlock(cmdq); } local_irq_restore(flags); @@ -1505,12 +1502,7 @@ static int arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu, return -EINVAL; } - return arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1, false); -} - -static int arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu) -{ - return arm_smmu_cmdq_issue_cmdlist(smmu, NULL, 0, true); + return arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1); } static void arm_smmu_cmdq_batch_add(struct arm_smmu_device *smmu, @@ -1521,7 +1513,7 @@ static void arm_smmu_cmdq_batch_add(struct arm_smmu_device *smmu, struct arm_smmu_ll_queue *llq = &q->q.llq; if (cmds->num == llq->max_cmd_per_batch) { - arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num, false); + arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num); cmds->num = 0; } arm_smmu_cmdq_build_cmd(&cmds->cmds[cmds->num * CMDQ_ENT_DWORDS], cmd); @@ -1531,7 +1523,7 @@ static void arm_smmu_cmdq_batch_add(struct arm_smmu_device *smmu, static int arm_smmu_cmdq_batch_submit(struct arm_smmu_device *smmu, struct arm_smmu_cmdq_batch *cmds) { - return arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num, true); + return arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num); } /* Context descriptor manipulation functions */ @@ -1803,7 +1795,6 @@ static void arm_smmu_sync_ste_for_sid(struct arm_smmu_device *smmu, u32 sid) }; arm_smmu_cmdq_issue_cmd(smmu, &cmd); - arm_smmu_cmdq_issue_sync(smmu); } static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid, @@ -2197,17 +2188,21 @@ arm_smmu_atc_inv_to_cmd(int ssid, unsigned long iova, size_t size, static int arm_smmu_atc_inv_master(struct arm_smmu_master *master) { - int i; + int i, ret = 0; struct arm_smmu_cmdq_ent cmd; arm_smmu_atc_inv_to_cmd(0, 0, 0, &cmd); for (i = 0; i < master->num_sids; i++) { + int rc; + cmd.atc.sid = master->sids[i]; - arm_smmu_cmdq_issue_cmd(master->smmu, &cmd); + rc = arm_smmu_cmdq_issue_cmd(master->smmu, &cmd); + if (rc) + ret = rc; } - return arm_smmu_cmdq_issue_sync(master->smmu); + return ret; } static int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain, @@ -2280,7 +2275,6 @@ static void arm_smmu_tlb_inv_context(void *cookie) * careful, 007. */ arm_smmu_cmdq_issue_cmd(smmu, &cmd); - arm_smmu_cmdq_issue_sync(smmu); arm_smmu_atc_inv_domain(smmu_domain, 0, 0, 0); } @@ -3667,7 +3661,6 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass) /* Invalidate any cached configuration */ cmd.opcode = CMDQ_OP_CFGI_ALL; arm_smmu_cmdq_issue_cmd(smmu, &cmd); - arm_smmu_cmdq_issue_sync(smmu); /* Invalidate any stale TLB entries */ if (smmu->features & ARM_SMMU_FEAT_HYP) { @@ -3677,7 +3670,6 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass) cmd.opcode = CMDQ_OP_TLBI_NSNH_ALL; arm_smmu_cmdq_issue_cmd(smmu, &cmd); - arm_smmu_cmdq_issue_sync(smmu); /* Event queue */ writeq_relaxed(smmu->evtq.q.q_base, smmu->base + ARM_SMMU_EVTQ_BASE); -- 2.26.2 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 3/4] iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch @ 2020-06-22 17:28 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-06-22 17:28 UTC (permalink / raw) To: will, robin.murphy Cc: trivial, maz, linuxarm, linux-kernel, iommu, linux-arm-kernel To ensure that a CPU does not send more than a permitted amount of commands to the cmdq, ensure that each batch includes a CMD_SYNC. When issuing a CMD_SYNC, we always wait for the consumption of its batch of commands - as such, we guarantee that any CPU will not issue more than its permitted amount. Signed-off-by: John Garry <john.garry@huawei.com> --- drivers/iommu/arm-smmu-v3.c | 86 +++++++++++++++++-------------------- 1 file changed, 39 insertions(+), 47 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 4e9677b066f1..45a39ccaf455 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -1373,11 +1373,15 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds, * - Command insertion is totally ordered, so if two CPUs each race to * insert their own list of commands then all of the commands from one * CPU will appear before any of the commands from the other CPU. + * + * - A CMD_SYNC is always inserted, ensuring that any CPU does not issue + * more than the permitted amount commands at once. */ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, - u64 *cmds, int n, bool sync) + u64 *cmds, int n) { u64 cmd_sync[CMDQ_ENT_DWORDS]; + const int sync = 1; u32 prod; unsigned long flags; bool owner; @@ -1419,19 +1423,17 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, * Dependency ordering from the cmpxchg() loop above. */ arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n); - if (sync) { - prod = queue_inc_prod_n(&llq, n); - arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod); - queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS); - /* - * In order to determine completion of our CMD_SYNC, we must - * ensure that the queue can't wrap twice without us noticing. - * We achieve that by taking the cmdq lock as shared before - * marking our slot as valid. - */ - arm_smmu_cmdq_shared_lock(cmdq); - } + prod = queue_inc_prod_n(&llq, n); + arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod); + queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS); + + /* + * In order to determine completion of our CMD_SYNC, we must ensure + * that the queue can't wrap twice without us noticing. We achieve that + * by taking the cmdq lock as shared before marking our slot as valid. + */ + arm_smmu_cmdq_shared_lock(cmdq); /* 3. Mark our slots as valid, ensuring commands are visible first */ dma_wmb(); @@ -1468,26 +1470,21 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, atomic_set_release(&cmdq->owner_prod, prod); } - /* 5. If we are inserting a CMD_SYNC, we must wait for it to complete */ - if (sync) { - llq.prod = queue_inc_prod_n(&llq, n); - ret = arm_smmu_cmdq_poll_until_sync(smmu, &llq); - if (ret) { - dev_err_ratelimited(smmu->dev, - "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n", - llq.prod, - readl_relaxed(cmdq->q.prod_reg), - readl_relaxed(cmdq->q.cons_reg)); - } + /* 5. Since we always insert a CMD_SYNC, we must wait for it to complete */ + llq.prod = queue_inc_prod_n(&llq, n); + ret = arm_smmu_cmdq_poll_until_sync(smmu, &llq); + if (ret) + dev_err_ratelimited(smmu->dev, "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n", + llq.prod, readl_relaxed(cmdq->q.prod_reg), + readl_relaxed(cmdq->q.cons_reg)); - /* - * Try to unlock the cmdq lock. This will fail if we're the last - * reader, in which case we can safely update cmdq->q.llq.cons - */ - if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) { - WRITE_ONCE(cmdq->q.llq.cons, llq.cons); - arm_smmu_cmdq_shared_unlock(cmdq); - } + /* + * Try to unlock the cmdq lock. This will fail if we're the last reader, + * in which case we can safely update cmdq->q.llq.cons + */ + if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) { + WRITE_ONCE(cmdq->q.llq.cons, llq.cons); + arm_smmu_cmdq_shared_unlock(cmdq); } local_irq_restore(flags); @@ -1505,12 +1502,7 @@ static int arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu, return -EINVAL; } - return arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1, false); -} - -static int arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu) -{ - return arm_smmu_cmdq_issue_cmdlist(smmu, NULL, 0, true); + return arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1); } static void arm_smmu_cmdq_batch_add(struct arm_smmu_device *smmu, @@ -1521,7 +1513,7 @@ static void arm_smmu_cmdq_batch_add(struct arm_smmu_device *smmu, struct arm_smmu_ll_queue *llq = &q->q.llq; if (cmds->num == llq->max_cmd_per_batch) { - arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num, false); + arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num); cmds->num = 0; } arm_smmu_cmdq_build_cmd(&cmds->cmds[cmds->num * CMDQ_ENT_DWORDS], cmd); @@ -1531,7 +1523,7 @@ static void arm_smmu_cmdq_batch_add(struct arm_smmu_device *smmu, static int arm_smmu_cmdq_batch_submit(struct arm_smmu_device *smmu, struct arm_smmu_cmdq_batch *cmds) { - return arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num, true); + return arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num); } /* Context descriptor manipulation functions */ @@ -1803,7 +1795,6 @@ static void arm_smmu_sync_ste_for_sid(struct arm_smmu_device *smmu, u32 sid) }; arm_smmu_cmdq_issue_cmd(smmu, &cmd); - arm_smmu_cmdq_issue_sync(smmu); } static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid, @@ -2197,17 +2188,21 @@ arm_smmu_atc_inv_to_cmd(int ssid, unsigned long iova, size_t size, static int arm_smmu_atc_inv_master(struct arm_smmu_master *master) { - int i; + int i, ret = 0; struct arm_smmu_cmdq_ent cmd; arm_smmu_atc_inv_to_cmd(0, 0, 0, &cmd); for (i = 0; i < master->num_sids; i++) { + int rc; + cmd.atc.sid = master->sids[i]; - arm_smmu_cmdq_issue_cmd(master->smmu, &cmd); + rc = arm_smmu_cmdq_issue_cmd(master->smmu, &cmd); + if (rc) + ret = rc; } - return arm_smmu_cmdq_issue_sync(master->smmu); + return ret; } static int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain, @@ -2280,7 +2275,6 @@ static void arm_smmu_tlb_inv_context(void *cookie) * careful, 007. */ arm_smmu_cmdq_issue_cmd(smmu, &cmd); - arm_smmu_cmdq_issue_sync(smmu); arm_smmu_atc_inv_domain(smmu_domain, 0, 0, 0); } @@ -3667,7 +3661,6 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass) /* Invalidate any cached configuration */ cmd.opcode = CMDQ_OP_CFGI_ALL; arm_smmu_cmdq_issue_cmd(smmu, &cmd); - arm_smmu_cmdq_issue_sync(smmu); /* Invalidate any stale TLB entries */ if (smmu->features & ARM_SMMU_FEAT_HYP) { @@ -3677,7 +3670,6 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass) cmd.opcode = CMDQ_OP_TLBI_NSNH_ALL; arm_smmu_cmdq_issue_cmd(smmu, &cmd); - arm_smmu_cmdq_issue_sync(smmu); /* Event queue */ writeq_relaxed(smmu->evtq.q.q_base, smmu->base + ARM_SMMU_EVTQ_BASE); -- 2.26.2 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() 2020-06-22 17:28 ` John Garry @ 2020-06-22 17:28 ` John Garry -1 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-06-22 17:28 UTC (permalink / raw) To: will, robin.murphy Cc: joro, trivial, linux-arm-kernel, iommu, linux-kernel, linuxarm, maz, John Garry It has been shown that the cmpxchg() for finding space in the cmdq can be a bottleneck: - for more CPUs contending the cmdq, the cmpxchg() will fail more often - since the software-maintained cons pointer is updated on the same 64b memory region, the chance of cmpxchg() failure increases again The cmpxchg() is removed as part of 2 related changes: - Update prod and cmdq owner in a single atomic add operation. For this, we count the prod and owner in separate regions in prod memory. As with simple binary counting, once the prod+wrap fields overflow, they will zero. They should never overflow into "owner" region, and we zero the non-owner, prod region for each owner. This maintains the prod pointer. As for the "owner", we now count this value, instead of setting a flag. Similar to before, once the owner has finished gathering, it will clear a mask. As such, a CPU declares itself as the "owner" when it reads zero for this region. This zeroing will also clear possible overflow in wrap+prod region, above. The owner is now responsible for all cmdq locking to avoid possible deadlock. The owner will lock the cmdq for all non-owers it has gathered when they have space in the queue and have written their entries. - Check for space in the cmdq after the prod pointer has been assigned. We don't bother checking for space in the cmdq before assigning the prod pointer, as this would be racy. So since the prod pointer is updated unconditionally, it would be common for no space to be available in the cmdq when prod is assigned - that is, according the software-maintained prod and cons pointer. So now it must be ensured that the entries are not yet written and not until there is space. How the prod pointer is maintained also leads to a strange condition where the prod pointer can wrap past the cons pointer. We can detect this condition, and report no space here. However, a prod pointer progressed twice past the cons pointer cannot be detected. But it can be ensured that this that this scenario does not occur, as we limit the amount of commands any CPU can issue at any given time, such that we cannot progress prod pointer further. Signed-off-by: John Garry <john.garry@huawei.com> --- drivers/iommu/arm-smmu-v3.c | 101 ++++++++++++++++++++++-------------- 1 file changed, 61 insertions(+), 40 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 45a39ccaf455..54d056f10bea 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -772,10 +772,19 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n) prod = Q_IDX(q, q->prod); cons = Q_IDX(q, q->cons); - if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons)) + if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons)) { + /* check if we have wrapped, meaning definitely no space */ + if (cons > prod) + return false; + space = (1 << q->max_n_shift) - (prod - cons); - else + } else { + /* similar check to above */ + if (prod > cons) + return false; + space = cons - prod; + } return space >= n; } @@ -1073,7 +1082,7 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu) * fails if the caller appears to be the last lock holder (yes, this is * racy). All successful UNLOCK routines have RELEASE semantics. */ -static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq) +static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq, int count) { int val; @@ -1083,12 +1092,12 @@ static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq) * to INT_MIN so these increments won't hurt as the value will remain * negative. */ - if (atomic_fetch_inc_relaxed(&cmdq->lock) >= 0) + if (atomic_fetch_add_relaxed(count, &cmdq->lock) >= 0) return; do { val = atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0); - } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1) != val); + } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + count) != val); } static void arm_smmu_cmdq_shared_unlock(struct arm_smmu_cmdq *cmdq) @@ -1374,8 +1383,10 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds, * insert their own list of commands then all of the commands from one * CPU will appear before any of the commands from the other CPU. * - * - A CMD_SYNC is always inserted, ensuring that any CPU does not issue - * more than the permitted amount commands at once. + * - A CMD_SYNC is always inserted, which ensures we limit the prod pointer + * for when the cmdq is full, such that we don't wrap more than twice. + * It also makes it easy for the owner to know by how many to increment the + * cmdq lock. */ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, u64 *cmds, int n) @@ -1388,39 +1399,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, struct arm_smmu_cmdq *cmdq = &smmu->cmdq; struct arm_smmu_ll_queue llq = { .max_n_shift = cmdq->q.llq.max_n_shift, - }, head = llq; + }, head = llq, space = llq; + u32 owner_val = 1 << cmdq->q.llq.owner_count_shift; + u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); + u32 owner_mask = GENMASK(30, cmdq->q.llq.owner_count_shift); int ret = 0; /* 1. Allocate some space in the queue */ local_irq_save(flags); - llq.val = READ_ONCE(cmdq->q.llq.val); - do { - u64 old; - while (!queue_has_space(&llq, n + sync)) { - local_irq_restore(flags); - if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq)) - dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); - local_irq_save(flags); - } + prod = atomic_fetch_add(n + sync + owner_val, + &cmdq->q.llq.atomic.prod); - head.cons = llq.cons; - head.prod = queue_inc_prod_n(&llq, n + sync) | - CMDQ_PROD_OWNED_FLAG; + owner = !(prod & owner_mask); + llq.prod = prod_mask & prod; + head.prod = queue_inc_prod_n(&llq, n + sync); - old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val); - if (old == llq.val) - break; + /* + * Ensure it's safe to write the entries. For this, we need to ensure + * that there is space in the queue from our prod pointer. + */ + space.cons = READ_ONCE(cmdq->q.llq.cons); + space.prod = llq.prod; + while (!queue_has_space(&space, n + sync)) { + if (arm_smmu_cmdq_poll_until_not_full(smmu, &space)) + dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); - llq.val = old; - } while (1); - owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG); - head.prod &= ~CMDQ_PROD_OWNED_FLAG; - llq.prod &= ~CMDQ_PROD_OWNED_FLAG; + space.prod = llq.prod; + } /* * 2. Write our commands into the queue - * Dependency ordering from the cmpxchg() loop above. + * Dependency ordering from the space-checking while loop, above. */ arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n); @@ -1428,26 +1438,24 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod); queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS); - /* - * In order to determine completion of our CMD_SYNC, we must ensure - * that the queue can't wrap twice without us noticing. We achieve that - * by taking the cmdq lock as shared before marking our slot as valid. - */ - arm_smmu_cmdq_shared_lock(cmdq); - /* 3. Mark our slots as valid, ensuring commands are visible first */ dma_wmb(); arm_smmu_cmdq_set_valid_map(cmdq, llq.prod, head.prod); /* 4. If we are the owner, take control of the SMMU hardware */ if (owner) { + int owner_count; + u32 prod_tmp; + /* a. Wait for previous owner to finish */ atomic_cond_read_relaxed(&cmdq->owner_prod, VAL == llq.prod); - /* b. Stop gathering work by clearing the owned flag */ - prod = atomic_fetch_andnot_relaxed(CMDQ_PROD_OWNED_FLAG, - &cmdq->q.llq.atomic.prod); - prod &= ~CMDQ_PROD_OWNED_FLAG; + /* b. Stop gathering work by clearing the owned mask */ + prod_tmp = atomic_fetch_andnot_relaxed(~prod_mask, + &cmdq->q.llq.atomic.prod); + prod = prod_tmp & prod_mask; + owner_count = prod_tmp & owner_mask; + owner_count >>= cmdq->q.llq.owner_count_shift; /* * c. Wait for any gathered work to be written to the queue. @@ -1456,6 +1464,19 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, */ arm_smmu_cmdq_poll_valid_map(cmdq, llq.prod, prod); + /* + * In order to determine completion of the CMD_SYNCs, we must + * ensure that the queue can't wrap twice without us noticing. + * We achieve that by taking the cmdq lock as shared before + * progressing the prod pointer. + * The owner does this for all the non-owners it has gathered. + * Otherwise, some non-owner(s) may lock the cmdq, blocking + * cons being updating. This could be when the cmdq has just + * become full. In this case, other sibling non-owners could be + * blocked from progressing, leading to deadlock. + */ + arm_smmu_cmdq_shared_lock(cmdq, owner_count); + /* * d. Advance the hardware prod pointer * Control dependency ordering from the entries becoming valid. -- 2.26.2 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() @ 2020-06-22 17:28 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-06-22 17:28 UTC (permalink / raw) To: will, robin.murphy Cc: trivial, maz, linuxarm, linux-kernel, iommu, linux-arm-kernel It has been shown that the cmpxchg() for finding space in the cmdq can be a bottleneck: - for more CPUs contending the cmdq, the cmpxchg() will fail more often - since the software-maintained cons pointer is updated on the same 64b memory region, the chance of cmpxchg() failure increases again The cmpxchg() is removed as part of 2 related changes: - Update prod and cmdq owner in a single atomic add operation. For this, we count the prod and owner in separate regions in prod memory. As with simple binary counting, once the prod+wrap fields overflow, they will zero. They should never overflow into "owner" region, and we zero the non-owner, prod region for each owner. This maintains the prod pointer. As for the "owner", we now count this value, instead of setting a flag. Similar to before, once the owner has finished gathering, it will clear a mask. As such, a CPU declares itself as the "owner" when it reads zero for this region. This zeroing will also clear possible overflow in wrap+prod region, above. The owner is now responsible for all cmdq locking to avoid possible deadlock. The owner will lock the cmdq for all non-owers it has gathered when they have space in the queue and have written their entries. - Check for space in the cmdq after the prod pointer has been assigned. We don't bother checking for space in the cmdq before assigning the prod pointer, as this would be racy. So since the prod pointer is updated unconditionally, it would be common for no space to be available in the cmdq when prod is assigned - that is, according the software-maintained prod and cons pointer. So now it must be ensured that the entries are not yet written and not until there is space. How the prod pointer is maintained also leads to a strange condition where the prod pointer can wrap past the cons pointer. We can detect this condition, and report no space here. However, a prod pointer progressed twice past the cons pointer cannot be detected. But it can be ensured that this that this scenario does not occur, as we limit the amount of commands any CPU can issue at any given time, such that we cannot progress prod pointer further. Signed-off-by: John Garry <john.garry@huawei.com> --- drivers/iommu/arm-smmu-v3.c | 101 ++++++++++++++++++++++-------------- 1 file changed, 61 insertions(+), 40 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 45a39ccaf455..54d056f10bea 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -772,10 +772,19 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n) prod = Q_IDX(q, q->prod); cons = Q_IDX(q, q->cons); - if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons)) + if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons)) { + /* check if we have wrapped, meaning definitely no space */ + if (cons > prod) + return false; + space = (1 << q->max_n_shift) - (prod - cons); - else + } else { + /* similar check to above */ + if (prod > cons) + return false; + space = cons - prod; + } return space >= n; } @@ -1073,7 +1082,7 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu) * fails if the caller appears to be the last lock holder (yes, this is * racy). All successful UNLOCK routines have RELEASE semantics. */ -static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq) +static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq, int count) { int val; @@ -1083,12 +1092,12 @@ static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq) * to INT_MIN so these increments won't hurt as the value will remain * negative. */ - if (atomic_fetch_inc_relaxed(&cmdq->lock) >= 0) + if (atomic_fetch_add_relaxed(count, &cmdq->lock) >= 0) return; do { val = atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0); - } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1) != val); + } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + count) != val); } static void arm_smmu_cmdq_shared_unlock(struct arm_smmu_cmdq *cmdq) @@ -1374,8 +1383,10 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds, * insert their own list of commands then all of the commands from one * CPU will appear before any of the commands from the other CPU. * - * - A CMD_SYNC is always inserted, ensuring that any CPU does not issue - * more than the permitted amount commands at once. + * - A CMD_SYNC is always inserted, which ensures we limit the prod pointer + * for when the cmdq is full, such that we don't wrap more than twice. + * It also makes it easy for the owner to know by how many to increment the + * cmdq lock. */ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, u64 *cmds, int n) @@ -1388,39 +1399,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, struct arm_smmu_cmdq *cmdq = &smmu->cmdq; struct arm_smmu_ll_queue llq = { .max_n_shift = cmdq->q.llq.max_n_shift, - }, head = llq; + }, head = llq, space = llq; + u32 owner_val = 1 << cmdq->q.llq.owner_count_shift; + u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); + u32 owner_mask = GENMASK(30, cmdq->q.llq.owner_count_shift); int ret = 0; /* 1. Allocate some space in the queue */ local_irq_save(flags); - llq.val = READ_ONCE(cmdq->q.llq.val); - do { - u64 old; - while (!queue_has_space(&llq, n + sync)) { - local_irq_restore(flags); - if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq)) - dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); - local_irq_save(flags); - } + prod = atomic_fetch_add(n + sync + owner_val, + &cmdq->q.llq.atomic.prod); - head.cons = llq.cons; - head.prod = queue_inc_prod_n(&llq, n + sync) | - CMDQ_PROD_OWNED_FLAG; + owner = !(prod & owner_mask); + llq.prod = prod_mask & prod; + head.prod = queue_inc_prod_n(&llq, n + sync); - old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val); - if (old == llq.val) - break; + /* + * Ensure it's safe to write the entries. For this, we need to ensure + * that there is space in the queue from our prod pointer. + */ + space.cons = READ_ONCE(cmdq->q.llq.cons); + space.prod = llq.prod; + while (!queue_has_space(&space, n + sync)) { + if (arm_smmu_cmdq_poll_until_not_full(smmu, &space)) + dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); - llq.val = old; - } while (1); - owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG); - head.prod &= ~CMDQ_PROD_OWNED_FLAG; - llq.prod &= ~CMDQ_PROD_OWNED_FLAG; + space.prod = llq.prod; + } /* * 2. Write our commands into the queue - * Dependency ordering from the cmpxchg() loop above. + * Dependency ordering from the space-checking while loop, above. */ arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n); @@ -1428,26 +1438,24 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod); queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS); - /* - * In order to determine completion of our CMD_SYNC, we must ensure - * that the queue can't wrap twice without us noticing. We achieve that - * by taking the cmdq lock as shared before marking our slot as valid. - */ - arm_smmu_cmdq_shared_lock(cmdq); - /* 3. Mark our slots as valid, ensuring commands are visible first */ dma_wmb(); arm_smmu_cmdq_set_valid_map(cmdq, llq.prod, head.prod); /* 4. If we are the owner, take control of the SMMU hardware */ if (owner) { + int owner_count; + u32 prod_tmp; + /* a. Wait for previous owner to finish */ atomic_cond_read_relaxed(&cmdq->owner_prod, VAL == llq.prod); - /* b. Stop gathering work by clearing the owned flag */ - prod = atomic_fetch_andnot_relaxed(CMDQ_PROD_OWNED_FLAG, - &cmdq->q.llq.atomic.prod); - prod &= ~CMDQ_PROD_OWNED_FLAG; + /* b. Stop gathering work by clearing the owned mask */ + prod_tmp = atomic_fetch_andnot_relaxed(~prod_mask, + &cmdq->q.llq.atomic.prod); + prod = prod_tmp & prod_mask; + owner_count = prod_tmp & owner_mask; + owner_count >>= cmdq->q.llq.owner_count_shift; /* * c. Wait for any gathered work to be written to the queue. @@ -1456,6 +1464,19 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, */ arm_smmu_cmdq_poll_valid_map(cmdq, llq.prod, prod); + /* + * In order to determine completion of the CMD_SYNCs, we must + * ensure that the queue can't wrap twice without us noticing. + * We achieve that by taking the cmdq lock as shared before + * progressing the prod pointer. + * The owner does this for all the non-owners it has gathered. + * Otherwise, some non-owner(s) may lock the cmdq, blocking + * cons being updating. This could be when the cmdq has just + * become full. In this case, other sibling non-owners could be + * blocked from progressing, leading to deadlock. + */ + arm_smmu_cmdq_shared_lock(cmdq, owner_count); + /* * d. Advance the hardware prod pointer * Control dependency ordering from the entries becoming valid. -- 2.26.2 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() 2020-06-22 17:28 ` John Garry (?) @ 2020-06-23 1:07 ` kernel test robot -1 siblings, 0 replies; 69+ messages in thread From: kernel test robot @ 2020-06-23 1:07 UTC (permalink / raw) To: John Garry, will, robin.murphy Cc: kbuild-all, joro, trivial, linux-arm-kernel, iommu, linux-kernel, linuxarm, maz, John Garry [-- Attachment #1: Type: text/plain, Size: 8647 bytes --] Hi John, I love your patch! Perhaps something to improve: [auto build test WARNING on iommu/next] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438 base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next config: arm64-randconfig-c024-20200622 (attached as .config) compiler: aarch64-linux-gcc (GCC) 9.3.0 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All warnings (new ones prefixed by >>, old ones prefixed by <<): In file included from include/linux/bits.h:23, from include/linux/ioport.h:15, from include/linux/acpi.h:12, from drivers/iommu/arm-smmu-v3.c:12: drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist': include/linux/bits.h:26:28: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0))) | ^ include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO' 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); }))) | ^ include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK' 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l)) | ^~~~~~~~~~~~~~~~~~~ >> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK' 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); | ^~~~~~~ include/linux/bits.h:26:40: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0))) | ^ include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO' 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); }))) | ^ include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK' 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l)) | ^~~~~~~~~~~~~~~~~~~ >> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK' 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); | ^~~~~~~ vim +/GENMASK +1404 drivers/iommu/arm-smmu-v3.c 1369 1370 /* 1371 * This is the actual insertion function, and provides the following 1372 * ordering guarantees to callers: 1373 * 1374 * - There is a dma_wmb() before publishing any commands to the queue. 1375 * This can be relied upon to order prior writes to data structures 1376 * in memory (such as a CD or an STE) before the command. 1377 * 1378 * - On completion of a CMD_SYNC, there is a control dependency. 1379 * This can be relied upon to order subsequent writes to memory (e.g. 1380 * freeing an IOVA) after completion of the CMD_SYNC. 1381 * 1382 * - Command insertion is totally ordered, so if two CPUs each race to 1383 * insert their own list of commands then all of the commands from one 1384 * CPU will appear before any of the commands from the other CPU. 1385 * 1386 * - A CMD_SYNC is always inserted, which ensures we limit the prod pointer 1387 * for when the cmdq is full, such that we don't wrap more than twice. 1388 * It also makes it easy for the owner to know by how many to increment the 1389 * cmdq lock. 1390 */ 1391 static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, 1392 u64 *cmds, int n) 1393 { 1394 u64 cmd_sync[CMDQ_ENT_DWORDS]; 1395 const int sync = 1; 1396 u32 prod; 1397 unsigned long flags; 1398 bool owner; 1399 struct arm_smmu_cmdq *cmdq = &smmu->cmdq; 1400 struct arm_smmu_ll_queue llq = { 1401 .max_n_shift = cmdq->q.llq.max_n_shift, 1402 }, head = llq, space = llq; 1403 u32 owner_val = 1 << cmdq->q.llq.owner_count_shift; > 1404 u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); 1405 u32 owner_mask = GENMASK(30, cmdq->q.llq.owner_count_shift); 1406 int ret = 0; 1407 1408 /* 1. Allocate some space in the queue */ 1409 local_irq_save(flags); 1410 1411 prod = atomic_fetch_add(n + sync + owner_val, 1412 &cmdq->q.llq.atomic.prod); 1413 1414 owner = !(prod & owner_mask); 1415 llq.prod = prod_mask & prod; 1416 head.prod = queue_inc_prod_n(&llq, n + sync); 1417 1418 /* 1419 * Ensure it's safe to write the entries. For this, we need to ensure 1420 * that there is space in the queue from our prod pointer. 1421 */ 1422 space.cons = READ_ONCE(cmdq->q.llq.cons); 1423 space.prod = llq.prod; 1424 while (!queue_has_space(&space, n + sync)) { 1425 if (arm_smmu_cmdq_poll_until_not_full(smmu, &space)) 1426 dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); 1427 1428 space.prod = llq.prod; 1429 } 1430 1431 /* 1432 * 2. Write our commands into the queue 1433 * Dependency ordering from the space-checking while loop, above. 1434 */ 1435 arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n); 1436 1437 prod = queue_inc_prod_n(&llq, n); 1438 arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod); 1439 queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS); 1440 1441 /* 3. Mark our slots as valid, ensuring commands are visible first */ 1442 dma_wmb(); 1443 arm_smmu_cmdq_set_valid_map(cmdq, llq.prod, head.prod); 1444 1445 /* 4. If we are the owner, take control of the SMMU hardware */ 1446 if (owner) { 1447 int owner_count; 1448 u32 prod_tmp; 1449 1450 /* a. Wait for previous owner to finish */ 1451 atomic_cond_read_relaxed(&cmdq->owner_prod, VAL == llq.prod); 1452 1453 /* b. Stop gathering work by clearing the owned mask */ 1454 prod_tmp = atomic_fetch_andnot_relaxed(~prod_mask, 1455 &cmdq->q.llq.atomic.prod); 1456 prod = prod_tmp & prod_mask; 1457 owner_count = prod_tmp & owner_mask; 1458 owner_count >>= cmdq->q.llq.owner_count_shift; 1459 1460 /* 1461 * c. Wait for any gathered work to be written to the queue. 1462 * Note that we read our own entries so that we have the control 1463 * dependency required by (d). 1464 */ 1465 arm_smmu_cmdq_poll_valid_map(cmdq, llq.prod, prod); 1466 1467 /* 1468 * In order to determine completion of the CMD_SYNCs, we must 1469 * ensure that the queue can't wrap twice without us noticing. 1470 * We achieve that by taking the cmdq lock as shared before 1471 * progressing the prod pointer. 1472 * The owner does this for all the non-owners it has gathered. 1473 * Otherwise, some non-owner(s) may lock the cmdq, blocking 1474 * cons being updating. This could be when the cmdq has just 1475 * become full. In this case, other sibling non-owners could be 1476 * blocked from progressing, leading to deadlock. 1477 */ 1478 arm_smmu_cmdq_shared_lock(cmdq, owner_count); 1479 1480 /* 1481 * d. Advance the hardware prod pointer 1482 * Control dependency ordering from the entries becoming valid. 1483 */ 1484 writel_relaxed(prod, cmdq->q.prod_reg); 1485 1486 /* 1487 * e. Tell the next owner we're done 1488 * Make sure we've updated the hardware first, so that we don't 1489 * race to update prod and potentially move it backwards. 1490 */ 1491 atomic_set_release(&cmdq->owner_prod, prod); 1492 } 1493 1494 /* 5. Since we always insert a CMD_SYNC, we must wait for it to complete */ 1495 llq.prod = queue_inc_prod_n(&llq, n); 1496 ret = arm_smmu_cmdq_poll_until_sync(smmu, &llq); 1497 if (ret) 1498 dev_err_ratelimited(smmu->dev, "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n", 1499 llq.prod, readl_relaxed(cmdq->q.prod_reg), 1500 readl_relaxed(cmdq->q.cons_reg)); 1501 1502 /* 1503 * Try to unlock the cmdq lock. This will fail if we're the last reader, 1504 * in which case we can safely update cmdq->q.llq.cons 1505 */ 1506 if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) { 1507 WRITE_ONCE(cmdq->q.llq.cons, llq.cons); 1508 arm_smmu_cmdq_shared_unlock(cmdq); 1509 } 1510 1511 local_irq_restore(flags); 1512 return ret; 1513 } 1514 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 30901 bytes --] ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() @ 2020-06-23 1:07 ` kernel test robot 0 siblings, 0 replies; 69+ messages in thread From: kernel test robot @ 2020-06-23 1:07 UTC (permalink / raw) To: kbuild-all [-- Attachment #1: Type: text/plain, Size: 8850 bytes --] Hi John, I love your patch! Perhaps something to improve: [auto build test WARNING on iommu/next] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438 base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next config: arm64-randconfig-c024-20200622 (attached as .config) compiler: aarch64-linux-gcc (GCC) 9.3.0 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All warnings (new ones prefixed by >>, old ones prefixed by <<): In file included from include/linux/bits.h:23, from include/linux/ioport.h:15, from include/linux/acpi.h:12, from drivers/iommu/arm-smmu-v3.c:12: drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist': include/linux/bits.h:26:28: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0))) | ^ include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO' 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); }))) | ^ include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK' 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l)) | ^~~~~~~~~~~~~~~~~~~ >> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK' 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); | ^~~~~~~ include/linux/bits.h:26:40: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0))) | ^ include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO' 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); }))) | ^ include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK' 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l)) | ^~~~~~~~~~~~~~~~~~~ >> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK' 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); | ^~~~~~~ vim +/GENMASK +1404 drivers/iommu/arm-smmu-v3.c 1369 1370 /* 1371 * This is the actual insertion function, and provides the following 1372 * ordering guarantees to callers: 1373 * 1374 * - There is a dma_wmb() before publishing any commands to the queue. 1375 * This can be relied upon to order prior writes to data structures 1376 * in memory (such as a CD or an STE) before the command. 1377 * 1378 * - On completion of a CMD_SYNC, there is a control dependency. 1379 * This can be relied upon to order subsequent writes to memory (e.g. 1380 * freeing an IOVA) after completion of the CMD_SYNC. 1381 * 1382 * - Command insertion is totally ordered, so if two CPUs each race to 1383 * insert their own list of commands then all of the commands from one 1384 * CPU will appear before any of the commands from the other CPU. 1385 * 1386 * - A CMD_SYNC is always inserted, which ensures we limit the prod pointer 1387 * for when the cmdq is full, such that we don't wrap more than twice. 1388 * It also makes it easy for the owner to know by how many to increment the 1389 * cmdq lock. 1390 */ 1391 static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, 1392 u64 *cmds, int n) 1393 { 1394 u64 cmd_sync[CMDQ_ENT_DWORDS]; 1395 const int sync = 1; 1396 u32 prod; 1397 unsigned long flags; 1398 bool owner; 1399 struct arm_smmu_cmdq *cmdq = &smmu->cmdq; 1400 struct arm_smmu_ll_queue llq = { 1401 .max_n_shift = cmdq->q.llq.max_n_shift, 1402 }, head = llq, space = llq; 1403 u32 owner_val = 1 << cmdq->q.llq.owner_count_shift; > 1404 u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); 1405 u32 owner_mask = GENMASK(30, cmdq->q.llq.owner_count_shift); 1406 int ret = 0; 1407 1408 /* 1. Allocate some space in the queue */ 1409 local_irq_save(flags); 1410 1411 prod = atomic_fetch_add(n + sync + owner_val, 1412 &cmdq->q.llq.atomic.prod); 1413 1414 owner = !(prod & owner_mask); 1415 llq.prod = prod_mask & prod; 1416 head.prod = queue_inc_prod_n(&llq, n + sync); 1417 1418 /* 1419 * Ensure it's safe to write the entries. For this, we need to ensure 1420 * that there is space in the queue from our prod pointer. 1421 */ 1422 space.cons = READ_ONCE(cmdq->q.llq.cons); 1423 space.prod = llq.prod; 1424 while (!queue_has_space(&space, n + sync)) { 1425 if (arm_smmu_cmdq_poll_until_not_full(smmu, &space)) 1426 dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); 1427 1428 space.prod = llq.prod; 1429 } 1430 1431 /* 1432 * 2. Write our commands into the queue 1433 * Dependency ordering from the space-checking while loop, above. 1434 */ 1435 arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n); 1436 1437 prod = queue_inc_prod_n(&llq, n); 1438 arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod); 1439 queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS); 1440 1441 /* 3. Mark our slots as valid, ensuring commands are visible first */ 1442 dma_wmb(); 1443 arm_smmu_cmdq_set_valid_map(cmdq, llq.prod, head.prod); 1444 1445 /* 4. If we are the owner, take control of the SMMU hardware */ 1446 if (owner) { 1447 int owner_count; 1448 u32 prod_tmp; 1449 1450 /* a. Wait for previous owner to finish */ 1451 atomic_cond_read_relaxed(&cmdq->owner_prod, VAL == llq.prod); 1452 1453 /* b. Stop gathering work by clearing the owned mask */ 1454 prod_tmp = atomic_fetch_andnot_relaxed(~prod_mask, 1455 &cmdq->q.llq.atomic.prod); 1456 prod = prod_tmp & prod_mask; 1457 owner_count = prod_tmp & owner_mask; 1458 owner_count >>= cmdq->q.llq.owner_count_shift; 1459 1460 /* 1461 * c. Wait for any gathered work to be written to the queue. 1462 * Note that we read our own entries so that we have the control 1463 * dependency required by (d). 1464 */ 1465 arm_smmu_cmdq_poll_valid_map(cmdq, llq.prod, prod); 1466 1467 /* 1468 * In order to determine completion of the CMD_SYNCs, we must 1469 * ensure that the queue can't wrap twice without us noticing. 1470 * We achieve that by taking the cmdq lock as shared before 1471 * progressing the prod pointer. 1472 * The owner does this for all the non-owners it has gathered. 1473 * Otherwise, some non-owner(s) may lock the cmdq, blocking 1474 * cons being updating. This could be when the cmdq has just 1475 * become full. In this case, other sibling non-owners could be 1476 * blocked from progressing, leading to deadlock. 1477 */ 1478 arm_smmu_cmdq_shared_lock(cmdq, owner_count); 1479 1480 /* 1481 * d. Advance the hardware prod pointer 1482 * Control dependency ordering from the entries becoming valid. 1483 */ 1484 writel_relaxed(prod, cmdq->q.prod_reg); 1485 1486 /* 1487 * e. Tell the next owner we're done 1488 * Make sure we've updated the hardware first, so that we don't 1489 * race to update prod and potentially move it backwards. 1490 */ 1491 atomic_set_release(&cmdq->owner_prod, prod); 1492 } 1493 1494 /* 5. Since we always insert a CMD_SYNC, we must wait for it to complete */ 1495 llq.prod = queue_inc_prod_n(&llq, n); 1496 ret = arm_smmu_cmdq_poll_until_sync(smmu, &llq); 1497 if (ret) 1498 dev_err_ratelimited(smmu->dev, "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n", 1499 llq.prod, readl_relaxed(cmdq->q.prod_reg), 1500 readl_relaxed(cmdq->q.cons_reg)); 1501 1502 /* 1503 * Try to unlock the cmdq lock. This will fail if we're the last reader, 1504 * in which case we can safely update cmdq->q.llq.cons 1505 */ 1506 if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) { 1507 WRITE_ONCE(cmdq->q.llq.cons, llq.cons); 1508 arm_smmu_cmdq_shared_unlock(cmdq); 1509 } 1510 1511 local_irq_restore(flags); 1512 return ret; 1513 } 1514 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org [-- Attachment #2: config.gz --] [-- Type: application/gzip, Size: 30901 bytes --] ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() @ 2020-06-23 1:07 ` kernel test robot 0 siblings, 0 replies; 69+ messages in thread From: kernel test robot @ 2020-06-23 1:07 UTC (permalink / raw) To: John Garry, will, robin.murphy Cc: trivial, maz, linux-kernel, linuxarm, iommu, kbuild-all, linux-arm-kernel [-- Attachment #1: Type: text/plain, Size: 8647 bytes --] Hi John, I love your patch! Perhaps something to improve: [auto build test WARNING on iommu/next] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438 base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next config: arm64-randconfig-c024-20200622 (attached as .config) compiler: aarch64-linux-gcc (GCC) 9.3.0 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All warnings (new ones prefixed by >>, old ones prefixed by <<): In file included from include/linux/bits.h:23, from include/linux/ioport.h:15, from include/linux/acpi.h:12, from drivers/iommu/arm-smmu-v3.c:12: drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist': include/linux/bits.h:26:28: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0))) | ^ include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO' 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); }))) | ^ include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK' 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l)) | ^~~~~~~~~~~~~~~~~~~ >> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK' 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); | ^~~~~~~ include/linux/bits.h:26:40: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0))) | ^ include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO' 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); }))) | ^ include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK' 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l)) | ^~~~~~~~~~~~~~~~~~~ >> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK' 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); | ^~~~~~~ vim +/GENMASK +1404 drivers/iommu/arm-smmu-v3.c 1369 1370 /* 1371 * This is the actual insertion function, and provides the following 1372 * ordering guarantees to callers: 1373 * 1374 * - There is a dma_wmb() before publishing any commands to the queue. 1375 * This can be relied upon to order prior writes to data structures 1376 * in memory (such as a CD or an STE) before the command. 1377 * 1378 * - On completion of a CMD_SYNC, there is a control dependency. 1379 * This can be relied upon to order subsequent writes to memory (e.g. 1380 * freeing an IOVA) after completion of the CMD_SYNC. 1381 * 1382 * - Command insertion is totally ordered, so if two CPUs each race to 1383 * insert their own list of commands then all of the commands from one 1384 * CPU will appear before any of the commands from the other CPU. 1385 * 1386 * - A CMD_SYNC is always inserted, which ensures we limit the prod pointer 1387 * for when the cmdq is full, such that we don't wrap more than twice. 1388 * It also makes it easy for the owner to know by how many to increment the 1389 * cmdq lock. 1390 */ 1391 static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, 1392 u64 *cmds, int n) 1393 { 1394 u64 cmd_sync[CMDQ_ENT_DWORDS]; 1395 const int sync = 1; 1396 u32 prod; 1397 unsigned long flags; 1398 bool owner; 1399 struct arm_smmu_cmdq *cmdq = &smmu->cmdq; 1400 struct arm_smmu_ll_queue llq = { 1401 .max_n_shift = cmdq->q.llq.max_n_shift, 1402 }, head = llq, space = llq; 1403 u32 owner_val = 1 << cmdq->q.llq.owner_count_shift; > 1404 u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); 1405 u32 owner_mask = GENMASK(30, cmdq->q.llq.owner_count_shift); 1406 int ret = 0; 1407 1408 /* 1. Allocate some space in the queue */ 1409 local_irq_save(flags); 1410 1411 prod = atomic_fetch_add(n + sync + owner_val, 1412 &cmdq->q.llq.atomic.prod); 1413 1414 owner = !(prod & owner_mask); 1415 llq.prod = prod_mask & prod; 1416 head.prod = queue_inc_prod_n(&llq, n + sync); 1417 1418 /* 1419 * Ensure it's safe to write the entries. For this, we need to ensure 1420 * that there is space in the queue from our prod pointer. 1421 */ 1422 space.cons = READ_ONCE(cmdq->q.llq.cons); 1423 space.prod = llq.prod; 1424 while (!queue_has_space(&space, n + sync)) { 1425 if (arm_smmu_cmdq_poll_until_not_full(smmu, &space)) 1426 dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); 1427 1428 space.prod = llq.prod; 1429 } 1430 1431 /* 1432 * 2. Write our commands into the queue 1433 * Dependency ordering from the space-checking while loop, above. 1434 */ 1435 arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n); 1436 1437 prod = queue_inc_prod_n(&llq, n); 1438 arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod); 1439 queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS); 1440 1441 /* 3. Mark our slots as valid, ensuring commands are visible first */ 1442 dma_wmb(); 1443 arm_smmu_cmdq_set_valid_map(cmdq, llq.prod, head.prod); 1444 1445 /* 4. If we are the owner, take control of the SMMU hardware */ 1446 if (owner) { 1447 int owner_count; 1448 u32 prod_tmp; 1449 1450 /* a. Wait for previous owner to finish */ 1451 atomic_cond_read_relaxed(&cmdq->owner_prod, VAL == llq.prod); 1452 1453 /* b. Stop gathering work by clearing the owned mask */ 1454 prod_tmp = atomic_fetch_andnot_relaxed(~prod_mask, 1455 &cmdq->q.llq.atomic.prod); 1456 prod = prod_tmp & prod_mask; 1457 owner_count = prod_tmp & owner_mask; 1458 owner_count >>= cmdq->q.llq.owner_count_shift; 1459 1460 /* 1461 * c. Wait for any gathered work to be written to the queue. 1462 * Note that we read our own entries so that we have the control 1463 * dependency required by (d). 1464 */ 1465 arm_smmu_cmdq_poll_valid_map(cmdq, llq.prod, prod); 1466 1467 /* 1468 * In order to determine completion of the CMD_SYNCs, we must 1469 * ensure that the queue can't wrap twice without us noticing. 1470 * We achieve that by taking the cmdq lock as shared before 1471 * progressing the prod pointer. 1472 * The owner does this for all the non-owners it has gathered. 1473 * Otherwise, some non-owner(s) may lock the cmdq, blocking 1474 * cons being updating. This could be when the cmdq has just 1475 * become full. In this case, other sibling non-owners could be 1476 * blocked from progressing, leading to deadlock. 1477 */ 1478 arm_smmu_cmdq_shared_lock(cmdq, owner_count); 1479 1480 /* 1481 * d. Advance the hardware prod pointer 1482 * Control dependency ordering from the entries becoming valid. 1483 */ 1484 writel_relaxed(prod, cmdq->q.prod_reg); 1485 1486 /* 1487 * e. Tell the next owner we're done 1488 * Make sure we've updated the hardware first, so that we don't 1489 * race to update prod and potentially move it backwards. 1490 */ 1491 atomic_set_release(&cmdq->owner_prod, prod); 1492 } 1493 1494 /* 5. Since we always insert a CMD_SYNC, we must wait for it to complete */ 1495 llq.prod = queue_inc_prod_n(&llq, n); 1496 ret = arm_smmu_cmdq_poll_until_sync(smmu, &llq); 1497 if (ret) 1498 dev_err_ratelimited(smmu->dev, "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n", 1499 llq.prod, readl_relaxed(cmdq->q.prod_reg), 1500 readl_relaxed(cmdq->q.cons_reg)); 1501 1502 /* 1503 * Try to unlock the cmdq lock. This will fail if we're the last reader, 1504 * in which case we can safely update cmdq->q.llq.cons 1505 */ 1506 if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) { 1507 WRITE_ONCE(cmdq->q.llq.cons, llq.cons); 1508 arm_smmu_cmdq_shared_unlock(cmdq); 1509 } 1510 1511 local_irq_restore(flags); 1512 return ret; 1513 } 1514 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 30901 bytes --] [-- Attachment #3: Type: text/plain, Size: 156 bytes --] _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() 2020-06-23 1:07 ` kernel test robot (?) @ 2020-06-23 9:21 ` John Garry -1 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-06-23 9:21 UTC (permalink / raw) To: kernel test robot, will, robin.murphy, rikard.falkeborn Cc: kbuild-all, joro, trivial, linux-arm-kernel, iommu, linux-kernel, Linuxarm, maz On 23/06/2020 02:07, kernel test robot wrote: + Rikard, as the GENMASK compile-time sanity checks were added recently > Hi John, > > I love your patch! Perhaps something to improve: > > [auto build test WARNING on iommu/next] > [If your patch is applied to the wrong git tree, kindly drop us a note. > And when submitting patch, we suggest to use as documented in > https://git-scm.com/docs/git-format-patch] > > url: https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438 > base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next > config: arm64-randconfig-c024-20200622 (attached as .config) > compiler: aarch64-linux-gcc (GCC) 9.3.0 > > If you fix the issue, kindly add following tag as appropriate > Reported-by: kernel test robot <lkp@intel.com> > > All warnings (new ones prefixed by >>, old ones prefixed by <<): > > In file included from include/linux/bits.h:23, > from include/linux/ioport.h:15, > from include/linux/acpi.h:12, > from drivers/iommu/arm-smmu-v3.c:12: > drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist': > include/linux/bits.h:26:28: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] > 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0))) I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and h=unsigned value, so I doubt this warn. Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks like GENMASK_INPUT_CHECK() could be improved. > | ^ > include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO' > 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); }))) > | ^ > include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK' > 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l)) > | ^~~~~~~~~~~~~~~~~~~ >>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK' > 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); > | ^~~~~~~ > include/linux/bits.h:26:40: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] > 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0))) > | ^ > include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO' > 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); }))) > | ^ > include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK' > 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l)) > | ^~~~~~~~~~~~~~~~~~~ >>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK' > 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); > | ^~~~~~~ > > vim +/GENMASK +1404 drivers/iommu/arm-smmu-v3.c ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() @ 2020-06-23 9:21 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-06-23 9:21 UTC (permalink / raw) To: kbuild-all [-- Attachment #1: Type: text/plain, Size: 3022 bytes --] On 23/06/2020 02:07, kernel test robot wrote: + Rikard, as the GENMASK compile-time sanity checks were added recently > Hi John, > > I love your patch! Perhaps something to improve: > > [auto build test WARNING on iommu/next] > [If your patch is applied to the wrong git tree, kindly drop us a note. > And when submitting patch, we suggest to use as documented in > https://git-scm.com/docs/git-format-patch] > > url: https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438 > base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next > config: arm64-randconfig-c024-20200622 (attached as .config) > compiler: aarch64-linux-gcc (GCC) 9.3.0 > > If you fix the issue, kindly add following tag as appropriate > Reported-by: kernel test robot <lkp@intel.com> > > All warnings (new ones prefixed by >>, old ones prefixed by <<): > > In file included from include/linux/bits.h:23, > from include/linux/ioport.h:15, > from include/linux/acpi.h:12, > from drivers/iommu/arm-smmu-v3.c:12: > drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist': > include/linux/bits.h:26:28: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] > 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0))) I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and h=unsigned value, so I doubt this warn. Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks like GENMASK_INPUT_CHECK() could be improved. > | ^ > include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO' > 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); }))) > | ^ > include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK' > 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l)) > | ^~~~~~~~~~~~~~~~~~~ >>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK' > 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); > | ^~~~~~~ > include/linux/bits.h:26:40: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] > 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0))) > | ^ > include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO' > 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); }))) > | ^ > include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK' > 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l)) > | ^~~~~~~~~~~~~~~~~~~ >>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK' > 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); > | ^~~~~~~ > > vim +/GENMASK +1404 drivers/iommu/arm-smmu-v3.c ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() @ 2020-06-23 9:21 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-06-23 9:21 UTC (permalink / raw) To: kernel test robot, will, robin.murphy, rikard.falkeborn Cc: trivial, maz, linux-kernel, Linuxarm, iommu, kbuild-all, linux-arm-kernel On 23/06/2020 02:07, kernel test robot wrote: + Rikard, as the GENMASK compile-time sanity checks were added recently > Hi John, > > I love your patch! Perhaps something to improve: > > [auto build test WARNING on iommu/next] > [If your patch is applied to the wrong git tree, kindly drop us a note. > And when submitting patch, we suggest to use as documented in > https://git-scm.com/docs/git-format-patch] > > url: https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438 > base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next > config: arm64-randconfig-c024-20200622 (attached as .config) > compiler: aarch64-linux-gcc (GCC) 9.3.0 > > If you fix the issue, kindly add following tag as appropriate > Reported-by: kernel test robot <lkp@intel.com> > > All warnings (new ones prefixed by >>, old ones prefixed by <<): > > In file included from include/linux/bits.h:23, > from include/linux/ioport.h:15, > from include/linux/acpi.h:12, > from drivers/iommu/arm-smmu-v3.c:12: > drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist': > include/linux/bits.h:26:28: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] > 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0))) I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and h=unsigned value, so I doubt this warn. Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks like GENMASK_INPUT_CHECK() could be improved. > | ^ > include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO' > 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); }))) > | ^ > include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK' > 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l)) > | ^~~~~~~~~~~~~~~~~~~ >>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK' > 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); > | ^~~~~~~ > include/linux/bits.h:26:40: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] > 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0))) > | ^ > include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO' > 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); }))) > | ^ > include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK' > 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l)) > | ^~~~~~~~~~~~~~~~~~~ >>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK' > 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); > | ^~~~~~~ > > vim +/GENMASK +1404 drivers/iommu/arm-smmu-v3.c _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() 2020-06-23 9:21 ` John Garry (?) (?) @ 2020-06-23 9:35 ` Rikard Falkeborn 2020-06-23 10:19 ` John Garry -1 siblings, 1 reply; 69+ messages in thread From: Rikard Falkeborn @ 2020-06-23 9:35 UTC (permalink / raw) To: John Garry Cc: trivial, kernel test robot, will, linux-kernel, Linuxarm, iommu, maz, kbuild-all, robin.murphy, linux-arm-kernel [-- Attachment #1.1: Type: text/plain, Size: 3214 bytes --] Hi Den tis 23 juni 2020 11:22John Garry <john.garry@huawei.com> skrev: > On 23/06/2020 02:07, kernel test robot wrote: > > + Rikard, as the GENMASK compile-time sanity checks were added recently > > > Hi John, > > > > I love your patch! Perhaps something to improve: > > > > [auto build test WARNING on iommu/next] > > [If your patch is applied to the wrong git tree, kindly drop us a note. > > And when submitting patch, we suggest to use as documented in > > https://git-scm.com/docs/git-format-patch] > > > > url: > https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438 > > base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git > next > > config: arm64-randconfig-c024-20200622 (attached as .config) > > compiler: aarch64-linux-gcc (GCC) 9.3.0 > > > > If you fix the issue, kindly add following tag as appropriate > > Reported-by: kernel test robot <lkp@intel.com> > > > > All warnings (new ones prefixed by >>, old ones prefixed by <<): > > > > In file included from include/linux/bits.h:23, > > from include/linux/ioport.h:15, > > from include/linux/acpi.h:12, > > from drivers/iommu/arm-smmu-v3.c:12: > > drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist': > > include/linux/bits.h:26:28: warning: comparison of unsigned expression < > 0 is always false [-Wtype-limits] > > 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0))) > > I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and > h=unsigned value, so I doubt this warn. > > Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks > like GENMASK_INPUT_CHECK() could be improved. > Indeed it could, it is fixed in -next. Rikard > | ^ > > include/linux/build_bug.h:16:62: note: in definition of macro > 'BUILD_BUG_ON_ZERO' > > 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); > }))) > > | ^ > > include/linux/bits.h:39:3: note: in expansion of macro > 'GENMASK_INPUT_CHECK' > > 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l)) > > | ^~~~~~~~~~~~~~~~~~~ > >>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro > 'GENMASK' > > 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); > > | ^~~~~~~ > > include/linux/bits.h:26:40: warning: comparison of unsigned expression < > 0 is always false [-Wtype-limits] > > 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0))) > > | ^ > > include/linux/build_bug.h:16:62: note: in definition of macro > 'BUILD_BUG_ON_ZERO' > > 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); > }))) > > | ^ > > include/linux/bits.h:39:3: note: in expansion of macro > 'GENMASK_INPUT_CHECK' > > 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l)) > > | ^~~~~~~~~~~~~~~~~~~ > >>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro > 'GENMASK' > > 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); > > | ^~~~~~~ > > > > vim +/GENMASK +1404 drivers/iommu/arm-smmu-v3.c > > > [-- Attachment #1.2: Type: text/html, Size: 4817 bytes --] [-- Attachment #2: Type: text/plain, Size: 156 bytes --] _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() 2020-06-23 9:35 ` Rikard Falkeborn 2020-06-23 10:19 ` John Garry @ 2020-06-23 10:19 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-06-23 10:19 UTC (permalink / raw) To: Rikard Falkeborn, joro Cc: kernel test robot, will, robin.murphy, kbuild-all, trivial, linux-arm-kernel, iommu, linux-kernel, Linuxarm, maz On 23/06/2020 10:35, Rikard Falkeborn wrote: > > I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and > h=unsigned value, so I doubt this warn. > > Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it > looks > like GENMASK_INPUT_CHECK() could be improved. > > > Indeed it could, it is fixed in -next. ok, thanks for the pointer, but I still see this on today's -next with this patch: make W=1 drivers/iommu/arm-smmu-v3.o In file included from ./include/linux/bits.h:23:0, from ./include/linux/ioport.h:15, from ./include/linux/acpi.h:12, from drivers/iommu/arm-smmu-v3.c:12: drivers/iommu/arm-smmu-v3.c: In function ‘arm_smmu_cmdq_issue_cmdlist’: ./include/linux/bits.h:27:7: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] (l) > (h), 0))) ^ ./include/linux/build_bug.h:16:62: note: in definition of macro ‘BUILD_BUG_ON_ZERO’ #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); }))) ^ ./include/linux/bits.h:40:3: note: in expansion of macro ‘GENMASK_INPUT_CHECK’ (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l)) ^~~~~~~~~~~~~~~~~~~ drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro ‘GENMASK’ u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); That's gcc 7.5.0 . Cheers, John ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() @ 2020-06-23 10:19 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-06-23 10:19 UTC (permalink / raw) To: kbuild-all [-- Attachment #1: Type: text/plain, Size: 1487 bytes --] On 23/06/2020 10:35, Rikard Falkeborn wrote: > > I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and > h=unsigned value, so I doubt this warn. > > Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it > looks > like GENMASK_INPUT_CHECK() could be improved. > > > Indeed it could, it is fixed in -next. ok, thanks for the pointer, but I still see this on today's -next with this patch: make W=1 drivers/iommu/arm-smmu-v3.o In file included from ./include/linux/bits.h:23:0, from ./include/linux/ioport.h:15, from ./include/linux/acpi.h:12, from drivers/iommu/arm-smmu-v3.c:12: drivers/iommu/arm-smmu-v3.c: In function ‘arm_smmu_cmdq_issue_cmdlist’: ./include/linux/bits.h:27:7: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] (l) > (h), 0))) ^ ./include/linux/build_bug.h:16:62: note: in definition of macro ‘BUILD_BUG_ON_ZERO’ #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); }))) ^ ./include/linux/bits.h:40:3: note: in expansion of macro ‘GENMASK_INPUT_CHECK’ (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l)) ^~~~~~~~~~~~~~~~~~~ drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro ‘GENMASK’ u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); That's gcc 7.5.0 . Cheers, John ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() @ 2020-06-23 10:19 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-06-23 10:19 UTC (permalink / raw) To: Rikard Falkeborn, joro Cc: trivial, kernel test robot, will, linux-kernel, Linuxarm, iommu, maz, kbuild-all, robin.murphy, linux-arm-kernel On 23/06/2020 10:35, Rikard Falkeborn wrote: > > I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and > h=unsigned value, so I doubt this warn. > > Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it > looks > like GENMASK_INPUT_CHECK() could be improved. > > > Indeed it could, it is fixed in -next. ok, thanks for the pointer, but I still see this on today's -next with this patch: make W=1 drivers/iommu/arm-smmu-v3.o In file included from ./include/linux/bits.h:23:0, from ./include/linux/ioport.h:15, from ./include/linux/acpi.h:12, from drivers/iommu/arm-smmu-v3.c:12: drivers/iommu/arm-smmu-v3.c: In function ‘arm_smmu_cmdq_issue_cmdlist’: ./include/linux/bits.h:27:7: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] (l) > (h), 0))) ^ ./include/linux/build_bug.h:16:62: note: in definition of macro ‘BUILD_BUG_ON_ZERO’ #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); }))) ^ ./include/linux/bits.h:40:3: note: in expansion of macro ‘GENMASK_INPUT_CHECK’ (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l)) ^~~~~~~~~~~~~~~~~~~ drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro ‘GENMASK’ u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); That's gcc 7.5.0 . Cheers, John _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() 2020-06-23 10:19 ` John Garry (?) (?) @ 2020-06-23 13:55 ` Rikard Falkeborn 2020-06-26 10:05 ` John Garry -1 siblings, 1 reply; 69+ messages in thread From: Rikard Falkeborn @ 2020-06-23 13:55 UTC (permalink / raw) To: John Garry Cc: trivial, kernel test robot, will, linux-kernel, Linuxarm, iommu, maz, kbuild-all, robin.murphy, linux-arm-kernel [-- Attachment #1.1: Type: text/plain, Size: 1974 bytes --] Den tis 23 juni 2020 12:21John Garry <john.garry@huawei.com> skrev: > On 23/06/2020 10:35, Rikard Falkeborn wrote: > > > > I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and > > h=unsigned value, so I doubt this warn. > > > > Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it > > looks > > like GENMASK_INPUT_CHECK() could be improved. > > > > > > Indeed it could, it is fixed in -next. > > ok, thanks for the pointer, but I still see this on today's -next with > this patch: > > make W=1 drivers/iommu/arm-smmu-v3.o > > Oh, ok thanks for reporting. I guess different gcc versions have different behaviour. I guess we'll have to change the comparison to (!((h) == (l) || (h) > (l))) instead (not sure I got all parenthesis and logic correct but you get the idea). I'm travelling and wont have time to look at this until next week though. Rikard In file included from ./include/linux/bits.h:23:0, > from ./include/linux/ioport.h:15, > from ./include/linux/acpi.h:12, > from drivers/iommu/arm-smmu-v3.c:12: > drivers/iommu/arm-smmu-v3.c: In function ‘arm_smmu_cmdq_issue_cmdlist’: > ./include/linux/bits.h:27:7: warning: comparison of unsigned expression > < 0 is always false [-Wtype-limits] > (l) > (h), 0))) > ^ > ./include/linux/build_bug.h:16:62: note: in definition of macro > ‘BUILD_BUG_ON_ZERO’ > #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); }))) > ^ > ./include/linux/bits.h:40:3: note: in expansion of macro > ‘GENMASK_INPUT_CHECK’ > (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l)) > ^~~~~~~~~~~~~~~~~~~ > drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro ‘GENMASK’ > u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); > > That's gcc 7.5.0 . > > Cheers, > John > > [-- Attachment #1.2: Type: text/html, Size: 2883 bytes --] [-- Attachment #2: Type: text/plain, Size: 156 bytes --] _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() 2020-06-23 13:55 ` Rikard Falkeborn 2020-06-26 10:05 ` John Garry (?) @ 2020-06-26 10:05 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-06-26 10:05 UTC (permalink / raw) To: Rikard Falkeborn Cc: Joerg Roedel, kernel test robot, will, robin.murphy, kbuild-all, trivial, linux-arm-kernel, iommu, linux-kernel, Linuxarm, maz On 23/06/2020 14:55, Rikard Falkeborn wrote: > Den tis 23 juni 2020 12:21John Garry <john.garry@huawei.com > <mailto:john.garry@huawei.com>> skrev: > > On 23/06/2020 10:35, Rikard Falkeborn wrote: > > > > I'd say that GENMASK_INPUT_CHECK() should be able to handle a > l=0 and > > h=unsigned value, so I doubt this warn. > > > > Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, > but it > > looks > > like GENMASK_INPUT_CHECK() could be improved. > > > > > > Indeed it could, it is fixed in -next. > > ok, thanks for the pointer, but I still see this on today's -next with > this patch: > > make W=1 drivers/iommu/arm-smmu-v3.o > > > Oh, ok thanks for reporting. I guess different gcc versions have > different behaviour. I guess we'll have to change the comparison to > (!((h) == (l) || (h) > (l))) instead (not sure I got all parenthesis and > logic correct but you get the idea). > Yeah, so this looks to fix it: --- a/include/linux/bits.h +++ b/include/linux/bits.h @@ -23,7 +23,8 @@ #include <linux/build_bug.h> #define GENMASK_INPUT_CHECK(h, l) \ (BUILD_BUG_ON_ZERO(__builtin_choose_expr( \ - __builtin_constant_p((l) > (h)), (l) > (h), 0))) + __builtin_constant_p(!((h) == (l) ||(h) > (l))), !((h) == (l) ||(h) > (l)), 0))) + We may be able to just use (h) == (l) as the const expr to make it more concise, but that may be confusing. I only tested with my toolchain based on 7.5.0 Thanks, John ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() @ 2020-06-26 10:05 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-06-26 10:05 UTC (permalink / raw) To: kbuild-all [-- Attachment #1: Type: text/plain, Size: 1624 bytes --] On 23/06/2020 14:55, Rikard Falkeborn wrote: > Den tis 23 juni 2020 12:21John Garry <john.garry@huawei.com > <mailto:john.garry@huawei.com>> skrev: > > On 23/06/2020 10:35, Rikard Falkeborn wrote: > > > > I'd say that GENMASK_INPUT_CHECK() should be able to handle a > l=0 and > > h=unsigned value, so I doubt this warn. > > > > Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, > but it > > looks > > like GENMASK_INPUT_CHECK() could be improved. > > > > > > Indeed it could, it is fixed in -next. > > ok, thanks for the pointer, but I still see this on today's -next with > this patch: > > make W=1 drivers/iommu/arm-smmu-v3.o > > > Oh, ok thanks for reporting. I guess different gcc versions have > different behaviour. I guess we'll have to change the comparison to > (!((h) == (l) || (h) > (l))) instead (not sure I got all parenthesis and > logic correct but you get the idea). > Yeah, so this looks to fix it: --- a/include/linux/bits.h +++ b/include/linux/bits.h @@ -23,7 +23,8 @@ #include <linux/build_bug.h> #define GENMASK_INPUT_CHECK(h, l) \ (BUILD_BUG_ON_ZERO(__builtin_choose_expr( \ - __builtin_constant_p((l) > (h)), (l) > (h), 0))) + __builtin_constant_p(!((h) == (l) ||(h) > (l))), !((h) == (l) ||(h) > (l)), 0))) + We may be able to just use (h) == (l) as the const expr to make it more concise, but that may be confusing. I only tested with my toolchain based on 7.5.0 Thanks, John ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() @ 2020-06-26 10:05 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-06-26 10:05 UTC (permalink / raw) To: Rikard Falkeborn Cc: trivial, kernel test robot, will, Joerg Roedel, linux-kernel, Linuxarm, iommu, maz, kbuild-all, robin.murphy, linux-arm-kernel On 23/06/2020 14:55, Rikard Falkeborn wrote: > Den tis 23 juni 2020 12:21John Garry <john.garry@huawei.com > <mailto:john.garry@huawei.com>> skrev: > > On 23/06/2020 10:35, Rikard Falkeborn wrote: > > > > I'd say that GENMASK_INPUT_CHECK() should be able to handle a > l=0 and > > h=unsigned value, so I doubt this warn. > > > > Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, > but it > > looks > > like GENMASK_INPUT_CHECK() could be improved. > > > > > > Indeed it could, it is fixed in -next. > > ok, thanks for the pointer, but I still see this on today's -next with > this patch: > > make W=1 drivers/iommu/arm-smmu-v3.o > > > Oh, ok thanks for reporting. I guess different gcc versions have > different behaviour. I guess we'll have to change the comparison to > (!((h) == (l) || (h) > (l))) instead (not sure I got all parenthesis and > logic correct but you get the idea). > Yeah, so this looks to fix it: --- a/include/linux/bits.h +++ b/include/linux/bits.h @@ -23,7 +23,8 @@ #include <linux/build_bug.h> #define GENMASK_INPUT_CHECK(h, l) \ (BUILD_BUG_ON_ZERO(__builtin_choose_expr( \ - __builtin_constant_p((l) > (h)), (l) > (h), 0))) + __builtin_constant_p(!((h) == (l) ||(h) > (l))), !((h) == (l) ||(h) > (l)), 0))) + We may be able to just use (h) == (l) as the const expr to make it more concise, but that may be confusing. I only tested with my toolchain based on 7.5.0 Thanks, John _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() @ 2020-06-26 10:05 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-06-26 10:05 UTC (permalink / raw) To: Rikard Falkeborn Cc: trivial, kernel test robot, will, linux-kernel, Linuxarm, iommu, maz, kbuild-all, robin.murphy, linux-arm-kernel On 23/06/2020 14:55, Rikard Falkeborn wrote: > Den tis 23 juni 2020 12:21John Garry <john.garry@huawei.com > <mailto:john.garry@huawei.com>> skrev: > > On 23/06/2020 10:35, Rikard Falkeborn wrote: > > > > I'd say that GENMASK_INPUT_CHECK() should be able to handle a > l=0 and > > h=unsigned value, so I doubt this warn. > > > > Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, > but it > > looks > > like GENMASK_INPUT_CHECK() could be improved. > > > > > > Indeed it could, it is fixed in -next. > > ok, thanks for the pointer, but I still see this on today's -next with > this patch: > > make W=1 drivers/iommu/arm-smmu-v3.o > > > Oh, ok thanks for reporting. I guess different gcc versions have > different behaviour. I guess we'll have to change the comparison to > (!((h) == (l) || (h) > (l))) instead (not sure I got all parenthesis and > logic correct but you get the idea). > Yeah, so this looks to fix it: --- a/include/linux/bits.h +++ b/include/linux/bits.h @@ -23,7 +23,8 @@ #include <linux/build_bug.h> #define GENMASK_INPUT_CHECK(h, l) \ (BUILD_BUG_ON_ZERO(__builtin_choose_expr( \ - __builtin_constant_p((l) > (h)), (l) > (h), 0))) + __builtin_constant_p(!((h) == (l) ||(h) > (l))), !((h) == (l) ||(h) > (l)), 0))) + We may be able to just use (h) == (l) as the const expr to make it more concise, but that may be confusing. I only tested with my toolchain based on 7.5.0 Thanks, John _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() 2020-06-23 9:21 ` John Garry (?) (?) @ 2020-06-23 16:22 ` Robin Murphy -1 siblings, 0 replies; 69+ messages in thread From: Robin Murphy @ 2020-06-23 16:22 UTC (permalink / raw) To: John Garry, kernel test robot, will, rikard.falkeborn Cc: trivial, maz, linux-kernel, Linuxarm, iommu, kbuild-all, linux-arm-kernel On 2020-06-23 10:21, John Garry wrote: > On 23/06/2020 02:07, kernel test robot wrote: > > + Rikard, as the GENMASK compile-time sanity checks were added recently > >> Hi John, >> >> I love your patch! Perhaps something to improve: >> >> [auto build test WARNING on iommu/next] >> [If your patch is applied to the wrong git tree, kindly drop us a note. >> And when submitting patch, we suggest to use as documented in >> https://git-scm.com/docs/git-format-patch] >> >> url: >> https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438 >> >> base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git >> next >> config: arm64-randconfig-c024-20200622 (attached as .config) >> compiler: aarch64-linux-gcc (GCC) 9.3.0 >> >> If you fix the issue, kindly add following tag as appropriate >> Reported-by: kernel test robot <lkp@intel.com> >> >> All warnings (new ones prefixed by >>, old ones prefixed by <<): >> >> In file included from include/linux/bits.h:23, >> from include/linux/ioport.h:15, >> from include/linux/acpi.h:12, >> from drivers/iommu/arm-smmu-v3.c:12: >> drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist': >> include/linux/bits.h:26:28: warning: comparison of unsigned expression >> < 0 is always false [-Wtype-limits] >> 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0))) > > I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and > h=unsigned value, so I doubt this warn. > > Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks > like GENMASK_INPUT_CHECK() could be improved. That said, I think this particular case might be even better off dodging GENMASK() entirely, by doing something like this first. Untested... Robin. ----->8----- Subject: [PATCH] iommu/arm-smmu-v3: Streamline queue calculations Beyond the initial queue setup based on the log2 values from ID registers, the log2 queue size is only ever used in the form of (1 << max_n_shift) to repeatedly recalculate the number of queue elements. Simply storing it in that form leads to slightly more efficient code, particularly in the low-level queue accessors where it counts most: add/remove: 0/0 grow/shrink: 1/7 up/down: 4/-120 (-116) Function old new delta arm_smmu_init_one_queue 360 364 +4 arm_smmu_priq_thread 512 508 -4 arm_smmu_evtq_thread 300 292 -8 __arm_smmu_cmdq_poll_set_valid_map.isra 296 288 -8 queue_remove_raw 180 164 -16 arm_smmu_gerror_handler 732 716 -16 arm_smmu_device_probe 4312 4284 -28 arm_smmu_cmdq_issue_cmdlist 1892 1852 -40 Total: Before=20135, After=20019, chg -0.58% Signed-off-by: Robin Murphy <robin.murphy@arm.com> --- drivers/iommu/arm-smmu-v3.c | 46 ++++++++++++++++++------------------- 1 file changed, 23 insertions(+), 23 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index f578677a5c41..407cb9451a7a 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -185,8 +185,8 @@ #define ARM_SMMU_MEMATTR_DEVICE_nGnRE 0x1 #define ARM_SMMU_MEMATTR_OIWB 0xf -#define Q_IDX(llq, p) ((p) & ((1 << (llq)->max_n_shift) - 1)) -#define Q_WRP(llq, p) ((p) & (1 << (llq)->max_n_shift)) +#define Q_IDX(llq, p) ((p) & ((llq)->max_n - 1)) +#define Q_WRP(llq, p) ((p) & (llq)->max_n) #define Q_OVERFLOW_FLAG (1U << 31) #define Q_OVF(p) ((p) & Q_OVERFLOW_FLAG) #define Q_ENT(q, p) ((q)->base + \ @@ -531,7 +531,7 @@ struct arm_smmu_ll_queue { } atomic; u8 __pad[SMP_CACHE_BYTES]; } ____cacheline_aligned_in_smp; - u32 max_n_shift; + u32 max_n; }; struct arm_smmu_queue { @@ -771,7 +771,7 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n) cons = Q_IDX(q, q->cons); if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons)) - space = (1 << q->max_n_shift) - (prod - cons); + space = q->max_n - (prod - cons); else space = cons - prod; @@ -1164,8 +1164,8 @@ static void __arm_smmu_cmdq_poll_set_valid_map(struct arm_smmu_cmdq *cmdq, { u32 swidx, sbidx, ewidx, ebidx; struct arm_smmu_ll_queue llq = { - .max_n_shift = cmdq->q.llq.max_n_shift, - .prod = sprod, + .max_n = cmdq->q.llq.max_n, + .prod = sprod, }; ewidx = BIT_WORD(Q_IDX(&llq, eprod)); @@ -1344,8 +1344,8 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds, { int i; struct arm_smmu_ll_queue llq = { - .max_n_shift = cmdq->q.llq.max_n_shift, - .prod = prod, + .max_n = cmdq->q.llq.max_n, + .prod = prod, }; for (i = 0; i < n; ++i) { @@ -1381,7 +1381,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, bool owner; struct arm_smmu_cmdq *cmdq = &smmu->cmdq; struct arm_smmu_ll_queue llq = { - .max_n_shift = cmdq->q.llq.max_n_shift, + .max_n = cmdq->q.llq.max_n, }, head = llq; int ret = 0; @@ -3144,13 +3144,13 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu, size_t qsz; do { - qsz = ((1 << q->llq.max_n_shift) * dwords) << 3; + qsz = (q->llq.max_n * dwords) << 3; q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma, GFP_KERNEL); if (q->base || qsz < PAGE_SIZE) break; - q->llq.max_n_shift--; + q->llq.max_n >>= 1; } while (1); if (!q->base) { @@ -3162,7 +3162,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu, if (!WARN_ON(q->base_dma & (qsz - 1))) { dev_info(smmu->dev, "allocated %u entries for %s\n", - 1 << q->llq.max_n_shift, name); + q->llq.max_n, name); } q->prod_reg = arm_smmu_page1_fixup(prod_off, smmu); @@ -3171,7 +3171,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu, q->q_base = Q_BASE_RWA; q->q_base |= q->base_dma & Q_BASE_ADDR_MASK; - q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, q->llq.max_n_shift); + q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, ilog2(q->llq.max_n)); q->llq.prod = q->llq.cons = 0; return 0; @@ -3187,13 +3187,12 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu) { int ret = 0; struct arm_smmu_cmdq *cmdq = &smmu->cmdq; - unsigned int nents = 1 << cmdq->q.llq.max_n_shift; atomic_long_t *bitmap; atomic_set(&cmdq->owner_prod, 0); atomic_set(&cmdq->lock, 0); - bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL); + bitmap = (atomic_long_t *)bitmap_zalloc(cmdq->q.llq.max_n, GFP_KERNEL); if (!bitmap) { dev_err(smmu->dev, "failed to allocate cmdq bitmap\n"); ret = -ENOMEM; @@ -3695,7 +3694,7 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass) static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) { - u32 reg; + u32 reg, max_n_shift; bool coherent = smmu->features & ARM_SMMU_FEAT_COHERENCY; /* IDR0 */ @@ -3798,9 +3797,9 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) } /* Queue sizes, capped to ensure natural alignment */ - smmu->cmdq.q.llq.max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT, - FIELD_GET(IDR1_CMDQS, reg)); - if (smmu->cmdq.q.llq.max_n_shift <= ilog2(CMDQ_BATCH_ENTRIES)) { + max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_CMDQS, reg)); + smmu->cmdq.q.llq.max_n = 1 << max_n_shift; + if (smmu->cmdq.q.llq.max_n <= CMDQ_BATCH_ENTRIES) { /* * We don't support splitting up batches, so one batch of * commands plus an extra sync needs to fit inside the command @@ -3812,10 +3811,11 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) return -ENXIO; } - smmu->evtq.q.llq.max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, - FIELD_GET(IDR1_EVTQS, reg)); - smmu->priq.q.llq.max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, - FIELD_GET(IDR1_PRIQS, reg)); + max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_EVTQS, reg)); + smmu->evtq.q.llq.max_n = 1 << max_n_shift; + + max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_PRIQS, reg)); + smmu->priq.q.llq.max_n = 1 << max_n_shift; /* SID/SSID sizes */ smmu->ssid_bits = FIELD_GET(IDR1_SSIDSIZE, reg); -- 2.23.0.dirty ^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() @ 2020-06-23 16:22 ` Robin Murphy 0 siblings, 0 replies; 69+ messages in thread From: Robin Murphy @ 2020-06-23 16:22 UTC (permalink / raw) To: kbuild-all [-- Attachment #1: Type: text/plain, Size: 8479 bytes --] On 2020-06-23 10:21, John Garry wrote: > On 23/06/2020 02:07, kernel test robot wrote: > > + Rikard, as the GENMASK compile-time sanity checks were added recently > >> Hi John, >> >> I love your patch! Perhaps something to improve: >> >> [auto build test WARNING on iommu/next] >> [If your patch is applied to the wrong git tree, kindly drop us a note. >> And when submitting patch, we suggest to use as documented in >> https://git-scm.com/docs/git-format-patch] >> >> url: >> https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438 >> >> base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git >> next >> config: arm64-randconfig-c024-20200622 (attached as .config) >> compiler: aarch64-linux-gcc (GCC) 9.3.0 >> >> If you fix the issue, kindly add following tag as appropriate >> Reported-by: kernel test robot <lkp@intel.com> >> >> All warnings (new ones prefixed by >>, old ones prefixed by <<): >> >> In file included from include/linux/bits.h:23, >> from include/linux/ioport.h:15, >> from include/linux/acpi.h:12, >> from drivers/iommu/arm-smmu-v3.c:12: >> drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist': >> include/linux/bits.h:26:28: warning: comparison of unsigned expression >> < 0 is always false [-Wtype-limits] >> 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0))) > > I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and > h=unsigned value, so I doubt this warn. > > Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks > like GENMASK_INPUT_CHECK() could be improved. That said, I think this particular case might be even better off dodging GENMASK() entirely, by doing something like this first. Untested... Robin. ----->8----- Subject: [PATCH] iommu/arm-smmu-v3: Streamline queue calculations Beyond the initial queue setup based on the log2 values from ID registers, the log2 queue size is only ever used in the form of (1 << max_n_shift) to repeatedly recalculate the number of queue elements. Simply storing it in that form leads to slightly more efficient code, particularly in the low-level queue accessors where it counts most: add/remove: 0/0 grow/shrink: 1/7 up/down: 4/-120 (-116) Function old new delta arm_smmu_init_one_queue 360 364 +4 arm_smmu_priq_thread 512 508 -4 arm_smmu_evtq_thread 300 292 -8 __arm_smmu_cmdq_poll_set_valid_map.isra 296 288 -8 queue_remove_raw 180 164 -16 arm_smmu_gerror_handler 732 716 -16 arm_smmu_device_probe 4312 4284 -28 arm_smmu_cmdq_issue_cmdlist 1892 1852 -40 Total: Before=20135, After=20019, chg -0.58% Signed-off-by: Robin Murphy <robin.murphy@arm.com> --- drivers/iommu/arm-smmu-v3.c | 46 ++++++++++++++++++------------------- 1 file changed, 23 insertions(+), 23 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index f578677a5c41..407cb9451a7a 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -185,8 +185,8 @@ #define ARM_SMMU_MEMATTR_DEVICE_nGnRE 0x1 #define ARM_SMMU_MEMATTR_OIWB 0xf -#define Q_IDX(llq, p) ((p) & ((1 << (llq)->max_n_shift) - 1)) -#define Q_WRP(llq, p) ((p) & (1 << (llq)->max_n_shift)) +#define Q_IDX(llq, p) ((p) & ((llq)->max_n - 1)) +#define Q_WRP(llq, p) ((p) & (llq)->max_n) #define Q_OVERFLOW_FLAG (1U << 31) #define Q_OVF(p) ((p) & Q_OVERFLOW_FLAG) #define Q_ENT(q, p) ((q)->base + \ @@ -531,7 +531,7 @@ struct arm_smmu_ll_queue { } atomic; u8 __pad[SMP_CACHE_BYTES]; } ____cacheline_aligned_in_smp; - u32 max_n_shift; + u32 max_n; }; struct arm_smmu_queue { @@ -771,7 +771,7 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n) cons = Q_IDX(q, q->cons); if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons)) - space = (1 << q->max_n_shift) - (prod - cons); + space = q->max_n - (prod - cons); else space = cons - prod; @@ -1164,8 +1164,8 @@ static void __arm_smmu_cmdq_poll_set_valid_map(struct arm_smmu_cmdq *cmdq, { u32 swidx, sbidx, ewidx, ebidx; struct arm_smmu_ll_queue llq = { - .max_n_shift = cmdq->q.llq.max_n_shift, - .prod = sprod, + .max_n = cmdq->q.llq.max_n, + .prod = sprod, }; ewidx = BIT_WORD(Q_IDX(&llq, eprod)); @@ -1344,8 +1344,8 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds, { int i; struct arm_smmu_ll_queue llq = { - .max_n_shift = cmdq->q.llq.max_n_shift, - .prod = prod, + .max_n = cmdq->q.llq.max_n, + .prod = prod, }; for (i = 0; i < n; ++i) { @@ -1381,7 +1381,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, bool owner; struct arm_smmu_cmdq *cmdq = &smmu->cmdq; struct arm_smmu_ll_queue llq = { - .max_n_shift = cmdq->q.llq.max_n_shift, + .max_n = cmdq->q.llq.max_n, }, head = llq; int ret = 0; @@ -3144,13 +3144,13 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu, size_t qsz; do { - qsz = ((1 << q->llq.max_n_shift) * dwords) << 3; + qsz = (q->llq.max_n * dwords) << 3; q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma, GFP_KERNEL); if (q->base || qsz < PAGE_SIZE) break; - q->llq.max_n_shift--; + q->llq.max_n >>= 1; } while (1); if (!q->base) { @@ -3162,7 +3162,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu, if (!WARN_ON(q->base_dma & (qsz - 1))) { dev_info(smmu->dev, "allocated %u entries for %s\n", - 1 << q->llq.max_n_shift, name); + q->llq.max_n, name); } q->prod_reg = arm_smmu_page1_fixup(prod_off, smmu); @@ -3171,7 +3171,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu, q->q_base = Q_BASE_RWA; q->q_base |= q->base_dma & Q_BASE_ADDR_MASK; - q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, q->llq.max_n_shift); + q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, ilog2(q->llq.max_n)); q->llq.prod = q->llq.cons = 0; return 0; @@ -3187,13 +3187,12 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu) { int ret = 0; struct arm_smmu_cmdq *cmdq = &smmu->cmdq; - unsigned int nents = 1 << cmdq->q.llq.max_n_shift; atomic_long_t *bitmap; atomic_set(&cmdq->owner_prod, 0); atomic_set(&cmdq->lock, 0); - bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL); + bitmap = (atomic_long_t *)bitmap_zalloc(cmdq->q.llq.max_n, GFP_KERNEL); if (!bitmap) { dev_err(smmu->dev, "failed to allocate cmdq bitmap\n"); ret = -ENOMEM; @@ -3695,7 +3694,7 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass) static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) { - u32 reg; + u32 reg, max_n_shift; bool coherent = smmu->features & ARM_SMMU_FEAT_COHERENCY; /* IDR0 */ @@ -3798,9 +3797,9 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) } /* Queue sizes, capped to ensure natural alignment */ - smmu->cmdq.q.llq.max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT, - FIELD_GET(IDR1_CMDQS, reg)); - if (smmu->cmdq.q.llq.max_n_shift <= ilog2(CMDQ_BATCH_ENTRIES)) { + max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_CMDQS, reg)); + smmu->cmdq.q.llq.max_n = 1 << max_n_shift; + if (smmu->cmdq.q.llq.max_n <= CMDQ_BATCH_ENTRIES) { /* * We don't support splitting up batches, so one batch of * commands plus an extra sync needs to fit inside the command @@ -3812,10 +3811,11 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) return -ENXIO; } - smmu->evtq.q.llq.max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, - FIELD_GET(IDR1_EVTQS, reg)); - smmu->priq.q.llq.max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, - FIELD_GET(IDR1_PRIQS, reg)); + max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_EVTQS, reg)); + smmu->evtq.q.llq.max_n = 1 << max_n_shift; + + max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_PRIQS, reg)); + smmu->priq.q.llq.max_n = 1 << max_n_shift; /* SID/SSID sizes */ smmu->ssid_bits = FIELD_GET(IDR1_SSIDSIZE, reg); -- 2.23.0.dirty ^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() @ 2020-06-23 16:22 ` Robin Murphy 0 siblings, 0 replies; 69+ messages in thread From: Robin Murphy @ 2020-06-23 16:22 UTC (permalink / raw) To: John Garry, kernel test robot, will, rikard.falkeborn Cc: trivial, maz, linux-kernel, Linuxarm, iommu, kbuild-all, linux-arm-kernel On 2020-06-23 10:21, John Garry wrote: > On 23/06/2020 02:07, kernel test robot wrote: > > + Rikard, as the GENMASK compile-time sanity checks were added recently > >> Hi John, >> >> I love your patch! Perhaps something to improve: >> >> [auto build test WARNING on iommu/next] >> [If your patch is applied to the wrong git tree, kindly drop us a note. >> And when submitting patch, we suggest to use as documented in >> https://git-scm.com/docs/git-format-patch] >> >> url: >> https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438 >> >> base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git >> next >> config: arm64-randconfig-c024-20200622 (attached as .config) >> compiler: aarch64-linux-gcc (GCC) 9.3.0 >> >> If you fix the issue, kindly add following tag as appropriate >> Reported-by: kernel test robot <lkp@intel.com> >> >> All warnings (new ones prefixed by >>, old ones prefixed by <<): >> >> In file included from include/linux/bits.h:23, >> from include/linux/ioport.h:15, >> from include/linux/acpi.h:12, >> from drivers/iommu/arm-smmu-v3.c:12: >> drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist': >> include/linux/bits.h:26:28: warning: comparison of unsigned expression >> < 0 is always false [-Wtype-limits] >> 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0))) > > I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and > h=unsigned value, so I doubt this warn. > > Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks > like GENMASK_INPUT_CHECK() could be improved. That said, I think this particular case might be even better off dodging GENMASK() entirely, by doing something like this first. Untested... Robin. ----->8----- Subject: [PATCH] iommu/arm-smmu-v3: Streamline queue calculations Beyond the initial queue setup based on the log2 values from ID registers, the log2 queue size is only ever used in the form of (1 << max_n_shift) to repeatedly recalculate the number of queue elements. Simply storing it in that form leads to slightly more efficient code, particularly in the low-level queue accessors where it counts most: add/remove: 0/0 grow/shrink: 1/7 up/down: 4/-120 (-116) Function old new delta arm_smmu_init_one_queue 360 364 +4 arm_smmu_priq_thread 512 508 -4 arm_smmu_evtq_thread 300 292 -8 __arm_smmu_cmdq_poll_set_valid_map.isra 296 288 -8 queue_remove_raw 180 164 -16 arm_smmu_gerror_handler 732 716 -16 arm_smmu_device_probe 4312 4284 -28 arm_smmu_cmdq_issue_cmdlist 1892 1852 -40 Total: Before=20135, After=20019, chg -0.58% Signed-off-by: Robin Murphy <robin.murphy@arm.com> --- drivers/iommu/arm-smmu-v3.c | 46 ++++++++++++++++++------------------- 1 file changed, 23 insertions(+), 23 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index f578677a5c41..407cb9451a7a 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -185,8 +185,8 @@ #define ARM_SMMU_MEMATTR_DEVICE_nGnRE 0x1 #define ARM_SMMU_MEMATTR_OIWB 0xf -#define Q_IDX(llq, p) ((p) & ((1 << (llq)->max_n_shift) - 1)) -#define Q_WRP(llq, p) ((p) & (1 << (llq)->max_n_shift)) +#define Q_IDX(llq, p) ((p) & ((llq)->max_n - 1)) +#define Q_WRP(llq, p) ((p) & (llq)->max_n) #define Q_OVERFLOW_FLAG (1U << 31) #define Q_OVF(p) ((p) & Q_OVERFLOW_FLAG) #define Q_ENT(q, p) ((q)->base + \ @@ -531,7 +531,7 @@ struct arm_smmu_ll_queue { } atomic; u8 __pad[SMP_CACHE_BYTES]; } ____cacheline_aligned_in_smp; - u32 max_n_shift; + u32 max_n; }; struct arm_smmu_queue { @@ -771,7 +771,7 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n) cons = Q_IDX(q, q->cons); if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons)) - space = (1 << q->max_n_shift) - (prod - cons); + space = q->max_n - (prod - cons); else space = cons - prod; @@ -1164,8 +1164,8 @@ static void __arm_smmu_cmdq_poll_set_valid_map(struct arm_smmu_cmdq *cmdq, { u32 swidx, sbidx, ewidx, ebidx; struct arm_smmu_ll_queue llq = { - .max_n_shift = cmdq->q.llq.max_n_shift, - .prod = sprod, + .max_n = cmdq->q.llq.max_n, + .prod = sprod, }; ewidx = BIT_WORD(Q_IDX(&llq, eprod)); @@ -1344,8 +1344,8 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds, { int i; struct arm_smmu_ll_queue llq = { - .max_n_shift = cmdq->q.llq.max_n_shift, - .prod = prod, + .max_n = cmdq->q.llq.max_n, + .prod = prod, }; for (i = 0; i < n; ++i) { @@ -1381,7 +1381,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, bool owner; struct arm_smmu_cmdq *cmdq = &smmu->cmdq; struct arm_smmu_ll_queue llq = { - .max_n_shift = cmdq->q.llq.max_n_shift, + .max_n = cmdq->q.llq.max_n, }, head = llq; int ret = 0; @@ -3144,13 +3144,13 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu, size_t qsz; do { - qsz = ((1 << q->llq.max_n_shift) * dwords) << 3; + qsz = (q->llq.max_n * dwords) << 3; q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma, GFP_KERNEL); if (q->base || qsz < PAGE_SIZE) break; - q->llq.max_n_shift--; + q->llq.max_n >>= 1; } while (1); if (!q->base) { @@ -3162,7 +3162,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu, if (!WARN_ON(q->base_dma & (qsz - 1))) { dev_info(smmu->dev, "allocated %u entries for %s\n", - 1 << q->llq.max_n_shift, name); + q->llq.max_n, name); } q->prod_reg = arm_smmu_page1_fixup(prod_off, smmu); @@ -3171,7 +3171,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu, q->q_base = Q_BASE_RWA; q->q_base |= q->base_dma & Q_BASE_ADDR_MASK; - q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, q->llq.max_n_shift); + q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, ilog2(q->llq.max_n)); q->llq.prod = q->llq.cons = 0; return 0; @@ -3187,13 +3187,12 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu) { int ret = 0; struct arm_smmu_cmdq *cmdq = &smmu->cmdq; - unsigned int nents = 1 << cmdq->q.llq.max_n_shift; atomic_long_t *bitmap; atomic_set(&cmdq->owner_prod, 0); atomic_set(&cmdq->lock, 0); - bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL); + bitmap = (atomic_long_t *)bitmap_zalloc(cmdq->q.llq.max_n, GFP_KERNEL); if (!bitmap) { dev_err(smmu->dev, "failed to allocate cmdq bitmap\n"); ret = -ENOMEM; @@ -3695,7 +3694,7 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass) static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) { - u32 reg; + u32 reg, max_n_shift; bool coherent = smmu->features & ARM_SMMU_FEAT_COHERENCY; /* IDR0 */ @@ -3798,9 +3797,9 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) } /* Queue sizes, capped to ensure natural alignment */ - smmu->cmdq.q.llq.max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT, - FIELD_GET(IDR1_CMDQS, reg)); - if (smmu->cmdq.q.llq.max_n_shift <= ilog2(CMDQ_BATCH_ENTRIES)) { + max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_CMDQS, reg)); + smmu->cmdq.q.llq.max_n = 1 << max_n_shift; + if (smmu->cmdq.q.llq.max_n <= CMDQ_BATCH_ENTRIES) { /* * We don't support splitting up batches, so one batch of * commands plus an extra sync needs to fit inside the command @@ -3812,10 +3811,11 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) return -ENXIO; } - smmu->evtq.q.llq.max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, - FIELD_GET(IDR1_EVTQS, reg)); - smmu->priq.q.llq.max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, - FIELD_GET(IDR1_PRIQS, reg)); + max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_EVTQS, reg)); + smmu->evtq.q.llq.max_n = 1 << max_n_shift; + + max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_PRIQS, reg)); + smmu->priq.q.llq.max_n = 1 << max_n_shift; /* SID/SSID sizes */ smmu->ssid_bits = FIELD_GET(IDR1_SSIDSIZE, reg); -- 2.23.0.dirty _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() @ 2020-06-23 16:22 ` Robin Murphy 0 siblings, 0 replies; 69+ messages in thread From: Robin Murphy @ 2020-06-23 16:22 UTC (permalink / raw) To: John Garry, kernel test robot, will, rikard.falkeborn Cc: trivial, maz, linux-kernel, Linuxarm, iommu, kbuild-all, linux-arm-kernel On 2020-06-23 10:21, John Garry wrote: > On 23/06/2020 02:07, kernel test robot wrote: > > + Rikard, as the GENMASK compile-time sanity checks were added recently > >> Hi John, >> >> I love your patch! Perhaps something to improve: >> >> [auto build test WARNING on iommu/next] >> [If your patch is applied to the wrong git tree, kindly drop us a note. >> And when submitting patch, we suggest to use as documented in >> https://git-scm.com/docs/git-format-patch] >> >> url: >> https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438 >> >> base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git >> next >> config: arm64-randconfig-c024-20200622 (attached as .config) >> compiler: aarch64-linux-gcc (GCC) 9.3.0 >> >> If you fix the issue, kindly add following tag as appropriate >> Reported-by: kernel test robot <lkp@intel.com> >> >> All warnings (new ones prefixed by >>, old ones prefixed by <<): >> >> In file included from include/linux/bits.h:23, >> from include/linux/ioport.h:15, >> from include/linux/acpi.h:12, >> from drivers/iommu/arm-smmu-v3.c:12: >> drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist': >> include/linux/bits.h:26:28: warning: comparison of unsigned expression >> < 0 is always false [-Wtype-limits] >> 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0))) > > I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and > h=unsigned value, so I doubt this warn. > > Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks > like GENMASK_INPUT_CHECK() could be improved. That said, I think this particular case might be even better off dodging GENMASK() entirely, by doing something like this first. Untested... Robin. ----->8----- Subject: [PATCH] iommu/arm-smmu-v3: Streamline queue calculations Beyond the initial queue setup based on the log2 values from ID registers, the log2 queue size is only ever used in the form of (1 << max_n_shift) to repeatedly recalculate the number of queue elements. Simply storing it in that form leads to slightly more efficient code, particularly in the low-level queue accessors where it counts most: add/remove: 0/0 grow/shrink: 1/7 up/down: 4/-120 (-116) Function old new delta arm_smmu_init_one_queue 360 364 +4 arm_smmu_priq_thread 512 508 -4 arm_smmu_evtq_thread 300 292 -8 __arm_smmu_cmdq_poll_set_valid_map.isra 296 288 -8 queue_remove_raw 180 164 -16 arm_smmu_gerror_handler 732 716 -16 arm_smmu_device_probe 4312 4284 -28 arm_smmu_cmdq_issue_cmdlist 1892 1852 -40 Total: Before=20135, After=20019, chg -0.58% Signed-off-by: Robin Murphy <robin.murphy@arm.com> --- drivers/iommu/arm-smmu-v3.c | 46 ++++++++++++++++++------------------- 1 file changed, 23 insertions(+), 23 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index f578677a5c41..407cb9451a7a 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -185,8 +185,8 @@ #define ARM_SMMU_MEMATTR_DEVICE_nGnRE 0x1 #define ARM_SMMU_MEMATTR_OIWB 0xf -#define Q_IDX(llq, p) ((p) & ((1 << (llq)->max_n_shift) - 1)) -#define Q_WRP(llq, p) ((p) & (1 << (llq)->max_n_shift)) +#define Q_IDX(llq, p) ((p) & ((llq)->max_n - 1)) +#define Q_WRP(llq, p) ((p) & (llq)->max_n) #define Q_OVERFLOW_FLAG (1U << 31) #define Q_OVF(p) ((p) & Q_OVERFLOW_FLAG) #define Q_ENT(q, p) ((q)->base + \ @@ -531,7 +531,7 @@ struct arm_smmu_ll_queue { } atomic; u8 __pad[SMP_CACHE_BYTES]; } ____cacheline_aligned_in_smp; - u32 max_n_shift; + u32 max_n; }; struct arm_smmu_queue { @@ -771,7 +771,7 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n) cons = Q_IDX(q, q->cons); if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons)) - space = (1 << q->max_n_shift) - (prod - cons); + space = q->max_n - (prod - cons); else space = cons - prod; @@ -1164,8 +1164,8 @@ static void __arm_smmu_cmdq_poll_set_valid_map(struct arm_smmu_cmdq *cmdq, { u32 swidx, sbidx, ewidx, ebidx; struct arm_smmu_ll_queue llq = { - .max_n_shift = cmdq->q.llq.max_n_shift, - .prod = sprod, + .max_n = cmdq->q.llq.max_n, + .prod = sprod, }; ewidx = BIT_WORD(Q_IDX(&llq, eprod)); @@ -1344,8 +1344,8 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds, { int i; struct arm_smmu_ll_queue llq = { - .max_n_shift = cmdq->q.llq.max_n_shift, - .prod = prod, + .max_n = cmdq->q.llq.max_n, + .prod = prod, }; for (i = 0; i < n; ++i) { @@ -1381,7 +1381,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, bool owner; struct arm_smmu_cmdq *cmdq = &smmu->cmdq; struct arm_smmu_ll_queue llq = { - .max_n_shift = cmdq->q.llq.max_n_shift, + .max_n = cmdq->q.llq.max_n, }, head = llq; int ret = 0; @@ -3144,13 +3144,13 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu, size_t qsz; do { - qsz = ((1 << q->llq.max_n_shift) * dwords) << 3; + qsz = (q->llq.max_n * dwords) << 3; q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma, GFP_KERNEL); if (q->base || qsz < PAGE_SIZE) break; - q->llq.max_n_shift--; + q->llq.max_n >>= 1; } while (1); if (!q->base) { @@ -3162,7 +3162,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu, if (!WARN_ON(q->base_dma & (qsz - 1))) { dev_info(smmu->dev, "allocated %u entries for %s\n", - 1 << q->llq.max_n_shift, name); + q->llq.max_n, name); } q->prod_reg = arm_smmu_page1_fixup(prod_off, smmu); @@ -3171,7 +3171,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu, q->q_base = Q_BASE_RWA; q->q_base |= q->base_dma & Q_BASE_ADDR_MASK; - q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, q->llq.max_n_shift); + q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, ilog2(q->llq.max_n)); q->llq.prod = q->llq.cons = 0; return 0; @@ -3187,13 +3187,12 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu) { int ret = 0; struct arm_smmu_cmdq *cmdq = &smmu->cmdq; - unsigned int nents = 1 << cmdq->q.llq.max_n_shift; atomic_long_t *bitmap; atomic_set(&cmdq->owner_prod, 0); atomic_set(&cmdq->lock, 0); - bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL); + bitmap = (atomic_long_t *)bitmap_zalloc(cmdq->q.llq.max_n, GFP_KERNEL); if (!bitmap) { dev_err(smmu->dev, "failed to allocate cmdq bitmap\n"); ret = -ENOMEM; @@ -3695,7 +3694,7 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass) static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) { - u32 reg; + u32 reg, max_n_shift; bool coherent = smmu->features & ARM_SMMU_FEAT_COHERENCY; /* IDR0 */ @@ -3798,9 +3797,9 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) } /* Queue sizes, capped to ensure natural alignment */ - smmu->cmdq.q.llq.max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT, - FIELD_GET(IDR1_CMDQS, reg)); - if (smmu->cmdq.q.llq.max_n_shift <= ilog2(CMDQ_BATCH_ENTRIES)) { + max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_CMDQS, reg)); + smmu->cmdq.q.llq.max_n = 1 << max_n_shift; + if (smmu->cmdq.q.llq.max_n <= CMDQ_BATCH_ENTRIES) { /* * We don't support splitting up batches, so one batch of * commands plus an extra sync needs to fit inside the command @@ -3812,10 +3811,11 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) return -ENXIO; } - smmu->evtq.q.llq.max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, - FIELD_GET(IDR1_EVTQS, reg)); - smmu->priq.q.llq.max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, - FIELD_GET(IDR1_PRIQS, reg)); + max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_EVTQS, reg)); + smmu->evtq.q.llq.max_n = 1 << max_n_shift; + + max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_PRIQS, reg)); + smmu->priq.q.llq.max_n = 1 << max_n_shift; /* SID/SSID sizes */ smmu->ssid_bits = FIELD_GET(IDR1_SSIDSIZE, reg); -- 2.23.0.dirty _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() 2020-06-23 16:22 ` Robin Murphy (?) (?) @ 2020-06-24 8:15 ` John Garry -1 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-06-24 8:15 UTC (permalink / raw) To: Robin Murphy, kernel test robot, will, rikard.falkeborn Cc: trivial, maz, linux-kernel, Linuxarm, iommu, kbuild-all, linux-arm-kernel >> >> I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and >> h=unsigned value, so I doubt this warn. >> >> Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks >> like GENMASK_INPUT_CHECK() could be improved. > > That said, I think this particular case might be even better off dodging > GENMASK() entirely, by doing something like this first. Untested... > > Robin. > > ----->8----- > Subject: [PATCH] iommu/arm-smmu-v3: Streamline queue calculations > > Beyond the initial queue setup based on the log2 values from ID > registers, the log2 queue size is only ever used in the form of > (1 << max_n_shift) to repeatedly recalculate the number of queue > elements. Simply storing it in that form leads to slightly more > efficient code, particularly in the low-level queue accessors > where it counts most: > > add/remove: 0/0 grow/shrink: 1/7 up/down: 4/-120 (-116) > Function old new delta > arm_smmu_init_one_queue 360 364 +4 > arm_smmu_priq_thread 512 508 -4 > arm_smmu_evtq_thread 300 292 -8 > __arm_smmu_cmdq_poll_set_valid_map.isra 296 288 -8 > queue_remove_raw 180 164 -16 > arm_smmu_gerror_handler 732 716 -16 > arm_smmu_device_probe 4312 4284 -28 > arm_smmu_cmdq_issue_cmdlist 1892 1852 -40 > Total: Before=20135, After=20019, chg -0.58% > > Signed-off-by: Robin Murphy <robin.murphy@arm.com> > --- [...] > } > > - smmu->evtq.q.llq.max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, > - FIELD_GET(IDR1_EVTQS, reg)); > - smmu->priq.q.llq.max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, > - FIELD_GET(IDR1_PRIQS, reg)); > + max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_EVTQS, reg)); > + smmu->evtq.q.llq.max_n = 1 << max_n_shift; So I require the bitmask of this for the prod, which would be (max_n << 1) - 1. I don't feel too strongly either way, and the other big changes in this series need to be considered first... Thanks, John > + > + max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_PRIQS, reg)); > + smmu->priq.q.llq.max_n = 1 << max_n_shift; > > /* SID/SSID sizes */ > smmu->ssid_bits = FIELD_GET(IDR1_SSIDSIZE, reg); > ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() @ 2020-06-24 8:15 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-06-24 8:15 UTC (permalink / raw) To: kbuild-all [-- Attachment #1: Type: text/plain, Size: 2471 bytes --] >> >> I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and >> h=unsigned value, so I doubt this warn. >> >> Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks >> like GENMASK_INPUT_CHECK() could be improved. > > That said, I think this particular case might be even better off dodging > GENMASK() entirely, by doing something like this first. Untested... > > Robin. > > ----->8----- > Subject: [PATCH] iommu/arm-smmu-v3: Streamline queue calculations > > Beyond the initial queue setup based on the log2 values from ID > registers, the log2 queue size is only ever used in the form of > (1 << max_n_shift) to repeatedly recalculate the number of queue > elements. Simply storing it in that form leads to slightly more > efficient code, particularly in the low-level queue accessors > where it counts most: > > add/remove: 0/0 grow/shrink: 1/7 up/down: 4/-120 (-116) > Function old new delta > arm_smmu_init_one_queue 360 364 +4 > arm_smmu_priq_thread 512 508 -4 > arm_smmu_evtq_thread 300 292 -8 > __arm_smmu_cmdq_poll_set_valid_map.isra 296 288 -8 > queue_remove_raw 180 164 -16 > arm_smmu_gerror_handler 732 716 -16 > arm_smmu_device_probe 4312 4284 -28 > arm_smmu_cmdq_issue_cmdlist 1892 1852 -40 > Total: Before=20135, After=20019, chg -0.58% > > Signed-off-by: Robin Murphy <robin.murphy@arm.com> > --- [...] > } > > - smmu->evtq.q.llq.max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, > - FIELD_GET(IDR1_EVTQS, reg)); > - smmu->priq.q.llq.max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, > - FIELD_GET(IDR1_PRIQS, reg)); > + max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_EVTQS, reg)); > + smmu->evtq.q.llq.max_n = 1 << max_n_shift; So I require the bitmask of this for the prod, which would be (max_n << 1) - 1. I don't feel too strongly either way, and the other big changes in this series need to be considered first... Thanks, John > + > + max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_PRIQS, reg)); > + smmu->priq.q.llq.max_n = 1 << max_n_shift; > > /* SID/SSID sizes */ > smmu->ssid_bits = FIELD_GET(IDR1_SSIDSIZE, reg); > ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() @ 2020-06-24 8:15 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-06-24 8:15 UTC (permalink / raw) To: Robin Murphy, kernel test robot, will, rikard.falkeborn Cc: trivial, maz, linux-kernel, Linuxarm, iommu, kbuild-all, linux-arm-kernel >> >> I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and >> h=unsigned value, so I doubt this warn. >> >> Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks >> like GENMASK_INPUT_CHECK() could be improved. > > That said, I think this particular case might be even better off dodging > GENMASK() entirely, by doing something like this first. Untested... > > Robin. > > ----->8----- > Subject: [PATCH] iommu/arm-smmu-v3: Streamline queue calculations > > Beyond the initial queue setup based on the log2 values from ID > registers, the log2 queue size is only ever used in the form of > (1 << max_n_shift) to repeatedly recalculate the number of queue > elements. Simply storing it in that form leads to slightly more > efficient code, particularly in the low-level queue accessors > where it counts most: > > add/remove: 0/0 grow/shrink: 1/7 up/down: 4/-120 (-116) > Function old new delta > arm_smmu_init_one_queue 360 364 +4 > arm_smmu_priq_thread 512 508 -4 > arm_smmu_evtq_thread 300 292 -8 > __arm_smmu_cmdq_poll_set_valid_map.isra 296 288 -8 > queue_remove_raw 180 164 -16 > arm_smmu_gerror_handler 732 716 -16 > arm_smmu_device_probe 4312 4284 -28 > arm_smmu_cmdq_issue_cmdlist 1892 1852 -40 > Total: Before=20135, After=20019, chg -0.58% > > Signed-off-by: Robin Murphy <robin.murphy@arm.com> > --- [...] > } > > - smmu->evtq.q.llq.max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, > - FIELD_GET(IDR1_EVTQS, reg)); > - smmu->priq.q.llq.max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, > - FIELD_GET(IDR1_PRIQS, reg)); > + max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_EVTQS, reg)); > + smmu->evtq.q.llq.max_n = 1 << max_n_shift; So I require the bitmask of this for the prod, which would be (max_n << 1) - 1. I don't feel too strongly either way, and the other big changes in this series need to be considered first... Thanks, John > + > + max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_PRIQS, reg)); > + smmu->priq.q.llq.max_n = 1 << max_n_shift; > > /* SID/SSID sizes */ > smmu->ssid_bits = FIELD_GET(IDR1_SSIDSIZE, reg); > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() @ 2020-06-24 8:15 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-06-24 8:15 UTC (permalink / raw) To: Robin Murphy, kernel test robot, will, rikard.falkeborn Cc: trivial, maz, linux-kernel, Linuxarm, iommu, kbuild-all, linux-arm-kernel >> >> I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and >> h=unsigned value, so I doubt this warn. >> >> Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks >> like GENMASK_INPUT_CHECK() could be improved. > > That said, I think this particular case might be even better off dodging > GENMASK() entirely, by doing something like this first. Untested... > > Robin. > > ----->8----- > Subject: [PATCH] iommu/arm-smmu-v3: Streamline queue calculations > > Beyond the initial queue setup based on the log2 values from ID > registers, the log2 queue size is only ever used in the form of > (1 << max_n_shift) to repeatedly recalculate the number of queue > elements. Simply storing it in that form leads to slightly more > efficient code, particularly in the low-level queue accessors > where it counts most: > > add/remove: 0/0 grow/shrink: 1/7 up/down: 4/-120 (-116) > Function old new delta > arm_smmu_init_one_queue 360 364 +4 > arm_smmu_priq_thread 512 508 -4 > arm_smmu_evtq_thread 300 292 -8 > __arm_smmu_cmdq_poll_set_valid_map.isra 296 288 -8 > queue_remove_raw 180 164 -16 > arm_smmu_gerror_handler 732 716 -16 > arm_smmu_device_probe 4312 4284 -28 > arm_smmu_cmdq_issue_cmdlist 1892 1852 -40 > Total: Before=20135, After=20019, chg -0.58% > > Signed-off-by: Robin Murphy <robin.murphy@arm.com> > --- [...] > } > > - smmu->evtq.q.llq.max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, > - FIELD_GET(IDR1_EVTQS, reg)); > - smmu->priq.q.llq.max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, > - FIELD_GET(IDR1_PRIQS, reg)); > + max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_EVTQS, reg)); > + smmu->evtq.q.llq.max_n = 1 << max_n_shift; So I require the bitmask of this for the prod, which would be (max_n << 1) - 1. I don't feel too strongly either way, and the other big changes in this series need to be considered first... Thanks, John > + > + max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_PRIQS, reg)); > + smmu->priq.q.llq.max_n = 1 << max_n_shift; > > /* SID/SSID sizes */ > smmu->ssid_bits = FIELD_GET(IDR1_SSIDSIZE, reg); > _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() 2020-06-22 17:28 ` John Garry (?) @ 2020-07-16 10:20 ` Will Deacon -1 siblings, 0 replies; 69+ messages in thread From: Will Deacon @ 2020-07-16 10:20 UTC (permalink / raw) To: John Garry Cc: robin.murphy, joro, trivial, linux-arm-kernel, iommu, linux-kernel, linuxarm, maz On Tue, Jun 23, 2020 at 01:28:40AM +0800, John Garry wrote: > It has been shown that the cmpxchg() for finding space in the cmdq can > be a bottleneck: > - for more CPUs contending the cmdq, the cmpxchg() will fail more often > - since the software-maintained cons pointer is updated on the same 64b > memory region, the chance of cmpxchg() failure increases again > > The cmpxchg() is removed as part of 2 related changes: > > - Update prod and cmdq owner in a single atomic add operation. For this, we > count the prod and owner in separate regions in prod memory. > > As with simple binary counting, once the prod+wrap fields overflow, they > will zero. They should never overflow into "owner" region, and we zero > the non-owner, prod region for each owner. This maintains the prod > pointer. > > As for the "owner", we now count this value, instead of setting a flag. > Similar to before, once the owner has finished gathering, it will clear > a mask. As such, a CPU declares itself as the "owner" when it reads zero > for this region. This zeroing will also clear possible overflow in > wrap+prod region, above. > > The owner is now responsible for all cmdq locking to avoid possible > deadlock. The owner will lock the cmdq for all non-owers it has gathered > when they have space in the queue and have written their entries. > > - Check for space in the cmdq after the prod pointer has been assigned. > > We don't bother checking for space in the cmdq before assigning the prod > pointer, as this would be racy. > > So since the prod pointer is updated unconditionally, it would be common > for no space to be available in the cmdq when prod is assigned - that > is, according the software-maintained prod and cons pointer. So now > it must be ensured that the entries are not yet written and not until > there is space. > > How the prod pointer is maintained also leads to a strange condition > where the prod pointer can wrap past the cons pointer. We can detect this > condition, and report no space here. However, a prod pointer progressed > twice past the cons pointer cannot be detected. But it can be ensured that > this that this scenario does not occur, as we limit the amount of > commands any CPU can issue at any given time, such that we cannot > progress prod pointer further. > > Signed-off-by: John Garry <john.garry@huawei.com> > --- > drivers/iommu/arm-smmu-v3.c | 101 ++++++++++++++++++++++-------------- > 1 file changed, 61 insertions(+), 40 deletions(-) I must admit, you made me smile putting trivial@kernel.org on cc for this ;) Will ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() @ 2020-07-16 10:20 ` Will Deacon 0 siblings, 0 replies; 69+ messages in thread From: Will Deacon @ 2020-07-16 10:20 UTC (permalink / raw) To: John Garry Cc: trivial, maz, joro, linux-kernel, linuxarm, iommu, robin.murphy, linux-arm-kernel On Tue, Jun 23, 2020 at 01:28:40AM +0800, John Garry wrote: > It has been shown that the cmpxchg() for finding space in the cmdq can > be a bottleneck: > - for more CPUs contending the cmdq, the cmpxchg() will fail more often > - since the software-maintained cons pointer is updated on the same 64b > memory region, the chance of cmpxchg() failure increases again > > The cmpxchg() is removed as part of 2 related changes: > > - Update prod and cmdq owner in a single atomic add operation. For this, we > count the prod and owner in separate regions in prod memory. > > As with simple binary counting, once the prod+wrap fields overflow, they > will zero. They should never overflow into "owner" region, and we zero > the non-owner, prod region for each owner. This maintains the prod > pointer. > > As for the "owner", we now count this value, instead of setting a flag. > Similar to before, once the owner has finished gathering, it will clear > a mask. As such, a CPU declares itself as the "owner" when it reads zero > for this region. This zeroing will also clear possible overflow in > wrap+prod region, above. > > The owner is now responsible for all cmdq locking to avoid possible > deadlock. The owner will lock the cmdq for all non-owers it has gathered > when they have space in the queue and have written their entries. > > - Check for space in the cmdq after the prod pointer has been assigned. > > We don't bother checking for space in the cmdq before assigning the prod > pointer, as this would be racy. > > So since the prod pointer is updated unconditionally, it would be common > for no space to be available in the cmdq when prod is assigned - that > is, according the software-maintained prod and cons pointer. So now > it must be ensured that the entries are not yet written and not until > there is space. > > How the prod pointer is maintained also leads to a strange condition > where the prod pointer can wrap past the cons pointer. We can detect this > condition, and report no space here. However, a prod pointer progressed > twice past the cons pointer cannot be detected. But it can be ensured that > this that this scenario does not occur, as we limit the amount of > commands any CPU can issue at any given time, such that we cannot > progress prod pointer further. > > Signed-off-by: John Garry <john.garry@huawei.com> > --- > drivers/iommu/arm-smmu-v3.c | 101 ++++++++++++++++++++++-------------- > 1 file changed, 61 insertions(+), 40 deletions(-) I must admit, you made me smile putting trivial@kernel.org on cc for this ;) Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() @ 2020-07-16 10:20 ` Will Deacon 0 siblings, 0 replies; 69+ messages in thread From: Will Deacon @ 2020-07-16 10:20 UTC (permalink / raw) To: John Garry Cc: trivial, maz, linux-kernel, linuxarm, iommu, robin.murphy, linux-arm-kernel On Tue, Jun 23, 2020 at 01:28:40AM +0800, John Garry wrote: > It has been shown that the cmpxchg() for finding space in the cmdq can > be a bottleneck: > - for more CPUs contending the cmdq, the cmpxchg() will fail more often > - since the software-maintained cons pointer is updated on the same 64b > memory region, the chance of cmpxchg() failure increases again > > The cmpxchg() is removed as part of 2 related changes: > > - Update prod and cmdq owner in a single atomic add operation. For this, we > count the prod and owner in separate regions in prod memory. > > As with simple binary counting, once the prod+wrap fields overflow, they > will zero. They should never overflow into "owner" region, and we zero > the non-owner, prod region for each owner. This maintains the prod > pointer. > > As for the "owner", we now count this value, instead of setting a flag. > Similar to before, once the owner has finished gathering, it will clear > a mask. As such, a CPU declares itself as the "owner" when it reads zero > for this region. This zeroing will also clear possible overflow in > wrap+prod region, above. > > The owner is now responsible for all cmdq locking to avoid possible > deadlock. The owner will lock the cmdq for all non-owers it has gathered > when they have space in the queue and have written their entries. > > - Check for space in the cmdq after the prod pointer has been assigned. > > We don't bother checking for space in the cmdq before assigning the prod > pointer, as this would be racy. > > So since the prod pointer is updated unconditionally, it would be common > for no space to be available in the cmdq when prod is assigned - that > is, according the software-maintained prod and cons pointer. So now > it must be ensured that the entries are not yet written and not until > there is space. > > How the prod pointer is maintained also leads to a strange condition > where the prod pointer can wrap past the cons pointer. We can detect this > condition, and report no space here. However, a prod pointer progressed > twice past the cons pointer cannot be detected. But it can be ensured that > this that this scenario does not occur, as we limit the amount of > commands any CPU can issue at any given time, such that we cannot > progress prod pointer further. > > Signed-off-by: John Garry <john.garry@huawei.com> > --- > drivers/iommu/arm-smmu-v3.c | 101 ++++++++++++++++++++++-------------- > 1 file changed, 61 insertions(+), 40 deletions(-) I must admit, you made me smile putting trivial@kernel.org on cc for this ;) Will _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() 2020-07-16 10:20 ` Will Deacon (?) @ 2020-07-16 10:26 ` John Garry -1 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-07-16 10:26 UTC (permalink / raw) To: Will Deacon Cc: robin.murphy, joro, trivial, linux-arm-kernel, iommu, linux-kernel, linuxarm, maz On 16/07/2020 11:20, Will Deacon wrote: > On Tue, Jun 23, 2020 at 01:28:40AM +0800, John Garry wrote: >> It has been shown that the cmpxchg() for finding space in the cmdq can >> be a bottleneck: >> - for more CPUs contending the cmdq, the cmpxchg() will fail more often >> - since the software-maintained cons pointer is updated on the same 64b >> memory region, the chance of cmpxchg() failure increases again >> >> The cmpxchg() is removed as part of 2 related changes: >> >> - Update prod and cmdq owner in a single atomic add operation. For this, we >> count the prod and owner in separate regions in prod memory. >> >> As with simple binary counting, once the prod+wrap fields overflow, they >> will zero. They should never overflow into "owner" region, and we zero >> the non-owner, prod region for each owner. This maintains the prod >> pointer. >> >> As for the "owner", we now count this value, instead of setting a flag. >> Similar to before, once the owner has finished gathering, it will clear >> a mask. As such, a CPU declares itself as the "owner" when it reads zero >> for this region. This zeroing will also clear possible overflow in >> wrap+prod region, above. >> >> The owner is now responsible for all cmdq locking to avoid possible >> deadlock. The owner will lock the cmdq for all non-owers it has gathered >> when they have space in the queue and have written their entries. >> >> - Check for space in the cmdq after the prod pointer has been assigned. >> >> We don't bother checking for space in the cmdq before assigning the prod >> pointer, as this would be racy. >> >> So since the prod pointer is updated unconditionally, it would be common >> for no space to be available in the cmdq when prod is assigned - that >> is, according the software-maintained prod and cons pointer. So now >> it must be ensured that the entries are not yet written and not until >> there is space. >> >> How the prod pointer is maintained also leads to a strange condition >> where the prod pointer can wrap past the cons pointer. We can detect this >> condition, and report no space here. However, a prod pointer progressed >> twice past the cons pointer cannot be detected. But it can be ensured that >> this that this scenario does not occur, as we limit the amount of >> commands any CPU can issue at any given time, such that we cannot >> progress prod pointer further. >> >> Signed-off-by: John Garry <john.garry@huawei.com> >> --- >> drivers/iommu/arm-smmu-v3.c | 101 ++++++++++++++++++++++-------------- >> 1 file changed, 61 insertions(+), 40 deletions(-) > > I must admit, you made me smile putting trivial@kernel.org on cc for this ;) > Yes, quite ironic :) ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() @ 2020-07-16 10:26 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-07-16 10:26 UTC (permalink / raw) To: Will Deacon Cc: trivial, maz, joro, linux-kernel, linuxarm, iommu, robin.murphy, linux-arm-kernel On 16/07/2020 11:20, Will Deacon wrote: > On Tue, Jun 23, 2020 at 01:28:40AM +0800, John Garry wrote: >> It has been shown that the cmpxchg() for finding space in the cmdq can >> be a bottleneck: >> - for more CPUs contending the cmdq, the cmpxchg() will fail more often >> - since the software-maintained cons pointer is updated on the same 64b >> memory region, the chance of cmpxchg() failure increases again >> >> The cmpxchg() is removed as part of 2 related changes: >> >> - Update prod and cmdq owner in a single atomic add operation. For this, we >> count the prod and owner in separate regions in prod memory. >> >> As with simple binary counting, once the prod+wrap fields overflow, they >> will zero. They should never overflow into "owner" region, and we zero >> the non-owner, prod region for each owner. This maintains the prod >> pointer. >> >> As for the "owner", we now count this value, instead of setting a flag. >> Similar to before, once the owner has finished gathering, it will clear >> a mask. As such, a CPU declares itself as the "owner" when it reads zero >> for this region. This zeroing will also clear possible overflow in >> wrap+prod region, above. >> >> The owner is now responsible for all cmdq locking to avoid possible >> deadlock. The owner will lock the cmdq for all non-owers it has gathered >> when they have space in the queue and have written their entries. >> >> - Check for space in the cmdq after the prod pointer has been assigned. >> >> We don't bother checking for space in the cmdq before assigning the prod >> pointer, as this would be racy. >> >> So since the prod pointer is updated unconditionally, it would be common >> for no space to be available in the cmdq when prod is assigned - that >> is, according the software-maintained prod and cons pointer. So now >> it must be ensured that the entries are not yet written and not until >> there is space. >> >> How the prod pointer is maintained also leads to a strange condition >> where the prod pointer can wrap past the cons pointer. We can detect this >> condition, and report no space here. However, a prod pointer progressed >> twice past the cons pointer cannot be detected. But it can be ensured that >> this that this scenario does not occur, as we limit the amount of >> commands any CPU can issue at any given time, such that we cannot >> progress prod pointer further. >> >> Signed-off-by: John Garry <john.garry@huawei.com> >> --- >> drivers/iommu/arm-smmu-v3.c | 101 ++++++++++++++++++++++-------------- >> 1 file changed, 61 insertions(+), 40 deletions(-) > > I must admit, you made me smile putting trivial@kernel.org on cc for this ;) > Yes, quite ironic :) _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() @ 2020-07-16 10:26 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-07-16 10:26 UTC (permalink / raw) To: Will Deacon Cc: trivial, maz, linux-kernel, linuxarm, iommu, robin.murphy, linux-arm-kernel On 16/07/2020 11:20, Will Deacon wrote: > On Tue, Jun 23, 2020 at 01:28:40AM +0800, John Garry wrote: >> It has been shown that the cmpxchg() for finding space in the cmdq can >> be a bottleneck: >> - for more CPUs contending the cmdq, the cmpxchg() will fail more often >> - since the software-maintained cons pointer is updated on the same 64b >> memory region, the chance of cmpxchg() failure increases again >> >> The cmpxchg() is removed as part of 2 related changes: >> >> - Update prod and cmdq owner in a single atomic add operation. For this, we >> count the prod and owner in separate regions in prod memory. >> >> As with simple binary counting, once the prod+wrap fields overflow, they >> will zero. They should never overflow into "owner" region, and we zero >> the non-owner, prod region for each owner. This maintains the prod >> pointer. >> >> As for the "owner", we now count this value, instead of setting a flag. >> Similar to before, once the owner has finished gathering, it will clear >> a mask. As such, a CPU declares itself as the "owner" when it reads zero >> for this region. This zeroing will also clear possible overflow in >> wrap+prod region, above. >> >> The owner is now responsible for all cmdq locking to avoid possible >> deadlock. The owner will lock the cmdq for all non-owers it has gathered >> when they have space in the queue and have written their entries. >> >> - Check for space in the cmdq after the prod pointer has been assigned. >> >> We don't bother checking for space in the cmdq before assigning the prod >> pointer, as this would be racy. >> >> So since the prod pointer is updated unconditionally, it would be common >> for no space to be available in the cmdq when prod is assigned - that >> is, according the software-maintained prod and cons pointer. So now >> it must be ensured that the entries are not yet written and not until >> there is space. >> >> How the prod pointer is maintained also leads to a strange condition >> where the prod pointer can wrap past the cons pointer. We can detect this >> condition, and report no space here. However, a prod pointer progressed >> twice past the cons pointer cannot be detected. But it can be ensured that >> this that this scenario does not occur, as we limit the amount of >> commands any CPU can issue at any given time, such that we cannot >> progress prod pointer further. >> >> Signed-off-by: John Garry <john.garry@huawei.com> >> --- >> drivers/iommu/arm-smmu-v3.c | 101 ++++++++++++++++++++++-------------- >> 1 file changed, 61 insertions(+), 40 deletions(-) > > I must admit, you made me smile putting trivial@kernel.org on cc for this ;) > Yes, quite ironic :) _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency 2020-06-22 17:28 ` John Garry (?) @ 2020-07-08 13:00 ` John Garry -1 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-07-08 13:00 UTC (permalink / raw) To: will, robin.murphy Cc: joro, trivial, linux-arm-kernel, iommu, linux-kernel, linuxarm, maz On 22/06/2020 18:28, John Garry wrote: Hi, Can you guys let me know if this is on the radar at all? I have been talking about this performance issue since Jan, and not getting anything really. thanks > As mentioned in [0], the CPU may consume many cycles processing > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to > get space on the queue takes approx 25% of the cycles for this function. > > This series removes that cmpxchg(). > > For my NVMe test with 3x NVMe SSDs, I'm getting a ~24% throughput > increase: > Before: 1310 IOPs > After: 1630 IOPs > > I also have a test harness to check the rate of DMA map+unmaps we can > achieve: > > CPU count 32 64 128 > Before: 63187 19418 10169 > After: 93287 44789 15862 > > (unit is map+unmaps per CPU per second) > > [0] https://lore.kernel.org/linux-iommu/B926444035E5E2439431908E3842AFD24B86DB@DGGEMI525-MBS.china.huawei.com/T/#ma02e301c38c3e94b7725e685757c27e39c7cbde3 > > John Garry (4): > iommu/arm-smmu-v3: Fix trivial typo > iommu/arm-smmu-v3: Calculate bits for prod and owner > iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch > iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() > > drivers/iommu/arm-smmu-v3.c | 233 +++++++++++++++++++++++------------- > 1 file changed, 151 insertions(+), 82 deletions(-) > ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency @ 2020-07-08 13:00 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-07-08 13:00 UTC (permalink / raw) To: will, robin.murphy Cc: trivial, maz, joro, linuxarm, linux-kernel, iommu, linux-arm-kernel On 22/06/2020 18:28, John Garry wrote: Hi, Can you guys let me know if this is on the radar at all? I have been talking about this performance issue since Jan, and not getting anything really. thanks > As mentioned in [0], the CPU may consume many cycles processing > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to > get space on the queue takes approx 25% of the cycles for this function. > > This series removes that cmpxchg(). > > For my NVMe test with 3x NVMe SSDs, I'm getting a ~24% throughput > increase: > Before: 1310 IOPs > After: 1630 IOPs > > I also have a test harness to check the rate of DMA map+unmaps we can > achieve: > > CPU count 32 64 128 > Before: 63187 19418 10169 > After: 93287 44789 15862 > > (unit is map+unmaps per CPU per second) > > [0] https://lore.kernel.org/linux-iommu/B926444035E5E2439431908E3842AFD24B86DB@DGGEMI525-MBS.china.huawei.com/T/#ma02e301c38c3e94b7725e685757c27e39c7cbde3 > > John Garry (4): > iommu/arm-smmu-v3: Fix trivial typo > iommu/arm-smmu-v3: Calculate bits for prod and owner > iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch > iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() > > drivers/iommu/arm-smmu-v3.c | 233 +++++++++++++++++++++++------------- > 1 file changed, 151 insertions(+), 82 deletions(-) > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency @ 2020-07-08 13:00 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-07-08 13:00 UTC (permalink / raw) To: will, robin.murphy Cc: trivial, maz, linuxarm, linux-kernel, iommu, linux-arm-kernel On 22/06/2020 18:28, John Garry wrote: Hi, Can you guys let me know if this is on the radar at all? I have been talking about this performance issue since Jan, and not getting anything really. thanks > As mentioned in [0], the CPU may consume many cycles processing > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to > get space on the queue takes approx 25% of the cycles for this function. > > This series removes that cmpxchg(). > > For my NVMe test with 3x NVMe SSDs, I'm getting a ~24% throughput > increase: > Before: 1310 IOPs > After: 1630 IOPs > > I also have a test harness to check the rate of DMA map+unmaps we can > achieve: > > CPU count 32 64 128 > Before: 63187 19418 10169 > After: 93287 44789 15862 > > (unit is map+unmaps per CPU per second) > > [0] https://lore.kernel.org/linux-iommu/B926444035E5E2439431908E3842AFD24B86DB@DGGEMI525-MBS.china.huawei.com/T/#ma02e301c38c3e94b7725e685757c27e39c7cbde3 > > John Garry (4): > iommu/arm-smmu-v3: Fix trivial typo > iommu/arm-smmu-v3: Calculate bits for prod and owner > iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch > iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() > > drivers/iommu/arm-smmu-v3.c | 233 +++++++++++++++++++++++------------- > 1 file changed, 151 insertions(+), 82 deletions(-) > _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency 2020-06-22 17:28 ` John Garry (?) @ 2020-07-16 10:19 ` Will Deacon -1 siblings, 0 replies; 69+ messages in thread From: Will Deacon @ 2020-07-16 10:19 UTC (permalink / raw) To: John Garry Cc: robin.murphy, joro, trivial, linux-arm-kernel, iommu, linux-kernel, linuxarm, maz On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote: > As mentioned in [0], the CPU may consume many cycles processing > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to > get space on the queue takes approx 25% of the cycles for this function. > > This series removes that cmpxchg(). How about something much simpler like the diff below? Will --->8 diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index f578677a5c41..705d99ec8219 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -560,6 +560,7 @@ struct arm_smmu_cmdq { atomic_long_t *valid_map; atomic_t owner_prod; atomic_t lock; + spinlock_t slock; }; struct arm_smmu_cmdq_batch { @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, u64 cmd_sync[CMDQ_ENT_DWORDS]; u32 prod; unsigned long flags; - bool owner; + bool owner, cas_failed = false; struct arm_smmu_cmdq *cmdq = &smmu->cmdq; struct arm_smmu_ll_queue llq = { .max_n_shift = cmdq->q.llq.max_n_shift, @@ -1387,27 +1388,35 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, /* 1. Allocate some space in the queue */ local_irq_save(flags); - llq.val = READ_ONCE(cmdq->q.llq.val); do { u64 old; + llq.val = READ_ONCE(cmdq->q.llq.val); - while (!queue_has_space(&llq, n + sync)) { + if (queue_has_space(&llq, n + sync)) + goto try_cas; + + if (cas_failed) + spin_unlock(&cmdq->slock); + + do { local_irq_restore(flags); if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq)) dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); local_irq_save(flags); - } + } while (!queue_has_space(&llq, n + sync)); +try_cas: head.cons = llq.cons; head.prod = queue_inc_prod_n(&llq, n + sync) | CMDQ_PROD_OWNED_FLAG; old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val); - if (old == llq.val) - break; + cas_failed = old != llq.val; + + if (cas_failed) + spin_lock(&cmdq->slock); + } while (cas_failed); - llq.val = old; - } while (1); owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG); head.prod &= ~CMDQ_PROD_OWNED_FLAG; llq.prod &= ~CMDQ_PROD_OWNED_FLAG; @@ -3192,6 +3201,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu) atomic_set(&cmdq->owner_prod, 0); atomic_set(&cmdq->lock, 0); + spin_lock_init(&cmdq->slock); bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL); if (!bitmap) { ^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency @ 2020-07-16 10:19 ` Will Deacon 0 siblings, 0 replies; 69+ messages in thread From: Will Deacon @ 2020-07-16 10:19 UTC (permalink / raw) To: John Garry Cc: trivial, maz, joro, linux-kernel, linuxarm, iommu, robin.murphy, linux-arm-kernel On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote: > As mentioned in [0], the CPU may consume many cycles processing > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to > get space on the queue takes approx 25% of the cycles for this function. > > This series removes that cmpxchg(). How about something much simpler like the diff below? Will --->8 diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index f578677a5c41..705d99ec8219 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -560,6 +560,7 @@ struct arm_smmu_cmdq { atomic_long_t *valid_map; atomic_t owner_prod; atomic_t lock; + spinlock_t slock; }; struct arm_smmu_cmdq_batch { @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, u64 cmd_sync[CMDQ_ENT_DWORDS]; u32 prod; unsigned long flags; - bool owner; + bool owner, cas_failed = false; struct arm_smmu_cmdq *cmdq = &smmu->cmdq; struct arm_smmu_ll_queue llq = { .max_n_shift = cmdq->q.llq.max_n_shift, @@ -1387,27 +1388,35 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, /* 1. Allocate some space in the queue */ local_irq_save(flags); - llq.val = READ_ONCE(cmdq->q.llq.val); do { u64 old; + llq.val = READ_ONCE(cmdq->q.llq.val); - while (!queue_has_space(&llq, n + sync)) { + if (queue_has_space(&llq, n + sync)) + goto try_cas; + + if (cas_failed) + spin_unlock(&cmdq->slock); + + do { local_irq_restore(flags); if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq)) dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); local_irq_save(flags); - } + } while (!queue_has_space(&llq, n + sync)); +try_cas: head.cons = llq.cons; head.prod = queue_inc_prod_n(&llq, n + sync) | CMDQ_PROD_OWNED_FLAG; old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val); - if (old == llq.val) - break; + cas_failed = old != llq.val; + + if (cas_failed) + spin_lock(&cmdq->slock); + } while (cas_failed); - llq.val = old; - } while (1); owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG); head.prod &= ~CMDQ_PROD_OWNED_FLAG; llq.prod &= ~CMDQ_PROD_OWNED_FLAG; @@ -3192,6 +3201,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu) atomic_set(&cmdq->owner_prod, 0); atomic_set(&cmdq->lock, 0); + spin_lock_init(&cmdq->slock); bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL); if (!bitmap) { _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency @ 2020-07-16 10:19 ` Will Deacon 0 siblings, 0 replies; 69+ messages in thread From: Will Deacon @ 2020-07-16 10:19 UTC (permalink / raw) To: John Garry Cc: trivial, maz, linux-kernel, linuxarm, iommu, robin.murphy, linux-arm-kernel On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote: > As mentioned in [0], the CPU may consume many cycles processing > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to > get space on the queue takes approx 25% of the cycles for this function. > > This series removes that cmpxchg(). How about something much simpler like the diff below? Will --->8 diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index f578677a5c41..705d99ec8219 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -560,6 +560,7 @@ struct arm_smmu_cmdq { atomic_long_t *valid_map; atomic_t owner_prod; atomic_t lock; + spinlock_t slock; }; struct arm_smmu_cmdq_batch { @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, u64 cmd_sync[CMDQ_ENT_DWORDS]; u32 prod; unsigned long flags; - bool owner; + bool owner, cas_failed = false; struct arm_smmu_cmdq *cmdq = &smmu->cmdq; struct arm_smmu_ll_queue llq = { .max_n_shift = cmdq->q.llq.max_n_shift, @@ -1387,27 +1388,35 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, /* 1. Allocate some space in the queue */ local_irq_save(flags); - llq.val = READ_ONCE(cmdq->q.llq.val); do { u64 old; + llq.val = READ_ONCE(cmdq->q.llq.val); - while (!queue_has_space(&llq, n + sync)) { + if (queue_has_space(&llq, n + sync)) + goto try_cas; + + if (cas_failed) + spin_unlock(&cmdq->slock); + + do { local_irq_restore(flags); if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq)) dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); local_irq_save(flags); - } + } while (!queue_has_space(&llq, n + sync)); +try_cas: head.cons = llq.cons; head.prod = queue_inc_prod_n(&llq, n + sync) | CMDQ_PROD_OWNED_FLAG; old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val); - if (old == llq.val) - break; + cas_failed = old != llq.val; + + if (cas_failed) + spin_lock(&cmdq->slock); + } while (cas_failed); - llq.val = old; - } while (1); owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG); head.prod &= ~CMDQ_PROD_OWNED_FLAG; llq.prod &= ~CMDQ_PROD_OWNED_FLAG; @@ -3192,6 +3201,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu) atomic_set(&cmdq->owner_prod, 0); atomic_set(&cmdq->lock, 0); + spin_lock_init(&cmdq->slock); bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL); if (!bitmap) { _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency 2020-07-16 10:19 ` Will Deacon (?) @ 2020-07-16 10:22 ` Will Deacon -1 siblings, 0 replies; 69+ messages in thread From: Will Deacon @ 2020-07-16 10:22 UTC (permalink / raw) To: John Garry Cc: robin.murphy, joro, trivial, linux-arm-kernel, iommu, linux-kernel, linuxarm, maz On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote: > On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote: > > As mentioned in [0], the CPU may consume many cycles processing > > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to > > get space on the queue takes approx 25% of the cycles for this function. > > > > This series removes that cmpxchg(). > > How about something much simpler like the diff below? Ah, scratch that, I don't drop the lock if we fail the cas with it held. Let me hack it some more (I have no hardware so I can only build-test this). Will ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency @ 2020-07-16 10:22 ` Will Deacon 0 siblings, 0 replies; 69+ messages in thread From: Will Deacon @ 2020-07-16 10:22 UTC (permalink / raw) To: John Garry Cc: trivial, maz, joro, linux-kernel, linuxarm, iommu, robin.murphy, linux-arm-kernel On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote: > On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote: > > As mentioned in [0], the CPU may consume many cycles processing > > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to > > get space on the queue takes approx 25% of the cycles for this function. > > > > This series removes that cmpxchg(). > > How about something much simpler like the diff below? Ah, scratch that, I don't drop the lock if we fail the cas with it held. Let me hack it some more (I have no hardware so I can only build-test this). Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency @ 2020-07-16 10:22 ` Will Deacon 0 siblings, 0 replies; 69+ messages in thread From: Will Deacon @ 2020-07-16 10:22 UTC (permalink / raw) To: John Garry Cc: trivial, maz, linux-kernel, linuxarm, iommu, robin.murphy, linux-arm-kernel On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote: > On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote: > > As mentioned in [0], the CPU may consume many cycles processing > > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to > > get space on the queue takes approx 25% of the cycles for this function. > > > > This series removes that cmpxchg(). > > How about something much simpler like the diff below? Ah, scratch that, I don't drop the lock if we fail the cas with it held. Let me hack it some more (I have no hardware so I can only build-test this). Will _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency 2020-07-16 10:22 ` Will Deacon (?) @ 2020-07-16 10:28 ` Will Deacon -1 siblings, 0 replies; 69+ messages in thread From: Will Deacon @ 2020-07-16 10:28 UTC (permalink / raw) To: John Garry Cc: robin.murphy, joro, trivial, linux-arm-kernel, iommu, linux-kernel, linuxarm, maz On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote: > On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote: > > On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote: > > > As mentioned in [0], the CPU may consume many cycles processing > > > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to > > > get space on the queue takes approx 25% of the cycles for this function. > > > > > > This series removes that cmpxchg(). > > > > How about something much simpler like the diff below? > > Ah, scratch that, I don't drop the lock if we fail the cas with it held. > Let me hack it some more (I have no hardware so I can only build-test this). Right, second attempt... Will --->8 diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index f578677a5c41..e6bcddd6ef69 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -560,6 +560,7 @@ struct arm_smmu_cmdq { atomic_long_t *valid_map; atomic_t owner_prod; atomic_t lock; + spinlock_t slock; }; struct arm_smmu_cmdq_batch { @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, u64 cmd_sync[CMDQ_ENT_DWORDS]; u32 prod; unsigned long flags; - bool owner; + bool owner, locked = false; struct arm_smmu_cmdq *cmdq = &smmu->cmdq; struct arm_smmu_ll_queue llq = { .max_n_shift = cmdq->q.llq.max_n_shift, @@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, /* 1. Allocate some space in the queue */ local_irq_save(flags); - llq.val = READ_ONCE(cmdq->q.llq.val); do { u64 old; + llq.val = READ_ONCE(cmdq->q.llq.val); - while (!queue_has_space(&llq, n + sync)) { + if (queue_has_space(&llq, n + sync)) + goto try_cas; + + if (locked) + spin_unlock(&cmdq->slock); + + do { local_irq_restore(flags); if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq)) dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); local_irq_save(flags); - } + } while (!queue_has_space(&llq, n + sync)); +try_cas: head.cons = llq.cons; head.prod = queue_inc_prod_n(&llq, n + sync) | CMDQ_PROD_OWNED_FLAG; old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val); - if (old == llq.val) + if (old != llq.val) break; - llq.val = old; + if (!locked) { + spin_lock(&cmdq->slock); + locked = true; + } } while (1); + owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG); head.prod &= ~CMDQ_PROD_OWNED_FLAG; llq.prod &= ~CMDQ_PROD_OWNED_FLAG; @@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu) atomic_set(&cmdq->owner_prod, 0); atomic_set(&cmdq->lock, 0); + spin_lock_init(&cmdq->slock); bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL); if (!bitmap) { ^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency @ 2020-07-16 10:28 ` Will Deacon 0 siblings, 0 replies; 69+ messages in thread From: Will Deacon @ 2020-07-16 10:28 UTC (permalink / raw) To: John Garry Cc: trivial, maz, joro, linux-kernel, linuxarm, iommu, robin.murphy, linux-arm-kernel On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote: > On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote: > > On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote: > > > As mentioned in [0], the CPU may consume many cycles processing > > > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to > > > get space on the queue takes approx 25% of the cycles for this function. > > > > > > This series removes that cmpxchg(). > > > > How about something much simpler like the diff below? > > Ah, scratch that, I don't drop the lock if we fail the cas with it held. > Let me hack it some more (I have no hardware so I can only build-test this). Right, second attempt... Will --->8 diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index f578677a5c41..e6bcddd6ef69 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -560,6 +560,7 @@ struct arm_smmu_cmdq { atomic_long_t *valid_map; atomic_t owner_prod; atomic_t lock; + spinlock_t slock; }; struct arm_smmu_cmdq_batch { @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, u64 cmd_sync[CMDQ_ENT_DWORDS]; u32 prod; unsigned long flags; - bool owner; + bool owner, locked = false; struct arm_smmu_cmdq *cmdq = &smmu->cmdq; struct arm_smmu_ll_queue llq = { .max_n_shift = cmdq->q.llq.max_n_shift, @@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, /* 1. Allocate some space in the queue */ local_irq_save(flags); - llq.val = READ_ONCE(cmdq->q.llq.val); do { u64 old; + llq.val = READ_ONCE(cmdq->q.llq.val); - while (!queue_has_space(&llq, n + sync)) { + if (queue_has_space(&llq, n + sync)) + goto try_cas; + + if (locked) + spin_unlock(&cmdq->slock); + + do { local_irq_restore(flags); if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq)) dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); local_irq_save(flags); - } + } while (!queue_has_space(&llq, n + sync)); +try_cas: head.cons = llq.cons; head.prod = queue_inc_prod_n(&llq, n + sync) | CMDQ_PROD_OWNED_FLAG; old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val); - if (old == llq.val) + if (old != llq.val) break; - llq.val = old; + if (!locked) { + spin_lock(&cmdq->slock); + locked = true; + } } while (1); + owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG); head.prod &= ~CMDQ_PROD_OWNED_FLAG; llq.prod &= ~CMDQ_PROD_OWNED_FLAG; @@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu) atomic_set(&cmdq->owner_prod, 0); atomic_set(&cmdq->lock, 0); + spin_lock_init(&cmdq->slock); bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL); if (!bitmap) { _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency @ 2020-07-16 10:28 ` Will Deacon 0 siblings, 0 replies; 69+ messages in thread From: Will Deacon @ 2020-07-16 10:28 UTC (permalink / raw) To: John Garry Cc: trivial, maz, linux-kernel, linuxarm, iommu, robin.murphy, linux-arm-kernel On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote: > On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote: > > On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote: > > > As mentioned in [0], the CPU may consume many cycles processing > > > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to > > > get space on the queue takes approx 25% of the cycles for this function. > > > > > > This series removes that cmpxchg(). > > > > How about something much simpler like the diff below? > > Ah, scratch that, I don't drop the lock if we fail the cas with it held. > Let me hack it some more (I have no hardware so I can only build-test this). Right, second attempt... Will --->8 diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index f578677a5c41..e6bcddd6ef69 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -560,6 +560,7 @@ struct arm_smmu_cmdq { atomic_long_t *valid_map; atomic_t owner_prod; atomic_t lock; + spinlock_t slock; }; struct arm_smmu_cmdq_batch { @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, u64 cmd_sync[CMDQ_ENT_DWORDS]; u32 prod; unsigned long flags; - bool owner; + bool owner, locked = false; struct arm_smmu_cmdq *cmdq = &smmu->cmdq; struct arm_smmu_ll_queue llq = { .max_n_shift = cmdq->q.llq.max_n_shift, @@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, /* 1. Allocate some space in the queue */ local_irq_save(flags); - llq.val = READ_ONCE(cmdq->q.llq.val); do { u64 old; + llq.val = READ_ONCE(cmdq->q.llq.val); - while (!queue_has_space(&llq, n + sync)) { + if (queue_has_space(&llq, n + sync)) + goto try_cas; + + if (locked) + spin_unlock(&cmdq->slock); + + do { local_irq_restore(flags); if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq)) dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); local_irq_save(flags); - } + } while (!queue_has_space(&llq, n + sync)); +try_cas: head.cons = llq.cons; head.prod = queue_inc_prod_n(&llq, n + sync) | CMDQ_PROD_OWNED_FLAG; old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val); - if (old == llq.val) + if (old != llq.val) break; - llq.val = old; + if (!locked) { + spin_lock(&cmdq->slock); + locked = true; + } } while (1); + owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG); head.prod &= ~CMDQ_PROD_OWNED_FLAG; llq.prod &= ~CMDQ_PROD_OWNED_FLAG; @@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu) atomic_set(&cmdq->owner_prod, 0); atomic_set(&cmdq->lock, 0); + spin_lock_init(&cmdq->slock); bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL); if (!bitmap) { _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency 2020-07-16 10:28 ` Will Deacon (?) @ 2020-07-16 10:56 ` John Garry -1 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-07-16 10:56 UTC (permalink / raw) To: Will Deacon Cc: robin.murphy, joro, trivial, linux-arm-kernel, iommu, linux-kernel, linuxarm, maz On 16/07/2020 11:28, Will Deacon wrote: > On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote: >> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote: >>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote: >>>> As mentioned in [0], the CPU may consume many cycles processing >>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to >>>> get space on the queue takes approx 25% of the cycles for this function. >>>> >>>> This series removes that cmpxchg(). >>> >>> How about something much simpler like the diff below? >> >> Ah, scratch that, I don't drop the lock if we fail the cas with it held. >> Let me hack it some more (I have no hardware so I can only build-test this). > > Right, second attempt... I can try it, but if performance if not as good, then please check mine further (patch 4/4 specifically) - performance is really good, IMHO. Thanks, > > Will > > --->8 > > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c > index f578677a5c41..e6bcddd6ef69 100644 > --- a/drivers/iommu/arm-smmu-v3.c > +++ b/drivers/iommu/arm-smmu-v3.c > @@ -560,6 +560,7 @@ struct arm_smmu_cmdq { > atomic_long_t *valid_map; > atomic_t owner_prod; > atomic_t lock; > + spinlock_t slock; > }; > > struct arm_smmu_cmdq_batch { > @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, > u64 cmd_sync[CMDQ_ENT_DWORDS]; > u32 prod; > unsigned long flags; > - bool owner; > + bool owner, locked = false; > struct arm_smmu_cmdq *cmdq = &smmu->cmdq; > struct arm_smmu_ll_queue llq = { > .max_n_shift = cmdq->q.llq.max_n_shift, > @@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, > > /* 1. Allocate some space in the queue */ > local_irq_save(flags); > - llq.val = READ_ONCE(cmdq->q.llq.val); > do { > u64 old; > + llq.val = READ_ONCE(cmdq->q.llq.val); > > - while (!queue_has_space(&llq, n + sync)) { > + if (queue_has_space(&llq, n + sync)) > + goto try_cas; > + > + if (locked) > + spin_unlock(&cmdq->slock); > + > + do { > local_irq_restore(flags); > if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq)) > dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); > local_irq_save(flags); > - } > + } while (!queue_has_space(&llq, n + sync)); > > +try_cas: > head.cons = llq.cons; > head.prod = queue_inc_prod_n(&llq, n + sync) | > CMDQ_PROD_OWNED_FLAG; > > old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val); > - if (old == llq.val) > + if (old != llq.val) > break; > > - llq.val = old; > + if (!locked) { > + spin_lock(&cmdq->slock); > + locked = true; > + } > } while (1); > + > owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG); > head.prod &= ~CMDQ_PROD_OWNED_FLAG; > llq.prod &= ~CMDQ_PROD_OWNED_FLAG; > @@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu) > > atomic_set(&cmdq->owner_prod, 0); > atomic_set(&cmdq->lock, 0); > + spin_lock_init(&cmdq->slock); > > bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL); > if (!bitmap) { > . > ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency @ 2020-07-16 10:56 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-07-16 10:56 UTC (permalink / raw) To: Will Deacon Cc: trivial, maz, joro, linux-kernel, linuxarm, iommu, robin.murphy, linux-arm-kernel On 16/07/2020 11:28, Will Deacon wrote: > On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote: >> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote: >>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote: >>>> As mentioned in [0], the CPU may consume many cycles processing >>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to >>>> get space on the queue takes approx 25% of the cycles for this function. >>>> >>>> This series removes that cmpxchg(). >>> >>> How about something much simpler like the diff below? >> >> Ah, scratch that, I don't drop the lock if we fail the cas with it held. >> Let me hack it some more (I have no hardware so I can only build-test this). > > Right, second attempt... I can try it, but if performance if not as good, then please check mine further (patch 4/4 specifically) - performance is really good, IMHO. Thanks, > > Will > > --->8 > > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c > index f578677a5c41..e6bcddd6ef69 100644 > --- a/drivers/iommu/arm-smmu-v3.c > +++ b/drivers/iommu/arm-smmu-v3.c > @@ -560,6 +560,7 @@ struct arm_smmu_cmdq { > atomic_long_t *valid_map; > atomic_t owner_prod; > atomic_t lock; > + spinlock_t slock; > }; > > struct arm_smmu_cmdq_batch { > @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, > u64 cmd_sync[CMDQ_ENT_DWORDS]; > u32 prod; > unsigned long flags; > - bool owner; > + bool owner, locked = false; > struct arm_smmu_cmdq *cmdq = &smmu->cmdq; > struct arm_smmu_ll_queue llq = { > .max_n_shift = cmdq->q.llq.max_n_shift, > @@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, > > /* 1. Allocate some space in the queue */ > local_irq_save(flags); > - llq.val = READ_ONCE(cmdq->q.llq.val); > do { > u64 old; > + llq.val = READ_ONCE(cmdq->q.llq.val); > > - while (!queue_has_space(&llq, n + sync)) { > + if (queue_has_space(&llq, n + sync)) > + goto try_cas; > + > + if (locked) > + spin_unlock(&cmdq->slock); > + > + do { > local_irq_restore(flags); > if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq)) > dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); > local_irq_save(flags); > - } > + } while (!queue_has_space(&llq, n + sync)); > > +try_cas: > head.cons = llq.cons; > head.prod = queue_inc_prod_n(&llq, n + sync) | > CMDQ_PROD_OWNED_FLAG; > > old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val); > - if (old == llq.val) > + if (old != llq.val) > break; > > - llq.val = old; > + if (!locked) { > + spin_lock(&cmdq->slock); > + locked = true; > + } > } while (1); > + > owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG); > head.prod &= ~CMDQ_PROD_OWNED_FLAG; > llq.prod &= ~CMDQ_PROD_OWNED_FLAG; > @@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu) > > atomic_set(&cmdq->owner_prod, 0); > atomic_set(&cmdq->lock, 0); > + spin_lock_init(&cmdq->slock); > > bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL); > if (!bitmap) { > . > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency @ 2020-07-16 10:56 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-07-16 10:56 UTC (permalink / raw) To: Will Deacon Cc: trivial, maz, linux-kernel, linuxarm, iommu, robin.murphy, linux-arm-kernel On 16/07/2020 11:28, Will Deacon wrote: > On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote: >> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote: >>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote: >>>> As mentioned in [0], the CPU may consume many cycles processing >>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to >>>> get space on the queue takes approx 25% of the cycles for this function. >>>> >>>> This series removes that cmpxchg(). >>> >>> How about something much simpler like the diff below? >> >> Ah, scratch that, I don't drop the lock if we fail the cas with it held. >> Let me hack it some more (I have no hardware so I can only build-test this). > > Right, second attempt... I can try it, but if performance if not as good, then please check mine further (patch 4/4 specifically) - performance is really good, IMHO. Thanks, > > Will > > --->8 > > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c > index f578677a5c41..e6bcddd6ef69 100644 > --- a/drivers/iommu/arm-smmu-v3.c > +++ b/drivers/iommu/arm-smmu-v3.c > @@ -560,6 +560,7 @@ struct arm_smmu_cmdq { > atomic_long_t *valid_map; > atomic_t owner_prod; > atomic_t lock; > + spinlock_t slock; > }; > > struct arm_smmu_cmdq_batch { > @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, > u64 cmd_sync[CMDQ_ENT_DWORDS]; > u32 prod; > unsigned long flags; > - bool owner; > + bool owner, locked = false; > struct arm_smmu_cmdq *cmdq = &smmu->cmdq; > struct arm_smmu_ll_queue llq = { > .max_n_shift = cmdq->q.llq.max_n_shift, > @@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, > > /* 1. Allocate some space in the queue */ > local_irq_save(flags); > - llq.val = READ_ONCE(cmdq->q.llq.val); > do { > u64 old; > + llq.val = READ_ONCE(cmdq->q.llq.val); > > - while (!queue_has_space(&llq, n + sync)) { > + if (queue_has_space(&llq, n + sync)) > + goto try_cas; > + > + if (locked) > + spin_unlock(&cmdq->slock); > + > + do { > local_irq_restore(flags); > if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq)) > dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); > local_irq_save(flags); > - } > + } while (!queue_has_space(&llq, n + sync)); > > +try_cas: > head.cons = llq.cons; > head.prod = queue_inc_prod_n(&llq, n + sync) | > CMDQ_PROD_OWNED_FLAG; > > old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val); > - if (old == llq.val) > + if (old != llq.val) > break; > > - llq.val = old; > + if (!locked) { > + spin_lock(&cmdq->slock); > + locked = true; > + } > } while (1); > + > owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG); > head.prod &= ~CMDQ_PROD_OWNED_FLAG; > llq.prod &= ~CMDQ_PROD_OWNED_FLAG; > @@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu) > > atomic_set(&cmdq->owner_prod, 0); > atomic_set(&cmdq->lock, 0); > + spin_lock_init(&cmdq->slock); > > bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL); > if (!bitmap) { > . > _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency 2020-07-16 10:56 ` John Garry (?) @ 2020-07-16 11:22 ` Robin Murphy -1 siblings, 0 replies; 69+ messages in thread From: Robin Murphy @ 2020-07-16 11:22 UTC (permalink / raw) To: John Garry, Will Deacon Cc: joro, trivial, linux-arm-kernel, iommu, linux-kernel, linuxarm, maz On 2020-07-16 11:56, John Garry wrote: > On 16/07/2020 11:28, Will Deacon wrote: >> On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote: >>> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote: >>>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote: >>>>> As mentioned in [0], the CPU may consume many cycles processing >>>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() >>>>> loop to >>>>> get space on the queue takes approx 25% of the cycles for this >>>>> function. >>>>> >>>>> This series removes that cmpxchg(). >>>> >>>> How about something much simpler like the diff below? >> >>> Ah, scratch that, I don't drop the lock if we fail the cas with it held. >>> Let me hack it some more (I have no hardware so I can only build-test >>> this). >> >> Right, second attempt... > > I can try it, but if performance if not as good, then please check mine > further (patch 4/4 specifically) - performance is really good, IMHO. Perhaps a silly question (I'm too engrossed in PMU world ATM to get properly back up to speed on this), but couldn't this be done without cmpxchg anyway? Instinctively it feels like instead of maintaining a literal software copy of the prod value, we could resolve the "claim my slot in the queue" part with atomic_fetch_add on a free-running 32-bit "pseudo-prod" index, then whoever updates the hardware deals with the truncation and wrap bit to convert it to an actual register value. Robin. > > Thanks, > >> >> Will >> >> --->8 >> >> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c >> index f578677a5c41..e6bcddd6ef69 100644 >> --- a/drivers/iommu/arm-smmu-v3.c >> +++ b/drivers/iommu/arm-smmu-v3.c >> @@ -560,6 +560,7 @@ struct arm_smmu_cmdq { >> atomic_long_t *valid_map; >> atomic_t owner_prod; >> atomic_t lock; >> + spinlock_t slock; >> }; >> struct arm_smmu_cmdq_batch { >> @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct >> arm_smmu_device *smmu, >> u64 cmd_sync[CMDQ_ENT_DWORDS]; >> u32 prod; >> unsigned long flags; >> - bool owner; >> + bool owner, locked = false; >> struct arm_smmu_cmdq *cmdq = &smmu->cmdq; >> struct arm_smmu_ll_queue llq = { >> .max_n_shift = cmdq->q.llq.max_n_shift, >> @@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct >> arm_smmu_device *smmu, >> /* 1. Allocate some space in the queue */ >> local_irq_save(flags); >> - llq.val = READ_ONCE(cmdq->q.llq.val); >> do { >> u64 old; >> + llq.val = READ_ONCE(cmdq->q.llq.val); >> - while (!queue_has_space(&llq, n + sync)) { >> + if (queue_has_space(&llq, n + sync)) >> + goto try_cas; >> + >> + if (locked) >> + spin_unlock(&cmdq->slock); >> + >> + do { >> local_irq_restore(flags); >> if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq)) >> dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); >> local_irq_save(flags); >> - } >> + } while (!queue_has_space(&llq, n + sync)); >> +try_cas: >> head.cons = llq.cons; >> head.prod = queue_inc_prod_n(&llq, n + sync) | >> CMDQ_PROD_OWNED_FLAG; >> old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val); >> - if (old == llq.val) >> + if (old != llq.val) >> break; >> - llq.val = old; >> + if (!locked) { >> + spin_lock(&cmdq->slock); >> + locked = true; >> + } >> } while (1); >> + >> owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG); >> head.prod &= ~CMDQ_PROD_OWNED_FLAG; >> llq.prod &= ~CMDQ_PROD_OWNED_FLAG; >> @@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct >> arm_smmu_device *smmu) >> atomic_set(&cmdq->owner_prod, 0); >> atomic_set(&cmdq->lock, 0); >> + spin_lock_init(&cmdq->slock); >> bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL); >> if (!bitmap) { >> . >> > ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency @ 2020-07-16 11:22 ` Robin Murphy 0 siblings, 0 replies; 69+ messages in thread From: Robin Murphy @ 2020-07-16 11:22 UTC (permalink / raw) To: John Garry, Will Deacon Cc: trivial, maz, joro, linuxarm, linux-kernel, iommu, linux-arm-kernel On 2020-07-16 11:56, John Garry wrote: > On 16/07/2020 11:28, Will Deacon wrote: >> On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote: >>> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote: >>>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote: >>>>> As mentioned in [0], the CPU may consume many cycles processing >>>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() >>>>> loop to >>>>> get space on the queue takes approx 25% of the cycles for this >>>>> function. >>>>> >>>>> This series removes that cmpxchg(). >>>> >>>> How about something much simpler like the diff below? >> >>> Ah, scratch that, I don't drop the lock if we fail the cas with it held. >>> Let me hack it some more (I have no hardware so I can only build-test >>> this). >> >> Right, second attempt... > > I can try it, but if performance if not as good, then please check mine > further (patch 4/4 specifically) - performance is really good, IMHO. Perhaps a silly question (I'm too engrossed in PMU world ATM to get properly back up to speed on this), but couldn't this be done without cmpxchg anyway? Instinctively it feels like instead of maintaining a literal software copy of the prod value, we could resolve the "claim my slot in the queue" part with atomic_fetch_add on a free-running 32-bit "pseudo-prod" index, then whoever updates the hardware deals with the truncation and wrap bit to convert it to an actual register value. Robin. > > Thanks, > >> >> Will >> >> --->8 >> >> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c >> index f578677a5c41..e6bcddd6ef69 100644 >> --- a/drivers/iommu/arm-smmu-v3.c >> +++ b/drivers/iommu/arm-smmu-v3.c >> @@ -560,6 +560,7 @@ struct arm_smmu_cmdq { >> atomic_long_t *valid_map; >> atomic_t owner_prod; >> atomic_t lock; >> + spinlock_t slock; >> }; >> struct arm_smmu_cmdq_batch { >> @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct >> arm_smmu_device *smmu, >> u64 cmd_sync[CMDQ_ENT_DWORDS]; >> u32 prod; >> unsigned long flags; >> - bool owner; >> + bool owner, locked = false; >> struct arm_smmu_cmdq *cmdq = &smmu->cmdq; >> struct arm_smmu_ll_queue llq = { >> .max_n_shift = cmdq->q.llq.max_n_shift, >> @@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct >> arm_smmu_device *smmu, >> /* 1. Allocate some space in the queue */ >> local_irq_save(flags); >> - llq.val = READ_ONCE(cmdq->q.llq.val); >> do { >> u64 old; >> + llq.val = READ_ONCE(cmdq->q.llq.val); >> - while (!queue_has_space(&llq, n + sync)) { >> + if (queue_has_space(&llq, n + sync)) >> + goto try_cas; >> + >> + if (locked) >> + spin_unlock(&cmdq->slock); >> + >> + do { >> local_irq_restore(flags); >> if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq)) >> dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); >> local_irq_save(flags); >> - } >> + } while (!queue_has_space(&llq, n + sync)); >> +try_cas: >> head.cons = llq.cons; >> head.prod = queue_inc_prod_n(&llq, n + sync) | >> CMDQ_PROD_OWNED_FLAG; >> old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val); >> - if (old == llq.val) >> + if (old != llq.val) >> break; >> - llq.val = old; >> + if (!locked) { >> + spin_lock(&cmdq->slock); >> + locked = true; >> + } >> } while (1); >> + >> owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG); >> head.prod &= ~CMDQ_PROD_OWNED_FLAG; >> llq.prod &= ~CMDQ_PROD_OWNED_FLAG; >> @@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct >> arm_smmu_device *smmu) >> atomic_set(&cmdq->owner_prod, 0); >> atomic_set(&cmdq->lock, 0); >> + spin_lock_init(&cmdq->slock); >> bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL); >> if (!bitmap) { >> . >> > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency @ 2020-07-16 11:22 ` Robin Murphy 0 siblings, 0 replies; 69+ messages in thread From: Robin Murphy @ 2020-07-16 11:22 UTC (permalink / raw) To: John Garry, Will Deacon Cc: trivial, maz, linuxarm, linux-kernel, iommu, linux-arm-kernel On 2020-07-16 11:56, John Garry wrote: > On 16/07/2020 11:28, Will Deacon wrote: >> On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote: >>> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote: >>>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote: >>>>> As mentioned in [0], the CPU may consume many cycles processing >>>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() >>>>> loop to >>>>> get space on the queue takes approx 25% of the cycles for this >>>>> function. >>>>> >>>>> This series removes that cmpxchg(). >>>> >>>> How about something much simpler like the diff below? >> >>> Ah, scratch that, I don't drop the lock if we fail the cas with it held. >>> Let me hack it some more (I have no hardware so I can only build-test >>> this). >> >> Right, second attempt... > > I can try it, but if performance if not as good, then please check mine > further (patch 4/4 specifically) - performance is really good, IMHO. Perhaps a silly question (I'm too engrossed in PMU world ATM to get properly back up to speed on this), but couldn't this be done without cmpxchg anyway? Instinctively it feels like instead of maintaining a literal software copy of the prod value, we could resolve the "claim my slot in the queue" part with atomic_fetch_add on a free-running 32-bit "pseudo-prod" index, then whoever updates the hardware deals with the truncation and wrap bit to convert it to an actual register value. Robin. > > Thanks, > >> >> Will >> >> --->8 >> >> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c >> index f578677a5c41..e6bcddd6ef69 100644 >> --- a/drivers/iommu/arm-smmu-v3.c >> +++ b/drivers/iommu/arm-smmu-v3.c >> @@ -560,6 +560,7 @@ struct arm_smmu_cmdq { >> atomic_long_t *valid_map; >> atomic_t owner_prod; >> atomic_t lock; >> + spinlock_t slock; >> }; >> struct arm_smmu_cmdq_batch { >> @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct >> arm_smmu_device *smmu, >> u64 cmd_sync[CMDQ_ENT_DWORDS]; >> u32 prod; >> unsigned long flags; >> - bool owner; >> + bool owner, locked = false; >> struct arm_smmu_cmdq *cmdq = &smmu->cmdq; >> struct arm_smmu_ll_queue llq = { >> .max_n_shift = cmdq->q.llq.max_n_shift, >> @@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct >> arm_smmu_device *smmu, >> /* 1. Allocate some space in the queue */ >> local_irq_save(flags); >> - llq.val = READ_ONCE(cmdq->q.llq.val); >> do { >> u64 old; >> + llq.val = READ_ONCE(cmdq->q.llq.val); >> - while (!queue_has_space(&llq, n + sync)) { >> + if (queue_has_space(&llq, n + sync)) >> + goto try_cas; >> + >> + if (locked) >> + spin_unlock(&cmdq->slock); >> + >> + do { >> local_irq_restore(flags); >> if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq)) >> dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); >> local_irq_save(flags); >> - } >> + } while (!queue_has_space(&llq, n + sync)); >> +try_cas: >> head.cons = llq.cons; >> head.prod = queue_inc_prod_n(&llq, n + sync) | >> CMDQ_PROD_OWNED_FLAG; >> old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val); >> - if (old == llq.val) >> + if (old != llq.val) >> break; >> - llq.val = old; >> + if (!locked) { >> + spin_lock(&cmdq->slock); >> + locked = true; >> + } >> } while (1); >> + >> owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG); >> head.prod &= ~CMDQ_PROD_OWNED_FLAG; >> llq.prod &= ~CMDQ_PROD_OWNED_FLAG; >> @@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct >> arm_smmu_device *smmu) >> atomic_set(&cmdq->owner_prod, 0); >> atomic_set(&cmdq->lock, 0); >> + spin_lock_init(&cmdq->slock); >> bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL); >> if (!bitmap) { >> . >> > _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency 2020-07-16 11:22 ` Robin Murphy (?) @ 2020-07-16 11:30 ` John Garry -1 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-07-16 11:30 UTC (permalink / raw) To: Robin Murphy, Will Deacon Cc: joro, trivial, linux-arm-kernel, iommu, linux-kernel, Linuxarm, maz On 16/07/2020 12:22, Robin Murphy wrote: > On 2020-07-16 11:56, John Garry wrote: >> On 16/07/2020 11:28, Will Deacon wrote: >>> On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote: >>>> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote: >>>>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote: >>>>>> As mentioned in [0], the CPU may consume many cycles processing >>>>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() >>>>>> loop to >>>>>> get space on the queue takes approx 25% of the cycles for this >>>>>> function. >>>>>> >>>>>> This series removes that cmpxchg(). >>>>> >>>>> How about something much simpler like the diff below? >> >>>> Ah, scratch that, I don't drop the lock if we fail the cas with it held. >>>> Let me hack it some more (I have no hardware so I can only build-test >>>> this). >>> >>> Right, second attempt... >> >> I can try it, but if performance if not as good, then please check mine >> further (patch 4/4 specifically) - performance is really good, IMHO. > > Perhaps a silly question (I'm too engrossed in PMU world ATM to get > properly back up to speed on this), but couldn't this be done without > cmpxchg anyway? Instinctively it feels like instead of maintaining a > literal software copy of the prod value, we could resolve the "claim my > slot in the queue" part with atomic_fetch_add on a free-running 32-bit > "pseudo-prod" index, then whoever updates the hardware deals with the > truncation and wrap bit to convert it to an actual register value. > That's what mine does. But I also need to take care of cmdq locking and how we unconditionally provide space. Cheers, John ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency @ 2020-07-16 11:30 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-07-16 11:30 UTC (permalink / raw) To: Robin Murphy, Will Deacon Cc: trivial, maz, joro, Linuxarm, linux-kernel, iommu, linux-arm-kernel On 16/07/2020 12:22, Robin Murphy wrote: > On 2020-07-16 11:56, John Garry wrote: >> On 16/07/2020 11:28, Will Deacon wrote: >>> On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote: >>>> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote: >>>>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote: >>>>>> As mentioned in [0], the CPU may consume many cycles processing >>>>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() >>>>>> loop to >>>>>> get space on the queue takes approx 25% of the cycles for this >>>>>> function. >>>>>> >>>>>> This series removes that cmpxchg(). >>>>> >>>>> How about something much simpler like the diff below? >> >>>> Ah, scratch that, I don't drop the lock if we fail the cas with it held. >>>> Let me hack it some more (I have no hardware so I can only build-test >>>> this). >>> >>> Right, second attempt... >> >> I can try it, but if performance if not as good, then please check mine >> further (patch 4/4 specifically) - performance is really good, IMHO. > > Perhaps a silly question (I'm too engrossed in PMU world ATM to get > properly back up to speed on this), but couldn't this be done without > cmpxchg anyway? Instinctively it feels like instead of maintaining a > literal software copy of the prod value, we could resolve the "claim my > slot in the queue" part with atomic_fetch_add on a free-running 32-bit > "pseudo-prod" index, then whoever updates the hardware deals with the > truncation and wrap bit to convert it to an actual register value. > That's what mine does. But I also need to take care of cmdq locking and how we unconditionally provide space. Cheers, John _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency @ 2020-07-16 11:30 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-07-16 11:30 UTC (permalink / raw) To: Robin Murphy, Will Deacon Cc: trivial, maz, Linuxarm, linux-kernel, iommu, linux-arm-kernel On 16/07/2020 12:22, Robin Murphy wrote: > On 2020-07-16 11:56, John Garry wrote: >> On 16/07/2020 11:28, Will Deacon wrote: >>> On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote: >>>> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote: >>>>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote: >>>>>> As mentioned in [0], the CPU may consume many cycles processing >>>>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() >>>>>> loop to >>>>>> get space on the queue takes approx 25% of the cycles for this >>>>>> function. >>>>>> >>>>>> This series removes that cmpxchg(). >>>>> >>>>> How about something much simpler like the diff below? >> >>>> Ah, scratch that, I don't drop the lock if we fail the cas with it held. >>>> Let me hack it some more (I have no hardware so I can only build-test >>>> this). >>> >>> Right, second attempt... >> >> I can try it, but if performance if not as good, then please check mine >> further (patch 4/4 specifically) - performance is really good, IMHO. > > Perhaps a silly question (I'm too engrossed in PMU world ATM to get > properly back up to speed on this), but couldn't this be done without > cmpxchg anyway? Instinctively it feels like instead of maintaining a > literal software copy of the prod value, we could resolve the "claim my > slot in the queue" part with atomic_fetch_add on a free-running 32-bit > "pseudo-prod" index, then whoever updates the hardware deals with the > truncation and wrap bit to convert it to an actual register value. > That's what mine does. But I also need to take care of cmdq locking and how we unconditionally provide space. Cheers, John _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency 2020-07-16 11:22 ` Robin Murphy (?) @ 2020-07-16 11:32 ` Will Deacon -1 siblings, 0 replies; 69+ messages in thread From: Will Deacon @ 2020-07-16 11:32 UTC (permalink / raw) To: Robin Murphy Cc: John Garry, joro, trivial, linux-arm-kernel, iommu, linux-kernel, linuxarm, maz On Thu, Jul 16, 2020 at 12:22:05PM +0100, Robin Murphy wrote: > On 2020-07-16 11:56, John Garry wrote: > > On 16/07/2020 11:28, Will Deacon wrote: > > > On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote: > > > > On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote: > > > > > On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote: > > > > > > As mentioned in [0], the CPU may consume many cycles processing > > > > > > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the > > > > > > cmpxchg() loop to > > > > > > get space on the queue takes approx 25% of the cycles > > > > > > for this function. > > > > > > > > > > > > This series removes that cmpxchg(). > > > > > > > > > > How about something much simpler like the diff below? >> > > > > Ah, scratch that, I don't drop the lock if we fail the cas with it held. > > > > Let me hack it some more (I have no hardware so I can only > > > > build-test this). > > > > > > Right, second attempt... > > > > I can try it, but if performance if not as good, then please check mine > > further (patch 4/4 specifically) - performance is really good, IMHO. > > Perhaps a silly question (I'm too engrossed in PMU world ATM to get properly > back up to speed on this), but couldn't this be done without cmpxchg anyway? > Instinctively it feels like instead of maintaining a literal software copy > of the prod value, we could resolve the "claim my slot in the queue" part > with atomic_fetch_add on a free-running 32-bit "pseudo-prod" index, then > whoever updates the hardware deals with the truncation and wrap bit to > convert it to an actual register value. Maybe, but it's easier said than done. The hard part is figuring how that you have space and there's no way I'm touching that logic without a way to test this. I'm also not keen to complicate the code because of systems that don't scale well with contended CAS [1]. If you put down loads of cores, you need an interconnect/coherence protocol that can handle it. Will [1] https://lore.kernel.org/lkml/20190607072652.GA5522@hc/T/#m0d00f107c29223045933292a8b5b90d2ca9b7e5c ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency @ 2020-07-16 11:32 ` Will Deacon 0 siblings, 0 replies; 69+ messages in thread From: Will Deacon @ 2020-07-16 11:32 UTC (permalink / raw) To: Robin Murphy Cc: trivial, maz, joro, John Garry, linux-kernel, linuxarm, iommu, linux-arm-kernel On Thu, Jul 16, 2020 at 12:22:05PM +0100, Robin Murphy wrote: > On 2020-07-16 11:56, John Garry wrote: > > On 16/07/2020 11:28, Will Deacon wrote: > > > On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote: > > > > On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote: > > > > > On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote: > > > > > > As mentioned in [0], the CPU may consume many cycles processing > > > > > > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the > > > > > > cmpxchg() loop to > > > > > > get space on the queue takes approx 25% of the cycles > > > > > > for this function. > > > > > > > > > > > > This series removes that cmpxchg(). > > > > > > > > > > How about something much simpler like the diff below? >> > > > > Ah, scratch that, I don't drop the lock if we fail the cas with it held. > > > > Let me hack it some more (I have no hardware so I can only > > > > build-test this). > > > > > > Right, second attempt... > > > > I can try it, but if performance if not as good, then please check mine > > further (patch 4/4 specifically) - performance is really good, IMHO. > > Perhaps a silly question (I'm too engrossed in PMU world ATM to get properly > back up to speed on this), but couldn't this be done without cmpxchg anyway? > Instinctively it feels like instead of maintaining a literal software copy > of the prod value, we could resolve the "claim my slot in the queue" part > with atomic_fetch_add on a free-running 32-bit "pseudo-prod" index, then > whoever updates the hardware deals with the truncation and wrap bit to > convert it to an actual register value. Maybe, but it's easier said than done. The hard part is figuring how that you have space and there's no way I'm touching that logic without a way to test this. I'm also not keen to complicate the code because of systems that don't scale well with contended CAS [1]. If you put down loads of cores, you need an interconnect/coherence protocol that can handle it. Will [1] https://lore.kernel.org/lkml/20190607072652.GA5522@hc/T/#m0d00f107c29223045933292a8b5b90d2ca9b7e5c _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency @ 2020-07-16 11:32 ` Will Deacon 0 siblings, 0 replies; 69+ messages in thread From: Will Deacon @ 2020-07-16 11:32 UTC (permalink / raw) To: Robin Murphy Cc: trivial, maz, linux-kernel, linuxarm, iommu, linux-arm-kernel On Thu, Jul 16, 2020 at 12:22:05PM +0100, Robin Murphy wrote: > On 2020-07-16 11:56, John Garry wrote: > > On 16/07/2020 11:28, Will Deacon wrote: > > > On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote: > > > > On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote: > > > > > On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote: > > > > > > As mentioned in [0], the CPU may consume many cycles processing > > > > > > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the > > > > > > cmpxchg() loop to > > > > > > get space on the queue takes approx 25% of the cycles > > > > > > for this function. > > > > > > > > > > > > This series removes that cmpxchg(). > > > > > > > > > > How about something much simpler like the diff below? >> > > > > Ah, scratch that, I don't drop the lock if we fail the cas with it held. > > > > Let me hack it some more (I have no hardware so I can only > > > > build-test this). > > > > > > Right, second attempt... > > > > I can try it, but if performance if not as good, then please check mine > > further (patch 4/4 specifically) - performance is really good, IMHO. > > Perhaps a silly question (I'm too engrossed in PMU world ATM to get properly > back up to speed on this), but couldn't this be done without cmpxchg anyway? > Instinctively it feels like instead of maintaining a literal software copy > of the prod value, we could resolve the "claim my slot in the queue" part > with atomic_fetch_add on a free-running 32-bit "pseudo-prod" index, then > whoever updates the hardware deals with the truncation and wrap bit to > convert it to an actual register value. Maybe, but it's easier said than done. The hard part is figuring how that you have space and there's no way I'm touching that logic without a way to test this. I'm also not keen to complicate the code because of systems that don't scale well with contended CAS [1]. If you put down loads of cores, you need an interconnect/coherence protocol that can handle it. Will [1] https://lore.kernel.org/lkml/20190607072652.GA5522@hc/T/#m0d00f107c29223045933292a8b5b90d2ca9b7e5c _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency 2020-07-16 11:32 ` Will Deacon (?) @ 2020-07-16 16:50 ` John Garry -1 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-07-16 16:50 UTC (permalink / raw) To: Will Deacon, Robin Murphy Cc: joro, trivial, linux-arm-kernel, iommu, linux-kernel, linuxarm, maz >> >> Perhaps a silly question (I'm too engrossed in PMU world ATM to get properly >> back up to speed on this), but couldn't this be done without cmpxchg anyway? >> Instinctively it feels like instead of maintaining a literal software copy >> of the prod value, we could resolve the "claim my slot in the queue" part >> with atomic_fetch_add on a free-running 32-bit "pseudo-prod" index, then >> whoever updates the hardware deals with the truncation and wrap bit to >> convert it to an actual register value. > > Maybe, but it's easier said than done. The hard part is figuring how that > you have space and there's no way I'm touching that logic without a way to > test this. > > I'm also not keen to complicate the code because of systems that don't scale > well with contended CAS [1]. If you put down loads of cores, you need an > interconnect/coherence protocol that can handle it. JFYI, I added some debug to the driver to get the cmpxchg() attempt rate while running a testharness (below): cpus rate 2 1.1 4 1.1 8 1.3 16 3.6 32 8.1 64 12.6 96 14.7 Ideal rate is 1. So it's not crazy high for many CPUs, but still drifting away from 1. John > > Will > > [1] https://lore.kernel.org/lkml/20190607072652.GA5522@hc/T/#m0d00f107c29223045933292a8b5b90d2ca9b7e5c > . > //copied from Barry, thanks :) static int ways = 64; module_param(ways, int, S_IRUGO); static int seconds = 20; module_param(seconds, int, S_IRUGO); int mappings[NR_CPUS]; struct semaphore sem[NR_CPUS]; #define COMPLETIONS_SIZE 50 static noinline dma_addr_t test_mapsingle(struct device *dev, void *buf, int size) { dma_addr_t dma_addr = dma_map_single(dev, buf, size, DMA_TO_DEVICE); return dma_addr; } static noinline void test_unmapsingle(struct device *dev, void *buf, int size, dma_addr_t dma_addr) { dma_unmap_single(dev, dma_addr, size, DMA_TO_DEVICE); } static noinline void test_memcpy(void *out, void *in, int size) { memcpy(out, in, size); } //just a hack to get a dev h behind a SMMU extern struct device *hisi_dev; static int testthread(void *data) { unsigned long stop = jiffies +seconds*HZ; struct device *dev = hisi_dev; char *inputs[COMPLETIONS_SIZE]; char *outputs[COMPLETIONS_SIZE]; dma_addr_t dma_addr[COMPLETIONS_SIZE]; int i, cpu = smp_processor_id(); struct semaphore *sem = data; for (i = 0; i < COMPLETIONS_SIZE; i++) { inputs[i] = kzalloc(4096, GFP_KERNEL); if (!inputs[i]) return -ENOMEM; } for (i = 0; i < COMPLETIONS_SIZE; i++) { outputs[i] = kzalloc(4096, GFP_KERNEL); if (!outputs[i]) return -ENOMEM; } while (time_before(jiffies, stop)) { for (i = 0; i < COMPLETIONS_SIZE; i++) { dma_addr[i] = test_mapsingle(dev, inputs[i], 4096); test_memcpy(outputs[i], inputs[i], 4096); } for (i = 0; i < COMPLETIONS_SIZE; i++) { test_unmapsingle(dev, inputs[i], 4096, dma_addr[i]); } mappings[cpu] += COMPLETIONS_SIZE; } for (i = 0; i < COMPLETIONS_SIZE; i++) { kfree(outputs[i]); kfree(inputs[i]); } up(sem); return 0; } void smmu_test_core(int cpus) { struct task_struct *tsk; int i; int total_mappings = 0; ways = cpus; for(i=0;i<ways;i++) { mappings[i] = 0; tsk = kthread_create_on_cpu(testthread, &sem[i], i, "map_test"); if (IS_ERR(tsk)) printk(KERN_ERR "create test thread failed\n"); wake_up_process(tsk); } for(i=0;i<ways;i++) { down(&sem[i]); total_mappings += mappings[i]; } printk(KERN_ERR "finished total_mappings=%d (per way=%d) (rate=%d per second per cpu) ways=%d\n", total_mappings, total_mappings / ways, total_mappings / (seconds* ways), ways); } EXPORT_SYMBOL(smmu_test_core); static int __init test_init(void) { int i; for(i=0;i<NR_CPUS;i++) sema_init(&sem[i], 0); return 0; } static void __exit test_exit(void) { } module_init(test_init); module_exit(test_exit); MODULE_LICENSE("GPL"); ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency @ 2020-07-16 16:50 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-07-16 16:50 UTC (permalink / raw) To: Will Deacon, Robin Murphy Cc: trivial, maz, joro, linuxarm, linux-kernel, iommu, linux-arm-kernel >> >> Perhaps a silly question (I'm too engrossed in PMU world ATM to get properly >> back up to speed on this), but couldn't this be done without cmpxchg anyway? >> Instinctively it feels like instead of maintaining a literal software copy >> of the prod value, we could resolve the "claim my slot in the queue" part >> with atomic_fetch_add on a free-running 32-bit "pseudo-prod" index, then >> whoever updates the hardware deals with the truncation and wrap bit to >> convert it to an actual register value. > > Maybe, but it's easier said than done. The hard part is figuring how that > you have space and there's no way I'm touching that logic without a way to > test this. > > I'm also not keen to complicate the code because of systems that don't scale > well with contended CAS [1]. If you put down loads of cores, you need an > interconnect/coherence protocol that can handle it. JFYI, I added some debug to the driver to get the cmpxchg() attempt rate while running a testharness (below): cpus rate 2 1.1 4 1.1 8 1.3 16 3.6 32 8.1 64 12.6 96 14.7 Ideal rate is 1. So it's not crazy high for many CPUs, but still drifting away from 1. John > > Will > > [1] https://lore.kernel.org/lkml/20190607072652.GA5522@hc/T/#m0d00f107c29223045933292a8b5b90d2ca9b7e5c > . > //copied from Barry, thanks :) static int ways = 64; module_param(ways, int, S_IRUGO); static int seconds = 20; module_param(seconds, int, S_IRUGO); int mappings[NR_CPUS]; struct semaphore sem[NR_CPUS]; #define COMPLETIONS_SIZE 50 static noinline dma_addr_t test_mapsingle(struct device *dev, void *buf, int size) { dma_addr_t dma_addr = dma_map_single(dev, buf, size, DMA_TO_DEVICE); return dma_addr; } static noinline void test_unmapsingle(struct device *dev, void *buf, int size, dma_addr_t dma_addr) { dma_unmap_single(dev, dma_addr, size, DMA_TO_DEVICE); } static noinline void test_memcpy(void *out, void *in, int size) { memcpy(out, in, size); } //just a hack to get a dev h behind a SMMU extern struct device *hisi_dev; static int testthread(void *data) { unsigned long stop = jiffies +seconds*HZ; struct device *dev = hisi_dev; char *inputs[COMPLETIONS_SIZE]; char *outputs[COMPLETIONS_SIZE]; dma_addr_t dma_addr[COMPLETIONS_SIZE]; int i, cpu = smp_processor_id(); struct semaphore *sem = data; for (i = 0; i < COMPLETIONS_SIZE; i++) { inputs[i] = kzalloc(4096, GFP_KERNEL); if (!inputs[i]) return -ENOMEM; } for (i = 0; i < COMPLETIONS_SIZE; i++) { outputs[i] = kzalloc(4096, GFP_KERNEL); if (!outputs[i]) return -ENOMEM; } while (time_before(jiffies, stop)) { for (i = 0; i < COMPLETIONS_SIZE; i++) { dma_addr[i] = test_mapsingle(dev, inputs[i], 4096); test_memcpy(outputs[i], inputs[i], 4096); } for (i = 0; i < COMPLETIONS_SIZE; i++) { test_unmapsingle(dev, inputs[i], 4096, dma_addr[i]); } mappings[cpu] += COMPLETIONS_SIZE; } for (i = 0; i < COMPLETIONS_SIZE; i++) { kfree(outputs[i]); kfree(inputs[i]); } up(sem); return 0; } void smmu_test_core(int cpus) { struct task_struct *tsk; int i; int total_mappings = 0; ways = cpus; for(i=0;i<ways;i++) { mappings[i] = 0; tsk = kthread_create_on_cpu(testthread, &sem[i], i, "map_test"); if (IS_ERR(tsk)) printk(KERN_ERR "create test thread failed\n"); wake_up_process(tsk); } for(i=0;i<ways;i++) { down(&sem[i]); total_mappings += mappings[i]; } printk(KERN_ERR "finished total_mappings=%d (per way=%d) (rate=%d per second per cpu) ways=%d\n", total_mappings, total_mappings / ways, total_mappings / (seconds* ways), ways); } EXPORT_SYMBOL(smmu_test_core); static int __init test_init(void) { int i; for(i=0;i<NR_CPUS;i++) sema_init(&sem[i], 0); return 0; } static void __exit test_exit(void) { } module_init(test_init); module_exit(test_exit); MODULE_LICENSE("GPL"); _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency @ 2020-07-16 16:50 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-07-16 16:50 UTC (permalink / raw) To: Will Deacon, Robin Murphy Cc: trivial, maz, linuxarm, linux-kernel, iommu, linux-arm-kernel >> >> Perhaps a silly question (I'm too engrossed in PMU world ATM to get properly >> back up to speed on this), but couldn't this be done without cmpxchg anyway? >> Instinctively it feels like instead of maintaining a literal software copy >> of the prod value, we could resolve the "claim my slot in the queue" part >> with atomic_fetch_add on a free-running 32-bit "pseudo-prod" index, then >> whoever updates the hardware deals with the truncation and wrap bit to >> convert it to an actual register value. > > Maybe, but it's easier said than done. The hard part is figuring how that > you have space and there's no way I'm touching that logic without a way to > test this. > > I'm also not keen to complicate the code because of systems that don't scale > well with contended CAS [1]. If you put down loads of cores, you need an > interconnect/coherence protocol that can handle it. JFYI, I added some debug to the driver to get the cmpxchg() attempt rate while running a testharness (below): cpus rate 2 1.1 4 1.1 8 1.3 16 3.6 32 8.1 64 12.6 96 14.7 Ideal rate is 1. So it's not crazy high for many CPUs, but still drifting away from 1. John > > Will > > [1] https://lore.kernel.org/lkml/20190607072652.GA5522@hc/T/#m0d00f107c29223045933292a8b5b90d2ca9b7e5c > . > //copied from Barry, thanks :) static int ways = 64; module_param(ways, int, S_IRUGO); static int seconds = 20; module_param(seconds, int, S_IRUGO); int mappings[NR_CPUS]; struct semaphore sem[NR_CPUS]; #define COMPLETIONS_SIZE 50 static noinline dma_addr_t test_mapsingle(struct device *dev, void *buf, int size) { dma_addr_t dma_addr = dma_map_single(dev, buf, size, DMA_TO_DEVICE); return dma_addr; } static noinline void test_unmapsingle(struct device *dev, void *buf, int size, dma_addr_t dma_addr) { dma_unmap_single(dev, dma_addr, size, DMA_TO_DEVICE); } static noinline void test_memcpy(void *out, void *in, int size) { memcpy(out, in, size); } //just a hack to get a dev h behind a SMMU extern struct device *hisi_dev; static int testthread(void *data) { unsigned long stop = jiffies +seconds*HZ; struct device *dev = hisi_dev; char *inputs[COMPLETIONS_SIZE]; char *outputs[COMPLETIONS_SIZE]; dma_addr_t dma_addr[COMPLETIONS_SIZE]; int i, cpu = smp_processor_id(); struct semaphore *sem = data; for (i = 0; i < COMPLETIONS_SIZE; i++) { inputs[i] = kzalloc(4096, GFP_KERNEL); if (!inputs[i]) return -ENOMEM; } for (i = 0; i < COMPLETIONS_SIZE; i++) { outputs[i] = kzalloc(4096, GFP_KERNEL); if (!outputs[i]) return -ENOMEM; } while (time_before(jiffies, stop)) { for (i = 0; i < COMPLETIONS_SIZE; i++) { dma_addr[i] = test_mapsingle(dev, inputs[i], 4096); test_memcpy(outputs[i], inputs[i], 4096); } for (i = 0; i < COMPLETIONS_SIZE; i++) { test_unmapsingle(dev, inputs[i], 4096, dma_addr[i]); } mappings[cpu] += COMPLETIONS_SIZE; } for (i = 0; i < COMPLETIONS_SIZE; i++) { kfree(outputs[i]); kfree(inputs[i]); } up(sem); return 0; } void smmu_test_core(int cpus) { struct task_struct *tsk; int i; int total_mappings = 0; ways = cpus; for(i=0;i<ways;i++) { mappings[i] = 0; tsk = kthread_create_on_cpu(testthread, &sem[i], i, "map_test"); if (IS_ERR(tsk)) printk(KERN_ERR "create test thread failed\n"); wake_up_process(tsk); } for(i=0;i<ways;i++) { down(&sem[i]); total_mappings += mappings[i]; } printk(KERN_ERR "finished total_mappings=%d (per way=%d) (rate=%d per second per cpu) ways=%d\n", total_mappings, total_mappings / ways, total_mappings / (seconds* ways), ways); } EXPORT_SYMBOL(smmu_test_core); static int __init test_init(void) { int i; for(i=0;i<NR_CPUS;i++) sema_init(&sem[i], 0); return 0; } static void __exit test_exit(void) { } module_init(test_init); module_exit(test_exit); MODULE_LICENSE("GPL"); _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency 2020-07-16 10:28 ` Will Deacon (?) @ 2020-07-16 13:31 ` John Garry -1 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-07-16 13:31 UTC (permalink / raw) To: Will Deacon Cc: robin.murphy, joro, trivial, linux-arm-kernel, iommu, linux-kernel, linuxarm, maz On 16/07/2020 11:28, Will Deacon wrote: > On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote: >> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote: >>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote: >>>> As mentioned in [0], the CPU may consume many cycles processing >>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to >>>> get space on the queue takes approx 25% of the cycles for this function. >>>> >>>> This series removes that cmpxchg(). >>> >>> How about something much simpler like the diff below? >> >> Ah, scratch that, I don't drop the lock if we fail the cas with it held. >> Let me hack it some more (I have no hardware so I can only build-test this). > > Right, second attempt... > > Will Unfortunately that hangs my machine during boot: [10.902893] 00:01: ttyS0 at MMIO 0x3f00003f8 (irq = 6, base_baud = 115200) is a 16550A [10.912048] SuperH (H)SCI(F) driver initialized [10.916811] msm_serial: driver initialized [10.921371] arm-smmu-v3 arm-smmu-v3.0.auto: option mask 0x0 [10.926946] arm-smmu-v3 arm-smmu-v3.0.auto: ias 48-bit, oas 48-bit (features 0x00000fef) [10.935374] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 65536 entries for cmdq [10.942522] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 32768 entries for evtq > > --->8 > > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c > index f578677a5c41..e6bcddd6ef69 100644 > --- a/drivers/iommu/arm-smmu-v3.c > +++ b/drivers/iommu/arm-smmu-v3.c > @@ -560,6 +560,7 @@ struct arm_smmu_cmdq { > atomic_long_t *valid_map; > atomic_t owner_prod; > atomic_t lock; > + spinlock_t slock; > }; > > struct arm_smmu_cmdq_batch { > @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, > u64 cmd_sync[CMDQ_ENT_DWORDS]; > u32 prod; > unsigned long flags; > - bool owner; > + bool owner, locked = false; > struct arm_smmu_cmdq *cmdq = &smmu->cmdq; > struct arm_smmu_ll_queue llq = { > .max_n_shift = cmdq->q.llq.max_n_shift, > @@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, > > /* 1. Allocate some space in the queue */ > local_irq_save(flags); > - llq.val = READ_ONCE(cmdq->q.llq.val); > do { > u64 old; > + llq.val = READ_ONCE(cmdq->q.llq.val); > > - while (!queue_has_space(&llq, n + sync)) { > + if (queue_has_space(&llq, n + sync)) > + goto try_cas; > + > + if (locked) > + spin_unlock(&cmdq->slock); > + > + do { > local_irq_restore(flags); > if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq)) > dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); > local_irq_save(flags); > - } > + } while (!queue_has_space(&llq, n + sync)); > > +try_cas: > head.cons = llq.cons; > head.prod = queue_inc_prod_n(&llq, n + sync) | > CMDQ_PROD_OWNED_FLAG; > > old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val); > - if (old == llq.val) > + if (old != llq.val) Not sure why you changed this. And if I change it back, it seems that we could drop out of the loop with cmdq->slock held, so need to drop the lock also. I tried that and it stops my machine hanging. Let me know that was the intention, so I can test. Thanks, John > break; > > - llq.val = old; > + if (!locked) { > + spin_lock(&cmdq->slock); > + locked = true; > + } > } while (1); > + > owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG); > head.prod &= ~CMDQ_PROD_OWNED_FLAG; > llq.prod &= ~CMDQ_PROD_OWNED_FLAG; > @@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu) > > atomic_set(&cmdq->owner_prod, 0); > atomic_set(&cmdq->lock, 0); > + spin_lock_init(&cmdq->slock); > > bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL); > if (!bitmap) { > . > ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency @ 2020-07-16 13:31 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-07-16 13:31 UTC (permalink / raw) To: Will Deacon Cc: trivial, maz, joro, linux-kernel, linuxarm, iommu, robin.murphy, linux-arm-kernel On 16/07/2020 11:28, Will Deacon wrote: > On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote: >> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote: >>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote: >>>> As mentioned in [0], the CPU may consume many cycles processing >>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to >>>> get space on the queue takes approx 25% of the cycles for this function. >>>> >>>> This series removes that cmpxchg(). >>> >>> How about something much simpler like the diff below? >> >> Ah, scratch that, I don't drop the lock if we fail the cas with it held. >> Let me hack it some more (I have no hardware so I can only build-test this). > > Right, second attempt... > > Will Unfortunately that hangs my machine during boot: [10.902893] 00:01: ttyS0 at MMIO 0x3f00003f8 (irq = 6, base_baud = 115200) is a 16550A [10.912048] SuperH (H)SCI(F) driver initialized [10.916811] msm_serial: driver initialized [10.921371] arm-smmu-v3 arm-smmu-v3.0.auto: option mask 0x0 [10.926946] arm-smmu-v3 arm-smmu-v3.0.auto: ias 48-bit, oas 48-bit (features 0x00000fef) [10.935374] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 65536 entries for cmdq [10.942522] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 32768 entries for evtq > > --->8 > > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c > index f578677a5c41..e6bcddd6ef69 100644 > --- a/drivers/iommu/arm-smmu-v3.c > +++ b/drivers/iommu/arm-smmu-v3.c > @@ -560,6 +560,7 @@ struct arm_smmu_cmdq { > atomic_long_t *valid_map; > atomic_t owner_prod; > atomic_t lock; > + spinlock_t slock; > }; > > struct arm_smmu_cmdq_batch { > @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, > u64 cmd_sync[CMDQ_ENT_DWORDS]; > u32 prod; > unsigned long flags; > - bool owner; > + bool owner, locked = false; > struct arm_smmu_cmdq *cmdq = &smmu->cmdq; > struct arm_smmu_ll_queue llq = { > .max_n_shift = cmdq->q.llq.max_n_shift, > @@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, > > /* 1. Allocate some space in the queue */ > local_irq_save(flags); > - llq.val = READ_ONCE(cmdq->q.llq.val); > do { > u64 old; > + llq.val = READ_ONCE(cmdq->q.llq.val); > > - while (!queue_has_space(&llq, n + sync)) { > + if (queue_has_space(&llq, n + sync)) > + goto try_cas; > + > + if (locked) > + spin_unlock(&cmdq->slock); > + > + do { > local_irq_restore(flags); > if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq)) > dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); > local_irq_save(flags); > - } > + } while (!queue_has_space(&llq, n + sync)); > > +try_cas: > head.cons = llq.cons; > head.prod = queue_inc_prod_n(&llq, n + sync) | > CMDQ_PROD_OWNED_FLAG; > > old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val); > - if (old == llq.val) > + if (old != llq.val) Not sure why you changed this. And if I change it back, it seems that we could drop out of the loop with cmdq->slock held, so need to drop the lock also. I tried that and it stops my machine hanging. Let me know that was the intention, so I can test. Thanks, John > break; > > - llq.val = old; > + if (!locked) { > + spin_lock(&cmdq->slock); > + locked = true; > + } > } while (1); > + > owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG); > head.prod &= ~CMDQ_PROD_OWNED_FLAG; > llq.prod &= ~CMDQ_PROD_OWNED_FLAG; > @@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu) > > atomic_set(&cmdq->owner_prod, 0); > atomic_set(&cmdq->lock, 0); > + spin_lock_init(&cmdq->slock); > > bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL); > if (!bitmap) { > . > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency @ 2020-07-16 13:31 ` John Garry 0 siblings, 0 replies; 69+ messages in thread From: John Garry @ 2020-07-16 13:31 UTC (permalink / raw) To: Will Deacon Cc: trivial, maz, linux-kernel, linuxarm, iommu, robin.murphy, linux-arm-kernel On 16/07/2020 11:28, Will Deacon wrote: > On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote: >> On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote: >>> On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote: >>>> As mentioned in [0], the CPU may consume many cycles processing >>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to >>>> get space on the queue takes approx 25% of the cycles for this function. >>>> >>>> This series removes that cmpxchg(). >>> >>> How about something much simpler like the diff below? >> >> Ah, scratch that, I don't drop the lock if we fail the cas with it held. >> Let me hack it some more (I have no hardware so I can only build-test this). > > Right, second attempt... > > Will Unfortunately that hangs my machine during boot: [10.902893] 00:01: ttyS0 at MMIO 0x3f00003f8 (irq = 6, base_baud = 115200) is a 16550A [10.912048] SuperH (H)SCI(F) driver initialized [10.916811] msm_serial: driver initialized [10.921371] arm-smmu-v3 arm-smmu-v3.0.auto: option mask 0x0 [10.926946] arm-smmu-v3 arm-smmu-v3.0.auto: ias 48-bit, oas 48-bit (features 0x00000fef) [10.935374] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 65536 entries for cmdq [10.942522] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 32768 entries for evtq > > --->8 > > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c > index f578677a5c41..e6bcddd6ef69 100644 > --- a/drivers/iommu/arm-smmu-v3.c > +++ b/drivers/iommu/arm-smmu-v3.c > @@ -560,6 +560,7 @@ struct arm_smmu_cmdq { > atomic_long_t *valid_map; > atomic_t owner_prod; > atomic_t lock; > + spinlock_t slock; > }; > > struct arm_smmu_cmdq_batch { > @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, > u64 cmd_sync[CMDQ_ENT_DWORDS]; > u32 prod; > unsigned long flags; > - bool owner; > + bool owner, locked = false; > struct arm_smmu_cmdq *cmdq = &smmu->cmdq; > struct arm_smmu_ll_queue llq = { > .max_n_shift = cmdq->q.llq.max_n_shift, > @@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, > > /* 1. Allocate some space in the queue */ > local_irq_save(flags); > - llq.val = READ_ONCE(cmdq->q.llq.val); > do { > u64 old; > + llq.val = READ_ONCE(cmdq->q.llq.val); > > - while (!queue_has_space(&llq, n + sync)) { > + if (queue_has_space(&llq, n + sync)) > + goto try_cas; > + > + if (locked) > + spin_unlock(&cmdq->slock); > + > + do { > local_irq_restore(flags); > if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq)) > dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); > local_irq_save(flags); > - } > + } while (!queue_has_space(&llq, n + sync)); > > +try_cas: > head.cons = llq.cons; > head.prod = queue_inc_prod_n(&llq, n + sync) | > CMDQ_PROD_OWNED_FLAG; > > old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val); > - if (old == llq.val) > + if (old != llq.val) Not sure why you changed this. And if I change it back, it seems that we could drop out of the loop with cmdq->slock held, so need to drop the lock also. I tried that and it stops my machine hanging. Let me know that was the intention, so I can test. Thanks, John > break; > > - llq.val = old; > + if (!locked) { > + spin_lock(&cmdq->slock); > + locked = true; > + } > } while (1); > + > owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG); > head.prod &= ~CMDQ_PROD_OWNED_FLAG; > llq.prod &= ~CMDQ_PROD_OWNED_FLAG; > @@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu) > > atomic_set(&cmdq->owner_prod, 0); > atomic_set(&cmdq->lock, 0); > + spin_lock_init(&cmdq->slock); > > bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL); > if (!bitmap) { > . > _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply [flat|nested] 69+ messages in thread
end of thread, other threads:[~2020-07-16 16:57 UTC | newest] Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-06-22 17:28 [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency John Garry 2020-06-22 17:28 ` John Garry 2020-06-22 17:28 ` [PATCH 1/4] iommu/arm-smmu-v3: Fix trivial typo John Garry 2020-06-22 17:28 ` John Garry 2020-06-22 17:28 ` [PATCH 2/4] iommu/arm-smmu-v3: Calculate bits for prod and owner John Garry 2020-06-22 17:28 ` John Garry 2020-06-22 17:28 ` [PATCH 3/4] iommu/arm-smmu-v3: Always issue a CMD_SYNC per batch John Garry 2020-06-22 17:28 ` John Garry 2020-06-22 17:28 ` [PATCH 4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() John Garry 2020-06-22 17:28 ` John Garry 2020-06-23 1:07 ` kernel test robot 2020-06-23 1:07 ` kernel test robot 2020-06-23 1:07 ` kernel test robot 2020-06-23 9:21 ` John Garry 2020-06-23 9:21 ` John Garry 2020-06-23 9:21 ` John Garry 2020-06-23 9:35 ` Rikard Falkeborn 2020-06-23 10:19 ` John Garry 2020-06-23 10:19 ` John Garry 2020-06-23 10:19 ` John Garry 2020-06-23 13:55 ` Rikard Falkeborn 2020-06-26 10:05 ` John Garry 2020-06-26 10:05 ` John Garry 2020-06-26 10:05 ` John Garry 2020-06-26 10:05 ` John Garry 2020-06-23 16:22 ` Robin Murphy 2020-06-23 16:22 ` Robin Murphy 2020-06-23 16:22 ` Robin Murphy 2020-06-23 16:22 ` Robin Murphy 2020-06-24 8:15 ` John Garry 2020-06-24 8:15 ` John Garry 2020-06-24 8:15 ` John Garry 2020-06-24 8:15 ` John Garry 2020-07-16 10:20 ` Will Deacon 2020-07-16 10:20 ` Will Deacon 2020-07-16 10:20 ` Will Deacon 2020-07-16 10:26 ` John Garry 2020-07-16 10:26 ` John Garry 2020-07-16 10:26 ` John Garry 2020-07-08 13:00 ` [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency John Garry 2020-07-08 13:00 ` John Garry 2020-07-08 13:00 ` John Garry 2020-07-16 10:19 ` Will Deacon 2020-07-16 10:19 ` Will Deacon 2020-07-16 10:19 ` Will Deacon 2020-07-16 10:22 ` Will Deacon 2020-07-16 10:22 ` Will Deacon 2020-07-16 10:22 ` Will Deacon 2020-07-16 10:28 ` Will Deacon 2020-07-16 10:28 ` Will Deacon 2020-07-16 10:28 ` Will Deacon 2020-07-16 10:56 ` John Garry 2020-07-16 10:56 ` John Garry 2020-07-16 10:56 ` John Garry 2020-07-16 11:22 ` Robin Murphy 2020-07-16 11:22 ` Robin Murphy 2020-07-16 11:22 ` Robin Murphy 2020-07-16 11:30 ` John Garry 2020-07-16 11:30 ` John Garry 2020-07-16 11:30 ` John Garry 2020-07-16 11:32 ` Will Deacon 2020-07-16 11:32 ` Will Deacon 2020-07-16 11:32 ` Will Deacon 2020-07-16 16:50 ` John Garry 2020-07-16 16:50 ` John Garry 2020-07-16 16:50 ` John Garry 2020-07-16 13:31 ` John Garry 2020-07-16 13:31 ` John Garry 2020-07-16 13:31 ` John Garry
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.