Re: [PATCH v2] iommu/arm-smmu-v3: Add SMMUv3.2 range invalidation support

From: Rob Herring <robh@kernel.org>
To: Robin Murphy <robin.murphy@arm.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>,
	Linux IOMMU <iommu@lists.linux-foundation.org>,
	Will Deacon <will@kernel.org>,
	"moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE"
	<linux-arm-kernel@lists.infradead.org>
Subject: Re: [PATCH v2] iommu/arm-smmu-v3: Add SMMUv3.2 range invalidation support
Date: Fri, 21 Feb 2020 10:47:23 -0600	[thread overview]
Message-ID: <CAL_JsqJam8qO-XAfQTXNZ0y_gLdOx0LAX28XnKjXd2yh7y3_Uw@mail.gmail.com> (raw)
In-Reply-To: <49847ca9-3ee8-4c08-a772-a93f10aa817f@arm.com>

On Fri, Feb 21, 2020 at 10:19 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 20/02/2020 5:54 pm, Rob Herring wrote:
> > On Mon, Feb 17, 2020 at 1:17 PM Robin Murphy <robin.murphy@arm.com> wrote:
> >>
> >> On 13/02/2020 9:49 pm, Rob Herring wrote:
> >>> On Thu, Jan 30, 2020 at 11:34 AM Robin Murphy <robin.murphy@arm.com> wrote:
> >>>>
> >>>> On 30/01/2020 3:06 pm, Auger Eric wrote:
> >>>>> Hi Rob,
> >>>>> On 1/17/20 10:16 PM, Rob Herring wrote:
> >>>>>> Arm SMMUv3.2 adds support for TLB range invalidate operations.
> >>>>>> Support for range invalidate is determined by the RIL bit in the IDR3
> >>>>>> register.
> >>>>>>
> >>>>>> The range invalidate is in units of the leaf page size and operates on
> >>>>>> 1-32 chunks of a power of 2 multiple pages. First, we determine from the
> >>>>>> size what power of 2 multiple we can use. Then we calculate how many
> >>>>>> chunks (1-31) of the power of 2 size for the range on the iteration. On
> >>>>>> each iteration, we move up in size by at least 5 bits.
> >>>>>>
> >>>>>> Cc: Eric Auger <eric.auger@redhat.com>
> >>>>>> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> >>>>>> Cc: Will Deacon <will@kernel.org>
> >>>>>> Cc: Robin Murphy <robin.murphy@arm.com>
> >>>>>> Cc: Joerg Roedel <joro@8bytes.org>
> >>>>>> Signed-off-by: Rob Herring <robh@kernel.org>
> >>>
> >>>
> >>>>>> @@ -2003,7 +2024,7 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
> >>>>>>     {
> >>>>>>        u64 cmds[CMDQ_BATCH_ENTRIES * CMDQ_ENT_DWORDS];
> >>>>>>        struct arm_smmu_device *smmu = smmu_domain->smmu;
> >>>>>> -    unsigned long start = iova, end = iova + size;
> >>>>>> +    unsigned long start = iova, end = iova + size, num_pages = 0, tg = 0;
> >>>>>>        int i = 0;
> >>>>>>        struct arm_smmu_cmdq_ent cmd = {
> >>>>>>                .tlbi = {
> >>>>>> @@ -2022,12 +2043,50 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
> >>>>>>                cmd.tlbi.vmid   = smmu_domain->s2_cfg.vmid;
> >>>>>>        }
> >>>>>>
> >>>>>> +    if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
> >>>>>> +            /* Get the leaf page size */
> >>>>>> +            tg = __ffs(smmu_domain->domain.pgsize_bitmap);
> >>>>>> +
> >>>>>> +            /* Convert page size of 12,14,16 (log2) to 1,2,3 */
> >>>>>> +            cmd.tlbi.tg = ((tg - ilog2(SZ_4K)) / 2) + 1;
> >>>>
> >>>> Given the comment, I think "(tg - 10) / 2" would suffice ;)
> >>>
> >>> Well, duh...
> >>>
> >>>>
> >>>>>> +
> >>>>>> +            /* Determine what level the granule is at */
> >>>>>> +            cmd.tlbi.ttl = 4 - ((ilog2(granule) - 3) / (tg - 3));
> >>>>
> >>>> Is it possible to rephrase that with logs and shifts to avoid the division?
> >>>
> >>> Well, with a loop is the only other way I came up with:
> >>>
> >>> bpl = tg - 3;
> >>> ttl = 3;
> >>> mask = BIT(bpl) - 1;
> >>> while ((granule & (mask << ((4 - ttl) * bpl + 3))) == 0)
> >>>       ttl--;
> >>>
> >>> Doesn't seem like much improvement to me given we have h/w divide...
> >>
> >> Sure, it doesn't take too many extra instructions to start matching
> >> typical IDIV latency, so there's no point being silly if there really
> >> isn't a clean alternative.
> >>
> >> Some quick scribbling suggests "4 - ilog2(granule) / 10" might actually
> >> be close enough, but perhaps that's a bit too cheeky.
> >
> > How does divide by 10 save a divide?
>
> Well, by that point I was more just thinking about the smallest
> expression, since *some* division seems unavoidable. Although GCC does
> apparently still think that transforming constant division is a win ;)

Okay. Still, divide by 10 happens to work, but it is very much not
obvious. It also doesn't work for level 1 and 16 or 64KB pages (though
we'll never see that granule). Subtracting 3 is not that obvious
either, but is at least in the spec for calculating the bits per
level.

I think we've spent enough time micro-optimizing this and have better
things to worry about.

Rob
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu