Re: [PATCH v2] iommu/arm-smmu-v3: Add SMMUv3.2 range invalidation support

From: Rob Herring <robh@kernel.org>
To: Robin Murphy <robin.murphy@arm.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>,
	Linux IOMMU <iommu@lists.linux-foundation.org>,
	Will Deacon <will@kernel.org>,
	"moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE"
	<linux-arm-kernel@lists.infradead.org>
Subject: Re: [PATCH v2] iommu/arm-smmu-v3: Add SMMUv3.2 range invalidation support
Date: Thu, 13 Feb 2020 15:49:51 -0600	[thread overview]
Message-ID: <CAL_JsqJRSD-7U8UH1tevBdD2P6qPWzApQLpXU-SVBmZ8=Yiy0A@mail.gmail.com> (raw)
In-Reply-To: <7ac3f864-6c39-76e9-ee4a-21be03abc044@arm.com>

On Thu, Jan 30, 2020 at 11:34 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 30/01/2020 3:06 pm, Auger Eric wrote:
> > Hi Rob,
> > On 1/17/20 10:16 PM, Rob Herring wrote:
> >> Arm SMMUv3.2 adds support for TLB range invalidate operations.
> >> Support for range invalidate is determined by the RIL bit in the IDR3
> >> register.
> >>
> >> The range invalidate is in units of the leaf page size and operates on
> >> 1-32 chunks of a power of 2 multiple pages. First, we determine from the
> >> size what power of 2 multiple we can use. Then we calculate how many
> >> chunks (1-31) of the power of 2 size for the range on the iteration. On
> >> each iteration, we move up in size by at least 5 bits.
> >>
> >> Cc: Eric Auger <eric.auger@redhat.com>
> >> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> >> Cc: Will Deacon <will@kernel.org>
> >> Cc: Robin Murphy <robin.murphy@arm.com>
> >> Cc: Joerg Roedel <joro@8bytes.org>
> >> Signed-off-by: Rob Herring <robh@kernel.org>

> >> @@ -2003,7 +2024,7 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
> >>   {
> >>      u64 cmds[CMDQ_BATCH_ENTRIES * CMDQ_ENT_DWORDS];
> >>      struct arm_smmu_device *smmu = smmu_domain->smmu;
> >> -    unsigned long start = iova, end = iova + size;
> >> +    unsigned long start = iova, end = iova + size, num_pages = 0, tg = 0;
> >>      int i = 0;
> >>      struct arm_smmu_cmdq_ent cmd = {
> >>              .tlbi = {
> >> @@ -2022,12 +2043,50 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
> >>              cmd.tlbi.vmid   = smmu_domain->s2_cfg.vmid;
> >>      }
> >>
> >> +    if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
> >> +            /* Get the leaf page size */
> >> +            tg = __ffs(smmu_domain->domain.pgsize_bitmap);
> >> +
> >> +            /* Convert page size of 12,14,16 (log2) to 1,2,3 */
> >> +            cmd.tlbi.tg = ((tg - ilog2(SZ_4K)) / 2) + 1;
>
> Given the comment, I think "(tg - 10) / 2" would suffice ;)

Well, duh...

>
> >> +
> >> +            /* Determine what level the granule is at */
> >> +            cmd.tlbi.ttl = 4 - ((ilog2(granule) - 3) / (tg - 3));
>
> Is it possible to rephrase that with logs and shifts to avoid the division?

Well, with a loop is the only other way I came up with:

bpl = tg - 3;
ttl = 3;
mask = BIT(bpl) - 1;
while ((granule & (mask << ((4 - ttl) * bpl + 3))) == 0)
    ttl--;

Doesn't seem like much improvement to me given we have h/w divide...

>
> >> +
> >> +            num_pages = size / (1UL << tg);
>
> Similarly, in my experience GCC has always seemed too cautious to elide
> non-constant division even in a seemingly-obvious case like this, so
> explicit "size >> tg" might be preferable.

My experience has been gcc is smart enough. I checked and there's only
one divide and it is the prior one. But I'll change it anyways.

>
> >> +    }
> >> +
> >>      while (iova < end) {
> >>              if (i == CMDQ_BATCH_ENTRIES) {
> >>                      arm_smmu_cmdq_issue_cmdlist(smmu, cmds, i, false);
> >>                      i = 0;
> >>              }
> >>
> >> +            if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
> >> +                    /*
> >> +                     * On each iteration of the loop, the range is 5 bits
> >> +                     * worth of the aligned size remaining.
> >> +                     * The range in pages is:
> >> +                     *
> >> +                     * range = (num_pages & (0x1f << __ffs(num_pages)))
> >> +                     */
> >> +                    unsigned long scale, num;
> >> +
> >> +                    /* Determine the power of 2 multiple number of pages */
> >> +                    scale = __ffs(num_pages);
> >> +                    cmd.tlbi.scale = scale;
> >> +
> >> +                    /* Determine how many chunks of 2^scale size we have */
> >> +                    num = (num_pages >> scale) & CMDQ_TLBI_RANGE_NUM_MAX;
> >> +                    cmd.tlbi.num = num - 1;
> >> +
> >> +                    /* range is num * 2^scale * pgsize */
> >> +                    granule = num << (scale + tg);
> >> +
> >> +                    /* Clear out the lower order bits for the next iteration */
> >> +                    num_pages -= num << scale;
> > Regarding the 2 options given in
> > https://lore.kernel.org/linux-arm-kernel/CAL_JsqKABoE+0crGwyZdNogNgEoG=MOOpf6deQgH6s73c0UNdA@mail.gmail.com/raw,
> >
> > I understand you implemented 2) but I still do not understand why you
> > preferred that one against 1).
> >
> > In your case of 1023*4k pages this will invalidate by 31 32*2^0*4K +
> > 31*2^0*4K pages
> > whereas you could achieve that with 10 invalidations with the 1st algo.
> > I did not get the case where it is more efficient. Please can you detail.
>
> I guess essentially we want to solve for two variables to get a range as
> close to size as possible. There might be a more elegant numerical
> method, but for the numbers involved brute force is probably good enough
> for the real world. 5 minutes alone with the architecture spec and a
> blank editor begat this pseudo-implementation:
>
>         size_t npages = size >> pgshift;
>         while (npages) {
>                 unsigned long range;
>                 unsigned int delta, best = UINT_MAX;
>                 int num, scale = min(31, __ffs(npages));
>
>                 while (scale) {
>                         num = min(32, npages >> scale);
>                         if (num == 32)
>                                 break;
>
>                         delta = npages & ((1 << scale) - 1);
>                         if (!delta || delta > best)
>                                 break;
>
>                         best = delta;
>                         scale--;
>                 }
>
>                 //invalidate
>
>                 range = num << scale;
>                 npages -= range;
>                 iova += (range) << pgshift;
>         }
>
> Modulo any obvious thinkos, what do you reckon?

I don't think this is an improvement. See my other reply.

Rob
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu