Re: [PATCH] iommu/arm-smmu-v3: Add SMMUv3.2 range invalidation support

From: Rob Herring <robh@kernel.org>
To: Auger Eric <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>,
	Robin Murphy <robin.murphy@arm.com>,
	Joerg Roedel <joro@8bytes.org>,
	Linux IOMMU <iommu@lists.linux-foundation.org>,
	Will Deacon <will@kernel.org>,
	"moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE"
	<linux-arm-kernel@lists.infradead.org>
Subject: Re: [PATCH] iommu/arm-smmu-v3: Add SMMUv3.2 range invalidation support
Date: Thu, 16 Jan 2020 17:09:06 -0600	[thread overview]
Message-ID: <CAL_JsqKABoE+0crGwyZdNogNgEoG=MOOpf6deQgH6s73c0UNdA@mail.gmail.com> (raw)
In-Reply-To: <4e56aa27-37f0-d8d9-46fd-871055abcb49@redhat.com>

On Thu, Jan 16, 2020 at 3:23 PM Auger Eric <eric.auger@redhat.com> wrote:
>
> Hi Rob,
>
> On 1/16/20 5:57 PM, Rob Herring wrote:
> > On Wed, Jan 15, 2020 at 10:33 AM Auger Eric <eric.auger@redhat.com> wrote:
> >>
> >> Hi Rob,
> >>
> >> On 1/15/20 3:02 PM, Rob Herring wrote:
> >>> On Wed, Jan 15, 2020 at 3:21 AM Auger Eric <eric.auger@redhat.com> wrote:
> >>>>
> >>>> Hi Rob,
> >>>>
> >>>> On 1/13/20 3:39 PM, Rob Herring wrote:
> >>>>> Arm SMMUv3.2 adds support for TLB range invalidate operations.
> >>>>> Support for range invalidate is determined by the RIL bit in the IDR3
> >>>>> register.
> >>>>>
> >>>>> The range invalidate is in units of the leaf page size and operates on
> >>>>> 1-32 chunks of a power of 2 multiple pages. First we determine from the
> >>>>> size what power of 2 multiple we can use and then adjust the granule to
> >>>>> 32x that size.
> >
> >>>>> @@ -2022,12 +2043,39 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
> >>>>>               cmd.tlbi.vmid   = smmu_domain->s2_cfg.vmid;
> >>>>>       }
> >>>>>
> >>>>> +     if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
> >>>>> +             unsigned long tg, scale;
> >>>>> +
> >>>>> +             /* Get the leaf page size */
> >>>>> +             tg = __ffs(smmu_domain->domain.pgsize_bitmap);
> >>>> it is unclear to me why you can't set tg with the granule parameter.
> >>>
> >>> granule could be 2MB sections if THP is enabled, right?
> >>
> >> Ah OK I thought it was a page size and not a block size.
> >>
> >> I requested this feature a long time ago for virtual SMMUv3. With
> >> DPDK/VFIO the guest was sending page TLB invalidation for each page
> >> (granule=4K or 64K) part of the hugepage buffer and those were trapped
> >> by the VMM. This stalled qemu.
> >
> > I did some more testing to make sure THP is enabled, but haven't been
> > able to get granule to be anything but 4K. I only have the Fast Model
> > with AHCI on PCI to test this with. Maybe I'm hitting some place where
> > THPs aren't supported yet.
> >
> >>>>> +             /* Determine the power of 2 multiple number of pages */
> >>>>> +             scale = __ffs(size / (1UL << tg));
> >>>>> +             cmd.tlbi.scale = scale;
> >>>>> +
> >>>>> +             cmd.tlbi.num = CMDQ_TLBI_RANGE_NUM_MAX - 1;
> >>>> Also could you explain why you use CMDQ_TLBI_RANGE_NUM_MAX.
> >>>
> >>> How's this:
> >>> /* The invalidation loop defaults to the maximum range */
> >> I would have expected num=0 directly. Don't we invalidate the &size in
> >> one shot as 2^scale * pages of granularity @tg? I fail to understand
> >> when NUM > 0.
> >
> > NUM is > 0 anytime size is not a power of 2. For example, if size is
> > 33 pages, then it takes 2 loops doing 32 pages and then 1 page. If
> > size is 34 pages, then NUM is (17-1) and SCALE is 1.
> OK I get it now. I misread the scale computation as log2() :-(.
>
> I still have a doubt about the scale choice. What if you invalidate a
> large number of pages such as 1025 pages. scale is 0 and you end up with
> 32 * 32 * 2^0 + 1 * 2 * 2^0  invalidations (33). Whereas you could
> invalidate the whole range with 2 invalidation commands: 1 x 2^10 +
> 1*1^1 (packing the invalidations by largest scale). Am I correct or do I
> still miss something?

No, that's correct. 33 is a lot better than 1025 though. :) 1023 pages
is about the worst case if we assume we get 2MB blocks, but maybe not
a good assumption given our testing so far...

So thinking out loud, I guess we could iterate on power of 2 chunks of
size (in units of pages) like this:

while (size) {
  scale = fls(size);
  range = 1 << scale;
  size &= ~range;

  iova += range;
}

But that means NUM is always 0, so also not ideal. So we need to
extract 5 bits from size for NUM on each iteration:

while (size) {
  scale = __ffs(size);
  num = (size >> scale)) & 0x1f;
  size -= (num + 1) * (1 << scale);

  ...
}

So worst case, we'd have 4 invalidates for up to 4G.

> Besides in the patch I think in the while loop the iova should be
> incremented with the actual number of invalidated bytes and not the max
> sized granule variable.

Ok.

Rob

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel