* Re: [PATCH] iommu/arm-smmu-v3: Add SMMUv3.2 range invalidation support
@ 2020-01-15 16:32 ` Auger Eric
0 siblings, 0 replies; 16+ messages in thread
From: Auger Eric @ 2020-01-15 16:32 UTC (permalink / raw)
To: Rob Herring
Cc: Jean-Philippe Brucker, Will Deacon, Joerg Roedel, Linux IOMMU,
Robin Murphy,
moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE
Hi Rob,
On 1/15/20 3:02 PM, Rob Herring wrote:
> On Wed, Jan 15, 2020 at 3:21 AM Auger Eric <eric.auger@redhat.com> wrote:
>>
>> Hi Rob,
>>
>> On 1/13/20 3:39 PM, Rob Herring wrote:
>>> Arm SMMUv3.2 adds support for TLB range invalidate operations.
>>> Support for range invalidate is determined by the RIL bit in the IDR3
>>> register.
>>>
>>> The range invalidate is in units of the leaf page size and operates on
>>> 1-32 chunks of a power of 2 multiple pages. First we determine from the
>>> size what power of 2 multiple we can use and then adjust the granule to
>>> 32x that size.
>>>
>>> Cc: Eric Auger <eric.auger@redhat.com>
>>> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
>>> Cc: Will Deacon <will@kernel.org>
>>> Cc: Robin Murphy <robin.murphy@arm.com>
>>> Cc: Joerg Roedel <joro@8bytes.org>
>>> Signed-off-by: Rob Herring <robh@kernel.org>
>>> ---
>>> drivers/iommu/arm-smmu-v3.c | 53 +++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 53 insertions(+)
>>>
>>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>>> index e91b4a098215..8b6b3e2aa383 100644
>>> --- a/drivers/iommu/arm-smmu-v3.c
>>> +++ b/drivers/iommu/arm-smmu-v3.c
>>> @@ -70,6 +70,9 @@
>>> #define IDR1_SSIDSIZE GENMASK(10, 6)
>>> #define IDR1_SIDSIZE GENMASK(5, 0)
>>>
>>> +#define ARM_SMMU_IDR3 0xc
>>> +#define IDR3_RIL (1 << 10)
>>> +
>>> #define ARM_SMMU_IDR5 0x14
>>> #define IDR5_STALL_MAX GENMASK(31, 16)
>>> #define IDR5_GRAN64K (1 << 6)
>>> @@ -327,9 +330,14 @@
>>> #define CMDQ_CFGI_1_LEAF (1UL << 0)
>>> #define CMDQ_CFGI_1_RANGE GENMASK_ULL(4, 0)
>>>
>>> +#define CMDQ_TLBI_0_NUM GENMASK_ULL(16, 12)
>>> +#define CMDQ_TLBI_RANGE_NUM_MAX 32
>>> +#define CMDQ_TLBI_0_SCALE GENMASK_ULL(24, 20)
>>> #define CMDQ_TLBI_0_VMID GENMASK_ULL(47, 32)
>>> #define CMDQ_TLBI_0_ASID GENMASK_ULL(63, 48)
>>> #define CMDQ_TLBI_1_LEAF (1UL << 0)
>>> +#define CMDQ_TLBI_1_TTL GENMASK_ULL(9, 8)
>>> +#define CMDQ_TLBI_1_TG GENMASK_ULL(11, 10)
>>> #define CMDQ_TLBI_1_VA_MASK GENMASK_ULL(63, 12)
>>> #define CMDQ_TLBI_1_IPA_MASK GENMASK_ULL(51, 12)
>>>
>>> @@ -455,9 +463,13 @@ struct arm_smmu_cmdq_ent {
>>> #define CMDQ_OP_TLBI_S2_IPA 0x2a
>>> #define CMDQ_OP_TLBI_NSNH_ALL 0x30
>>> struct {
>>> + u8 num;
>>> + u8 scale;
>>> u16 asid;
>>> u16 vmid;
>>> bool leaf;
>>> + u8 ttl;
>>> + u8 tg;
>>> u64 addr;
>>> } tlbi;
>>>
>>> @@ -595,6 +607,7 @@ struct arm_smmu_device {
>>> #define ARM_SMMU_FEAT_HYP (1 << 12)
>>> #define ARM_SMMU_FEAT_STALL_FORCE (1 << 13)
>>> #define ARM_SMMU_FEAT_VAX (1 << 14)
>>> +#define ARM_SMMU_FEAT_RANGE_INV (1 << 15)
>>> u32 features;
>>>
>>> #define ARM_SMMU_OPT_SKIP_PREFETCH (1 << 0)
>>> @@ -856,13 +869,21 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
>>> cmd[1] |= FIELD_PREP(CMDQ_CFGI_1_RANGE, 31);
>>> break;
>>> case CMDQ_OP_TLBI_NH_VA:
>>> + cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_NUM, ent->tlbi.num);
>>> + cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_SCALE, ent->tlbi.scale);
>>> cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid);
>>> cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_LEAF, ent->tlbi.leaf);
>>> + cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TTL, ent->tlbi.ttl);
>>> + cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TG, ent->tlbi.tg);
>>> cmd[1] |= ent->tlbi.addr & CMDQ_TLBI_1_VA_MASK;
>>> break;
>>> case CMDQ_OP_TLBI_S2_IPA:
>>> + cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_NUM, ent->tlbi.num);
>>> + cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_SCALE, ent->tlbi.scale);
>>> cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
>>> cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_LEAF, ent->tlbi.leaf);
>>> + cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TTL, ent->tlbi.ttl);
>>> + cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TG, ent->tlbi.tg);
>>> cmd[1] |= ent->tlbi.addr & CMDQ_TLBI_1_IPA_MASK;
>>> break;
>>> case CMDQ_OP_TLBI_NH_ASID:
>>> @@ -2022,12 +2043,39 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
>>> cmd.tlbi.vmid = smmu_domain->s2_cfg.vmid;
>>> }
>>>
>>> + if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
>>> + unsigned long tg, scale;
>>> +
>>> + /* Get the leaf page size */
>>> + tg = __ffs(smmu_domain->domain.pgsize_bitmap);
>> it is unclear to me why you can't set tg with the granule parameter.
>
> granule could be 2MB sections if THP is enabled, right?
Ah OK I thought it was a page size and not a block size.
I requested this feature a long time ago for virtual SMMUv3. With
DPDK/VFIO the guest was sending page TLB invalidation for each page
(granule=4K or 64K) part of the hugepage buffer and those were trapped
by the VMM. This stalled qemu.
>
>>> +
>>> + /* Determine the power of 2 multiple number of pages */
>>> + scale = __ffs(size / (1UL << tg));
>>> + cmd.tlbi.scale = scale;
>>> +
>>> + cmd.tlbi.num = CMDQ_TLBI_RANGE_NUM_MAX - 1;
>> Also could you explain why you use CMDQ_TLBI_RANGE_NUM_MAX.
>
> How's this:
> /* The invalidation loop defaults to the maximum range */
I would have expected num=0 directly. Don't we invalidate the &size in
one shot as 2^scale * pages of granularity @tg? I fail to understand
when NUM > 0.
Thanks
Eric
>
> And perhaps I'll move it next to setting granule.
>
>>> +
>>> + /* Convert page size of 12,14,16 (log2) to 1,2,3 */
>>> + cmd.tlbi.tg = ((tg - ilog2(SZ_4K)) / 2) + 1;
>>> +
>>> + /* Determine what level the granule is at */
>>> + cmd.tlbi.ttl = 4 - ((ilog2(granule) - 3) / (tg - 3));
>>> +
>>> + /* Adjust granule to the maximum range */
>>> + granule = CMDQ_TLBI_RANGE_NUM_MAX * (1 << scale) * (1UL << tg);
>> spec says
>> Range = ((NUM+1)*2 ^ SCALE )*Translation_Granule_Size
>
> (NUM+1) can be 1-32. I went with the logical max for
> CMDQ_TLBI_RANGE_NUM_MAX rather than the NUM field value max.
>
> Rob
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Add SMMUv3.2 range invalidation support
2020-01-15 16:32 ` Auger Eric
@ 2020-01-16 12:14 ` Robin Murphy
-1 siblings, 0 replies; 16+ messages in thread
From: Robin Murphy @ 2020-01-16 12:14 UTC (permalink / raw)
To: Auger Eric, Rob Herring
Cc: Jean-Philippe Brucker, Will Deacon, Linux IOMMU,
moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE
On 2020-01-15 4:32 pm, Auger Eric wrote:
> Hi Rob,
>
> On 1/15/20 3:02 PM, Rob Herring wrote:
>> On Wed, Jan 15, 2020 at 3:21 AM Auger Eric <eric.auger@redhat.com> wrote:
>>>
>>> Hi Rob,
>>>
>>> On 1/13/20 3:39 PM, Rob Herring wrote:
>>>> Arm SMMUv3.2 adds support for TLB range invalidate operations.
>>>> Support for range invalidate is determined by the RIL bit in the IDR3
>>>> register.
>>>>
>>>> The range invalidate is in units of the leaf page size and operates on
>>>> 1-32 chunks of a power of 2 multiple pages. First we determine from the
>>>> size what power of 2 multiple we can use and then adjust the granule to
>>>> 32x that size.
>>>>
>>>> Cc: Eric Auger <eric.auger@redhat.com>
>>>> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
>>>> Cc: Will Deacon <will@kernel.org>
>>>> Cc: Robin Murphy <robin.murphy@arm.com>
>>>> Cc: Joerg Roedel <joro@8bytes.org>
>>>> Signed-off-by: Rob Herring <robh@kernel.org>
>>>> ---
>>>> drivers/iommu/arm-smmu-v3.c | 53 +++++++++++++++++++++++++++++++++++++
>>>> 1 file changed, 53 insertions(+)
>>>>
>>>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>>>> index e91b4a098215..8b6b3e2aa383 100644
>>>> --- a/drivers/iommu/arm-smmu-v3.c
>>>> +++ b/drivers/iommu/arm-smmu-v3.c
>>>> @@ -70,6 +70,9 @@
>>>> #define IDR1_SSIDSIZE GENMASK(10, 6)
>>>> #define IDR1_SIDSIZE GENMASK(5, 0)
>>>>
>>>> +#define ARM_SMMU_IDR3 0xc
>>>> +#define IDR3_RIL (1 << 10)
>>>> +
>>>> #define ARM_SMMU_IDR5 0x14
>>>> #define IDR5_STALL_MAX GENMASK(31, 16)
>>>> #define IDR5_GRAN64K (1 << 6)
>>>> @@ -327,9 +330,14 @@
>>>> #define CMDQ_CFGI_1_LEAF (1UL << 0)
>>>> #define CMDQ_CFGI_1_RANGE GENMASK_ULL(4, 0)
>>>>
>>>> +#define CMDQ_TLBI_0_NUM GENMASK_ULL(16, 12)
>>>> +#define CMDQ_TLBI_RANGE_NUM_MAX 32
>>>> +#define CMDQ_TLBI_0_SCALE GENMASK_ULL(24, 20)
>>>> #define CMDQ_TLBI_0_VMID GENMASK_ULL(47, 32)
>>>> #define CMDQ_TLBI_0_ASID GENMASK_ULL(63, 48)
>>>> #define CMDQ_TLBI_1_LEAF (1UL << 0)
>>>> +#define CMDQ_TLBI_1_TTL GENMASK_ULL(9, 8)
>>>> +#define CMDQ_TLBI_1_TG GENMASK_ULL(11, 10)
>>>> #define CMDQ_TLBI_1_VA_MASK GENMASK_ULL(63, 12)
>>>> #define CMDQ_TLBI_1_IPA_MASK GENMASK_ULL(51, 12)
>>>>
>>>> @@ -455,9 +463,13 @@ struct arm_smmu_cmdq_ent {
>>>> #define CMDQ_OP_TLBI_S2_IPA 0x2a
>>>> #define CMDQ_OP_TLBI_NSNH_ALL 0x30
>>>> struct {
>>>> + u8 num;
>>>> + u8 scale;
>>>> u16 asid;
>>>> u16 vmid;
>>>> bool leaf;
>>>> + u8 ttl;
>>>> + u8 tg;
>>>> u64 addr;
>>>> } tlbi;
>>>>
>>>> @@ -595,6 +607,7 @@ struct arm_smmu_device {
>>>> #define ARM_SMMU_FEAT_HYP (1 << 12)
>>>> #define ARM_SMMU_FEAT_STALL_FORCE (1 << 13)
>>>> #define ARM_SMMU_FEAT_VAX (1 << 14)
>>>> +#define ARM_SMMU_FEAT_RANGE_INV (1 << 15)
>>>> u32 features;
>>>>
>>>> #define ARM_SMMU_OPT_SKIP_PREFETCH (1 << 0)
>>>> @@ -856,13 +869,21 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
>>>> cmd[1] |= FIELD_PREP(CMDQ_CFGI_1_RANGE, 31);
>>>> break;
>>>> case CMDQ_OP_TLBI_NH_VA:
>>>> + cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_NUM, ent->tlbi.num);
>>>> + cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_SCALE, ent->tlbi.scale);
>>>> cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid);
>>>> cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_LEAF, ent->tlbi.leaf);
>>>> + cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TTL, ent->tlbi.ttl);
>>>> + cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TG, ent->tlbi.tg);
>>>> cmd[1] |= ent->tlbi.addr & CMDQ_TLBI_1_VA_MASK;
>>>> break;
>>>> case CMDQ_OP_TLBI_S2_IPA:
>>>> + cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_NUM, ent->tlbi.num);
>>>> + cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_SCALE, ent->tlbi.scale);
>>>> cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
>>>> cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_LEAF, ent->tlbi.leaf);
>>>> + cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TTL, ent->tlbi.ttl);
>>>> + cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TG, ent->tlbi.tg);
>>>> cmd[1] |= ent->tlbi.addr & CMDQ_TLBI_1_IPA_MASK;
>>>> break;
>>>> case CMDQ_OP_TLBI_NH_ASID:
>>>> @@ -2022,12 +2043,39 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
>>>> cmd.tlbi.vmid = smmu_domain->s2_cfg.vmid;
>>>> }
>>>>
>>>> + if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
>>>> + unsigned long tg, scale;
>>>> +
>>>> + /* Get the leaf page size */
>>>> + tg = __ffs(smmu_domain->domain.pgsize_bitmap);
>>> it is unclear to me why you can't set tg with the granule parameter.
>>
>> granule could be 2MB sections if THP is enabled, right?
>
> Ah OK I thought it was a page size and not a block size.
In hindsight, @granule might be more accurately called @leaf_size - for
a non-leaf invalidate, it should always be the actual granule (i.e. page
size), per __arm_lpae_unmap(). Even if we're knocking out a level 1
table, we don't walk the whole thing to find leaves at level 2 and/or
level 1 to invalidate, we just knock out the range at page granularity
to be safe. However for leaf invalidations we know exactly what we're
taking out, so @granule may be a block size if appropriate (that
definitely used to be the case, and I don't *think* the gather ops
changed it).
> I requested this feature a long time ago for virtual SMMUv3. With
> DPDK/VFIO the guest was sending page TLB invalidation for each page
> (granule=4K or 64K) part of the hugepage buffer and those were trapped
> by the VMM. This stalled qemu.
Heh, I remember that being awkward to comment on at the time since we
were already speccing out 3.2 internally :)
Robin.
>>>> +
>>>> + /* Determine the power of 2 multiple number of pages */
>>>> + scale = __ffs(size / (1UL << tg));
>>>> + cmd.tlbi.scale = scale;
>>>> +
>>>> + cmd.tlbi.num = CMDQ_TLBI_RANGE_NUM_MAX - 1;
>>> Also could you explain why you use CMDQ_TLBI_RANGE_NUM_MAX.
>>
>> How's this:
>> /* The invalidation loop defaults to the maximum range */
> I would have expected num=0 directly. Don't we invalidate the &size in
> one shot as 2^scale * pages of granularity @tg? I fail to understand
> when NUM > 0.
>
>
> Thanks
>
> Eric
>>
>> And perhaps I'll move it next to setting granule.
>>
>>>> +
>>>> + /* Convert page size of 12,14,16 (log2) to 1,2,3 */
>>>> + cmd.tlbi.tg = ((tg - ilog2(SZ_4K)) / 2) + 1;
>>>> +
>>>> + /* Determine what level the granule is at */
>>>> + cmd.tlbi.ttl = 4 - ((ilog2(granule) - 3) / (tg - 3));
>>>> +
>>>> + /* Adjust granule to the maximum range */
>>>> + granule = CMDQ_TLBI_RANGE_NUM_MAX * (1 << scale) * (1UL << tg);
>>> spec says
>>> Range = ((NUM+1)*2 ^ SCALE )*Translation_Granule_Size
>>
>> (NUM+1) can be 1-32. I went with the logical max for
>> CMDQ_TLBI_RANGE_NUM_MAX rather than the NUM field value max.
>>
>> Rob
>>
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>>
>
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Add SMMUv3.2 range invalidation support
@ 2020-01-16 12:14 ` Robin Murphy
0 siblings, 0 replies; 16+ messages in thread
From: Robin Murphy @ 2020-01-16 12:14 UTC (permalink / raw)
To: Auger Eric, Rob Herring
Cc: Jean-Philippe Brucker, Joerg Roedel, Will Deacon, Linux IOMMU,
moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE
On 2020-01-15 4:32 pm, Auger Eric wrote:
> Hi Rob,
>
> On 1/15/20 3:02 PM, Rob Herring wrote:
>> On Wed, Jan 15, 2020 at 3:21 AM Auger Eric <eric.auger@redhat.com> wrote:
>>>
>>> Hi Rob,
>>>
>>> On 1/13/20 3:39 PM, Rob Herring wrote:
>>>> Arm SMMUv3.2 adds support for TLB range invalidate operations.
>>>> Support for range invalidate is determined by the RIL bit in the IDR3
>>>> register.
>>>>
>>>> The range invalidate is in units of the leaf page size and operates on
>>>> 1-32 chunks of a power of 2 multiple pages. First we determine from the
>>>> size what power of 2 multiple we can use and then adjust the granule to
>>>> 32x that size.
>>>>
>>>> Cc: Eric Auger <eric.auger@redhat.com>
>>>> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
>>>> Cc: Will Deacon <will@kernel.org>
>>>> Cc: Robin Murphy <robin.murphy@arm.com>
>>>> Cc: Joerg Roedel <joro@8bytes.org>
>>>> Signed-off-by: Rob Herring <robh@kernel.org>
>>>> ---
>>>> drivers/iommu/arm-smmu-v3.c | 53 +++++++++++++++++++++++++++++++++++++
>>>> 1 file changed, 53 insertions(+)
>>>>
>>>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>>>> index e91b4a098215..8b6b3e2aa383 100644
>>>> --- a/drivers/iommu/arm-smmu-v3.c
>>>> +++ b/drivers/iommu/arm-smmu-v3.c
>>>> @@ -70,6 +70,9 @@
>>>> #define IDR1_SSIDSIZE GENMASK(10, 6)
>>>> #define IDR1_SIDSIZE GENMASK(5, 0)
>>>>
>>>> +#define ARM_SMMU_IDR3 0xc
>>>> +#define IDR3_RIL (1 << 10)
>>>> +
>>>> #define ARM_SMMU_IDR5 0x14
>>>> #define IDR5_STALL_MAX GENMASK(31, 16)
>>>> #define IDR5_GRAN64K (1 << 6)
>>>> @@ -327,9 +330,14 @@
>>>> #define CMDQ_CFGI_1_LEAF (1UL << 0)
>>>> #define CMDQ_CFGI_1_RANGE GENMASK_ULL(4, 0)
>>>>
>>>> +#define CMDQ_TLBI_0_NUM GENMASK_ULL(16, 12)
>>>> +#define CMDQ_TLBI_RANGE_NUM_MAX 32
>>>> +#define CMDQ_TLBI_0_SCALE GENMASK_ULL(24, 20)
>>>> #define CMDQ_TLBI_0_VMID GENMASK_ULL(47, 32)
>>>> #define CMDQ_TLBI_0_ASID GENMASK_ULL(63, 48)
>>>> #define CMDQ_TLBI_1_LEAF (1UL << 0)
>>>> +#define CMDQ_TLBI_1_TTL GENMASK_ULL(9, 8)
>>>> +#define CMDQ_TLBI_1_TG GENMASK_ULL(11, 10)
>>>> #define CMDQ_TLBI_1_VA_MASK GENMASK_ULL(63, 12)
>>>> #define CMDQ_TLBI_1_IPA_MASK GENMASK_ULL(51, 12)
>>>>
>>>> @@ -455,9 +463,13 @@ struct arm_smmu_cmdq_ent {
>>>> #define CMDQ_OP_TLBI_S2_IPA 0x2a
>>>> #define CMDQ_OP_TLBI_NSNH_ALL 0x30
>>>> struct {
>>>> + u8 num;
>>>> + u8 scale;
>>>> u16 asid;
>>>> u16 vmid;
>>>> bool leaf;
>>>> + u8 ttl;
>>>> + u8 tg;
>>>> u64 addr;
>>>> } tlbi;
>>>>
>>>> @@ -595,6 +607,7 @@ struct arm_smmu_device {
>>>> #define ARM_SMMU_FEAT_HYP (1 << 12)
>>>> #define ARM_SMMU_FEAT_STALL_FORCE (1 << 13)
>>>> #define ARM_SMMU_FEAT_VAX (1 << 14)
>>>> +#define ARM_SMMU_FEAT_RANGE_INV (1 << 15)
>>>> u32 features;
>>>>
>>>> #define ARM_SMMU_OPT_SKIP_PREFETCH (1 << 0)
>>>> @@ -856,13 +869,21 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
>>>> cmd[1] |= FIELD_PREP(CMDQ_CFGI_1_RANGE, 31);
>>>> break;
>>>> case CMDQ_OP_TLBI_NH_VA:
>>>> + cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_NUM, ent->tlbi.num);
>>>> + cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_SCALE, ent->tlbi.scale);
>>>> cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid);
>>>> cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_LEAF, ent->tlbi.leaf);
>>>> + cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TTL, ent->tlbi.ttl);
>>>> + cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TG, ent->tlbi.tg);
>>>> cmd[1] |= ent->tlbi.addr & CMDQ_TLBI_1_VA_MASK;
>>>> break;
>>>> case CMDQ_OP_TLBI_S2_IPA:
>>>> + cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_NUM, ent->tlbi.num);
>>>> + cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_SCALE, ent->tlbi.scale);
>>>> cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
>>>> cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_LEAF, ent->tlbi.leaf);
>>>> + cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TTL, ent->tlbi.ttl);
>>>> + cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TG, ent->tlbi.tg);
>>>> cmd[1] |= ent->tlbi.addr & CMDQ_TLBI_1_IPA_MASK;
>>>> break;
>>>> case CMDQ_OP_TLBI_NH_ASID:
>>>> @@ -2022,12 +2043,39 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
>>>> cmd.tlbi.vmid = smmu_domain->s2_cfg.vmid;
>>>> }
>>>>
>>>> + if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
>>>> + unsigned long tg, scale;
>>>> +
>>>> + /* Get the leaf page size */
>>>> + tg = __ffs(smmu_domain->domain.pgsize_bitmap);
>>> it is unclear to me why you can't set tg with the granule parameter.
>>
>> granule could be 2MB sections if THP is enabled, right?
>
> Ah OK I thought it was a page size and not a block size.
In hindsight, @granule might be more accurately called @leaf_size - for
a non-leaf invalidate, it should always be the actual granule (i.e. page
size), per __arm_lpae_unmap(). Even if we're knocking out a level 1
table, we don't walk the whole thing to find leaves at level 2 and/or
level 1 to invalidate, we just knock out the range at page granularity
to be safe. However for leaf invalidations we know exactly what we're
taking out, so @granule may be a block size if appropriate (that
definitely used to be the case, and I don't *think* the gather ops
changed it).
> I requested this feature a long time ago for virtual SMMUv3. With
> DPDK/VFIO the guest was sending page TLB invalidation for each page
> (granule=4K or 64K) part of the hugepage buffer and those were trapped
> by the VMM. This stalled qemu.
Heh, I remember that being awkward to comment on at the time since we
were already speccing out 3.2 internally :)
Robin.
>>>> +
>>>> + /* Determine the power of 2 multiple number of pages */
>>>> + scale = __ffs(size / (1UL << tg));
>>>> + cmd.tlbi.scale = scale;
>>>> +
>>>> + cmd.tlbi.num = CMDQ_TLBI_RANGE_NUM_MAX - 1;
>>> Also could you explain why you use CMDQ_TLBI_RANGE_NUM_MAX.
>>
>> How's this:
>> /* The invalidation loop defaults to the maximum range */
> I would have expected num=0 directly. Don't we invalidate the &size in
> one shot as 2^scale * pages of granularity @tg? I fail to understand
> when NUM > 0.
>
>
> Thanks
>
> Eric
>>
>> And perhaps I'll move it next to setting granule.
>>
>>>> +
>>>> + /* Convert page size of 12,14,16 (log2) to 1,2,3 */
>>>> + cmd.tlbi.tg = ((tg - ilog2(SZ_4K)) / 2) + 1;
>>>> +
>>>> + /* Determine what level the granule is at */
>>>> + cmd.tlbi.ttl = 4 - ((ilog2(granule) - 3) / (tg - 3));
>>>> +
>>>> + /* Adjust granule to the maximum range */
>>>> + granule = CMDQ_TLBI_RANGE_NUM_MAX * (1 << scale) * (1UL << tg);
>>> spec says
>>> Range = ((NUM+1)*2 ^ SCALE )*Translation_Granule_Size
>>
>> (NUM+1) can be 1-32. I went with the logical max for
>> CMDQ_TLBI_RANGE_NUM_MAX rather than the NUM field value max.
>>
>> Rob
>>
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>>
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Add SMMUv3.2 range invalidation support
2020-01-15 16:32 ` Auger Eric
@ 2020-01-16 16:57 ` Rob Herring
-1 siblings, 0 replies; 16+ messages in thread
From: Rob Herring @ 2020-01-16 16:57 UTC (permalink / raw)
To: Auger Eric
Cc: Jean-Philippe Brucker, Will Deacon, Linux IOMMU, Robin Murphy,
moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE
On Wed, Jan 15, 2020 at 10:33 AM Auger Eric <eric.auger@redhat.com> wrote:
>
> Hi Rob,
>
> On 1/15/20 3:02 PM, Rob Herring wrote:
> > On Wed, Jan 15, 2020 at 3:21 AM Auger Eric <eric.auger@redhat.com> wrote:
> >>
> >> Hi Rob,
> >>
> >> On 1/13/20 3:39 PM, Rob Herring wrote:
> >>> Arm SMMUv3.2 adds support for TLB range invalidate operations.
> >>> Support for range invalidate is determined by the RIL bit in the IDR3
> >>> register.
> >>>
> >>> The range invalidate is in units of the leaf page size and operates on
> >>> 1-32 chunks of a power of 2 multiple pages. First we determine from the
> >>> size what power of 2 multiple we can use and then adjust the granule to
> >>> 32x that size.
> >>> @@ -2022,12 +2043,39 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
> >>> cmd.tlbi.vmid = smmu_domain->s2_cfg.vmid;
> >>> }
> >>>
> >>> + if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
> >>> + unsigned long tg, scale;
> >>> +
> >>> + /* Get the leaf page size */
> >>> + tg = __ffs(smmu_domain->domain.pgsize_bitmap);
> >> it is unclear to me why you can't set tg with the granule parameter.
> >
> > granule could be 2MB sections if THP is enabled, right?
>
> Ah OK I thought it was a page size and not a block size.
>
> I requested this feature a long time ago for virtual SMMUv3. With
> DPDK/VFIO the guest was sending page TLB invalidation for each page
> (granule=4K or 64K) part of the hugepage buffer and those were trapped
> by the VMM. This stalled qemu.
I did some more testing to make sure THP is enabled, but haven't been
able to get granule to be anything but 4K. I only have the Fast Model
with AHCI on PCI to test this with. Maybe I'm hitting some place where
THPs aren't supported yet.
> >>> + /* Determine the power of 2 multiple number of pages */
> >>> + scale = __ffs(size / (1UL << tg));
> >>> + cmd.tlbi.scale = scale;
> >>> +
> >>> + cmd.tlbi.num = CMDQ_TLBI_RANGE_NUM_MAX - 1;
> >> Also could you explain why you use CMDQ_TLBI_RANGE_NUM_MAX.
> >
> > How's this:
> > /* The invalidation loop defaults to the maximum range */
> I would have expected num=0 directly. Don't we invalidate the &size in
> one shot as 2^scale * pages of granularity @tg? I fail to understand
> when NUM > 0.
NUM is > 0 anytime size is not a power of 2. For example, if size is
33 pages, then it takes 2 loops doing 32 pages and then 1 page. If
size is 34 pages, then NUM is (17-1) and SCALE is 1.
Rob
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Add SMMUv3.2 range invalidation support
@ 2020-01-16 16:57 ` Rob Herring
0 siblings, 0 replies; 16+ messages in thread
From: Rob Herring @ 2020-01-16 16:57 UTC (permalink / raw)
To: Auger Eric
Cc: Jean-Philippe Brucker, Will Deacon, Joerg Roedel, Linux IOMMU,
Robin Murphy,
moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE
On Wed, Jan 15, 2020 at 10:33 AM Auger Eric <eric.auger@redhat.com> wrote:
>
> Hi Rob,
>
> On 1/15/20 3:02 PM, Rob Herring wrote:
> > On Wed, Jan 15, 2020 at 3:21 AM Auger Eric <eric.auger@redhat.com> wrote:
> >>
> >> Hi Rob,
> >>
> >> On 1/13/20 3:39 PM, Rob Herring wrote:
> >>> Arm SMMUv3.2 adds support for TLB range invalidate operations.
> >>> Support for range invalidate is determined by the RIL bit in the IDR3
> >>> register.
> >>>
> >>> The range invalidate is in units of the leaf page size and operates on
> >>> 1-32 chunks of a power of 2 multiple pages. First we determine from the
> >>> size what power of 2 multiple we can use and then adjust the granule to
> >>> 32x that size.
> >>> @@ -2022,12 +2043,39 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
> >>> cmd.tlbi.vmid = smmu_domain->s2_cfg.vmid;
> >>> }
> >>>
> >>> + if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
> >>> + unsigned long tg, scale;
> >>> +
> >>> + /* Get the leaf page size */
> >>> + tg = __ffs(smmu_domain->domain.pgsize_bitmap);
> >> it is unclear to me why you can't set tg with the granule parameter.
> >
> > granule could be 2MB sections if THP is enabled, right?
>
> Ah OK I thought it was a page size and not a block size.
>
> I requested this feature a long time ago for virtual SMMUv3. With
> DPDK/VFIO the guest was sending page TLB invalidation for each page
> (granule=4K or 64K) part of the hugepage buffer and those were trapped
> by the VMM. This stalled qemu.
I did some more testing to make sure THP is enabled, but haven't been
able to get granule to be anything but 4K. I only have the Fast Model
with AHCI on PCI to test this with. Maybe I'm hitting some place where
THPs aren't supported yet.
> >>> + /* Determine the power of 2 multiple number of pages */
> >>> + scale = __ffs(size / (1UL << tg));
> >>> + cmd.tlbi.scale = scale;
> >>> +
> >>> + cmd.tlbi.num = CMDQ_TLBI_RANGE_NUM_MAX - 1;
> >> Also could you explain why you use CMDQ_TLBI_RANGE_NUM_MAX.
> >
> > How's this:
> > /* The invalidation loop defaults to the maximum range */
> I would have expected num=0 directly. Don't we invalidate the &size in
> one shot as 2^scale * pages of granularity @tg? I fail to understand
> when NUM > 0.
NUM is > 0 anytime size is not a power of 2. For example, if size is
33 pages, then it takes 2 loops doing 32 pages and then 1 page. If
size is 34 pages, then NUM is (17-1) and SCALE is 1.
Rob
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Add SMMUv3.2 range invalidation support
2020-01-16 16:57 ` Rob Herring
@ 2020-01-16 21:23 ` Auger Eric
-1 siblings, 0 replies; 16+ messages in thread
From: Auger Eric @ 2020-01-16 21:23 UTC (permalink / raw)
To: Rob Herring
Cc: Jean-Philippe Brucker, Robin Murphy, Linux IOMMU, Will Deacon,
moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE
Hi Rob,
On 1/16/20 5:57 PM, Rob Herring wrote:
> On Wed, Jan 15, 2020 at 10:33 AM Auger Eric <eric.auger@redhat.com> wrote:
>>
>> Hi Rob,
>>
>> On 1/15/20 3:02 PM, Rob Herring wrote:
>>> On Wed, Jan 15, 2020 at 3:21 AM Auger Eric <eric.auger@redhat.com> wrote:
>>>>
>>>> Hi Rob,
>>>>
>>>> On 1/13/20 3:39 PM, Rob Herring wrote:
>>>>> Arm SMMUv3.2 adds support for TLB range invalidate operations.
>>>>> Support for range invalidate is determined by the RIL bit in the IDR3
>>>>> register.
>>>>>
>>>>> The range invalidate is in units of the leaf page size and operates on
>>>>> 1-32 chunks of a power of 2 multiple pages. First we determine from the
>>>>> size what power of 2 multiple we can use and then adjust the granule to
>>>>> 32x that size.
>
>>>>> @@ -2022,12 +2043,39 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
>>>>> cmd.tlbi.vmid = smmu_domain->s2_cfg.vmid;
>>>>> }
>>>>>
>>>>> + if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
>>>>> + unsigned long tg, scale;
>>>>> +
>>>>> + /* Get the leaf page size */
>>>>> + tg = __ffs(smmu_domain->domain.pgsize_bitmap);
>>>> it is unclear to me why you can't set tg with the granule parameter.
>>>
>>> granule could be 2MB sections if THP is enabled, right?
>>
>> Ah OK I thought it was a page size and not a block size.
>>
>> I requested this feature a long time ago for virtual SMMUv3. With
>> DPDK/VFIO the guest was sending page TLB invalidation for each page
>> (granule=4K or 64K) part of the hugepage buffer and those were trapped
>> by the VMM. This stalled qemu.
>
> I did some more testing to make sure THP is enabled, but haven't been
> able to get granule to be anything but 4K. I only have the Fast Model
> with AHCI on PCI to test this with. Maybe I'm hitting some place where
> THPs aren't supported yet.
>
>>>>> + /* Determine the power of 2 multiple number of pages */
>>>>> + scale = __ffs(size / (1UL << tg));
>>>>> + cmd.tlbi.scale = scale;
>>>>> +
>>>>> + cmd.tlbi.num = CMDQ_TLBI_RANGE_NUM_MAX - 1;
>>>> Also could you explain why you use CMDQ_TLBI_RANGE_NUM_MAX.
>>>
>>> How's this:
>>> /* The invalidation loop defaults to the maximum range */
>> I would have expected num=0 directly. Don't we invalidate the &size in
>> one shot as 2^scale * pages of granularity @tg? I fail to understand
>> when NUM > 0.
>
> NUM is > 0 anytime size is not a power of 2. For example, if size is
> 33 pages, then it takes 2 loops doing 32 pages and then 1 page. If
> size is 34 pages, then NUM is (17-1) and SCALE is 1.
OK I get it now. I misread the scale computation as log2() :-(.
I still have a doubt about the scale choice. What if you invalidate a
large number of pages such as 1025 pages. scale is 0 and you end up with
32 * 32 * 2^0 + 1 * 2 * 2^0 invalidations (33). Whereas you could
invalidate the whole range with 2 invalidation commands: 1 x 2^10 +
1*1^1 (packing the invalidations by largest scale). Am I correct or do I
still miss something?
Besides in the patch I think in the while loop the iova should be
incremented with the actual number of invalidated bytes and not the max
sized granule variable.
Thanks
Eric
>
> Rob
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Add SMMUv3.2 range invalidation support
@ 2020-01-16 21:23 ` Auger Eric
0 siblings, 0 replies; 16+ messages in thread
From: Auger Eric @ 2020-01-16 21:23 UTC (permalink / raw)
To: Rob Herring
Cc: Jean-Philippe Brucker, Robin Murphy, Joerg Roedel, Linux IOMMU,
Will Deacon,
moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE
Hi Rob,
On 1/16/20 5:57 PM, Rob Herring wrote:
> On Wed, Jan 15, 2020 at 10:33 AM Auger Eric <eric.auger@redhat.com> wrote:
>>
>> Hi Rob,
>>
>> On 1/15/20 3:02 PM, Rob Herring wrote:
>>> On Wed, Jan 15, 2020 at 3:21 AM Auger Eric <eric.auger@redhat.com> wrote:
>>>>
>>>> Hi Rob,
>>>>
>>>> On 1/13/20 3:39 PM, Rob Herring wrote:
>>>>> Arm SMMUv3.2 adds support for TLB range invalidate operations.
>>>>> Support for range invalidate is determined by the RIL bit in the IDR3
>>>>> register.
>>>>>
>>>>> The range invalidate is in units of the leaf page size and operates on
>>>>> 1-32 chunks of a power of 2 multiple pages. First we determine from the
>>>>> size what power of 2 multiple we can use and then adjust the granule to
>>>>> 32x that size.
>
>>>>> @@ -2022,12 +2043,39 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
>>>>> cmd.tlbi.vmid = smmu_domain->s2_cfg.vmid;
>>>>> }
>>>>>
>>>>> + if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
>>>>> + unsigned long tg, scale;
>>>>> +
>>>>> + /* Get the leaf page size */
>>>>> + tg = __ffs(smmu_domain->domain.pgsize_bitmap);
>>>> it is unclear to me why you can't set tg with the granule parameter.
>>>
>>> granule could be 2MB sections if THP is enabled, right?
>>
>> Ah OK I thought it was a page size and not a block size.
>>
>> I requested this feature a long time ago for virtual SMMUv3. With
>> DPDK/VFIO the guest was sending page TLB invalidation for each page
>> (granule=4K or 64K) part of the hugepage buffer and those were trapped
>> by the VMM. This stalled qemu.
>
> I did some more testing to make sure THP is enabled, but haven't been
> able to get granule to be anything but 4K. I only have the Fast Model
> with AHCI on PCI to test this with. Maybe I'm hitting some place where
> THPs aren't supported yet.
>
>>>>> + /* Determine the power of 2 multiple number of pages */
>>>>> + scale = __ffs(size / (1UL << tg));
>>>>> + cmd.tlbi.scale = scale;
>>>>> +
>>>>> + cmd.tlbi.num = CMDQ_TLBI_RANGE_NUM_MAX - 1;
>>>> Also could you explain why you use CMDQ_TLBI_RANGE_NUM_MAX.
>>>
>>> How's this:
>>> /* The invalidation loop defaults to the maximum range */
>> I would have expected num=0 directly. Don't we invalidate the &size in
>> one shot as 2^scale * pages of granularity @tg? I fail to understand
>> when NUM > 0.
>
> NUM is > 0 anytime size is not a power of 2. For example, if size is
> 33 pages, then it takes 2 loops doing 32 pages and then 1 page. If
> size is 34 pages, then NUM is (17-1) and SCALE is 1.
OK I get it now. I misread the scale computation as log2() :-(.
I still have a doubt about the scale choice. What if you invalidate a
large number of pages such as 1025 pages. scale is 0 and you end up with
32 * 32 * 2^0 + 1 * 2 * 2^0 invalidations (33). Whereas you could
invalidate the whole range with 2 invalidation commands: 1 x 2^10 +
1*1^1 (packing the invalidations by largest scale). Am I correct or do I
still miss something?
Besides in the patch I think in the while loop the iova should be
incremented with the actual number of invalidated bytes and not the max
sized granule variable.
Thanks
Eric
>
> Rob
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Add SMMUv3.2 range invalidation support
2020-01-16 21:23 ` Auger Eric
@ 2020-01-16 23:09 ` Rob Herring
-1 siblings, 0 replies; 16+ messages in thread
From: Rob Herring @ 2020-01-16 23:09 UTC (permalink / raw)
To: Auger Eric
Cc: Jean-Philippe Brucker, Robin Murphy, Linux IOMMU, Will Deacon,
moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE
On Thu, Jan 16, 2020 at 3:23 PM Auger Eric <eric.auger@redhat.com> wrote:
>
> Hi Rob,
>
> On 1/16/20 5:57 PM, Rob Herring wrote:
> > On Wed, Jan 15, 2020 at 10:33 AM Auger Eric <eric.auger@redhat.com> wrote:
> >>
> >> Hi Rob,
> >>
> >> On 1/15/20 3:02 PM, Rob Herring wrote:
> >>> On Wed, Jan 15, 2020 at 3:21 AM Auger Eric <eric.auger@redhat.com> wrote:
> >>>>
> >>>> Hi Rob,
> >>>>
> >>>> On 1/13/20 3:39 PM, Rob Herring wrote:
> >>>>> Arm SMMUv3.2 adds support for TLB range invalidate operations.
> >>>>> Support for range invalidate is determined by the RIL bit in the IDR3
> >>>>> register.
> >>>>>
> >>>>> The range invalidate is in units of the leaf page size and operates on
> >>>>> 1-32 chunks of a power of 2 multiple pages. First we determine from the
> >>>>> size what power of 2 multiple we can use and then adjust the granule to
> >>>>> 32x that size.
> >
> >>>>> @@ -2022,12 +2043,39 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
> >>>>> cmd.tlbi.vmid = smmu_domain->s2_cfg.vmid;
> >>>>> }
> >>>>>
> >>>>> + if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
> >>>>> + unsigned long tg, scale;
> >>>>> +
> >>>>> + /* Get the leaf page size */
> >>>>> + tg = __ffs(smmu_domain->domain.pgsize_bitmap);
> >>>> it is unclear to me why you can't set tg with the granule parameter.
> >>>
> >>> granule could be 2MB sections if THP is enabled, right?
> >>
> >> Ah OK I thought it was a page size and not a block size.
> >>
> >> I requested this feature a long time ago for virtual SMMUv3. With
> >> DPDK/VFIO the guest was sending page TLB invalidation for each page
> >> (granule=4K or 64K) part of the hugepage buffer and those were trapped
> >> by the VMM. This stalled qemu.
> >
> > I did some more testing to make sure THP is enabled, but haven't been
> > able to get granule to be anything but 4K. I only have the Fast Model
> > with AHCI on PCI to test this with. Maybe I'm hitting some place where
> > THPs aren't supported yet.
> >
> >>>>> + /* Determine the power of 2 multiple number of pages */
> >>>>> + scale = __ffs(size / (1UL << tg));
> >>>>> + cmd.tlbi.scale = scale;
> >>>>> +
> >>>>> + cmd.tlbi.num = CMDQ_TLBI_RANGE_NUM_MAX - 1;
> >>>> Also could you explain why you use CMDQ_TLBI_RANGE_NUM_MAX.
> >>>
> >>> How's this:
> >>> /* The invalidation loop defaults to the maximum range */
> >> I would have expected num=0 directly. Don't we invalidate the &size in
> >> one shot as 2^scale * pages of granularity @tg? I fail to understand
> >> when NUM > 0.
> >
> > NUM is > 0 anytime size is not a power of 2. For example, if size is
> > 33 pages, then it takes 2 loops doing 32 pages and then 1 page. If
> > size is 34 pages, then NUM is (17-1) and SCALE is 1.
> OK I get it now. I misread the scale computation as log2() :-(.
>
> I still have a doubt about the scale choice. What if you invalidate a
> large number of pages such as 1025 pages. scale is 0 and you end up with
> 32 * 32 * 2^0 + 1 * 2 * 2^0 invalidations (33). Whereas you could
> invalidate the whole range with 2 invalidation commands: 1 x 2^10 +
> 1*1^1 (packing the invalidations by largest scale). Am I correct or do I
> still miss something?
No, that's correct. 33 is a lot better than 1025 though. :) 1023 pages
is about the worst case if we assume we get 2MB blocks, but maybe not
a good assumption given our testing so far...
So thinking out loud, I guess we could iterate on power of 2 chunks of
size (in units of pages) like this:
while (size) {
scale = fls(size);
range = 1 << scale;
size &= ~range;
iova += range;
}
But that means NUM is always 0, so also not ideal. So we need to
extract 5 bits from size for NUM on each iteration:
while (size) {
scale = __ffs(size);
num = (size >> scale)) & 0x1f;
size -= (num + 1) * (1 << scale);
...
}
So worst case, we'd have 4 invalidates for up to 4G.
> Besides in the patch I think in the while loop the iova should be
> incremented with the actual number of invalidated bytes and not the max
> sized granule variable.
Ok.
Rob
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Add SMMUv3.2 range invalidation support
@ 2020-01-16 23:09 ` Rob Herring
0 siblings, 0 replies; 16+ messages in thread
From: Rob Herring @ 2020-01-16 23:09 UTC (permalink / raw)
To: Auger Eric
Cc: Jean-Philippe Brucker, Robin Murphy, Joerg Roedel, Linux IOMMU,
Will Deacon,
moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE
On Thu, Jan 16, 2020 at 3:23 PM Auger Eric <eric.auger@redhat.com> wrote:
>
> Hi Rob,
>
> On 1/16/20 5:57 PM, Rob Herring wrote:
> > On Wed, Jan 15, 2020 at 10:33 AM Auger Eric <eric.auger@redhat.com> wrote:
> >>
> >> Hi Rob,
> >>
> >> On 1/15/20 3:02 PM, Rob Herring wrote:
> >>> On Wed, Jan 15, 2020 at 3:21 AM Auger Eric <eric.auger@redhat.com> wrote:
> >>>>
> >>>> Hi Rob,
> >>>>
> >>>> On 1/13/20 3:39 PM, Rob Herring wrote:
> >>>>> Arm SMMUv3.2 adds support for TLB range invalidate operations.
> >>>>> Support for range invalidate is determined by the RIL bit in the IDR3
> >>>>> register.
> >>>>>
> >>>>> The range invalidate is in units of the leaf page size and operates on
> >>>>> 1-32 chunks of a power of 2 multiple pages. First we determine from the
> >>>>> size what power of 2 multiple we can use and then adjust the granule to
> >>>>> 32x that size.
> >
> >>>>> @@ -2022,12 +2043,39 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
> >>>>> cmd.tlbi.vmid = smmu_domain->s2_cfg.vmid;
> >>>>> }
> >>>>>
> >>>>> + if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
> >>>>> + unsigned long tg, scale;
> >>>>> +
> >>>>> + /* Get the leaf page size */
> >>>>> + tg = __ffs(smmu_domain->domain.pgsize_bitmap);
> >>>> it is unclear to me why you can't set tg with the granule parameter.
> >>>
> >>> granule could be 2MB sections if THP is enabled, right?
> >>
> >> Ah OK I thought it was a page size and not a block size.
> >>
> >> I requested this feature a long time ago for virtual SMMUv3. With
> >> DPDK/VFIO the guest was sending page TLB invalidation for each page
> >> (granule=4K or 64K) part of the hugepage buffer and those were trapped
> >> by the VMM. This stalled qemu.
> >
> > I did some more testing to make sure THP is enabled, but haven't been
> > able to get granule to be anything but 4K. I only have the Fast Model
> > with AHCI on PCI to test this with. Maybe I'm hitting some place where
> > THPs aren't supported yet.
> >
> >>>>> + /* Determine the power of 2 multiple number of pages */
> >>>>> + scale = __ffs(size / (1UL << tg));
> >>>>> + cmd.tlbi.scale = scale;
> >>>>> +
> >>>>> + cmd.tlbi.num = CMDQ_TLBI_RANGE_NUM_MAX - 1;
> >>>> Also could you explain why you use CMDQ_TLBI_RANGE_NUM_MAX.
> >>>
> >>> How's this:
> >>> /* The invalidation loop defaults to the maximum range */
> >> I would have expected num=0 directly. Don't we invalidate the &size in
> >> one shot as 2^scale * pages of granularity @tg? I fail to understand
> >> when NUM > 0.
> >
> > NUM is > 0 anytime size is not a power of 2. For example, if size is
> > 33 pages, then it takes 2 loops doing 32 pages and then 1 page. If
> > size is 34 pages, then NUM is (17-1) and SCALE is 1.
> OK I get it now. I misread the scale computation as log2() :-(.
>
> I still have a doubt about the scale choice. What if you invalidate a
> large number of pages such as 1025 pages. scale is 0 and you end up with
> 32 * 32 * 2^0 + 1 * 2 * 2^0 invalidations (33). Whereas you could
> invalidate the whole range with 2 invalidation commands: 1 x 2^10 +
> 1*1^1 (packing the invalidations by largest scale). Am I correct or do I
> still miss something?
No, that's correct. 33 is a lot better than 1025 though. :) 1023 pages
is about the worst case if we assume we get 2MB blocks, but maybe not
a good assumption given our testing so far...
So thinking out loud, I guess we could iterate on power of 2 chunks of
size (in units of pages) like this:
while (size) {
scale = fls(size);
range = 1 << scale;
size &= ~range;
iova += range;
}
But that means NUM is always 0, so also not ideal. So we need to
extract 5 bits from size for NUM on each iteration:
while (size) {
scale = __ffs(size);
num = (size >> scale)) & 0x1f;
size -= (num + 1) * (1 << scale);
...
}
So worst case, we'd have 4 invalidates for up to 4G.
> Besides in the patch I think in the while loop the iova should be
> incremented with the actual number of invalidated bytes and not the max
> sized granule variable.
Ok.
Rob
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 16+ messages in thread