linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list
       [not found] <20210609145315.25750-1-saiprakash.ranjan@codeaurora.org>
@ 2021-06-09 18:44 ` Robin Murphy
       [not found]   ` <c600e9b2534d54082a5272b508a7985f@codeaurora.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Robin Murphy @ 2021-06-09 18:44 UTC (permalink / raw)
  To: Sai Prakash Ranjan, Will Deacon, Joerg Roedel
  Cc: iommu, linux-arm-kernel, linux-kernel, linux-arm-msm

On 2021-06-09 15:53, Sai Prakash Ranjan wrote:
> Currently for iommu_unmap() of large scatter-gather list with page size
> elements, the majority of time is spent in flushing of partial walks in
> __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for
> arm-smmu).
> 
> For example: to unmap a 32MB scatter-gather list with page size elements
> (8192 entries), there are 16->2MB buffer unmaps based on the pgsize (2MB
> for 4K granule) and each of 2MB will further result in 512 TLBIVAs (2MB/4K)
> resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing a huge
> overhead.
> 
> So instead use io_pgtable_tlb_flush_all() to invalidate the entire context
> if size (pgsize) is greater than the granule size (4K, 16K, 64K). For this
> example of 32MB scatter-gather list unmap, this results in just 16 ASID
> based TLB invalidations or tlb_flush_all() callback (TLBIASID in case of
> arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the performance of
> unmaps drastically.
> 
> Condition (size > granule size) is chosen for io_pgtable_tlb_flush_all()
> because for any granule with supported pgsizes, we will have at least 512
> TLB invalidations for which tlb_flush_all() is already recommended. For
> example, take 4K granule with 2MB pgsize, this will result in 512 TLBIVA
> in partial walk flush.
> 
> Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap:
> (average over 10 iterations)
> 
> Before this optimization:
> 
>      size        iommu_map_sg      iommu_unmap
>        4K            2.067 us         1.854 us
>       64K            9.598 us         8.802 us
>        1M          148.890 us       130.718 us
>        2M          305.864 us        67.291 us
>       12M         1793.604 us       390.838 us
>       16M         2386.848 us       518.187 us
>       24M         3563.296 us       775.989 us
>       32M         4747.171 us      1033.364 us
> 
> After this optimization:
> 
>      size        iommu_map_sg      iommu_unmap
>        4K            1.723 us         1.765 us
>       64K            9.880 us         8.869 us
>        1M          155.364 us       135.223 us
>        2M          303.906 us         5.385 us
>       12M         1786.557 us        21.250 us
>       16M         2391.890 us        27.437 us
>       24M         3570.895 us        39.937 us
>       32M         4755.234 us        51.797 us
> 
> This is further reduced once the map/unmap_pages() support gets in which
> will result in just 1 tlb_flush_all() as opposed to 16 tlb_flush_all().
> 
> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
> ---
>   drivers/iommu/io-pgtable-arm.c | 7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> index 87def58e79b5..c3cb9add3179 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>   
>   		if (!iopte_leaf(pte, lvl, iop->fmt)) {
>   			/* Also flush any partial walks */
> -			io_pgtable_tlb_flush_walk(iop, iova, size,
> -						  ARM_LPAE_GRANULE(data));
> +			if (size > ARM_LPAE_GRANULE(data))
> +				io_pgtable_tlb_flush_all(iop);
> +			else

Erm, when will the above condition ever not be true? ;)

Taking a step back, though, what about the impact to drivers other than 
SMMUv2? In particular I'm thinking of SMMUv3.2 where the whole range can 
be invalidated by VA in a single command anyway, so the additional 
penalties of TLBIALL are undesirable.

Robin.

> +				io_pgtable_tlb_flush_walk(iop, iova, size,
> +							  ARM_LPAE_GRANULE(data));
>   			ptep = iopte_deref(pte, data);
>   			__arm_lpae_free_pgtable(data, lvl + 1, ptep);
>   		} else if (iop->cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT) {
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list
       [not found]   ` <c600e9b2534d54082a5272b508a7985f@codeaurora.org>
@ 2021-06-10  9:08     ` Robin Murphy
       [not found]       ` <12067ffb8243b220cf03e83aaac3e823@codeaurora.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Robin Murphy @ 2021-06-10  9:08 UTC (permalink / raw)
  To: Sai Prakash Ranjan
  Cc: linux-arm-msm, linux-kernel, iommu, Will Deacon, linux-arm-kernel

On 2021-06-10 06:24, Sai Prakash Ranjan wrote:
> Hi Robin,
> 
> On 2021-06-10 00:14, Robin Murphy wrote:
>> On 2021-06-09 15:53, Sai Prakash Ranjan wrote:
>>> Currently for iommu_unmap() of large scatter-gather list with page size
>>> elements, the majority of time is spent in flushing of partial walks in
>>> __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for
>>> arm-smmu).
>>>
>>> For example: to unmap a 32MB scatter-gather list with page size elements
>>> (8192 entries), there are 16->2MB buffer unmaps based on the pgsize (2MB
>>> for 4K granule) and each of 2MB will further result in 512 TLBIVAs 
>>> (2MB/4K)
>>> resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing a huge
>>> overhead.
>>>
>>> So instead use io_pgtable_tlb_flush_all() to invalidate the entire 
>>> context
>>> if size (pgsize) is greater than the granule size (4K, 16K, 64K). For 
>>> this
>>> example of 32MB scatter-gather list unmap, this results in just 16 ASID
>>> based TLB invalidations or tlb_flush_all() callback (TLBIASID in case of
>>> arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the 
>>> performance of
>>> unmaps drastically.
>>>
>>> Condition (size > granule size) is chosen for io_pgtable_tlb_flush_all()
>>> because for any granule with supported pgsizes, we will have at least 
>>> 512
>>> TLB invalidations for which tlb_flush_all() is already recommended. For
>>> example, take 4K granule with 2MB pgsize, this will result in 512 TLBIVA
>>> in partial walk flush.
>>>
>>> Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap:
>>> (average over 10 iterations)
>>>
>>> Before this optimization:
>>>
>>>      size        iommu_map_sg      iommu_unmap
>>>        4K            2.067 us         1.854 us
>>>       64K            9.598 us         8.802 us
>>>        1M          148.890 us       130.718 us
>>>        2M          305.864 us        67.291 us
>>>       12M         1793.604 us       390.838 us
>>>       16M         2386.848 us       518.187 us
>>>       24M         3563.296 us       775.989 us
>>>       32M         4747.171 us      1033.364 us
>>>
>>> After this optimization:
>>>
>>>      size        iommu_map_sg      iommu_unmap
>>>        4K            1.723 us         1.765 us
>>>       64K            9.880 us         8.869 us
>>>        1M          155.364 us       135.223 us
>>>        2M          303.906 us         5.385 us
>>>       12M         1786.557 us        21.250 us
>>>       16M         2391.890 us        27.437 us
>>>       24M         3570.895 us        39.937 us
>>>       32M         4755.234 us        51.797 us
>>>
>>> This is further reduced once the map/unmap_pages() support gets in which
>>> will result in just 1 tlb_flush_all() as opposed to 16 tlb_flush_all().
>>>
>>> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
>>> ---
>>>   drivers/iommu/io-pgtable-arm.c | 7 +++++--
>>>   1 file changed, 5 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/iommu/io-pgtable-arm.c 
>>> b/drivers/iommu/io-pgtable-arm.c
>>> index 87def58e79b5..c3cb9add3179 100644
>>> --- a/drivers/iommu/io-pgtable-arm.c
>>> +++ b/drivers/iommu/io-pgtable-arm.c
>>> @@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct 
>>> arm_lpae_io_pgtable *data,
>>>             if (!iopte_leaf(pte, lvl, iop->fmt)) {
>>>               /* Also flush any partial walks */
>>> -            io_pgtable_tlb_flush_walk(iop, iova, size,
>>> -                          ARM_LPAE_GRANULE(data));
>>> +            if (size > ARM_LPAE_GRANULE(data))
>>> +                io_pgtable_tlb_flush_all(iop);
>>> +            else
>>
>> Erm, when will the above condition ever not be true? ;)
>>
> 
> Ah right, silly me :)
> 
>> Taking a step back, though, what about the impact to drivers other
>> than SMMUv2?
> 
> Other drivers would be msm_iommu.c, qcom_iommu.c which does the same
> thing as arm-smmu-v2 (page based invalidations), then there is ipmmu-vmsa.c
> which does tlb_flush_all() for flush walk.
> 
>> In particular I'm thinking of SMMUv3.2 where the whole
>> range can be invalidated by VA in a single command anyway, so the
>> additional penalties of TLBIALL are undesirable.
>>
> 
> Right, so I am thinking we can have a new generic quirk 
> IO_PGTABLE_QUIRK_RANGE_INV
> to choose between range based invalidations(tlb_flush_walk) and 
> tlb_flush_all().
> In this case of arm-smmu-v3.2, we can tie up ARM_SMMU_FEAT_RANGE_INV 
> with this quirk
> and have something like below, thoughts?
> 
> if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV)
>          io_pgtable_tlb_flush_walk(iop, iova, size,
>                                    ARM_LPAE_GRANULE(data));
> else
>          io_pgtable_tlb_flush_all(iop);

The design here has always been that io-pgtable says *what* needs 
invalidating, and we left it up to the drivers to decide exactly *how*. 
Even though things have evolved a bit I don't think that has 
fundamentally changed - tlb_flush_walk is now only used in this one 
place (technically I suppose it could be renamed tlb_flush_table but 
it's not worth the churn), so drivers can implement their own preferred 
table-invalidating behaviour even more easily than choosing whether to 
bounce a quirk through the common code or not. Consider what you've 
already seen for the Renesas IPMMU, or SMMUv1 stage 2...

I'm instinctively a little twitchy about making this a blanket 
optimisation for SMMUv2 since I still remember the palaver with our 
display and MMU-500 integrations, where it had to implement the dodgy 
"prefetch" register to trigger translations before scanning out a frame 
since it couldn't ever afford a TLB miss, thus TLBIALL when freeing an 
old buffer would be a dangerous hammer to swing. However IIRC it also 
had to ensure everything was mapped as 2MB blocks to guarantee fitting 
everything in the TLBs in the first place, so I guess it would still 
work out OK due to never realistically unmapping a whole table at once 
anyway.

Cheers,
Robin.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list
       [not found]       ` <12067ffb8243b220cf03e83aaac3e823@codeaurora.org>
@ 2021-06-10 11:33         ` Robin Murphy
  2021-06-10 12:03           ` Thierry Reding
       [not found]           ` <dfdabcdec99a4c6e3bf2b3c5eebe067f@codeaurora.org>
  0 siblings, 2 replies; 10+ messages in thread
From: Robin Murphy @ 2021-06-10 11:33 UTC (permalink / raw)
  To: Sai Prakash Ranjan
  Cc: linux-arm-msm, linux-kernel, iommu, Thierry Reding, Will Deacon,
	linux-arm-kernel

On 2021-06-10 10:36, Sai Prakash Ranjan wrote:
> Hi Robin,
> 
> On 2021-06-10 14:38, Robin Murphy wrote:
>> On 2021-06-10 06:24, Sai Prakash Ranjan wrote:
>>> Hi Robin,
>>>
>>> On 2021-06-10 00:14, Robin Murphy wrote:
>>>> On 2021-06-09 15:53, Sai Prakash Ranjan wrote:
>>>>> Currently for iommu_unmap() of large scatter-gather list with page 
>>>>> size
>>>>> elements, the majority of time is spent in flushing of partial 
>>>>> walks in
>>>>> __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for
>>>>> arm-smmu).
>>>>>
>>>>> For example: to unmap a 32MB scatter-gather list with page size 
>>>>> elements
>>>>> (8192 entries), there are 16->2MB buffer unmaps based on the pgsize 
>>>>> (2MB
>>>>> for 4K granule) and each of 2MB will further result in 512 TLBIVAs 
>>>>> (2MB/4K)
>>>>> resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing a 
>>>>> huge
>>>>> overhead.
>>>>>
>>>>> So instead use io_pgtable_tlb_flush_all() to invalidate the entire 
>>>>> context
>>>>> if size (pgsize) is greater than the granule size (4K, 16K, 64K). 
>>>>> For this
>>>>> example of 32MB scatter-gather list unmap, this results in just 16 
>>>>> ASID
>>>>> based TLB invalidations or tlb_flush_all() callback (TLBIASID in 
>>>>> case of
>>>>> arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the 
>>>>> performance of
>>>>> unmaps drastically.
>>>>>
>>>>> Condition (size > granule size) is chosen for 
>>>>> io_pgtable_tlb_flush_all()
>>>>> because for any granule with supported pgsizes, we will have at 
>>>>> least 512
>>>>> TLB invalidations for which tlb_flush_all() is already recommended. 
>>>>> For
>>>>> example, take 4K granule with 2MB pgsize, this will result in 512 
>>>>> TLBIVA
>>>>> in partial walk flush.
>>>>>
>>>>> Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap:
>>>>> (average over 10 iterations)
>>>>>
>>>>> Before this optimization:
>>>>>
>>>>>      size        iommu_map_sg      iommu_unmap
>>>>>        4K            2.067 us         1.854 us
>>>>>       64K            9.598 us         8.802 us
>>>>>        1M          148.890 us       130.718 us
>>>>>        2M          305.864 us        67.291 us
>>>>>       12M         1793.604 us       390.838 us
>>>>>       16M         2386.848 us       518.187 us
>>>>>       24M         3563.296 us       775.989 us
>>>>>       32M         4747.171 us      1033.364 us
>>>>>
>>>>> After this optimization:
>>>>>
>>>>>      size        iommu_map_sg      iommu_unmap
>>>>>        4K            1.723 us         1.765 us
>>>>>       64K            9.880 us         8.869 us
>>>>>        1M          155.364 us       135.223 us
>>>>>        2M          303.906 us         5.385 us
>>>>>       12M         1786.557 us        21.250 us
>>>>>       16M         2391.890 us        27.437 us
>>>>>       24M         3570.895 us        39.937 us
>>>>>       32M         4755.234 us        51.797 us
>>>>>
>>>>> This is further reduced once the map/unmap_pages() support gets in 
>>>>> which
>>>>> will result in just 1 tlb_flush_all() as opposed to 16 
>>>>> tlb_flush_all().
>>>>>
>>>>> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
>>>>> ---
>>>>>   drivers/iommu/io-pgtable-arm.c | 7 +++++--
>>>>>   1 file changed, 5 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/iommu/io-pgtable-arm.c 
>>>>> b/drivers/iommu/io-pgtable-arm.c
>>>>> index 87def58e79b5..c3cb9add3179 100644
>>>>> --- a/drivers/iommu/io-pgtable-arm.c
>>>>> +++ b/drivers/iommu/io-pgtable-arm.c
>>>>> @@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct 
>>>>> arm_lpae_io_pgtable *data,
>>>>>             if (!iopte_leaf(pte, lvl, iop->fmt)) {
>>>>>               /* Also flush any partial walks */
>>>>> -            io_pgtable_tlb_flush_walk(iop, iova, size,
>>>>> -                          ARM_LPAE_GRANULE(data));
>>>>> +            if (size > ARM_LPAE_GRANULE(data))
>>>>> +                io_pgtable_tlb_flush_all(iop);
>>>>> +            else
>>>>
>>>> Erm, when will the above condition ever not be true? ;)
>>>>
>>>
>>> Ah right, silly me :)
>>>
>>>> Taking a step back, though, what about the impact to drivers other
>>>> than SMMUv2?
>>>
>>> Other drivers would be msm_iommu.c, qcom_iommu.c which does the same
>>> thing as arm-smmu-v2 (page based invalidations), then there is 
>>> ipmmu-vmsa.c
>>> which does tlb_flush_all() for flush walk.
>>>
>>>> In particular I'm thinking of SMMUv3.2 where the whole
>>>> range can be invalidated by VA in a single command anyway, so the
>>>> additional penalties of TLBIALL are undesirable.
>>>>
>>>
>>> Right, so I am thinking we can have a new generic quirk 
>>> IO_PGTABLE_QUIRK_RANGE_INV
>>> to choose between range based invalidations(tlb_flush_walk) and 
>>> tlb_flush_all().
>>> In this case of arm-smmu-v3.2, we can tie up ARM_SMMU_FEAT_RANGE_INV 
>>> with this quirk
>>> and have something like below, thoughts?
>>>
>>> if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV)
>>>          io_pgtable_tlb_flush_walk(iop, iova, size,
>>>                                    ARM_LPAE_GRANULE(data));
>>> else
>>>          io_pgtable_tlb_flush_all(iop);
>>
>> The design here has always been that io-pgtable says *what* needs
>> invalidating, and we left it up to the drivers to decide exactly
>> *how*. Even though things have evolved a bit I don't think that has
>> fundamentally changed - tlb_flush_walk is now only used in this one
>> place (technically I suppose it could be renamed tlb_flush_table but
>> it's not worth the churn), so drivers can implement their own
>> preferred table-invalidating behaviour even more easily than choosing
>> whether to bounce a quirk through the common code or not. Consider
>> what you've already seen for the Renesas IPMMU, or SMMUv1 stage 2...
>>
> 
> Thanks for the explanation, makes sense. If I am not mistaken, I see that
> you are suggesting to move this logic based on size and granule-size to
> arm-smmu-v2 driver and one more thing below..

Simpler than that - following on from my original comment above, 
tlb_flush_walk already knows it's invalidating at least one full level 
of table so there's nothing it even needs to check. Adding a size-based 
heuristic to arm_smmu_inv_range_* for leaf invalidations would be a 
separate concern (note that changing the non-leaf behaviour might allow 
cleaning up the "reg" indirection there too).

>> I'm instinctively a little twitchy about making this a blanket
>> optimisation for SMMUv2 since I still remember the palaver with our
>> display and MMU-500 integrations, where it had to implement the dodgy
>> "prefetch" register to trigger translations before scanning out a
>> frame since it couldn't ever afford a TLB miss, thus TLBIALL when
>> freeing an old buffer would be a dangerous hammer to swing. However
>> IIRC it also had to ensure everything was mapped as 2MB blocks to
>> guarantee fitting everything in the TLBs in the first place, so I
>> guess it would still work out OK due to never realistically unmapping
>> a whole table at once anyway.
>>
> 
> You are also hinting to not do this for all SMMUv2 implementations and make
> it QCOM specific?

No, I'm really just wary that the performance implication is more 
complex than a simple unmap latency benefit, possibly even for QCOM. 
Consider the access latency, power and memory bandwidth hit from all the 
additional pagetable walks incurred by other ongoing traffic fighting 
against those 16 successive TLBIASIDs. Whether it's an overall win 
really depends on the specific workload and system conditions as much as 
the SMMU implementation. Thinking some more, I wonder if the Tegra folks 
might have an opinion to add here, given that their multiple-SMMU 
solution was seemingly about trying to get enough TLB and pagetable walk 
bandwidth in the first place?

Robin.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list
  2021-06-10 11:33         ` Robin Murphy
@ 2021-06-10 12:03           ` Thierry Reding
       [not found]           ` <dfdabcdec99a4c6e3bf2b3c5eebe067f@codeaurora.org>
  1 sibling, 0 replies; 10+ messages in thread
From: Thierry Reding @ 2021-06-10 12:03 UTC (permalink / raw)
  To: Robin Murphy, Krishna Reddy
  Cc: Sai Prakash Ranjan, linux-arm-msm, linux-kernel, iommu,
	Will Deacon, linux-arm-kernel


[-- Attachment #1.1: Type: text/plain, Size: 9986 bytes --]

On Thu, Jun 10, 2021 at 12:33:56PM +0100, Robin Murphy wrote:
> On 2021-06-10 10:36, Sai Prakash Ranjan wrote:
> > Hi Robin,
> > 
> > On 2021-06-10 14:38, Robin Murphy wrote:
> > > On 2021-06-10 06:24, Sai Prakash Ranjan wrote:
> > > > Hi Robin,
> > > > 
> > > > On 2021-06-10 00:14, Robin Murphy wrote:
> > > > > On 2021-06-09 15:53, Sai Prakash Ranjan wrote:
> > > > > > Currently for iommu_unmap() of large scatter-gather list
> > > > > > with page size
> > > > > > elements, the majority of time is spent in flushing of
> > > > > > partial walks in
> > > > > > __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for
> > > > > > arm-smmu).
> > > > > > 
> > > > > > For example: to unmap a 32MB scatter-gather list with
> > > > > > page size elements
> > > > > > (8192 entries), there are 16->2MB buffer unmaps based on
> > > > > > the pgsize (2MB
> > > > > > for 4K granule) and each of 2MB will further result in
> > > > > > 512 TLBIVAs (2MB/4K)
> > > > > > resulting in a total of 8192 TLBIVAs (512*16) for
> > > > > > 16->2MB causing a huge
> > > > > > overhead.
> > > > > > 
> > > > > > So instead use io_pgtable_tlb_flush_all() to invalidate
> > > > > > the entire context
> > > > > > if size (pgsize) is greater than the granule size (4K,
> > > > > > 16K, 64K). For this
> > > > > > example of 32MB scatter-gather list unmap, this results
> > > > > > in just 16 ASID
> > > > > > based TLB invalidations or tlb_flush_all() callback
> > > > > > (TLBIASID in case of
> > > > > > arm-smmu) as opposed to 8192 TLBIVAs thereby increasing
> > > > > > the performance of
> > > > > > unmaps drastically.
> > > > > > 
> > > > > > Condition (size > granule size) is chosen for
> > > > > > io_pgtable_tlb_flush_all()
> > > > > > because for any granule with supported pgsizes, we will
> > > > > > have at least 512
> > > > > > TLB invalidations for which tlb_flush_all() is already
> > > > > > recommended. For
> > > > > > example, take 4K granule with 2MB pgsize, this will
> > > > > > result in 512 TLBIVA
> > > > > > in partial walk flush.
> > > > > > 
> > > > > > Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap:
> > > > > > (average over 10 iterations)
> > > > > > 
> > > > > > Before this optimization:
> > > > > > 
> > > > > >      size        iommu_map_sg      iommu_unmap
> > > > > >        4K            2.067 us         1.854 us
> > > > > >       64K            9.598 us         8.802 us
> > > > > >        1M          148.890 us       130.718 us
> > > > > >        2M          305.864 us        67.291 us
> > > > > >       12M         1793.604 us       390.838 us
> > > > > >       16M         2386.848 us       518.187 us
> > > > > >       24M         3563.296 us       775.989 us
> > > > > >       32M         4747.171 us      1033.364 us
> > > > > > 
> > > > > > After this optimization:
> > > > > > 
> > > > > >      size        iommu_map_sg      iommu_unmap
> > > > > >        4K            1.723 us         1.765 us
> > > > > >       64K            9.880 us         8.869 us
> > > > > >        1M          155.364 us       135.223 us
> > > > > >        2M          303.906 us         5.385 us
> > > > > >       12M         1786.557 us        21.250 us
> > > > > >       16M         2391.890 us        27.437 us
> > > > > >       24M         3570.895 us        39.937 us
> > > > > >       32M         4755.234 us        51.797 us
> > > > > > 
> > > > > > This is further reduced once the map/unmap_pages()
> > > > > > support gets in which
> > > > > > will result in just 1 tlb_flush_all() as opposed to 16
> > > > > > tlb_flush_all().
> > > > > > 
> > > > > > Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
> > > > > > ---
> > > > > >   drivers/iommu/io-pgtable-arm.c | 7 +++++--
> > > > > >   1 file changed, 5 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/iommu/io-pgtable-arm.c
> > > > > > b/drivers/iommu/io-pgtable-arm.c
> > > > > > index 87def58e79b5..c3cb9add3179 100644
> > > > > > --- a/drivers/iommu/io-pgtable-arm.c
> > > > > > +++ b/drivers/iommu/io-pgtable-arm.c
> > > > > > @@ -589,8 +589,11 @@ static size_t
> > > > > > __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> > > > > >             if (!iopte_leaf(pte, lvl, iop->fmt)) {
> > > > > >               /* Also flush any partial walks */
> > > > > > -            io_pgtable_tlb_flush_walk(iop, iova, size,
> > > > > > -                          ARM_LPAE_GRANULE(data));
> > > > > > +            if (size > ARM_LPAE_GRANULE(data))
> > > > > > +                io_pgtable_tlb_flush_all(iop);
> > > > > > +            else
> > > > > 
> > > > > Erm, when will the above condition ever not be true? ;)
> > > > > 
> > > > 
> > > > Ah right, silly me :)
> > > > 
> > > > > Taking a step back, though, what about the impact to drivers other
> > > > > than SMMUv2?
> > > > 
> > > > Other drivers would be msm_iommu.c, qcom_iommu.c which does the same
> > > > thing as arm-smmu-v2 (page based invalidations), then there is
> > > > ipmmu-vmsa.c
> > > > which does tlb_flush_all() for flush walk.
> > > > 
> > > > > In particular I'm thinking of SMMUv3.2 where the whole
> > > > > range can be invalidated by VA in a single command anyway, so the
> > > > > additional penalties of TLBIALL are undesirable.
> > > > > 
> > > > 
> > > > Right, so I am thinking we can have a new generic quirk
> > > > IO_PGTABLE_QUIRK_RANGE_INV
> > > > to choose between range based invalidations(tlb_flush_walk) and
> > > > tlb_flush_all().
> > > > In this case of arm-smmu-v3.2, we can tie up
> > > > ARM_SMMU_FEAT_RANGE_INV with this quirk
> > > > and have something like below, thoughts?
> > > > 
> > > > if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV)
> > > >          io_pgtable_tlb_flush_walk(iop, iova, size,
> > > >                                    ARM_LPAE_GRANULE(data));
> > > > else
> > > >          io_pgtable_tlb_flush_all(iop);
> > > 
> > > The design here has always been that io-pgtable says *what* needs
> > > invalidating, and we left it up to the drivers to decide exactly
> > > *how*. Even though things have evolved a bit I don't think that has
> > > fundamentally changed - tlb_flush_walk is now only used in this one
> > > place (technically I suppose it could be renamed tlb_flush_table but
> > > it's not worth the churn), so drivers can implement their own
> > > preferred table-invalidating behaviour even more easily than choosing
> > > whether to bounce a quirk through the common code or not. Consider
> > > what you've already seen for the Renesas IPMMU, or SMMUv1 stage 2...
> > > 
> > 
> > Thanks for the explanation, makes sense. If I am not mistaken, I see that
> > you are suggesting to move this logic based on size and granule-size to
> > arm-smmu-v2 driver and one more thing below..
> 
> Simpler than that - following on from my original comment above,
> tlb_flush_walk already knows it's invalidating at least one full level of
> table so there's nothing it even needs to check. Adding a size-based
> heuristic to arm_smmu_inv_range_* for leaf invalidations would be a separate
> concern (note that changing the non-leaf behaviour might allow cleaning up
> the "reg" indirection there too).
> 
> > > I'm instinctively a little twitchy about making this a blanket
> > > optimisation for SMMUv2 since I still remember the palaver with our
> > > display and MMU-500 integrations, where it had to implement the dodgy
> > > "prefetch" register to trigger translations before scanning out a
> > > frame since it couldn't ever afford a TLB miss, thus TLBIALL when
> > > freeing an old buffer would be a dangerous hammer to swing. However
> > > IIRC it also had to ensure everything was mapped as 2MB blocks to
> > > guarantee fitting everything in the TLBs in the first place, so I
> > > guess it would still work out OK due to never realistically unmapping
> > > a whole table at once anyway.
> > > 
> > 
> > You are also hinting to not do this for all SMMUv2 implementations and make
> > it QCOM specific?
> 
> No, I'm really just wary that the performance implication is more complex
> than a simple unmap latency benefit, possibly even for QCOM. Consider the
> access latency, power and memory bandwidth hit from all the additional
> pagetable walks incurred by other ongoing traffic fighting against those 16
> successive TLBIASIDs. Whether it's an overall win really depends on the
> specific workload and system conditions as much as the SMMU implementation.
> Thinking some more, I wonder if the Tegra folks might have an opinion to add
> here, given that their multiple-SMMU solution was seemingly about trying to
> get enough TLB and pagetable walk bandwidth in the first place?

Yes, so Tegra194 has three different instances of the SMMU. Two of them
are programmed in an interleaved mode to basically double the bandwidth
available. A third instance is specifically reserved for isochronous
memory clients and is used by the display controller to avoid potential
pressure on the dual-SMMU from interfering with display functionality.

I'm not sure if we've ever measured the impact of map/unmap operations
under heavy load. Krishna, do you know if this might be helpful for some
of the use-cases we have on Tegra? Or if it might negatively impact
performance under pressure?

Thierry

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list
       [not found]           ` <dfdabcdec99a4c6e3bf2b3c5eebe067f@codeaurora.org>
@ 2021-06-10 15:29             ` Robin Murphy
  2021-06-11  0:37               ` Krishna Reddy
  0 siblings, 1 reply; 10+ messages in thread
From: Robin Murphy @ 2021-06-10 15:29 UTC (permalink / raw)
  To: Sai Prakash Ranjan
  Cc: linux-arm-msm, linux-kernel, iommu, Thierry Reding, Will Deacon,
	linux-arm-kernel

On 2021-06-10 12:54, Sai Prakash Ranjan wrote:
> Hi Robin,
> 
> On 2021-06-10 17:03, Robin Murphy wrote:
>> On 2021-06-10 10:36, Sai Prakash Ranjan wrote:
>>> Hi Robin,
>>>
>>> On 2021-06-10 14:38, Robin Murphy wrote:
>>>> On 2021-06-10 06:24, Sai Prakash Ranjan wrote:
>>>>> Hi Robin,
>>>>>
>>>>> On 2021-06-10 00:14, Robin Murphy wrote:
>>>>>> On 2021-06-09 15:53, Sai Prakash Ranjan wrote:
>>>>>>> Currently for iommu_unmap() of large scatter-gather list with 
>>>>>>> page size
>>>>>>> elements, the majority of time is spent in flushing of partial 
>>>>>>> walks in
>>>>>>> __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for
>>>>>>> arm-smmu).
>>>>>>>
>>>>>>> For example: to unmap a 32MB scatter-gather list with page size 
>>>>>>> elements
>>>>>>> (8192 entries), there are 16->2MB buffer unmaps based on the 
>>>>>>> pgsize (2MB
>>>>>>> for 4K granule) and each of 2MB will further result in 512 
>>>>>>> TLBIVAs (2MB/4K)
>>>>>>> resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing 
>>>>>>> a huge
>>>>>>> overhead.
>>>>>>>
>>>>>>> So instead use io_pgtable_tlb_flush_all() to invalidate the 
>>>>>>> entire context
>>>>>>> if size (pgsize) is greater than the granule size (4K, 16K, 64K). 
>>>>>>> For this
>>>>>>> example of 32MB scatter-gather list unmap, this results in just 
>>>>>>> 16 ASID
>>>>>>> based TLB invalidations or tlb_flush_all() callback (TLBIASID in 
>>>>>>> case of
>>>>>>> arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the 
>>>>>>> performance of
>>>>>>> unmaps drastically.
>>>>>>>
>>>>>>> Condition (size > granule size) is chosen for 
>>>>>>> io_pgtable_tlb_flush_all()
>>>>>>> because for any granule with supported pgsizes, we will have at 
>>>>>>> least 512
>>>>>>> TLB invalidations for which tlb_flush_all() is already 
>>>>>>> recommended. For
>>>>>>> example, take 4K granule with 2MB pgsize, this will result in 512 
>>>>>>> TLBIVA
>>>>>>> in partial walk flush.
>>>>>>>
>>>>>>> Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap:
>>>>>>> (average over 10 iterations)
>>>>>>>
>>>>>>> Before this optimization:
>>>>>>>
>>>>>>>      size        iommu_map_sg      iommu_unmap
>>>>>>>        4K            2.067 us         1.854 us
>>>>>>>       64K            9.598 us         8.802 us
>>>>>>>        1M          148.890 us       130.718 us
>>>>>>>        2M          305.864 us        67.291 us
>>>>>>>       12M         1793.604 us       390.838 us
>>>>>>>       16M         2386.848 us       518.187 us
>>>>>>>       24M         3563.296 us       775.989 us
>>>>>>>       32M         4747.171 us      1033.364 us
>>>>>>>
>>>>>>> After this optimization:
>>>>>>>
>>>>>>>      size        iommu_map_sg      iommu_unmap
>>>>>>>        4K            1.723 us         1.765 us
>>>>>>>       64K            9.880 us         8.869 us
>>>>>>>        1M          155.364 us       135.223 us
>>>>>>>        2M          303.906 us         5.385 us
>>>>>>>       12M         1786.557 us        21.250 us
>>>>>>>       16M         2391.890 us        27.437 us
>>>>>>>       24M         3570.895 us        39.937 us
>>>>>>>       32M         4755.234 us        51.797 us
>>>>>>>
>>>>>>> This is further reduced once the map/unmap_pages() support gets 
>>>>>>> in which
>>>>>>> will result in just 1 tlb_flush_all() as opposed to 16 
>>>>>>> tlb_flush_all().
>>>>>>>
>>>>>>> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
>>>>>>> ---
>>>>>>>   drivers/iommu/io-pgtable-arm.c | 7 +++++--
>>>>>>>   1 file changed, 5 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/iommu/io-pgtable-arm.c 
>>>>>>> b/drivers/iommu/io-pgtable-arm.c
>>>>>>> index 87def58e79b5..c3cb9add3179 100644
>>>>>>> --- a/drivers/iommu/io-pgtable-arm.c
>>>>>>> +++ b/drivers/iommu/io-pgtable-arm.c
>>>>>>> @@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct 
>>>>>>> arm_lpae_io_pgtable *data,
>>>>>>>             if (!iopte_leaf(pte, lvl, iop->fmt)) {
>>>>>>>               /* Also flush any partial walks */
>>>>>>> -            io_pgtable_tlb_flush_walk(iop, iova, size,
>>>>>>> -                          ARM_LPAE_GRANULE(data));
>>>>>>> +            if (size > ARM_LPAE_GRANULE(data))
>>>>>>> +                io_pgtable_tlb_flush_all(iop);
>>>>>>> +            else
>>>>>>
>>>>>> Erm, when will the above condition ever not be true? ;)
>>>>>>
>>>>>
>>>>> Ah right, silly me :)
>>>>>
>>>>>> Taking a step back, though, what about the impact to drivers other
>>>>>> than SMMUv2?
>>>>>
>>>>> Other drivers would be msm_iommu.c, qcom_iommu.c which does the same
>>>>> thing as arm-smmu-v2 (page based invalidations), then there is 
>>>>> ipmmu-vmsa.c
>>>>> which does tlb_flush_all() for flush walk.
>>>>>
>>>>>> In particular I'm thinking of SMMUv3.2 where the whole
>>>>>> range can be invalidated by VA in a single command anyway, so the
>>>>>> additional penalties of TLBIALL are undesirable.
>>>>>>
>>>>>
>>>>> Right, so I am thinking we can have a new generic quirk 
>>>>> IO_PGTABLE_QUIRK_RANGE_INV
>>>>> to choose between range based invalidations(tlb_flush_walk) and 
>>>>> tlb_flush_all().
>>>>> In this case of arm-smmu-v3.2, we can tie up 
>>>>> ARM_SMMU_FEAT_RANGE_INV with this quirk
>>>>> and have something like below, thoughts?
>>>>>
>>>>> if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV)
>>>>>          io_pgtable_tlb_flush_walk(iop, iova, size,
>>>>>                                    ARM_LPAE_GRANULE(data));
>>>>> else
>>>>>          io_pgtable_tlb_flush_all(iop);
>>>>
>>>> The design here has always been that io-pgtable says *what* needs
>>>> invalidating, and we left it up to the drivers to decide exactly
>>>> *how*. Even though things have evolved a bit I don't think that has
>>>> fundamentally changed - tlb_flush_walk is now only used in this one
>>>> place (technically I suppose it could be renamed tlb_flush_table but
>>>> it's not worth the churn), so drivers can implement their own
>>>> preferred table-invalidating behaviour even more easily than choosing
>>>> whether to bounce a quirk through the common code or not. Consider
>>>> what you've already seen for the Renesas IPMMU, or SMMUv1 stage 2...
>>>>
>>>
>>> Thanks for the explanation, makes sense. If I am not mistaken, I see 
>>> that
>>> you are suggesting to move this logic based on size and granule-size to
>>> arm-smmu-v2 driver and one more thing below..
>>
>> Simpler than that - following on from my original comment above,
>> tlb_flush_walk already knows it's invalidating at least one full level
>> of table so there's nothing it even needs to check. Adding a
>> size-based heuristic to arm_smmu_inv_range_* for leaf invalidations
>> would be a separate concern (note that changing the non-leaf behaviour
>> might allow cleaning up the "reg" indirection there too).
> 
> Right, sorry I didn't mean to mention the size check as it was obvious
> from your first reply, but rather just calling impl->tlb_inv() in
> arm_smmu_tlb_inv_walk_s1().
> 
>>
>>>> I'm instinctively a little twitchy about making this a blanket
>>>> optimisation for SMMUv2 since I still remember the palaver with our
>>>> display and MMU-500 integrations, where it had to implement the dodgy
>>>> "prefetch" register to trigger translations before scanning out a
>>>> frame since it couldn't ever afford a TLB miss, thus TLBIALL when
>>>> freeing an old buffer would be a dangerous hammer to swing. However
>>>> IIRC it also had to ensure everything was mapped as 2MB blocks to
>>>> guarantee fitting everything in the TLBs in the first place, so I
>>>> guess it would still work out OK due to never realistically unmapping
>>>> a whole table at once anyway.
>>>>
>>>
>>> You are also hinting to not do this for all SMMUv2 implementations 
>>> and make
>>> it QCOM specific?
>>
>> No, I'm really just wary that the performance implication is more
>> complex than a simple unmap latency benefit, possibly even for QCOM.
>> Consider the access latency, power and memory bandwidth hit from all
>> the additional pagetable walks incurred by other ongoing traffic
>> fighting against those 16 successive TLBIASIDs. Whether it's an
>> overall win really depends on the specific workload and system
>> conditions as much as the SMMU implementation.
> 
> No, the unmap latency is not just in some test case written, the issue
> is very real and we have workloads where camera is reporting frame drops
> because of this unmap latency in the order of 100s of milliseconds.
> And hardware team recommends using ASID based invalidations for anything
> larger than 128 TLB entries. So yes, we have taken note of impacts here
> before going this way and hence feel more inclined to make this qcom
> specific if required.

OK, that's good to know. I never suggested that CPU unmap latency wasn't 
a valid concern in itself - obviously spending millions of cycles in, 
say, an interrupt handler doing pointless busy work has some serious 
downsides - just that it might not always be the most important concern 
for everyone, so I wanted to make sure this discussion was had in the open.

TBH I *am* inclined to make this a core SMMU driver change provided 
nobody pops up with a strong counter-argument.

>> Thinking some more, I
>> wonder if the Tegra folks might have an opinion to add here, given
>> that their multiple-SMMU solution was seemingly about trying to get
>> enough TLB and pagetable walk bandwidth in the first place?
>>
> 
> Sure but I do not see how that will help with the unmap latency?

It won't. However it implies a use-case which is already sensitive to 
translation bandwidth, and thus is somewhat more likely to be sensitive 
to over-invalidation. But even then they also have more to gain from 
reducing the number of MMIO writes that have to be duplicated :)

Thanks,
Robin.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list
  2021-06-10 15:29             ` Robin Murphy
@ 2021-06-11  0:37               ` Krishna Reddy
       [not found]                 ` <07001b4ed6c0a491eacce6e4dc13ab5e@codeaurora.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Krishna Reddy @ 2021-06-11  0:37 UTC (permalink / raw)
  To: Robin Murphy, Sai Prakash Ranjan
  Cc: linux-arm-msm, linux-kernel, iommu, Will Deacon,
	linux-arm-kernel, Thierry Reding

> > No, the unmap latency is not just in some test case written, the issue
> > is very real and we have workloads where camera is reporting frame
> > drops because of this unmap latency in the order of 100s of milliseconds.
> > And hardware team recommends using ASID based invalidations for
> > anything larger than 128 TLB entries. So yes, we have taken note of
> > impacts here before going this way and hence feel more inclined to
> > make this qcom specific if required.

Seems like the real issue here is not the unmap API latency.
It should be the high number of back to back SMMU TLB invalidate register writes that is resulting
in lower ISO BW to Camera and overflow. Isn't it?
Even Tegra186 SoC has similar issue and HW team recommended to rate limit the number of
back to back SMMU tlb invalidate registers writes. The subsequent Tegra194 SoC has a dedicated SMMU for
ISO clients to avoid the impact of TLB invalidates from NISO clients on ISO BW.

>> Thinking some more, I
>> wonder if the Tegra folks might have an opinion to add here, given 
>> that their multiple-SMMU solution was seemingly about trying to get 
>> enough TLB and pagetable walk bandwidth in the first place?

While it is good to reduce the number of tlb register writes, Flushing all TLB entries at context granularity arbitrarily
can have negative impact on active traffic and BW. I don't have much data on possible impact at this point.
Can the flushing at context granularity be made a quirk than performing it as default? 

-KR

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list
       [not found]                 ` <07001b4ed6c0a491eacce6e4dc13ab5e@codeaurora.org>
@ 2021-06-11 16:49                   ` Krishna Reddy
       [not found]                     ` <f749ba0957b516ab5f0ea57033d308c7@codeaurora.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Krishna Reddy @ 2021-06-11 16:49 UTC (permalink / raw)
  To: Sai Prakash Ranjan
  Cc: Robin Murphy, linux-arm-msm, linux-kernel, iommu, Will Deacon,
	linux-arm-kernel, Thierry Reding

Hi Sai,
> >> > No, the unmap latency is not just in some test case written, the
> >> > issue is very real and we have workloads where camera is reporting
> >> > frame drops because of this unmap latency in the order of 100s of
> milliseconds.

> Not exactly, this issue is not specific to camera. If you look at the numbers in the
> commit text, even for the test device its the same observation. It depends on
> the buffer size we are unmapping which affects the number of TLBIs issue. I am
> not aware of any such HW side bw issues for camera specifically on QCOM
> devices.

It is clear that reducing number of TLBIs  reduces the umap API latency. But, It is
at the expense of throwing away valid tlb entries. 
Quantifying the impact of arbitrary invalidation of valid tlb entries at context level is not straight forward and
use case dependent. The side-effects might be rare or won't be known until they are noticed.
Can you provide more details on How the unmap latency is causing camera to drop frames?
Is unmap performed in the perf path?
If unmap is queued and performed on a back ground thread, would it resolve the frame drops?

-KR


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list
       [not found]                     ` <f749ba0957b516ab5f0ea57033d308c7@codeaurora.org>
@ 2021-06-14 17:48                       ` Krishna Reddy
       [not found]                         ` <5eb5146ab51a8fe0b558680d479a26cd@codeaurora.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Krishna Reddy @ 2021-06-14 17:48 UTC (permalink / raw)
  To: Sai Prakash Ranjan, Robin Murphy
  Cc: linux-arm-msm, linux-kernel, iommu, Will Deacon,
	linux-arm-kernel, Thierry Reding

> Right but we won't know until we profile the specific usecases or try them in
> generic workload to see if they affect the performance. Sure, over invalidation is
> a concern where multiple buffers can be mapped to same context and the cache
> is not usable at the time for lookup and such but we don't do it for small buffers
> and only for large buffers which means thousands of TLB entry mappings in
> which case TLBIASID is preferred (note: I mentioned the HW team
> recommendation to use it for anything greater than 128 TLB entries) in my
> earlier reply. And also note that we do this only for partial walk flush, we are not
> arbitrarily changing all the TLBIs to ASID based.

Most of the heavy bw use cases does involve processing larger buffers.
When the physical memory is allocated dis-contiguously at page_size (let's use 4KB here)
granularity, each aligned 2MB chunks IOVA unmap would involve performing a TLBIASID
as 2MB is not a leaf. Essentially, It happens all the time during large buffer unmaps and
potentially impact active traffic on other large buffers. Depending on how much
latency HW engines can absorb, the overflow/underflow issues for ISO engines can be
sporadic and vendor specific. 
Performing TLBIASID as default for all SoCs is not a safe operation.


> I am no camera expert but from what the camera team mentioned is that there
> is a thread which frees memory(large unused memory buffers) periodically which
> ends up taking around 100+ms and causing some camera test failures with
> frame drops. Parallel efforts are already being made to optimize this usage of
> thread but as I mentioned previously, this is *not a camera specific*, lets say
> someone else invokes such large unmaps, it's going to face the same issue.

From the above, It doesn't look like the root cause of frame drops is fully understood.
Why is 100+ms delay causing camera frame drop?  Is the same thread submitting the buffers
to camera after unmap is complete? If not, how is the unmap latency causing issue here?
 

> > If unmap is queued and performed on a back ground thread, would it
> > resolve the frame drops?
> 
> Not sure I understand what you mean by queuing on background thread but with
> that or not, we still do the same number of TLBIs and hop through
> iommu->io-pgtable->arm-smmu to perform the the unmap, so how will that
> help?

I mean adding the unmap requests into a queue and processing them from a different thread.
It is not to reduce the TLBIs. But, not to block subsequent buffer allocation, IOVA map requests, if they
are being requested from same thread that is performing unmap. If unmap is already performed from
a different thread, then the issue still need to be root caused to understand it fully. Check for any
serialization issues. 


-KR

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list
       [not found]                         ` <5eb5146ab51a8fe0b558680d479a26cd@codeaurora.org>
@ 2021-06-15 13:53                           ` Robin Murphy
       [not found]                             ` <8535b6c757a5584b495f135f4377053c@codeaurora.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Robin Murphy @ 2021-06-15 13:53 UTC (permalink / raw)
  To: Sai Prakash Ranjan, Krishna Reddy
  Cc: linux-arm-msm, linux-kernel, iommu, Will Deacon,
	linux-arm-kernel, Thierry Reding

On 2021-06-15 12:51, Sai Prakash Ranjan wrote:
> Hi Krishna,
> 
> On 2021-06-14 23:18, Krishna Reddy wrote:
>>> Right but we won't know until we profile the specific usecases or try 
>>> them in
>>> generic workload to see if they affect the performance. Sure, over 
>>> invalidation is
>>> a concern where multiple buffers can be mapped to same context and 
>>> the cache
>>> is not usable at the time for lookup and such but we don't do it for 
>>> small buffers
>>> and only for large buffers which means thousands of TLB entry 
>>> mappings in
>>> which case TLBIASID is preferred (note: I mentioned the HW team
>>> recommendation to use it for anything greater than 128 TLB entries) 
>>> in my
>>> earlier reply. And also note that we do this only for partial walk 
>>> flush, we are not
>>> arbitrarily changing all the TLBIs to ASID based.
>>
>> Most of the heavy bw use cases does involve processing larger buffers.
>> When the physical memory is allocated dis-contiguously at page_size
>> (let's use 4KB here)
>> granularity, each aligned 2MB chunks IOVA unmap would involve
>> performing a TLBIASID
>> as 2MB is not a leaf. Essentially, It happens all the time during
>> large buffer unmaps and
>> potentially impact active traffic on other large buffers. Depending on 
>> how much
>> latency HW engines can absorb, the overflow/underflow issues for ISO
>> engines can be
>> sporadic and vendor specific.
>> Performing TLBIASID as default for all SoCs is not a safe operation.
>>
> 
> Ok so from what I gather from this is that its not easy to test for the
> negative impact and you don't have data on such yet and the behaviour is
> very vendor specific. To add on qcom impl, we have several performance
> improvements for TLB cache invalidations in HW like wait-for-safe(for 
> realtime
> clients such as camera and display) and few others to allow for cache
> lookups/updates when TLBI is in progress for the same context bank, so 
> atleast
> we are good here.
> 
>>
>>> I am no camera expert but from what the camera team mentioned is that 
>>> there
>>> is a thread which frees memory(large unused memory buffers) 
>>> periodically which
>>> ends up taking around 100+ms and causing some camera test failures with
>>> frame drops. Parallel efforts are already being made to optimize this 
>>> usage of
>>> thread but as I mentioned previously, this is *not a camera 
>>> specific*, lets say
>>> someone else invokes such large unmaps, it's going to face the same 
>>> issue.
>>
>> From the above, It doesn't look like the root cause of frame drops is
>> fully understood.
>> Why is 100+ms delay causing camera frame drop?  Is the same thread
>> submitting the buffers
>> to camera after unmap is complete? If not, how is the unmap latency
>> causing issue here?
>>
> 
> Ok since you are interested in camera usecase, I have requested for more 
> details
> from the camera team and will give it once they comeback. However I 
> don't think
> its good to have unmap latency at all and that is being addressed by 
> this patch.
> 
>>
>>> > If unmap is queued and performed on a back ground thread, would it
>>> > resolve the frame drops?
>>>
>>> Not sure I understand what you mean by queuing on background thread 
>>> but with
>>> that or not, we still do the same number of TLBIs and hop through
>>> iommu->io-pgtable->arm-smmu to perform the the unmap, so how will that
>>> help?
>>
>> I mean adding the unmap requests into a queue and processing them from
>> a different thread.
>> It is not to reduce the TLBIs. But, not to block subsequent buffer
>> allocation, IOVA map requests, if they
>> are being requested from same thread that is performing unmap. If
>> unmap is already performed from
>> a different thread, then the issue still need to be root caused to
>> understand it fully. Check for any
>> serialization issues.
>>
> 
> This patch is to optimize unmap latency because of large number of mmio 
> writes(TLBIVAs)
> wasting CPU cycles and not to fix camera issue which can probably be 
> solved by
> parallelization. It seems to me like you are ok with the unmap latency 
> in general
> which we are not and want to avoid that latency.
> 
> Hi @Robin, from these discussions it seems they are not ok with the change
> for all SoC vendor implementations and do not have any data on such impact.
> As I mentioned above, on QCOM platforms we do have several optimizations 
> in HW
> for TLBIs and would like to make use of it and reduce the unmap latency.
> What do you think, should this be made implementation specific?

Yes, it sounds like there's enough uncertainty for now that this needs 
to be an opt-in feature. However, I still think that non-strict mode 
could use it generically, since that's all about over-invalidating to 
save time on individual unmaps - and relatively non-deterministic - already.

So maybe we have a second set of iommu_flush_ops, or just a flag 
somewhere to control the tlb_flush_walk functions internally, and the 
choice can be made in the iommu_get_dma_strict() test, but also forced 
on all the time by your init_context hook. What do you reckon?

Robin.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list
       [not found]                               ` <d9226f4349c445c6ca63dc632b29e3e0@codeaurora.org>
@ 2021-06-17 21:18                                 ` Krishna Reddy
  0 siblings, 0 replies; 10+ messages in thread
From: Krishna Reddy @ 2021-06-17 21:18 UTC (permalink / raw)
  To: Sai Prakash Ranjan, Robin Murphy
  Cc: linux-arm-msm, linux-kernel, iommu, Will Deacon,
	linux-arm-kernel, Thierry Reding

> Instead of flush_ops in init_context hook, perhaps a io_pgtable quirk since this is
> related to tlb, probably a bad name but IO_PGTABLE_QUIRK_TLB_INV which will
> be set in init_context impl hook and the prev condition in
> io_pgtable_tlb_flush_walk()
> becomes something like below. Seems very minimal and neat instead of poking
> into tlb_flush_walk functions or touching dma strict with some flag?
> 
> if (iop->cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT ||
>      iop->cfg.quirks & IO_PGTABLE_QUIRK_TLB_INV) {
>          iop->cfg.tlb->tlb_flush_all(iop->cookie);
>          return;
> }

Can you name it as IO_PGTABLE_QUIRK_TLB_INV_ASID or IO_PGTABLE_QUIRK_TLB_INV_ALL_ASID?

-KR

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-06-17 21:20 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20210609145315.25750-1-saiprakash.ranjan@codeaurora.org>
2021-06-09 18:44 ` [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list Robin Murphy
     [not found]   ` <c600e9b2534d54082a5272b508a7985f@codeaurora.org>
2021-06-10  9:08     ` Robin Murphy
     [not found]       ` <12067ffb8243b220cf03e83aaac3e823@codeaurora.org>
2021-06-10 11:33         ` Robin Murphy
2021-06-10 12:03           ` Thierry Reding
     [not found]           ` <dfdabcdec99a4c6e3bf2b3c5eebe067f@codeaurora.org>
2021-06-10 15:29             ` Robin Murphy
2021-06-11  0:37               ` Krishna Reddy
     [not found]                 ` <07001b4ed6c0a491eacce6e4dc13ab5e@codeaurora.org>
2021-06-11 16:49                   ` Krishna Reddy
     [not found]                     ` <f749ba0957b516ab5f0ea57033d308c7@codeaurora.org>
2021-06-14 17:48                       ` Krishna Reddy
     [not found]                         ` <5eb5146ab51a8fe0b558680d479a26cd@codeaurora.org>
2021-06-15 13:53                           ` Robin Murphy
     [not found]                             ` <8535b6c757a5584b495f135f4377053c@codeaurora.org>
     [not found]                               ` <d9226f4349c445c6ca63dc632b29e3e0@codeaurora.org>
2021-06-17 21:18                                 ` Krishna Reddy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).