Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list

From: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
To: saiprakash.ranjan@codeaurora.org, vdumpa@nvidia.com
Cc: iommu@lists.linux-foundation.org,
	linux-arm-kernel@lists.infradead.org,
	linux-arm-msm@vger.kernel.org, linux-kernel@vger.kernel.org,
	robin.murphy@arm.com, treding@nvidia.com, will@kernel.org
Subject: Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list
Date: Fri, 18 Jun 2021 09:34:44 +0530	[thread overview]
Message-ID: <20210618040444.17270-1-saiprakash.ranjan@codeaurora.org> (raw)
In-Reply-To: <5eb5146ab51a8fe0b558680d479a26cd@codeaurora.org>

On 2021-06-15 17:21, Sai Prakash Ranjan wrote:
> Hi Krishna,
> 
> On 2021-06-14 23:18, Krishna Reddy wrote:
>>> Right but we won't know until we profile the specific usecases or try them in
>>> generic workload to see if they affect the performance. Sure, over invalidation is
>>> a concern where multiple buffers can be mapped to same context and the cache
>>> is not usable at the time for lookup and such but we don't do it for small buffers
>>> and only for large buffers which means thousands of TLB entry mappings in
>>> which case TLBIASID is preferred (note: I mentioned the HW team
>>> recommendation to use it for anything greater than 128 TLB entries) in my
>>> earlier reply. And also note that we do this only for partial walk flush, we are not
>>> arbitrarily changing all the TLBIs to ASID based.
>>
>> Most of the heavy bw use cases does involve processing larger buffers.
>> When the physical memory is allocated dis-contiguously at page_size
>> (let's use 4KB here)
>> granularity, each aligned 2MB chunks IOVA unmap would involve
>> performing a TLBIASID
>> as 2MB is not a leaf. Essentially, It happens all the time during
>> large buffer unmaps and
>> potentially impact active traffic on other large buffers. Depending on how much
>> latency HW engines can absorb, the overflow/underflow issues for ISO
>> engines can be
>> sporadic and vendor specific.
>> Performing TLBIASID as default for all SoCs is not a safe operation.
>>
> 
> Ok so from what I gather from this is that its not easy to test for the
> negative impact and you don't have data on such yet and the behaviour is
> very vendor specific. To add on qcom impl, we have several performance
> improvements for TLB cache invalidations in HW like wait-for-safe(for realtime
> clients such as camera and display) and few others to allow for cache
> lookups/updates when TLBI is in progress for the same context bank, so atleast
> we are good here.
> 
>>
>>> I am no camera expert but from what the camera team mentioned is that there
>>> is a thread which frees memory(large unused memory buffers) periodically which
>>> ends up taking around 100+ms and causing some camera test failures with
>>> frame drops. Parallel efforts are already being made to optimize this usage of
>>> thread but as I mentioned previously, this is *not a camera specific*, lets say
>>> someone else invokes such large unmaps, it's going to face the same issue.
>>
>> From the above, It doesn't look like the root cause of frame drops is
>> fully understood.
>> Why is 100+ms delay causing camera frame drop?  Is the same thread
>> submitting the buffers
>> to camera after unmap is complete? If not, how is the unmap latency
>> causing issue here?
>>
> 
> Ok since you are interested in camera usecase, I have requested for more details
> from the camera team and will give it once they comeback. However I don't think
> its good to have unmap latency at all and that is being addressed by this patch.
> 

As promised, here are some more details shared by camera team:

Mapping of a framework buffer happens at the time of process request and unmapping
of a framework buffer happens once the buffer is available from hardware and result
will be notified to camera framework.
 * When there is a delay in unmapping of a buffer, result notification to framework
   will be delayed and based on pipeline delay depth, new requests from framework
   will be delayed.
 * Camera stack uses internal buffer managers for internal and framework buffers.
   While mapping and unmapping these managers will be accessed, so uses common lock
   and hence is a blocking call. So unmapping delay will cause the delay for mapping
   of a new request and leads to framedrop.

Map and unmap happens in the camera service process context. There is no separate perf
path to perform unmapping.

In Camera stack along with map/unmap delay, additional delays are due to HW. So HW should
be able to get the requests in time from SW to avoid frame drops.

Thanks,
Sai
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation