All of lore.kernel.org
 help / color / mirror / Atom feed
From: Robin Murphy <robin.murphy@arm.com>
To: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>,
	Krishna Reddy <vdumpa@nvidia.com>
Cc: linux-arm-msm@vger.kernel.org, linux-kernel@vger.kernel.org,
	iommu@lists.linux-foundation.org, Will Deacon <will@kernel.org>,
	linux-arm-kernel@lists.infradead.org,
	Thierry Reding <treding@nvidia.com>
Subject: Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list
Date: Tue, 15 Jun 2021 14:53:45 +0100	[thread overview]
Message-ID: <da62ff1c-9b49-34d3-69a1-1a674e4a30f7@arm.com> (raw)
In-Reply-To: <5eb5146ab51a8fe0b558680d479a26cd@codeaurora.org>

On 2021-06-15 12:51, Sai Prakash Ranjan wrote:
> Hi Krishna,
> 
> On 2021-06-14 23:18, Krishna Reddy wrote:
>>> Right but we won't know until we profile the specific usecases or try 
>>> them in
>>> generic workload to see if they affect the performance. Sure, over 
>>> invalidation is
>>> a concern where multiple buffers can be mapped to same context and 
>>> the cache
>>> is not usable at the time for lookup and such but we don't do it for 
>>> small buffers
>>> and only for large buffers which means thousands of TLB entry 
>>> mappings in
>>> which case TLBIASID is preferred (note: I mentioned the HW team
>>> recommendation to use it for anything greater than 128 TLB entries) 
>>> in my
>>> earlier reply. And also note that we do this only for partial walk 
>>> flush, we are not
>>> arbitrarily changing all the TLBIs to ASID based.
>>
>> Most of the heavy bw use cases does involve processing larger buffers.
>> When the physical memory is allocated dis-contiguously at page_size
>> (let's use 4KB here)
>> granularity, each aligned 2MB chunks IOVA unmap would involve
>> performing a TLBIASID
>> as 2MB is not a leaf. Essentially, It happens all the time during
>> large buffer unmaps and
>> potentially impact active traffic on other large buffers. Depending on 
>> how much
>> latency HW engines can absorb, the overflow/underflow issues for ISO
>> engines can be
>> sporadic and vendor specific.
>> Performing TLBIASID as default for all SoCs is not a safe operation.
>>
> 
> Ok so from what I gather from this is that its not easy to test for the
> negative impact and you don't have data on such yet and the behaviour is
> very vendor specific. To add on qcom impl, we have several performance
> improvements for TLB cache invalidations in HW like wait-for-safe(for 
> realtime
> clients such as camera and display) and few others to allow for cache
> lookups/updates when TLBI is in progress for the same context bank, so 
> atleast
> we are good here.
> 
>>
>>> I am no camera expert but from what the camera team mentioned is that 
>>> there
>>> is a thread which frees memory(large unused memory buffers) 
>>> periodically which
>>> ends up taking around 100+ms and causing some camera test failures with
>>> frame drops. Parallel efforts are already being made to optimize this 
>>> usage of
>>> thread but as I mentioned previously, this is *not a camera 
>>> specific*, lets say
>>> someone else invokes such large unmaps, it's going to face the same 
>>> issue.
>>
>> From the above, It doesn't look like the root cause of frame drops is
>> fully understood.
>> Why is 100+ms delay causing camera frame drop?  Is the same thread
>> submitting the buffers
>> to camera after unmap is complete? If not, how is the unmap latency
>> causing issue here?
>>
> 
> Ok since you are interested in camera usecase, I have requested for more 
> details
> from the camera team and will give it once they comeback. However I 
> don't think
> its good to have unmap latency at all and that is being addressed by 
> this patch.
> 
>>
>>> > If unmap is queued and performed on a back ground thread, would it
>>> > resolve the frame drops?
>>>
>>> Not sure I understand what you mean by queuing on background thread 
>>> but with
>>> that or not, we still do the same number of TLBIs and hop through
>>> iommu->io-pgtable->arm-smmu to perform the the unmap, so how will that
>>> help?
>>
>> I mean adding the unmap requests into a queue and processing them from
>> a different thread.
>> It is not to reduce the TLBIs. But, not to block subsequent buffer
>> allocation, IOVA map requests, if they
>> are being requested from same thread that is performing unmap. If
>> unmap is already performed from
>> a different thread, then the issue still need to be root caused to
>> understand it fully. Check for any
>> serialization issues.
>>
> 
> This patch is to optimize unmap latency because of large number of mmio 
> writes(TLBIVAs)
> wasting CPU cycles and not to fix camera issue which can probably be 
> solved by
> parallelization. It seems to me like you are ok with the unmap latency 
> in general
> which we are not and want to avoid that latency.
> 
> Hi @Robin, from these discussions it seems they are not ok with the change
> for all SoC vendor implementations and do not have any data on such impact.
> As I mentioned above, on QCOM platforms we do have several optimizations 
> in HW
> for TLBIs and would like to make use of it and reduce the unmap latency.
> What do you think, should this be made implementation specific?

Yes, it sounds like there's enough uncertainty for now that this needs 
to be an opt-in feature. However, I still think that non-strict mode 
could use it generically, since that's all about over-invalidating to 
save time on individual unmaps - and relatively non-deterministic - already.

So maybe we have a second set of iommu_flush_ops, or just a flag 
somewhere to control the tlb_flush_walk functions internally, and the 
choice can be made in the iommu_get_dma_strict() test, but also forced 
on all the time by your init_context hook. What do you reckon?

Robin.

WARNING: multiple messages have this Message-ID (diff)
From: Robin Murphy <robin.murphy@arm.com>
To: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>,
	Krishna Reddy <vdumpa@nvidia.com>
Cc: linux-arm-msm@vger.kernel.org, linux-kernel@vger.kernel.org,
	iommu@lists.linux-foundation.org,
	Thierry Reding <treding@nvidia.com>,
	Will Deacon <will@kernel.org>,
	linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list
Date: Tue, 15 Jun 2021 14:53:45 +0100	[thread overview]
Message-ID: <da62ff1c-9b49-34d3-69a1-1a674e4a30f7@arm.com> (raw)
In-Reply-To: <5eb5146ab51a8fe0b558680d479a26cd@codeaurora.org>

On 2021-06-15 12:51, Sai Prakash Ranjan wrote:
> Hi Krishna,
> 
> On 2021-06-14 23:18, Krishna Reddy wrote:
>>> Right but we won't know until we profile the specific usecases or try 
>>> them in
>>> generic workload to see if they affect the performance. Sure, over 
>>> invalidation is
>>> a concern where multiple buffers can be mapped to same context and 
>>> the cache
>>> is not usable at the time for lookup and such but we don't do it for 
>>> small buffers
>>> and only for large buffers which means thousands of TLB entry 
>>> mappings in
>>> which case TLBIASID is preferred (note: I mentioned the HW team
>>> recommendation to use it for anything greater than 128 TLB entries) 
>>> in my
>>> earlier reply. And also note that we do this only for partial walk 
>>> flush, we are not
>>> arbitrarily changing all the TLBIs to ASID based.
>>
>> Most of the heavy bw use cases does involve processing larger buffers.
>> When the physical memory is allocated dis-contiguously at page_size
>> (let's use 4KB here)
>> granularity, each aligned 2MB chunks IOVA unmap would involve
>> performing a TLBIASID
>> as 2MB is not a leaf. Essentially, It happens all the time during
>> large buffer unmaps and
>> potentially impact active traffic on other large buffers. Depending on 
>> how much
>> latency HW engines can absorb, the overflow/underflow issues for ISO
>> engines can be
>> sporadic and vendor specific.
>> Performing TLBIASID as default for all SoCs is not a safe operation.
>>
> 
> Ok so from what I gather from this is that its not easy to test for the
> negative impact and you don't have data on such yet and the behaviour is
> very vendor specific. To add on qcom impl, we have several performance
> improvements for TLB cache invalidations in HW like wait-for-safe(for 
> realtime
> clients such as camera and display) and few others to allow for cache
> lookups/updates when TLBI is in progress for the same context bank, so 
> atleast
> we are good here.
> 
>>
>>> I am no camera expert but from what the camera team mentioned is that 
>>> there
>>> is a thread which frees memory(large unused memory buffers) 
>>> periodically which
>>> ends up taking around 100+ms and causing some camera test failures with
>>> frame drops. Parallel efforts are already being made to optimize this 
>>> usage of
>>> thread but as I mentioned previously, this is *not a camera 
>>> specific*, lets say
>>> someone else invokes such large unmaps, it's going to face the same 
>>> issue.
>>
>> From the above, It doesn't look like the root cause of frame drops is
>> fully understood.
>> Why is 100+ms delay causing camera frame drop?  Is the same thread
>> submitting the buffers
>> to camera after unmap is complete? If not, how is the unmap latency
>> causing issue here?
>>
> 
> Ok since you are interested in camera usecase, I have requested for more 
> details
> from the camera team and will give it once they comeback. However I 
> don't think
> its good to have unmap latency at all and that is being addressed by 
> this patch.
> 
>>
>>> > If unmap is queued and performed on a back ground thread, would it
>>> > resolve the frame drops?
>>>
>>> Not sure I understand what you mean by queuing on background thread 
>>> but with
>>> that or not, we still do the same number of TLBIs and hop through
>>> iommu->io-pgtable->arm-smmu to perform the the unmap, so how will that
>>> help?
>>
>> I mean adding the unmap requests into a queue and processing them from
>> a different thread.
>> It is not to reduce the TLBIs. But, not to block subsequent buffer
>> allocation, IOVA map requests, if they
>> are being requested from same thread that is performing unmap. If
>> unmap is already performed from
>> a different thread, then the issue still need to be root caused to
>> understand it fully. Check for any
>> serialization issues.
>>
> 
> This patch is to optimize unmap latency because of large number of mmio 
> writes(TLBIVAs)
> wasting CPU cycles and not to fix camera issue which can probably be 
> solved by
> parallelization. It seems to me like you are ok with the unmap latency 
> in general
> which we are not and want to avoid that latency.
> 
> Hi @Robin, from these discussions it seems they are not ok with the change
> for all SoC vendor implementations and do not have any data on such impact.
> As I mentioned above, on QCOM platforms we do have several optimizations 
> in HW
> for TLBIs and would like to make use of it and reduce the unmap latency.
> What do you think, should this be made implementation specific?

Yes, it sounds like there's enough uncertainty for now that this needs 
to be an opt-in feature. However, I still think that non-strict mode 
could use it generically, since that's all about over-invalidating to 
save time on individual unmaps - and relatively non-deterministic - already.

So maybe we have a second set of iommu_flush_ops, or just a flag 
somewhere to control the tlb_flush_walk functions internally, and the 
choice can be made in the iommu_get_dma_strict() test, but also forced 
on all the time by your init_context hook. What do you reckon?

Robin.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

WARNING: multiple messages have this Message-ID (diff)
From: Robin Murphy <robin.murphy@arm.com>
To: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>,
	Krishna Reddy <vdumpa@nvidia.com>
Cc: linux-arm-msm@vger.kernel.org, linux-kernel@vger.kernel.org,
	iommu@lists.linux-foundation.org, Will Deacon <will@kernel.org>,
	linux-arm-kernel@lists.infradead.org,
	Thierry Reding <treding@nvidia.com>
Subject: Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list
Date: Tue, 15 Jun 2021 14:53:45 +0100	[thread overview]
Message-ID: <da62ff1c-9b49-34d3-69a1-1a674e4a30f7@arm.com> (raw)
In-Reply-To: <5eb5146ab51a8fe0b558680d479a26cd@codeaurora.org>

On 2021-06-15 12:51, Sai Prakash Ranjan wrote:
> Hi Krishna,
> 
> On 2021-06-14 23:18, Krishna Reddy wrote:
>>> Right but we won't know until we profile the specific usecases or try 
>>> them in
>>> generic workload to see if they affect the performance. Sure, over 
>>> invalidation is
>>> a concern where multiple buffers can be mapped to same context and 
>>> the cache
>>> is not usable at the time for lookup and such but we don't do it for 
>>> small buffers
>>> and only for large buffers which means thousands of TLB entry 
>>> mappings in
>>> which case TLBIASID is preferred (note: I mentioned the HW team
>>> recommendation to use it for anything greater than 128 TLB entries) 
>>> in my
>>> earlier reply. And also note that we do this only for partial walk 
>>> flush, we are not
>>> arbitrarily changing all the TLBIs to ASID based.
>>
>> Most of the heavy bw use cases does involve processing larger buffers.
>> When the physical memory is allocated dis-contiguously at page_size
>> (let's use 4KB here)
>> granularity, each aligned 2MB chunks IOVA unmap would involve
>> performing a TLBIASID
>> as 2MB is not a leaf. Essentially, It happens all the time during
>> large buffer unmaps and
>> potentially impact active traffic on other large buffers. Depending on 
>> how much
>> latency HW engines can absorb, the overflow/underflow issues for ISO
>> engines can be
>> sporadic and vendor specific.
>> Performing TLBIASID as default for all SoCs is not a safe operation.
>>
> 
> Ok so from what I gather from this is that its not easy to test for the
> negative impact and you don't have data on such yet and the behaviour is
> very vendor specific. To add on qcom impl, we have several performance
> improvements for TLB cache invalidations in HW like wait-for-safe(for 
> realtime
> clients such as camera and display) and few others to allow for cache
> lookups/updates when TLBI is in progress for the same context bank, so 
> atleast
> we are good here.
> 
>>
>>> I am no camera expert but from what the camera team mentioned is that 
>>> there
>>> is a thread which frees memory(large unused memory buffers) 
>>> periodically which
>>> ends up taking around 100+ms and causing some camera test failures with
>>> frame drops. Parallel efforts are already being made to optimize this 
>>> usage of
>>> thread but as I mentioned previously, this is *not a camera 
>>> specific*, lets say
>>> someone else invokes such large unmaps, it's going to face the same 
>>> issue.
>>
>> From the above, It doesn't look like the root cause of frame drops is
>> fully understood.
>> Why is 100+ms delay causing camera frame drop?  Is the same thread
>> submitting the buffers
>> to camera after unmap is complete? If not, how is the unmap latency
>> causing issue here?
>>
> 
> Ok since you are interested in camera usecase, I have requested for more 
> details
> from the camera team and will give it once they comeback. However I 
> don't think
> its good to have unmap latency at all and that is being addressed by 
> this patch.
> 
>>
>>> > If unmap is queued and performed on a back ground thread, would it
>>> > resolve the frame drops?
>>>
>>> Not sure I understand what you mean by queuing on background thread 
>>> but with
>>> that or not, we still do the same number of TLBIs and hop through
>>> iommu->io-pgtable->arm-smmu to perform the the unmap, so how will that
>>> help?
>>
>> I mean adding the unmap requests into a queue and processing them from
>> a different thread.
>> It is not to reduce the TLBIs. But, not to block subsequent buffer
>> allocation, IOVA map requests, if they
>> are being requested from same thread that is performing unmap. If
>> unmap is already performed from
>> a different thread, then the issue still need to be root caused to
>> understand it fully. Check for any
>> serialization issues.
>>
> 
> This patch is to optimize unmap latency because of large number of mmio 
> writes(TLBIVAs)
> wasting CPU cycles and not to fix camera issue which can probably be 
> solved by
> parallelization. It seems to me like you are ok with the unmap latency 
> in general
> which we are not and want to avoid that latency.
> 
> Hi @Robin, from these discussions it seems they are not ok with the change
> for all SoC vendor implementations and do not have any data on such impact.
> As I mentioned above, on QCOM platforms we do have several optimizations 
> in HW
> for TLBIs and would like to make use of it and reduce the unmap latency.
> What do you think, should this be made implementation specific?

Yes, it sounds like there's enough uncertainty for now that this needs 
to be an opt-in feature. However, I still think that non-strict mode 
could use it generically, since that's all about over-invalidating to 
save time on individual unmaps - and relatively non-deterministic - already.

So maybe we have a second set of iommu_flush_ops, or just a flag 
somewhere to control the tlb_flush_walk functions internally, and the 
choice can be made in the iommu_get_dma_strict() test, but also forced 
on all the time by your init_context hook. What do you reckon?

Robin.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2021-06-15 13:53 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-09 14:53 [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list Sai Prakash Ranjan
2021-06-09 14:53 ` Sai Prakash Ranjan
2021-06-09 18:44 ` Robin Murphy
2021-06-09 18:44   ` Robin Murphy
2021-06-09 18:44   ` Robin Murphy
2021-06-10  5:24   ` Sai Prakash Ranjan
2021-06-10  5:24     ` Sai Prakash Ranjan
2021-06-10  9:08     ` Robin Murphy
2021-06-10  9:08       ` Robin Murphy
2021-06-10  9:08       ` Robin Murphy
2021-06-10  9:36       ` Sai Prakash Ranjan
2021-06-10  9:36         ` Sai Prakash Ranjan
2021-06-10 11:33         ` Robin Murphy
2021-06-10 11:33           ` Robin Murphy
2021-06-10 11:33           ` Robin Murphy
2021-06-10 11:54           ` Sai Prakash Ranjan
2021-06-10 11:54             ` Sai Prakash Ranjan
2021-06-10 15:29             ` Robin Murphy
2021-06-10 15:29               ` Robin Murphy
2021-06-10 15:29               ` Robin Murphy
2021-06-10 15:51               ` Sai Prakash Ranjan
2021-06-10 15:51                 ` Sai Prakash Ranjan
2021-06-11  0:37               ` Krishna Reddy
2021-06-11  0:37                 ` Krishna Reddy
2021-06-11  0:37                 ` Krishna Reddy
2021-06-11  0:54                 ` Sai Prakash Ranjan
2021-06-11  0:54                   ` Sai Prakash Ranjan
2021-06-11 16:49                   ` Krishna Reddy
2021-06-11 16:49                     ` Krishna Reddy
2021-06-11 16:49                     ` Krishna Reddy
2021-06-12  2:46                     ` Sai Prakash Ranjan
2021-06-12  2:46                       ` Sai Prakash Ranjan
2021-06-14 17:48                       ` Krishna Reddy
2021-06-14 17:48                         ` Krishna Reddy
2021-06-14 17:48                         ` Krishna Reddy
2021-06-15 11:51                         ` Sai Prakash Ranjan
2021-06-15 11:51                           ` Sai Prakash Ranjan
2021-06-15 13:53                           ` Robin Murphy [this message]
2021-06-15 13:53                             ` Robin Murphy
2021-06-15 13:53                             ` Robin Murphy
2021-06-16  6:58                             ` Sai Prakash Ranjan
2021-06-16  6:58                               ` Sai Prakash Ranjan
2021-06-16  9:03                               ` Sai Prakash Ranjan
2021-06-16  9:03                                 ` Sai Prakash Ranjan
2021-06-17 21:18                                 ` Krishna Reddy
2021-06-17 21:18                                   ` Krishna Reddy
2021-06-17 21:18                                   ` Krishna Reddy
2021-06-18  2:47                                   ` Sai Prakash Ranjan
2021-06-18  2:47                                     ` Sai Prakash Ranjan
2021-06-18  4:04                           ` Sai Prakash Ranjan
2021-06-18  4:04                             ` Sai Prakash Ranjan
2021-06-10 12:03           ` Thierry Reding
2021-06-10 12:03             ` Thierry Reding
2021-06-10 12:03             ` Thierry Reding

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=da62ff1c-9b49-34d3-69a1-1a674e4a30f7@arm.com \
    --to=robin.murphy@arm.com \
    --cc=iommu@lists.linux-foundation.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-arm-msm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=saiprakash.ranjan@codeaurora.org \
    --cc=treding@nvidia.com \
    --cc=vdumpa@nvidia.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.