[PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA
@ 2018-04-19 17:12 ` Ganapatrao Kulkarni
  0 siblings, 0 replies; 18+ messages in thread
From: Ganapatrao Kulkarni @ 2018-04-19 17:12 UTC (permalink / raw)
  To: joro, iommu, linux-kernel, robin.murphy
  Cc: tomasz.nowicki, jnair, Robert.Richter, Vadim.Lomovtsev,
	Jan.Glauber, gklkml16

The performance drop is observed with long hours iperf testing using 40G
cards. This is mainly due to long iterations in finding the free iova
range in 32bit address space.

In current implementation for 64bit PCI devices, there is always first
attempt to allocate iova from 32bit(SAC preferred over DAC) address
range. Once we run out 32bit range, there is allocation from higher range,
however due to cached32_node optimization it does not suppose to be
painful. cached32_node always points to recently allocated 32-bit node.
When address range is full, it will be pointing to last allocated node
(leaf node), so walking rbtree to find the available range is not
expensive affair. However this optimization does not behave well when
one of the middle node is freed. In that case cached32_node is updated
to point to next iova range. The next iova allocation will consume free
range and again update cached32_node to itself. From now on, walking
over 32-bit range is more expensive.

This patch adds fix to update cached node to leaf node when there are no
iova free range left, which avoids unnecessary long iterations.

Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
---
 drivers/iommu/iova.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 83fe262..e6ee2ea 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -201,6 +201,12 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 	} while (curr && new_pfn <= curr_iova->pfn_hi);

 	if (limit_pfn < size || new_pfn < iovad->start_pfn) {
+		/* No more cached node points to free hole, update to leaf node.
+		 */
+		struct iova *prev_iova;
+
+		prev_iova = rb_entry(prev, struct iova, node);
+		__cached_rbnode_insert_update(iovad, prev_iova);
 		spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
 		return -ENOMEM;
 	}
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA
@ 2018-04-19 17:12 ` Ganapatrao Kulkarni
  0 siblings, 0 replies; 18+ messages in thread
From: Ganapatrao Kulkarni @ 2018-04-19 17:12 UTC (permalink / raw)
  To: joro-zLv9SwRftAIdnm+yROfE0A,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, robin.murphy-5wv7dgnIgG8
  Cc: Vadim.Lomovtsev-YGCgFSpz5w/QT0dZR+AlfA,
	Jan.Glauber-YGCgFSpz5w/QT0dZR+AlfA,
	gklkml16-Re5JQEeQqe8AvxtiuMwx3w,
	tomasz.nowicki-YGCgFSpz5w/QT0dZR+AlfA,
	Robert.Richter-YGCgFSpz5w/QT0dZR+AlfA,
	jnair-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8

The performance drop is observed with long hours iperf testing using 40G
cards. This is mainly due to long iterations in finding the free iova
range in 32bit address space.

In current implementation for 64bit PCI devices, there is always first
attempt to allocate iova from 32bit(SAC preferred over DAC) address
range. Once we run out 32bit range, there is allocation from higher range,
however due to cached32_node optimization it does not suppose to be
painful. cached32_node always points to recently allocated 32-bit node.
When address range is full, it will be pointing to last allocated node
(leaf node), so walking rbtree to find the available range is not
expensive affair. However this optimization does not behave well when
one of the middle node is freed. In that case cached32_node is updated
to point to next iova range. The next iova allocation will consume free
range and again update cached32_node to itself. From now on, walking
over 32-bit range is more expensive.

This patch adds fix to update cached node to leaf node when there are no
iova free range left, which avoids unnecessary long iterations.

Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni-YGCgFSpz5w/QT0dZR+AlfA@public.gmane.org>
---
 drivers/iommu/iova.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 83fe262..e6ee2ea 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -201,6 +201,12 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 	} while (curr && new_pfn <= curr_iova->pfn_hi);

 	if (limit_pfn < size || new_pfn < iovad->start_pfn) {
+		/* No more cached node points to free hole, update to leaf node.
+		 */
+		struct iova *prev_iova;
+
+		prev_iova = rb_entry(prev, struct iova, node);
+		__cached_rbnode_insert_update(iovad, prev_iova);
 		spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
 		return -ENOMEM;
 	}
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA
@ 2018-04-23 16:37   ` Robin Murphy
  0 siblings, 0 replies; 18+ messages in thread
From: Robin Murphy @ 2018-04-23 16:37 UTC (permalink / raw)
  To: Ganapatrao Kulkarni, joro, iommu, linux-kernel
  Cc: tomasz.nowicki, jnair, Robert.Richter, Vadim.Lomovtsev,
	Jan.Glauber, gklkml16

On 19/04/18 18:12, Ganapatrao Kulkarni wrote:
> The performance drop is observed with long hours iperf testing using 40G
> cards. This is mainly due to long iterations in finding the free iova
> range in 32bit address space.
> 
> In current implementation for 64bit PCI devices, there is always first
> attempt to allocate iova from 32bit(SAC preferred over DAC) address
> range. Once we run out 32bit range, there is allocation from higher range,
> however due to cached32_node optimization it does not suppose to be
> painful. cached32_node always points to recently allocated 32-bit node.
> When address range is full, it will be pointing to last allocated node
> (leaf node), so walking rbtree to find the available range is not
> expensive affair. However this optimization does not behave well when
> one of the middle node is freed. In that case cached32_node is updated
> to point to next iova range. The next iova allocation will consume free
> range and again update cached32_node to itself. From now on, walking
> over 32-bit range is more expensive.
> 
> This patch adds fix to update cached node to leaf node when there are no
> iova free range left, which avoids unnecessary long iterations.

The only trouble with this is that "allocation failed" doesn't uniquely 
mean "space full". Say that after some time the 32-bit space ends up 
empty except for one page at 0x1000 and one at 0x80000000, then somebody 
tries to allocate 2GB. If we move the cached node down to the leftmost 
entry when that fails, all subsequent allocation attempts are now going 
to fail despite the space being 99.9999% free!

I can see a couple of ways to solve that general problem of free space 
above the cached node getting lost, but neither of them helps with the 
case where there is genuinely insufficient space (and if anything would 
make it even slower). In terms of the optimisation you want here, i.e. 
fail fast when an allocation cannot possibly succeed, the only reliable 
idea which comes to mind is free-PFN accounting. I might give that a go 
myself to see how ugly it looks.

Robin.

> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
> ---
>   drivers/iommu/iova.c | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
> index 83fe262..e6ee2ea 100644
> --- a/drivers/iommu/iova.c
> +++ b/drivers/iommu/iova.c
> @@ -201,6 +201,12 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
>   	} while (curr && new_pfn <= curr_iova->pfn_hi);
>   
>   	if (limit_pfn < size || new_pfn < iovad->start_pfn) {
> +		/* No more cached node points to free hole, update to leaf node.
> +		 */
> +		struct iova *prev_iova;
> +
> +		prev_iova = rb_entry(prev, struct iova, node);
> +		__cached_rbnode_insert_update(iovad, prev_iova);
>   		spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
>   		return -ENOMEM;
>   	}
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA
@ 2018-04-23 16:37   ` Robin Murphy
  0 siblings, 0 replies; 18+ messages in thread
From: Robin Murphy @ 2018-04-23 16:37 UTC (permalink / raw)
  To: Ganapatrao Kulkarni, joro-zLv9SwRftAIdnm+yROfE0A,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Vadim.Lomovtsev-YGCgFSpz5w/QT0dZR+AlfA,
	Jan.Glauber-YGCgFSpz5w/QT0dZR+AlfA,
	gklkml16-Re5JQEeQqe8AvxtiuMwx3w,
	tomasz.nowicki-YGCgFSpz5w/QT0dZR+AlfA,
	Robert.Richter-YGCgFSpz5w/QT0dZR+AlfA,
	jnair-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8

On 19/04/18 18:12, Ganapatrao Kulkarni wrote:
> The performance drop is observed with long hours iperf testing using 40G
> cards. This is mainly due to long iterations in finding the free iova
> range in 32bit address space.
> 
> In current implementation for 64bit PCI devices, there is always first
> attempt to allocate iova from 32bit(SAC preferred over DAC) address
> range. Once we run out 32bit range, there is allocation from higher range,
> however due to cached32_node optimization it does not suppose to be
> painful. cached32_node always points to recently allocated 32-bit node.
> When address range is full, it will be pointing to last allocated node
> (leaf node), so walking rbtree to find the available range is not
> expensive affair. However this optimization does not behave well when
> one of the middle node is freed. In that case cached32_node is updated
> to point to next iova range. The next iova allocation will consume free
> range and again update cached32_node to itself. From now on, walking
> over 32-bit range is more expensive.
> 
> This patch adds fix to update cached node to leaf node when there are no
> iova free range left, which avoids unnecessary long iterations.

The only trouble with this is that "allocation failed" doesn't uniquely 
mean "space full". Say that after some time the 32-bit space ends up 
empty except for one page at 0x1000 and one at 0x80000000, then somebody 
tries to allocate 2GB. If we move the cached node down to the leftmost 
entry when that fails, all subsequent allocation attempts are now going 
to fail despite the space being 99.9999% free!

I can see a couple of ways to solve that general problem of free space 
above the cached node getting lost, but neither of them helps with the 
case where there is genuinely insufficient space (and if anything would 
make it even slower). In terms of the optimisation you want here, i.e. 
fail fast when an allocation cannot possibly succeed, the only reliable 
idea which comes to mind is free-PFN accounting. I might give that a go 
myself to see how ugly it looks.

Robin.

> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni-YGCgFSpz5w/QT0dZR+AlfA@public.gmane.org>
> ---
>   drivers/iommu/iova.c | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
> index 83fe262..e6ee2ea 100644
> --- a/drivers/iommu/iova.c
> +++ b/drivers/iommu/iova.c
> @@ -201,6 +201,12 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
>   	} while (curr && new_pfn <= curr_iova->pfn_hi);
>   
>   	if (limit_pfn < size || new_pfn < iovad->start_pfn) {
> +		/* No more cached node points to free hole, update to leaf node.
> +		 */
> +		struct iova *prev_iova;
> +
> +		prev_iova = rb_entry(prev, struct iova, node);
> +		__cached_rbnode_insert_update(iovad, prev_iova);
>   		spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
>   		return -ENOMEM;
>   	}
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA
@ 2018-04-23 17:41     ` Ganapatrao Kulkarni
  0 siblings, 0 replies; 18+ messages in thread
From: Ganapatrao Kulkarni @ 2018-04-23 17:41 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Ganapatrao Kulkarni, Joerg Roedel, iommu, linux-kernel,
	tomasz.nowicki, jnair, Robert Richter, Vadim.Lomovtsev,
	Jan.Glauber

On Mon, Apr 23, 2018 at 10:07 PM, Robin Murphy <robin.murphy@arm.com> wrote:
> On 19/04/18 18:12, Ganapatrao Kulkarni wrote:
>>
>> The performance drop is observed with long hours iperf testing using 40G
>> cards. This is mainly due to long iterations in finding the free iova
>> range in 32bit address space.
>>
>> In current implementation for 64bit PCI devices, there is always first
>> attempt to allocate iova from 32bit(SAC preferred over DAC) address
>> range. Once we run out 32bit range, there is allocation from higher range,
>> however due to cached32_node optimization it does not suppose to be
>> painful. cached32_node always points to recently allocated 32-bit node.
>> When address range is full, it will be pointing to last allocated node
>> (leaf node), so walking rbtree to find the available range is not
>> expensive affair. However this optimization does not behave well when
>> one of the middle node is freed. In that case cached32_node is updated
>> to point to next iova range. The next iova allocation will consume free
>> range and again update cached32_node to itself. From now on, walking
>> over 32-bit range is more expensive.
>>
>> This patch adds fix to update cached node to leaf node when there are no
>> iova free range left, which avoids unnecessary long iterations.
>
>
> The only trouble with this is that "allocation failed" doesn't uniquely mean
> "space full". Say that after some time the 32-bit space ends up empty except
> for one page at 0x1000 and one at 0x80000000, then somebody tries to
> allocate 2GB. If we move the cached node down to the leftmost entry when
> that fails, all subsequent allocation attempts are now going to fail despite
> the space being 99.9999% free!
>
> I can see a couple of ways to solve that general problem of free space above
> the cached node getting lost, but neither of them helps with the case where
> there is genuinely insufficient space (and if anything would make it even
> slower). In terms of the optimisation you want here, i.e. fail fast when an
> allocation cannot possibly succeed, the only reliable idea which comes to
> mind is free-PFN accounting. I might give that a go myself to see how ugly
> it looks.

i see 2 problems in current implementation,
1. We don't replenish the 32 bits range, until first attempt of second
allocation(64 bit) fails.
2. Having  per cpu cache might not yield good hit on platforms with
more number of CPUs.

however irrespective of current issues, It makes sense to update
cached node as done in this patch , when there is failure to get iova
range using current cached pointer which is forcing for the
unnecessary time consuming do-while iterations until any replenish
happens!

thanks
Ganapat

>
> Robin.
>
>
>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
>> ---
>>   drivers/iommu/iova.c | 6 ++++++
>>   1 file changed, 6 insertions(+)
>>
>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>> index 83fe262..e6ee2ea 100644
>> --- a/drivers/iommu/iova.c
>> +++ b/drivers/iommu/iova.c
>> @@ -201,6 +201,12 @@ static int __alloc_and_insert_iova_range(struct
>> iova_domain *iovad,
>>         } while (curr && new_pfn <= curr_iova->pfn_hi);
>>         if (limit_pfn < size || new_pfn < iovad->start_pfn) {
>> +               /* No more cached node points to free hole, update to leaf
>> node.
>> +                */
>> +               struct iova *prev_iova;
>> +
>> +               prev_iova = rb_entry(prev, struct iova, node);
>> +               __cached_rbnode_insert_update(iovad, prev_iova);
>>                 spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
>>                 return -ENOMEM;
>>         }
>>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA
@ 2018-04-23 17:41     ` Ganapatrao Kulkarni
  0 siblings, 0 replies; 18+ messages in thread
From: Ganapatrao Kulkarni @ 2018-04-23 17:41 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Vadim.Lomovtsev-YGCgFSpz5w/QT0dZR+AlfA,
	Jan.Glauber-YGCgFSpz5w/QT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	tomasz.nowicki-YGCgFSpz5w/QT0dZR+AlfA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jnair-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8, Ganapatrao Kulkarni,
	Robert Richter

On Mon, Apr 23, 2018 at 10:07 PM, Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org> wrote:
> On 19/04/18 18:12, Ganapatrao Kulkarni wrote:
>>
>> The performance drop is observed with long hours iperf testing using 40G
>> cards. This is mainly due to long iterations in finding the free iova
>> range in 32bit address space.
>>
>> In current implementation for 64bit PCI devices, there is always first
>> attempt to allocate iova from 32bit(SAC preferred over DAC) address
>> range. Once we run out 32bit range, there is allocation from higher range,
>> however due to cached32_node optimization it does not suppose to be
>> painful. cached32_node always points to recently allocated 32-bit node.
>> When address range is full, it will be pointing to last allocated node
>> (leaf node), so walking rbtree to find the available range is not
>> expensive affair. However this optimization does not behave well when
>> one of the middle node is freed. In that case cached32_node is updated
>> to point to next iova range. The next iova allocation will consume free
>> range and again update cached32_node to itself. From now on, walking
>> over 32-bit range is more expensive.
>>
>> This patch adds fix to update cached node to leaf node when there are no
>> iova free range left, which avoids unnecessary long iterations.
>
>
> The only trouble with this is that "allocation failed" doesn't uniquely mean
> "space full". Say that after some time the 32-bit space ends up empty except
> for one page at 0x1000 and one at 0x80000000, then somebody tries to
> allocate 2GB. If we move the cached node down to the leftmost entry when
> that fails, all subsequent allocation attempts are now going to fail despite
> the space being 99.9999% free!
>
> I can see a couple of ways to solve that general problem of free space above
> the cached node getting lost, but neither of them helps with the case where
> there is genuinely insufficient space (and if anything would make it even
> slower). In terms of the optimisation you want here, i.e. fail fast when an
> allocation cannot possibly succeed, the only reliable idea which comes to
> mind is free-PFN accounting. I might give that a go myself to see how ugly
> it looks.

i see 2 problems in current implementation,
1. We don't replenish the 32 bits range, until first attempt of second
allocation(64 bit) fails.
2. Having  per cpu cache might not yield good hit on platforms with
more number of CPUs.

however irrespective of current issues, It makes sense to update
cached node as done in this patch , when there is failure to get iova
range using current cached pointer which is forcing for the
unnecessary time consuming do-while iterations until any replenish
happens!

thanks
Ganapat

>
> Robin.
>
>
>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni-YGCgFSpz5w/QT0dZR+AlfA@public.gmane.org>
>> ---
>>   drivers/iommu/iova.c | 6 ++++++
>>   1 file changed, 6 insertions(+)
>>
>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>> index 83fe262..e6ee2ea 100644
>> --- a/drivers/iommu/iova.c
>> +++ b/drivers/iommu/iova.c
>> @@ -201,6 +201,12 @@ static int __alloc_and_insert_iova_range(struct
>> iova_domain *iovad,
>>         } while (curr && new_pfn <= curr_iova->pfn_hi);
>>         if (limit_pfn < size || new_pfn < iovad->start_pfn) {
>> +               /* No more cached node points to free hole, update to leaf
>> node.
>> +                */
>> +               struct iova *prev_iova;
>> +
>> +               prev_iova = rb_entry(prev, struct iova, node);
>> +               __cached_rbnode_insert_update(iovad, prev_iova);
>>                 spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
>>                 return -ENOMEM;
>>         }
>>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA
@ 2018-04-26  9:45       ` Ganapatrao Kulkarni
  0 siblings, 0 replies; 18+ messages in thread
From: Ganapatrao Kulkarni @ 2018-04-26  9:45 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Ganapatrao Kulkarni, Joerg Roedel, iommu, linux-kernel,
	tomasz.nowicki, jnair, Robert Richter, Vadim.Lomovtsev,
	Jan.Glauber

Hi Robin,

On Mon, Apr 23, 2018 at 11:11 PM, Ganapatrao Kulkarni
<gklkml16@gmail.com> wrote:
> On Mon, Apr 23, 2018 at 10:07 PM, Robin Murphy <robin.murphy@arm.com> wrote:
>> On 19/04/18 18:12, Ganapatrao Kulkarni wrote:
>>>
>>> The performance drop is observed with long hours iperf testing using 40G
>>> cards. This is mainly due to long iterations in finding the free iova
>>> range in 32bit address space.
>>>
>>> In current implementation for 64bit PCI devices, there is always first
>>> attempt to allocate iova from 32bit(SAC preferred over DAC) address
>>> range. Once we run out 32bit range, there is allocation from higher range,
>>> however due to cached32_node optimization it does not suppose to be
>>> painful. cached32_node always points to recently allocated 32-bit node.
>>> When address range is full, it will be pointing to last allocated node
>>> (leaf node), so walking rbtree to find the available range is not
>>> expensive affair. However this optimization does not behave well when
>>> one of the middle node is freed. In that case cached32_node is updated
>>> to point to next iova range. The next iova allocation will consume free
>>> range and again update cached32_node to itself. From now on, walking
>>> over 32-bit range is more expensive.
>>>
>>> This patch adds fix to update cached node to leaf node when there are no
>>> iova free range left, which avoids unnecessary long iterations.
>>
>>
>> The only trouble with this is that "allocation failed" doesn't uniquely mean
>> "space full". Say that after some time the 32-bit space ends up empty except
>> for one page at 0x1000 and one at 0x80000000, then somebody tries to
>> allocate 2GB. If we move the cached node down to the leftmost entry when
>> that fails, all subsequent allocation attempts are now going to fail despite
>> the space being 99.9999% free!
>>
>> I can see a couple of ways to solve that general problem of free space above
>> the cached node getting lost, but neither of them helps with the case where
>> there is genuinely insufficient space (and if anything would make it even
>> slower). In terms of the optimisation you want here, i.e. fail fast when an
>> allocation cannot possibly succeed, the only reliable idea which comes to
>> mind is free-PFN accounting. I might give that a go myself to see how ugly
>> it looks.

For this testing, dual port intel 40G card(XL710) used and both ports
were connected in loop-back. Ran iperf server and clients on both
ports(used NAT to route packets out on intended ports).There were 10
iperf clients invoked every 60 seconds in loop for hours for each
port. Initially the performance on both ports is seen close to line
rate, however after test ran about 4 to 6 hours, the performance
started dropping  to very low (to few hundred Mbps) on both
connections.

IMO,  this is common bug and should happen on any other platforms too
and needs to be fixed at the earliest.
Please let me know if you have better way to fix this,  i am happy to
test your patch!

>
> i see 2 problems in current implementation,
> 1. We don't replenish the 32 bits range, until first attempt of second
> allocation(64 bit) fails.
> 2. Having  per cpu cache might not yield good hit on platforms with
> more number of CPUs.
>
> however irrespective of current issues, It makes sense to update
> cached node as done in this patch , when there is failure to get iova
> range using current cached pointer which is forcing for the
> unnecessary time consuming do-while iterations until any replenish
> happens!
>
> thanks
> Ganapat
>
>>
>> Robin.
>>
>>
>>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
>>> ---
>>>   drivers/iommu/iova.c | 6 ++++++
>>>   1 file changed, 6 insertions(+)
>>>
>>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>>> index 83fe262..e6ee2ea 100644
>>> --- a/drivers/iommu/iova.c
>>> +++ b/drivers/iommu/iova.c
>>> @@ -201,6 +201,12 @@ static int __alloc_and_insert_iova_range(struct
>>> iova_domain *iovad,
>>>         } while (curr && new_pfn <= curr_iova->pfn_hi);
>>>         if (limit_pfn < size || new_pfn < iovad->start_pfn) {
>>> +               /* No more cached node points to free hole, update to leaf
>>> node.
>>> +                */
>>> +               struct iova *prev_iova;
>>> +
>>> +               prev_iova = rb_entry(prev, struct iova, node);
>>> +               __cached_rbnode_insert_update(iovad, prev_iova);
>>>                 spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
>>>                 return -ENOMEM;
>>>         }
>>>
>>

thanks
Ganapat

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA
@ 2018-04-26  9:45       ` Ganapatrao Kulkarni
  0 siblings, 0 replies; 18+ messages in thread
From: Ganapatrao Kulkarni @ 2018-04-26  9:45 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Vadim.Lomovtsev-YGCgFSpz5w/QT0dZR+AlfA,
	Jan.Glauber-YGCgFSpz5w/QT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	tomasz.nowicki-YGCgFSpz5w/QT0dZR+AlfA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jnair-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8, Ganapatrao Kulkarni,
	Robert Richter

Hi Robin,

On Mon, Apr 23, 2018 at 11:11 PM, Ganapatrao Kulkarni
<gklkml16-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Mon, Apr 23, 2018 at 10:07 PM, Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org> wrote:
>> On 19/04/18 18:12, Ganapatrao Kulkarni wrote:
>>>
>>> The performance drop is observed with long hours iperf testing using 40G
>>> cards. This is mainly due to long iterations in finding the free iova
>>> range in 32bit address space.
>>>
>>> In current implementation for 64bit PCI devices, there is always first
>>> attempt to allocate iova from 32bit(SAC preferred over DAC) address
>>> range. Once we run out 32bit range, there is allocation from higher range,
>>> however due to cached32_node optimization it does not suppose to be
>>> painful. cached32_node always points to recently allocated 32-bit node.
>>> When address range is full, it will be pointing to last allocated node
>>> (leaf node), so walking rbtree to find the available range is not
>>> expensive affair. However this optimization does not behave well when
>>> one of the middle node is freed. In that case cached32_node is updated
>>> to point to next iova range. The next iova allocation will consume free
>>> range and again update cached32_node to itself. From now on, walking
>>> over 32-bit range is more expensive.
>>>
>>> This patch adds fix to update cached node to leaf node when there are no
>>> iova free range left, which avoids unnecessary long iterations.
>>
>>
>> The only trouble with this is that "allocation failed" doesn't uniquely mean
>> "space full". Say that after some time the 32-bit space ends up empty except
>> for one page at 0x1000 and one at 0x80000000, then somebody tries to
>> allocate 2GB. If we move the cached node down to the leftmost entry when
>> that fails, all subsequent allocation attempts are now going to fail despite
>> the space being 99.9999% free!
>>
>> I can see a couple of ways to solve that general problem of free space above
>> the cached node getting lost, but neither of them helps with the case where
>> there is genuinely insufficient space (and if anything would make it even
>> slower). In terms of the optimisation you want here, i.e. fail fast when an
>> allocation cannot possibly succeed, the only reliable idea which comes to
>> mind is free-PFN accounting. I might give that a go myself to see how ugly
>> it looks.

For this testing, dual port intel 40G card(XL710) used and both ports
were connected in loop-back. Ran iperf server and clients on both
ports(used NAT to route packets out on intended ports).There were 10
iperf clients invoked every 60 seconds in loop for hours for each
port. Initially the performance on both ports is seen close to line
rate, however after test ran about 4 to 6 hours, the performance
started dropping  to very low (to few hundred Mbps) on both
connections.

IMO,  this is common bug and should happen on any other platforms too
and needs to be fixed at the earliest.
Please let me know if you have better way to fix this,  i am happy to
test your patch!

>
> i see 2 problems in current implementation,
> 1. We don't replenish the 32 bits range, until first attempt of second
> allocation(64 bit) fails.
> 2. Having  per cpu cache might not yield good hit on platforms with
> more number of CPUs.
>
> however irrespective of current issues, It makes sense to update
> cached node as done in this patch , when there is failure to get iova
> range using current cached pointer which is forcing for the
> unnecessary time consuming do-while iterations until any replenish
> happens!
>
> thanks
> Ganapat
>
>>
>> Robin.
>>
>>
>>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni-YGCgFSpz5w/QT0dZR+AlfA@public.gmane.org>
>>> ---
>>>   drivers/iommu/iova.c | 6 ++++++
>>>   1 file changed, 6 insertions(+)
>>>
>>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>>> index 83fe262..e6ee2ea 100644
>>> --- a/drivers/iommu/iova.c
>>> +++ b/drivers/iommu/iova.c
>>> @@ -201,6 +201,12 @@ static int __alloc_and_insert_iova_range(struct
>>> iova_domain *iovad,
>>>         } while (curr && new_pfn <= curr_iova->pfn_hi);
>>>         if (limit_pfn < size || new_pfn < iovad->start_pfn) {
>>> +               /* No more cached node points to free hole, update to leaf
>>> node.
>>> +                */
>>> +               struct iova *prev_iova;
>>> +
>>> +               prev_iova = rb_entry(prev, struct iova, node);
>>> +               __cached_rbnode_insert_update(iovad, prev_iova);
>>>                 spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
>>>                 return -ENOMEM;
>>>         }
>>>
>>

thanks
Ganapat

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA
@ 2018-05-21  1:15         ` Ganapatrao Kulkarni
  0 siblings, 0 replies; 18+ messages in thread
From: Ganapatrao Kulkarni @ 2018-05-21  1:15 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Ganapatrao Kulkarni, Joerg Roedel, iommu, LKML, tomasz.nowicki,
	jnair, Robert Richter, Vadim.Lomovtsev, Jan.Glauber

On Thu, Apr 26, 2018 at 3:15 PM, Ganapatrao Kulkarni <gklkml16@gmail.com> wrote:
> Hi Robin,
>
> On Mon, Apr 23, 2018 at 11:11 PM, Ganapatrao Kulkarni
> <gklkml16@gmail.com> wrote:
>> On Mon, Apr 23, 2018 at 10:07 PM, Robin Murphy <robin.murphy@arm.com> wrote:
>>> On 19/04/18 18:12, Ganapatrao Kulkarni wrote:
>>>>
>>>> The performance drop is observed with long hours iperf testing using 40G
>>>> cards. This is mainly due to long iterations in finding the free iova
>>>> range in 32bit address space.
>>>>
>>>> In current implementation for 64bit PCI devices, there is always first
>>>> attempt to allocate iova from 32bit(SAC preferred over DAC) address
>>>> range. Once we run out 32bit range, there is allocation from higher range,
>>>> however due to cached32_node optimization it does not suppose to be
>>>> painful. cached32_node always points to recently allocated 32-bit node.
>>>> When address range is full, it will be pointing to last allocated node
>>>> (leaf node), so walking rbtree to find the available range is not
>>>> expensive affair. However this optimization does not behave well when
>>>> one of the middle node is freed. In that case cached32_node is updated
>>>> to point to next iova range. The next iova allocation will consume free
>>>> range and again update cached32_node to itself. From now on, walking
>>>> over 32-bit range is more expensive.
>>>>
>>>> This patch adds fix to update cached node to leaf node when there are no
>>>> iova free range left, which avoids unnecessary long iterations.
>>>
>>>
>>> The only trouble with this is that "allocation failed" doesn't uniquely mean
>>> "space full". Say that after some time the 32-bit space ends up empty except
>>> for one page at 0x1000 and one at 0x80000000, then somebody tries to
>>> allocate 2GB. If we move the cached node down to the leftmost entry when
>>> that fails, all subsequent allocation attempts are now going to fail despite
>>> the space being 99.9999% free!
>>>
>>> I can see a couple of ways to solve that general problem of free space above
>>> the cached node getting lost, but neither of them helps with the case where
>>> there is genuinely insufficient space (and if anything would make it even
>>> slower). In terms of the optimisation you want here, i.e. fail fast when an
>>> allocation cannot possibly succeed, the only reliable idea which comes to
>>> mind is free-PFN accounting. I might give that a go myself to see how ugly
>>> it looks.
>
> For this testing, dual port intel 40G card(XL710) used and both ports
> were connected in loop-back. Ran iperf server and clients on both
> ports(used NAT to route packets out on intended ports).There were 10
> iperf clients invoked every 60 seconds in loop for hours for each
> port. Initially the performance on both ports is seen close to line
> rate, however after test ran about 4 to 6 hours, the performance
> started dropping  to very low (to few hundred Mbps) on both
> connections.
>
> IMO,  this is common bug and should happen on any other platforms too
> and needs to be fixed at the earliest.
> Please let me know if you have better way to fix this,  i am happy to
> test your patch!

any update on this issue?
>
>>
>> i see 2 problems in current implementation,
>> 1. We don't replenish the 32 bits range, until first attempt of second
>> allocation(64 bit) fails.
>> 2. Having  per cpu cache might not yield good hit on platforms with
>> more number of CPUs.
>>
>> however irrespective of current issues, It makes sense to update
>> cached node as done in this patch , when there is failure to get iova
>> range using current cached pointer which is forcing for the
>> unnecessary time consuming do-while iterations until any replenish
>> happens!
>>
>> thanks
>> Ganapat
>>
>>>
>>> Robin.
>>>
>>>
>>>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
>>>> ---
>>>>   drivers/iommu/iova.c | 6 ++++++
>>>>   1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>>>> index 83fe262..e6ee2ea 100644
>>>> --- a/drivers/iommu/iova.c
>>>> +++ b/drivers/iommu/iova.c
>>>> @@ -201,6 +201,12 @@ static int __alloc_and_insert_iova_range(struct
>>>> iova_domain *iovad,
>>>>         } while (curr && new_pfn <= curr_iova->pfn_hi);
>>>>         if (limit_pfn < size || new_pfn < iovad->start_pfn) {
>>>> +               /* No more cached node points to free hole, update to leaf
>>>> node.
>>>> +                */
>>>> +               struct iova *prev_iova;
>>>> +
>>>> +               prev_iova = rb_entry(prev, struct iova, node);
>>>> +               __cached_rbnode_insert_update(iovad, prev_iova);
>>>>                 spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
>>>>                 return -ENOMEM;
>>>>         }
>>>>
>>>
>
> thanks
> Ganapat

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA
@ 2018-05-21  1:15         ` Ganapatrao Kulkarni
  0 siblings, 0 replies; 18+ messages in thread
From: Ganapatrao Kulkarni @ 2018-05-21  1:15 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Vadim.Lomovtsev-YGCgFSpz5w/QT0dZR+AlfA,
	Jan.Glauber-YGCgFSpz5w/QT0dZR+AlfA, LKML,
	tomasz.nowicki-YGCgFSpz5w/QT0dZR+AlfA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jnair-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8, Ganapatrao Kulkarni,
	Robert Richter

On Thu, Apr 26, 2018 at 3:15 PM, Ganapatrao Kulkarni <gklkml16-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Hi Robin,
>
> On Mon, Apr 23, 2018 at 11:11 PM, Ganapatrao Kulkarni
> <gklkml16-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> On Mon, Apr 23, 2018 at 10:07 PM, Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org> wrote:
>>> On 19/04/18 18:12, Ganapatrao Kulkarni wrote:
>>>>
>>>> The performance drop is observed with long hours iperf testing using 40G
>>>> cards. This is mainly due to long iterations in finding the free iova
>>>> range in 32bit address space.
>>>>
>>>> In current implementation for 64bit PCI devices, there is always first
>>>> attempt to allocate iova from 32bit(SAC preferred over DAC) address
>>>> range. Once we run out 32bit range, there is allocation from higher range,
>>>> however due to cached32_node optimization it does not suppose to be
>>>> painful. cached32_node always points to recently allocated 32-bit node.
>>>> When address range is full, it will be pointing to last allocated node
>>>> (leaf node), so walking rbtree to find the available range is not
>>>> expensive affair. However this optimization does not behave well when
>>>> one of the middle node is freed. In that case cached32_node is updated
>>>> to point to next iova range. The next iova allocation will consume free
>>>> range and again update cached32_node to itself. From now on, walking
>>>> over 32-bit range is more expensive.
>>>>
>>>> This patch adds fix to update cached node to leaf node when there are no
>>>> iova free range left, which avoids unnecessary long iterations.
>>>
>>>
>>> The only trouble with this is that "allocation failed" doesn't uniquely mean
>>> "space full". Say that after some time the 32-bit space ends up empty except
>>> for one page at 0x1000 and one at 0x80000000, then somebody tries to
>>> allocate 2GB. If we move the cached node down to the leftmost entry when
>>> that fails, all subsequent allocation attempts are now going to fail despite
>>> the space being 99.9999% free!
>>>
>>> I can see a couple of ways to solve that general problem of free space above
>>> the cached node getting lost, but neither of them helps with the case where
>>> there is genuinely insufficient space (and if anything would make it even
>>> slower). In terms of the optimisation you want here, i.e. fail fast when an
>>> allocation cannot possibly succeed, the only reliable idea which comes to
>>> mind is free-PFN accounting. I might give that a go myself to see how ugly
>>> it looks.
>
> For this testing, dual port intel 40G card(XL710) used and both ports
> were connected in loop-back. Ran iperf server and clients on both
> ports(used NAT to route packets out on intended ports).There were 10
> iperf clients invoked every 60 seconds in loop for hours for each
> port. Initially the performance on both ports is seen close to line
> rate, however after test ran about 4 to 6 hours, the performance
> started dropping  to very low (to few hundred Mbps) on both
> connections.
>
> IMO,  this is common bug and should happen on any other platforms too
> and needs to be fixed at the earliest.
> Please let me know if you have better way to fix this,  i am happy to
> test your patch!

any update on this issue?
>
>>
>> i see 2 problems in current implementation,
>> 1. We don't replenish the 32 bits range, until first attempt of second
>> allocation(64 bit) fails.
>> 2. Having  per cpu cache might not yield good hit on platforms with
>> more number of CPUs.
>>
>> however irrespective of current issues, It makes sense to update
>> cached node as done in this patch , when there is failure to get iova
>> range using current cached pointer which is forcing for the
>> unnecessary time consuming do-while iterations until any replenish
>> happens!
>>
>> thanks
>> Ganapat
>>
>>>
>>> Robin.
>>>
>>>
>>>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni-YGCgFSpz5w/QT0dZR+AlfA@public.gmane.org>
>>>> ---
>>>>   drivers/iommu/iova.c | 6 ++++++
>>>>   1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>>>> index 83fe262..e6ee2ea 100644
>>>> --- a/drivers/iommu/iova.c
>>>> +++ b/drivers/iommu/iova.c
>>>> @@ -201,6 +201,12 @@ static int __alloc_and_insert_iova_range(struct
>>>> iova_domain *iovad,
>>>>         } while (curr && new_pfn <= curr_iova->pfn_hi);
>>>>         if (limit_pfn < size || new_pfn < iovad->start_pfn) {
>>>> +               /* No more cached node points to free hole, update to leaf
>>>> node.
>>>> +                */
>>>> +               struct iova *prev_iova;
>>>> +
>>>> +               prev_iova = rb_entry(prev, struct iova, node);
>>>> +               __cached_rbnode_insert_update(iovad, prev_iova);
>>>>                 spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
>>>>                 return -ENOMEM;
>>>>         }
>>>>
>>>
>
> thanks
> Ganapat

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA
  2018-05-21  1:15         ` Ganapatrao Kulkarni
  (?)
@ 2018-06-04  4:06         ` Ganapatrao Kulkarni
  2018-07-12  7:45           ` Ganapatrao Kulkarni
  -1 siblings, 1 reply; 18+ messages in thread
From: Ganapatrao Kulkarni @ 2018-06-04  4:06 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Ganapatrao Kulkarni, Joerg Roedel, iommu, LKML, tomasz.nowicki,
	jnair, Robert Richter, Vadim.Lomovtsev, Jan.Glauber

ping??

On Mon, May 21, 2018 at 6:45 AM, Ganapatrao Kulkarni <gklkml16@gmail.com> wrote:
> On Thu, Apr 26, 2018 at 3:15 PM, Ganapatrao Kulkarni <gklkml16@gmail.com> wrote:
>> Hi Robin,
>>
>> On Mon, Apr 23, 2018 at 11:11 PM, Ganapatrao Kulkarni
>> <gklkml16@gmail.com> wrote:
>>> On Mon, Apr 23, 2018 at 10:07 PM, Robin Murphy <robin.murphy@arm.com> wrote:
>>>> On 19/04/18 18:12, Ganapatrao Kulkarni wrote:
>>>>>
>>>>> The performance drop is observed with long hours iperf testing using 40G
>>>>> cards. This is mainly due to long iterations in finding the free iova
>>>>> range in 32bit address space.
>>>>>
>>>>> In current implementation for 64bit PCI devices, there is always first
>>>>> attempt to allocate iova from 32bit(SAC preferred over DAC) address
>>>>> range. Once we run out 32bit range, there is allocation from higher range,
>>>>> however due to cached32_node optimization it does not suppose to be
>>>>> painful. cached32_node always points to recently allocated 32-bit node.
>>>>> When address range is full, it will be pointing to last allocated node
>>>>> (leaf node), so walking rbtree to find the available range is not
>>>>> expensive affair. However this optimization does not behave well when
>>>>> one of the middle node is freed. In that case cached32_node is updated
>>>>> to point to next iova range. The next iova allocation will consume free
>>>>> range and again update cached32_node to itself. From now on, walking
>>>>> over 32-bit range is more expensive.
>>>>>
>>>>> This patch adds fix to update cached node to leaf node when there are no
>>>>> iova free range left, which avoids unnecessary long iterations.
>>>>
>>>>
>>>> The only trouble with this is that "allocation failed" doesn't uniquely mean
>>>> "space full". Say that after some time the 32-bit space ends up empty except
>>>> for one page at 0x1000 and one at 0x80000000, then somebody tries to
>>>> allocate 2GB. If we move the cached node down to the leftmost entry when
>>>> that fails, all subsequent allocation attempts are now going to fail despite
>>>> the space being 99.9999% free!
>>>>
>>>> I can see a couple of ways to solve that general problem of free space above
>>>> the cached node getting lost, but neither of them helps with the case where
>>>> there is genuinely insufficient space (and if anything would make it even
>>>> slower). In terms of the optimisation you want here, i.e. fail fast when an
>>>> allocation cannot possibly succeed, the only reliable idea which comes to
>>>> mind is free-PFN accounting. I might give that a go myself to see how ugly
>>>> it looks.
>>
>> For this testing, dual port intel 40G card(XL710) used and both ports
>> were connected in loop-back. Ran iperf server and clients on both
>> ports(used NAT to route packets out on intended ports).There were 10
>> iperf clients invoked every 60 seconds in loop for hours for each
>> port. Initially the performance on both ports is seen close to line
>> rate, however after test ran about 4 to 6 hours, the performance
>> started dropping  to very low (to few hundred Mbps) on both
>> connections.
>>
>> IMO,  this is common bug and should happen on any other platforms too
>> and needs to be fixed at the earliest.
>> Please let me know if you have better way to fix this,  i am happy to
>> test your patch!
>
> any update on this issue?
>>
>>>
>>> i see 2 problems in current implementation,
>>> 1. We don't replenish the 32 bits range, until first attempt of second
>>> allocation(64 bit) fails.
>>> 2. Having  per cpu cache might not yield good hit on platforms with
>>> more number of CPUs.
>>>
>>> however irrespective of current issues, It makes sense to update
>>> cached node as done in this patch , when there is failure to get iova
>>> range using current cached pointer which is forcing for the
>>> unnecessary time consuming do-while iterations until any replenish
>>> happens!
>>>
>>> thanks
>>> Ganapat
>>>
>>>>
>>>> Robin.
>>>>
>>>>
>>>>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
>>>>> ---
>>>>>   drivers/iommu/iova.c | 6 ++++++
>>>>>   1 file changed, 6 insertions(+)
>>>>>
>>>>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>>>>> index 83fe262..e6ee2ea 100644
>>>>> --- a/drivers/iommu/iova.c
>>>>> +++ b/drivers/iommu/iova.c
>>>>> @@ -201,6 +201,12 @@ static int __alloc_and_insert_iova_range(struct
>>>>> iova_domain *iovad,
>>>>>         } while (curr && new_pfn <= curr_iova->pfn_hi);
>>>>>         if (limit_pfn < size || new_pfn < iovad->start_pfn) {
>>>>> +               /* No more cached node points to free hole, update to leaf
>>>>> node.
>>>>> +                */
>>>>> +               struct iova *prev_iova;
>>>>> +
>>>>> +               prev_iova = rb_entry(prev, struct iova, node);
>>>>> +               __cached_rbnode_insert_update(iovad, prev_iova);
>>>>>                 spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
>>>>>                 return -ENOMEM;
>>>>>         }
>>>>>
>>>>
>>
>> thanks
>> Ganapat

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA
  2018-06-04  4:06         ` Ganapatrao Kulkarni
@ 2018-07-12  7:45           ` Ganapatrao Kulkarni
  2018-07-25 14:20             ` Robin Murphy
  0 siblings, 1 reply; 18+ messages in thread
From: Ganapatrao Kulkarni @ 2018-07-12  7:45 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Ganapatrao Kulkarni, Joerg Roedel, iommu, LKML, tomasz.nowicki,
	jnair, Robert Richter, Vadim.Lomovtsev, Jan.Glauber

Hi Robin,


On Mon, Jun 4, 2018 at 9:36 AM, Ganapatrao Kulkarni <gklkml16@gmail.com> wrote:
> ping??
>
> On Mon, May 21, 2018 at 6:45 AM, Ganapatrao Kulkarni <gklkml16@gmail.com> wrote:
>> On Thu, Apr 26, 2018 at 3:15 PM, Ganapatrao Kulkarni <gklkml16@gmail.com> wrote:
>>> Hi Robin,
>>>
>>> On Mon, Apr 23, 2018 at 11:11 PM, Ganapatrao Kulkarni
>>> <gklkml16@gmail.com> wrote:
>>>> On Mon, Apr 23, 2018 at 10:07 PM, Robin Murphy <robin.murphy@arm.com> wrote:
>>>>> On 19/04/18 18:12, Ganapatrao Kulkarni wrote:
>>>>>>
>>>>>> The performance drop is observed with long hours iperf testing using 40G
>>>>>> cards. This is mainly due to long iterations in finding the free iova
>>>>>> range in 32bit address space.
>>>>>>
>>>>>> In current implementation for 64bit PCI devices, there is always first
>>>>>> attempt to allocate iova from 32bit(SAC preferred over DAC) address
>>>>>> range. Once we run out 32bit range, there is allocation from higher range,
>>>>>> however due to cached32_node optimization it does not suppose to be
>>>>>> painful. cached32_node always points to recently allocated 32-bit node.
>>>>>> When address range is full, it will be pointing to last allocated node
>>>>>> (leaf node), so walking rbtree to find the available range is not
>>>>>> expensive affair. However this optimization does not behave well when
>>>>>> one of the middle node is freed. In that case cached32_node is updated
>>>>>> to point to next iova range. The next iova allocation will consume free
>>>>>> range and again update cached32_node to itself. From now on, walking
>>>>>> over 32-bit range is more expensive.
>>>>>>
>>>>>> This patch adds fix to update cached node to leaf node when there are no
>>>>>> iova free range left, which avoids unnecessary long iterations.
>>>>>
>>>>>
>>>>> The only trouble with this is that "allocation failed" doesn't uniquely mean
>>>>> "space full". Say that after some time the 32-bit space ends up empty except
>>>>> for one page at 0x1000 and one at 0x80000000, then somebody tries to
>>>>> allocate 2GB. If we move the cached node down to the leftmost entry when
>>>>> that fails, all subsequent allocation attempts are now going to fail despite
>>>>> the space being 99.9999% free!
>>>>>
>>>>> I can see a couple of ways to solve that general problem of free space above
>>>>> the cached node getting lost, but neither of them helps with the case where
>>>>> there is genuinely insufficient space (and if anything would make it even
>>>>> slower). In terms of the optimisation you want here, i.e. fail fast when an
>>>>> allocation cannot possibly succeed, the only reliable idea which comes to
>>>>> mind is free-PFN accounting. I might give that a go myself to see how ugly
>>>>> it looks.

did you get any chance to look in to this issue?
i am waiting for your suggestion/patch for this issue!

>>>
>>> For this testing, dual port intel 40G card(XL710) used and both ports
>>> were connected in loop-back. Ran iperf server and clients on both
>>> ports(used NAT to route packets out on intended ports).There were 10
>>> iperf clients invoked every 60 seconds in loop for hours for each
>>> port. Initially the performance on both ports is seen close to line
>>> rate, however after test ran about 4 to 6 hours, the performance
>>> started dropping  to very low (to few hundred Mbps) on both
>>> connections.
>>>
>>> IMO,  this is common bug and should happen on any other platforms too
>>> and needs to be fixed at the earliest.
>>> Please let me know if you have better way to fix this,  i am happy to
>>> test your patch!
>>
>> any update on this issue?
>>>
>>>>
>>>> i see 2 problems in current implementation,
>>>> 1. We don't replenish the 32 bits range, until first attempt of second
>>>> allocation(64 bit) fails.
>>>> 2. Having  per cpu cache might not yield good hit on platforms with
>>>> more number of CPUs.
>>>>
>>>> however irrespective of current issues, It makes sense to update
>>>> cached node as done in this patch , when there is failure to get iova
>>>> range using current cached pointer which is forcing for the
>>>> unnecessary time consuming do-while iterations until any replenish
>>>> happens!
>>>>
>>>> thanks
>>>> Ganapat
>>>>
>>>>>
>>>>> Robin.
>>>>>
>>>>>
>>>>>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
>>>>>> ---
>>>>>>   drivers/iommu/iova.c | 6 ++++++
>>>>>>   1 file changed, 6 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>>>>>> index 83fe262..e6ee2ea 100644
>>>>>> --- a/drivers/iommu/iova.c
>>>>>> +++ b/drivers/iommu/iova.c
>>>>>> @@ -201,6 +201,12 @@ static int __alloc_and_insert_iova_range(struct
>>>>>> iova_domain *iovad,
>>>>>>         } while (curr && new_pfn <= curr_iova->pfn_hi);
>>>>>>         if (limit_pfn < size || new_pfn < iovad->start_pfn) {
>>>>>> +               /* No more cached node points to free hole, update to leaf
>>>>>> node.
>>>>>> +                */
>>>>>> +               struct iova *prev_iova;
>>>>>> +
>>>>>> +               prev_iova = rb_entry(prev, struct iova, node);
>>>>>> +               __cached_rbnode_insert_update(iovad, prev_iova);
>>>>>>                 spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
>>>>>>                 return -ENOMEM;
>>>>>>         }
>>>>>>
>>>>>
>>>
>>> thanks
>>> Ganapat

thanks
Ganapat

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA
  2018-07-12  7:45           ` Ganapatrao Kulkarni
@ 2018-07-25 14:20             ` Robin Murphy
  2018-07-27 12:56               ` Ganapatrao Kulkarni
  0 siblings, 1 reply; 18+ messages in thread
From: Robin Murphy @ 2018-07-25 14:20 UTC (permalink / raw)
  To: Ganapatrao Kulkarni
  Cc: Ganapatrao Kulkarni, Joerg Roedel, iommu, LKML, tomasz.nowicki,
	jnair, Robert Richter, Vadim.Lomovtsev, Jan.Glauber

On 12/07/18 08:45, Ganapatrao Kulkarni wrote:
> Hi Robin,
> 
> 
> On Mon, Jun 4, 2018 at 9:36 AM, Ganapatrao Kulkarni <gklkml16@gmail.com> wrote:
>> ping??
>>
>> On Mon, May 21, 2018 at 6:45 AM, Ganapatrao Kulkarni <gklkml16@gmail.com> wrote:
>>> On Thu, Apr 26, 2018 at 3:15 PM, Ganapatrao Kulkarni <gklkml16@gmail.com> wrote:
>>>> Hi Robin,
>>>>
>>>> On Mon, Apr 23, 2018 at 11:11 PM, Ganapatrao Kulkarni
>>>> <gklkml16@gmail.com> wrote:
>>>>> On Mon, Apr 23, 2018 at 10:07 PM, Robin Murphy <robin.murphy@arm.com> wrote:
>>>>>> On 19/04/18 18:12, Ganapatrao Kulkarni wrote:
>>>>>>>
>>>>>>> The performance drop is observed with long hours iperf testing using 40G
>>>>>>> cards. This is mainly due to long iterations in finding the free iova
>>>>>>> range in 32bit address space.
>>>>>>>
>>>>>>> In current implementation for 64bit PCI devices, there is always first
>>>>>>> attempt to allocate iova from 32bit(SAC preferred over DAC) address
>>>>>>> range. Once we run out 32bit range, there is allocation from higher range,
>>>>>>> however due to cached32_node optimization it does not suppose to be
>>>>>>> painful. cached32_node always points to recently allocated 32-bit node.
>>>>>>> When address range is full, it will be pointing to last allocated node
>>>>>>> (leaf node), so walking rbtree to find the available range is not
>>>>>>> expensive affair. However this optimization does not behave well when
>>>>>>> one of the middle node is freed. In that case cached32_node is updated
>>>>>>> to point to next iova range. The next iova allocation will consume free
>>>>>>> range and again update cached32_node to itself. From now on, walking
>>>>>>> over 32-bit range is more expensive.
>>>>>>>
>>>>>>> This patch adds fix to update cached node to leaf node when there are no
>>>>>>> iova free range left, which avoids unnecessary long iterations.
>>>>>>
>>>>>>
>>>>>> The only trouble with this is that "allocation failed" doesn't uniquely mean
>>>>>> "space full". Say that after some time the 32-bit space ends up empty except
>>>>>> for one page at 0x1000 and one at 0x80000000, then somebody tries to
>>>>>> allocate 2GB. If we move the cached node down to the leftmost entry when
>>>>>> that fails, all subsequent allocation attempts are now going to fail despite
>>>>>> the space being 99.9999% free!
>>>>>>
>>>>>> I can see a couple of ways to solve that general problem of free space above
>>>>>> the cached node getting lost, but neither of them helps with the case where
>>>>>> there is genuinely insufficient space (and if anything would make it even
>>>>>> slower). In terms of the optimisation you want here, i.e. fail fast when an
>>>>>> allocation cannot possibly succeed, the only reliable idea which comes to
>>>>>> mind is free-PFN accounting. I might give that a go myself to see how ugly
>>>>>> it looks.
> 
> did you get any chance to look in to this issue?
> i am waiting for your suggestion/patch for this issue!

I got as far as [1], but I wasn't sure how much I liked it, since it 
still seems a little invasive for such a specific case (plus I can't 
remember if it's actually been debugged or not). I think in the end I 
started wondering whether it's even worth bothering with the 32-bit 
optimisation for PCIe devices - 4 extra bytes worth of TLP is surely a 
lot less significant than every transaction taking up to 50% more bus 
cycles was for legacy PCI.

Robin.

[1] 
http://www.linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=a8e0e4af10ebebb3669750e05bf0028e5bd6afe8

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA
  2018-07-25 14:20             ` Robin Murphy
@ 2018-07-27 12:56               ` Ganapatrao Kulkarni
  2018-07-27 16:18                 ` Robin Murphy
  0 siblings, 1 reply; 18+ messages in thread
From: Ganapatrao Kulkarni @ 2018-07-27 12:56 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Ganapatrao Kulkarni, Joerg Roedel, iommu, LKML, tomasz.nowicki,
	jnair, Robert Richter, Vadim.Lomovtsev, Jan.Glauber

Hi Robin,

On Wed, Jul 25, 2018 at 7:50 PM, Robin Murphy <robin.murphy@arm.com> wrote:
> On 12/07/18 08:45, Ganapatrao Kulkarni wrote:
>>
>> Hi Robin,
>>
>>
>> On Mon, Jun 4, 2018 at 9:36 AM, Ganapatrao Kulkarni <gklkml16@gmail.com>
>> wrote:
>>>
>>> ping??
>>>
>>> On Mon, May 21, 2018 at 6:45 AM, Ganapatrao Kulkarni <gklkml16@gmail.com>
>>> wrote:
>>>>
>>>> On Thu, Apr 26, 2018 at 3:15 PM, Ganapatrao Kulkarni
>>>> <gklkml16@gmail.com> wrote:
>>>>>
>>>>> Hi Robin,
>>>>>
>>>>> On Mon, Apr 23, 2018 at 11:11 PM, Ganapatrao Kulkarni
>>>>> <gklkml16@gmail.com> wrote:
>>>>>>
>>>>>> On Mon, Apr 23, 2018 at 10:07 PM, Robin Murphy <robin.murphy@arm.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> On 19/04/18 18:12, Ganapatrao Kulkarni wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> The performance drop is observed with long hours iperf testing using
>>>>>>>> 40G
>>>>>>>> cards. This is mainly due to long iterations in finding the free
>>>>>>>> iova
>>>>>>>> range in 32bit address space.
>>>>>>>>
>>>>>>>> In current implementation for 64bit PCI devices, there is always
>>>>>>>> first
>>>>>>>> attempt to allocate iova from 32bit(SAC preferred over DAC) address
>>>>>>>> range. Once we run out 32bit range, there is allocation from higher
>>>>>>>> range,
>>>>>>>> however due to cached32_node optimization it does not suppose to be
>>>>>>>> painful. cached32_node always points to recently allocated 32-bit
>>>>>>>> node.
>>>>>>>> When address range is full, it will be pointing to last allocated
>>>>>>>> node
>>>>>>>> (leaf node), so walking rbtree to find the available range is not
>>>>>>>> expensive affair. However this optimization does not behave well
>>>>>>>> when
>>>>>>>> one of the middle node is freed. In that case cached32_node is
>>>>>>>> updated
>>>>>>>> to point to next iova range. The next iova allocation will consume
>>>>>>>> free
>>>>>>>> range and again update cached32_node to itself. From now on, walking
>>>>>>>> over 32-bit range is more expensive.
>>>>>>>>
>>>>>>>> This patch adds fix to update cached node to leaf node when there
>>>>>>>> are no
>>>>>>>> iova free range left, which avoids unnecessary long iterations.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The only trouble with this is that "allocation failed" doesn't
>>>>>>> uniquely mean
>>>>>>> "space full". Say that after some time the 32-bit space ends up empty
>>>>>>> except
>>>>>>> for one page at 0x1000 and one at 0x80000000, then somebody tries to
>>>>>>> allocate 2GB. If we move the cached node down to the leftmost entry
>>>>>>> when
>>>>>>> that fails, all subsequent allocation attempts are now going to fail
>>>>>>> despite
>>>>>>> the space being 99.9999% free!
>>>>>>>
>>>>>>> I can see a couple of ways to solve that general problem of free
>>>>>>> space above
>>>>>>> the cached node getting lost, but neither of them helps with the case
>>>>>>> where
>>>>>>> there is genuinely insufficient space (and if anything would make it
>>>>>>> even
>>>>>>> slower). In terms of the optimisation you want here, i.e. fail fast
>>>>>>> when an
>>>>>>> allocation cannot possibly succeed, the only reliable idea which
>>>>>>> comes to
>>>>>>> mind is free-PFN accounting. I might give that a go myself to see how
>>>>>>> ugly
>>>>>>> it looks.
>>
>>
>> did you get any chance to look in to this issue?
>> i am waiting for your suggestion/patch for this issue!
>
>
> I got as far as [1], but I wasn't sure how much I liked it, since it still
> seems a little invasive for such a specific case (plus I can't remember if
> it's actually been debugged or not). I think in the end I started wondering
> whether it's even worth bothering with the 32-bit optimisation for PCIe
> devices - 4 extra bytes worth of TLP is surely a lot less significant than
> every transaction taking up to 50% more bus cycles was for legacy PCI.

how about tracking previous attempt to get 32bit range iova and avoid
further attempts, if it was failed. Later Resume attempts once
replenish happens.
Created patch for the same [2]

[2] https://github.com/gpkulkarni/linux/commit/e2343a3e1f55cdeb5694103dd354bcb881dc65c3
note, the testing of this patch is in progress.

>
> Robin.
>
> [1]
> http://www.linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=a8e0e4af10ebebb3669750e05bf0028e5bd6afe8

thanks
Ganapat

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA
  2018-07-27 12:56               ` Ganapatrao Kulkarni
@ 2018-07-27 16:18                 ` Robin Murphy
  2018-07-30  7:10                   ` Ganapatrao Kulkarni
  0 siblings, 1 reply; 18+ messages in thread
From: Robin Murphy @ 2018-07-27 16:18 UTC (permalink / raw)
  To: Ganapatrao Kulkarni
  Cc: Ganapatrao Kulkarni, Joerg Roedel, iommu, LKML, tomasz.nowicki,
	jnair, Robert Richter, Vadim.Lomovtsev, Jan.Glauber

On 27/07/18 13:56, Ganapatrao Kulkarni wrote:
[...]
>>> did you get any chance to look in to this issue?
>>> i am waiting for your suggestion/patch for this issue!
>>
>>
>> I got as far as [1], but I wasn't sure how much I liked it, since it still
>> seems a little invasive for such a specific case (plus I can't remember if
>> it's actually been debugged or not). I think in the end I started wondering
>> whether it's even worth bothering with the 32-bit optimisation for PCIe
>> devices - 4 extra bytes worth of TLP is surely a lot less significant than
>> every transaction taking up to 50% more bus cycles was for legacy PCI.
> 
> how about tracking previous attempt to get 32bit range iova and avoid
> further attempts, if it was failed. Later Resume attempts once
> replenish happens.
> Created patch for the same [2]

Ooh, that's a much neater implementation of essentially the same concept 
- now why couldn't I think of that? :)

Looks like it should be possible to make it entirely self-contained too, 
since alloc_iova() is in a position to both test and update the flag 
based on the limit_pfn passed in.

Robin.

> 
> [2] https://github.com/gpkulkarni/linux/commit/e2343a3e1f55cdeb5694103dd354bcb881dc65c3
> note, the testing of this patch is in progress.
> 
>>
>> Robin.
>>
>> [1]
>> http://www.linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=a8e0e4af10ebebb3669750e05bf0028e5bd6afe8
> 
> thanks
> Ganapat
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA
  2018-07-27 16:18                 ` Robin Murphy
@ 2018-07-30  7:10                   ` Ganapatrao Kulkarni
  2018-07-31  2:40                       ` Ganapatrao Kulkarni
  0 siblings, 1 reply; 18+ messages in thread
From: Ganapatrao Kulkarni @ 2018-07-30  7:10 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Ganapatrao Kulkarni, Joerg Roedel, iommu, LKML, tomasz.nowicki,
	jnair, Robert Richter, Vadim.Lomovtsev, Jan.Glauber

On Fri, Jul 27, 2018 at 9:48 PM, Robin Murphy <robin.murphy@arm.com> wrote:
> On 27/07/18 13:56, Ganapatrao Kulkarni wrote:
> [...]
>>>>
>>>> did you get any chance to look in to this issue?
>>>> i am waiting for your suggestion/patch for this issue!
>>>
>>>
>>>
>>> I got as far as [1], but I wasn't sure how much I liked it, since it
>>> still
>>> seems a little invasive for such a specific case (plus I can't remember
>>> if
>>> it's actually been debugged or not). I think in the end I started
>>> wondering
>>> whether it's even worth bothering with the 32-bit optimisation for PCIe
>>> devices - 4 extra bytes worth of TLP is surely a lot less significant
>>> than
>>> every transaction taking up to 50% more bus cycles was for legacy PCI.
>>
>>
>> how about tracking previous attempt to get 32bit range iova and avoid
>> further attempts, if it was failed. Later Resume attempts once
>> replenish happens.
>> Created patch for the same [2]
>
>
> Ooh, that's a much neater implementation of essentially the same concept -
> now why couldn't I think of that? :)
>
> Looks like it should be possible to make it entirely self-contained too,
> since alloc_iova() is in a position to both test and update the flag based
> on the limit_pfn passed in.

is below patch much better?

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 83fe262..abb15d6 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -56,6 +56,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned
long granule,
     iovad->granule = granule;
     iovad->start_pfn = start_pfn;
     iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad));
+    iovad->free_32bit_pfns = true;
     iovad->flush_cb = NULL;
     iovad->fq = NULL;
     iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
@@ -139,8 +140,10 @@ __cached_rbnode_delete_update(struct iova_domain
*iovad, struct iova *free)

     cached_iova = rb_entry(iovad->cached32_node, struct iova, node);
     if (free->pfn_hi < iovad->dma_32bit_pfn &&
-        free->pfn_lo >= cached_iova->pfn_lo)
+        free->pfn_lo >= cached_iova->pfn_lo) {
         iovad->cached32_node = rb_next(&free->node);
+        iovad->free_32bit_pfns = true;
+    }

     cached_iova = rb_entry(iovad->cached_node, struct iova, node);
     if (free->pfn_lo >= cached_iova->pfn_lo)
@@ -290,6 +293,10 @@ alloc_iova(struct iova_domain *iovad, unsigned long size,
     struct iova *new_iova;
     int ret;

+    if (limit_pfn < iovad->dma_32bit_pfn &&
+            !iovad->free_32bit_pfns)
+        return NULL;
+
     new_iova = alloc_iova_mem();
     if (!new_iova)
         return NULL;
@@ -299,6 +306,8 @@ alloc_iova(struct iova_domain *iovad, unsigned long size,

     if (ret) {
         free_iova_mem(new_iova);
+        if (limit_pfn < iovad->dma_32bit_pfn)
+            iovad->free_32bit_pfns = false;
         return NULL;
     }

diff --git a/include/linux/iova.h b/include/linux/iova.h
index 928442d..3810ba9 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -96,6 +96,7 @@ struct iova_domain {
                            flush-queues */
     atomic_t fq_timer_on;            /* 1 when timer is active, 0
                            when not */
+    bool    free_32bit_pfns;
 };

 static inline unsigned long iova_size(struct iova *iova)
-- 
2.9.4
>
> Robin.
>
>
>>
>> [2]
>> https://github.com/gpkulkarni/linux/commit/e2343a3e1f55cdeb5694103dd354bcb881dc65c3
>> note, the testing of this patch is in progress.
>>
>>>
>>> Robin.
>>>
>>> [1]
>>>
>>> http://www.linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=a8e0e4af10ebebb3669750e05bf0028e5bd6afe8
>>
>>
>> thanks
>> Ganapat
>>
>

thanks
Ganapat

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA
@ 2018-07-31  2:40                       ` Ganapatrao Kulkarni
  0 siblings, 0 replies; 18+ messages in thread
From: Ganapatrao Kulkarni @ 2018-07-31  2:40 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Ganapatrao Kulkarni, Joerg Roedel, iommu, LKML, tomasz.nowicki,
	jnair, Robert Richter, Vadim.Lomovtsev, Jan.Glauber

Hi Robin,


On Mon, Jul 30, 2018 at 12:40 PM, Ganapatrao Kulkarni
<gklkml16@gmail.com> wrote:
> On Fri, Jul 27, 2018 at 9:48 PM, Robin Murphy <robin.murphy@arm.com> wrote:
>> On 27/07/18 13:56, Ganapatrao Kulkarni wrote:
>> [...]
>>>>>
>>>>> did you get any chance to look in to this issue?
>>>>> i am waiting for your suggestion/patch for this issue!
>>>>
>>>>
>>>>
>>>> I got as far as [1], but I wasn't sure how much I liked it, since it
>>>> still
>>>> seems a little invasive for such a specific case (plus I can't remember
>>>> if
>>>> it's actually been debugged or not). I think in the end I started
>>>> wondering
>>>> whether it's even worth bothering with the 32-bit optimisation for PCIe
>>>> devices - 4 extra bytes worth of TLP is surely a lot less significant
>>>> than
>>>> every transaction taking up to 50% more bus cycles was for legacy PCI.
>>>
>>>
>>> how about tracking previous attempt to get 32bit range iova and avoid
>>> further attempts, if it was failed. Later Resume attempts once
>>> replenish happens.
>>> Created patch for the same [2]
>>
>>
>> Ooh, that's a much neater implementation of essentially the same concept -
>> now why couldn't I think of that? :)
>>
>> Looks like it should be possible to make it entirely self-contained too,
>> since alloc_iova() is in a position to both test and update the flag based
>> on the limit_pfn passed in.
>
> is below patch much better?

testing with this diff looks ok, shall i send formal patch with your Acked-by?
>
> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
> index 83fe262..abb15d6 100644
> --- a/drivers/iommu/iova.c
> +++ b/drivers/iommu/iova.c
> @@ -56,6 +56,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned
> long granule,
>      iovad->granule = granule;
>      iovad->start_pfn = start_pfn;
>      iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad));
> +    iovad->free_32bit_pfns = true;
>      iovad->flush_cb = NULL;
>      iovad->fq = NULL;
>      iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
> @@ -139,8 +140,10 @@ __cached_rbnode_delete_update(struct iova_domain
> *iovad, struct iova *free)
>
>      cached_iova = rb_entry(iovad->cached32_node, struct iova, node);
>      if (free->pfn_hi < iovad->dma_32bit_pfn &&
> -        free->pfn_lo >= cached_iova->pfn_lo)
> +        free->pfn_lo >= cached_iova->pfn_lo) {
>          iovad->cached32_node = rb_next(&free->node);
> +        iovad->free_32bit_pfns = true;
> +    }
>
>      cached_iova = rb_entry(iovad->cached_node, struct iova, node);
>      if (free->pfn_lo >= cached_iova->pfn_lo)
> @@ -290,6 +293,10 @@ alloc_iova(struct iova_domain *iovad, unsigned long size,
>      struct iova *new_iova;
>      int ret;
>
> +    if (limit_pfn < iovad->dma_32bit_pfn &&
> +            !iovad->free_32bit_pfns)
> +        return NULL;
> +
>      new_iova = alloc_iova_mem();
>      if (!new_iova)
>          return NULL;
> @@ -299,6 +306,8 @@ alloc_iova(struct iova_domain *iovad, unsigned long size,
>
>      if (ret) {
>          free_iova_mem(new_iova);
> +        if (limit_pfn < iovad->dma_32bit_pfn)
> +            iovad->free_32bit_pfns = false;
>          return NULL;
>      }
>
> diff --git a/include/linux/iova.h b/include/linux/iova.h
> index 928442d..3810ba9 100644
> --- a/include/linux/iova.h
> +++ b/include/linux/iova.h
> @@ -96,6 +96,7 @@ struct iova_domain {
>                             flush-queues */
>      atomic_t fq_timer_on;            /* 1 when timer is active, 0
>                             when not */
> +    bool    free_32bit_pfns;
>  };
>
>  static inline unsigned long iova_size(struct iova *iova)
> --
> 2.9.4
>>
>> Robin.
>>
>>
>>>
>>> [2]
>>> https://github.com/gpkulkarni/linux/commit/e2343a3e1f55cdeb5694103dd354bcb881dc65c3
>>> note, the testing of this patch is in progress.
>>>
>>>>
>>>> Robin.
>>>>
>>>> [1]
>>>>
>>>> http://www.linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=a8e0e4af10ebebb3669750e05bf0028e5bd6afe8
>>>
>>>
>>> thanks
>>> Ganapat
>>>
>>
>
> thanks
> Ganapat

thanks
Ganapat

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA
@ 2018-07-31  2:40                       ` Ganapatrao Kulkarni
  0 siblings, 0 replies; 18+ messages in thread
From: Ganapatrao Kulkarni @ 2018-07-31  2:40 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Vadim.Lomovtsev-YGCgFSpz5w/QT0dZR+AlfA,
	Jan.Glauber-YGCgFSpz5w/QT0dZR+AlfA, LKML,
	tomasz.nowicki-YGCgFSpz5w/QT0dZR+AlfA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jnair-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8, Ganapatrao Kulkarni,
	Robert Richter

Hi Robin,


On Mon, Jul 30, 2018 at 12:40 PM, Ganapatrao Kulkarni
<gklkml16-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Fri, Jul 27, 2018 at 9:48 PM, Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org> wrote:
>> On 27/07/18 13:56, Ganapatrao Kulkarni wrote:
>> [...]
>>>>>
>>>>> did you get any chance to look in to this issue?
>>>>> i am waiting for your suggestion/patch for this issue!
>>>>
>>>>
>>>>
>>>> I got as far as [1], but I wasn't sure how much I liked it, since it
>>>> still
>>>> seems a little invasive for such a specific case (plus I can't remember
>>>> if
>>>> it's actually been debugged or not). I think in the end I started
>>>> wondering
>>>> whether it's even worth bothering with the 32-bit optimisation for PCIe
>>>> devices - 4 extra bytes worth of TLP is surely a lot less significant
>>>> than
>>>> every transaction taking up to 50% more bus cycles was for legacy PCI.
>>>
>>>
>>> how about tracking previous attempt to get 32bit range iova and avoid
>>> further attempts, if it was failed. Later Resume attempts once
>>> replenish happens.
>>> Created patch for the same [2]
>>
>>
>> Ooh, that's a much neater implementation of essentially the same concept -
>> now why couldn't I think of that? :)
>>
>> Looks like it should be possible to make it entirely self-contained too,
>> since alloc_iova() is in a position to both test and update the flag based
>> on the limit_pfn passed in.
>
> is below patch much better?

testing with this diff looks ok, shall i send formal patch with your Acked-by?
>
> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
> index 83fe262..abb15d6 100644
> --- a/drivers/iommu/iova.c
> +++ b/drivers/iommu/iova.c
> @@ -56,6 +56,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned
> long granule,
>      iovad->granule = granule;
>      iovad->start_pfn = start_pfn;
>      iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad));
> +    iovad->free_32bit_pfns = true;
>      iovad->flush_cb = NULL;
>      iovad->fq = NULL;
>      iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
> @@ -139,8 +140,10 @@ __cached_rbnode_delete_update(struct iova_domain
> *iovad, struct iova *free)
>
>      cached_iova = rb_entry(iovad->cached32_node, struct iova, node);
>      if (free->pfn_hi < iovad->dma_32bit_pfn &&
> -        free->pfn_lo >= cached_iova->pfn_lo)
> +        free->pfn_lo >= cached_iova->pfn_lo) {
>          iovad->cached32_node = rb_next(&free->node);
> +        iovad->free_32bit_pfns = true;
> +    }
>
>      cached_iova = rb_entry(iovad->cached_node, struct iova, node);
>      if (free->pfn_lo >= cached_iova->pfn_lo)
> @@ -290,6 +293,10 @@ alloc_iova(struct iova_domain *iovad, unsigned long size,
>      struct iova *new_iova;
>      int ret;
>
> +    if (limit_pfn < iovad->dma_32bit_pfn &&
> +            !iovad->free_32bit_pfns)
> +        return NULL;
> +
>      new_iova = alloc_iova_mem();
>      if (!new_iova)
>          return NULL;
> @@ -299,6 +306,8 @@ alloc_iova(struct iova_domain *iovad, unsigned long size,
>
>      if (ret) {
>          free_iova_mem(new_iova);
> +        if (limit_pfn < iovad->dma_32bit_pfn)
> +            iovad->free_32bit_pfns = false;
>          return NULL;
>      }
>
> diff --git a/include/linux/iova.h b/include/linux/iova.h
> index 928442d..3810ba9 100644
> --- a/include/linux/iova.h
> +++ b/include/linux/iova.h
> @@ -96,6 +96,7 @@ struct iova_domain {
>                             flush-queues */
>      atomic_t fq_timer_on;            /* 1 when timer is active, 0
>                             when not */
> +    bool    free_32bit_pfns;
>  };
>
>  static inline unsigned long iova_size(struct iova *iova)
> --
> 2.9.4
>>
>> Robin.
>>
>>
>>>
>>> [2]
>>> https://github.com/gpkulkarni/linux/commit/e2343a3e1f55cdeb5694103dd354bcb881dc65c3
>>> note, the testing of this patch is in progress.
>>>
>>>>
>>>> Robin.
>>>>
>>>> [1]
>>>>
>>>> http://www.linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=a8e0e4af10ebebb3669750e05bf0028e5bd6afe8
>>>
>>>
>>> thanks
>>> Ganapat
>>>
>>
>
> thanks
> Ganapat

thanks
Ganapat

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-07-31  2:41 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-19 17:12 [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA Ganapatrao Kulkarni
2018-04-19 17:12 ` Ganapatrao Kulkarni
2018-04-23 16:37 ` Robin Murphy
2018-04-23 16:37   ` Robin Murphy
2018-04-23 17:41   ` Ganapatrao Kulkarni
2018-04-23 17:41     ` Ganapatrao Kulkarni
2018-04-26  9:45     ` Ganapatrao Kulkarni
2018-04-26  9:45       ` Ganapatrao Kulkarni
2018-05-21  1:15       ` Ganapatrao Kulkarni
2018-05-21  1:15         ` Ganapatrao Kulkarni
2018-06-04  4:06         ` Ganapatrao Kulkarni
2018-07-12  7:45           ` Ganapatrao Kulkarni
2018-07-25 14:20             ` Robin Murphy
2018-07-27 12:56               ` Ganapatrao Kulkarni
2018-07-27 16:18                 ` Robin Murphy
2018-07-30  7:10                   ` Ganapatrao Kulkarni
2018-07-31  2:40                     ` Ganapatrao Kulkarni
2018-07-31  2:40                       ` Ganapatrao Kulkarni

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.