[PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"
@ 2021-01-29  9:21 Zhen Lei
  2021-01-29  9:48 ` Leizhen (ThunderTown)
  0 siblings, 1 reply; 18+ messages in thread
From: Zhen Lei @ 2021-01-29  9:21 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, iommu, linux-kernel
  Cc: Vijayanand Jitta

This reverts commit 4e89dce725213d3d0b0475211b500eda4ef4bf2f.

We find that this patch has a great impact on performance. According to
our test: the iops decreases from 1655.6K to 893.5K, about half.

Hardware: 1 SAS expander with 12 SAS SSD
Command:  Only the main parameters are listed.
          fio bs=4k rw=read iodepth=128 cpus_allowed=0-127

Fixes: 4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search fails")
Tested-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
---
 drivers/iommu/iova.c | 23 ++++++-----------------
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index d20b8b333d30d17..f840c7207efbced 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -185,9 +185,8 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 	struct rb_node *curr, *prev;
 	struct iova *curr_iova;
 	unsigned long flags;
-	unsigned long new_pfn, retry_pfn;
+	unsigned long new_pfn;
 	unsigned long align_mask = ~0UL;
-	unsigned long high_pfn = limit_pfn, low_pfn = iovad->start_pfn;
 
 	if (size_aligned)
 		align_mask <<= fls_long(size - 1);
@@ -200,25 +199,15 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 
 	curr = __get_cached_rbnode(iovad, limit_pfn);
 	curr_iova = rb_entry(curr, struct iova, node);
-	retry_pfn = curr_iova->pfn_hi + 1;
-
-retry:
 	do {
-		high_pfn = min(high_pfn, curr_iova->pfn_lo);
-		new_pfn = (high_pfn - size) & align_mask;
+		limit_pfn = min(limit_pfn, curr_iova->pfn_lo);
+		new_pfn = (limit_pfn - size) & align_mask;
 		prev = curr;
 		curr = rb_prev(curr);
 		curr_iova = rb_entry(curr, struct iova, node);
-	} while (curr && new_pfn <= curr_iova->pfn_hi && new_pfn >= low_pfn);
-
-	if (high_pfn < size || new_pfn < low_pfn) {
-		if (low_pfn == iovad->start_pfn && retry_pfn < limit_pfn) {
-			high_pfn = limit_pfn;
-			low_pfn = retry_pfn;
-			curr = &iovad->anchor.node;
-			curr_iova = rb_entry(curr, struct iova, node);
-			goto retry;
-		}
+	} while (curr && new_pfn <= curr_iova->pfn_hi);
+
+	if (limit_pfn < size || new_pfn < iovad->start_pfn) {
 		iovad->max32_alloc_size = size;
 		goto iova32_full;
 	}
-- 
1.8.3


_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"
  2021-01-29  9:21 [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails" Zhen Lei
@ 2021-01-29  9:48 ` Leizhen (ThunderTown)
  2021-01-29 12:03   ` Robin Murphy
  0 siblings, 1 reply; 18+ messages in thread
From: Leizhen (ThunderTown) @ 2021-01-29  9:48 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, iommu, linux-kernel
  Cc: Vijayanand Jitta


Currently, we are thinking about the solution to the problem. However, because the end time of v5.11 is approaching, this patch is sent first.


On 2021/1/29 17:21, Zhen Lei wrote:
> This reverts commit 4e89dce725213d3d0b0475211b500eda4ef4bf2f.
> 
> We find that this patch has a great impact on performance. According to
> our test: the iops decreases from 1655.6K to 893.5K, about half.
> 
> Hardware: 1 SAS expander with 12 SAS SSD
> Command:  Only the main parameters are listed.
>           fio bs=4k rw=read iodepth=128 cpus_allowed=0-127
> 
> Fixes: 4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search fails")
> Tested-by: Xiang Chen <chenxiang66@hisilicon.com>
> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
> ---
>  drivers/iommu/iova.c | 23 ++++++-----------------
>  1 file changed, 6 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
> index d20b8b333d30d17..f840c7207efbced 100644
> --- a/drivers/iommu/iova.c
> +++ b/drivers/iommu/iova.c
> @@ -185,9 +185,8 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
>  	struct rb_node *curr, *prev;
>  	struct iova *curr_iova;
>  	unsigned long flags;
> -	unsigned long new_pfn, retry_pfn;
> +	unsigned long new_pfn;
>  	unsigned long align_mask = ~0UL;
> -	unsigned long high_pfn = limit_pfn, low_pfn = iovad->start_pfn;
>  
>  	if (size_aligned)
>  		align_mask <<= fls_long(size - 1);
> @@ -200,25 +199,15 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
>  
>  	curr = __get_cached_rbnode(iovad, limit_pfn);
>  	curr_iova = rb_entry(curr, struct iova, node);
> -	retry_pfn = curr_iova->pfn_hi + 1;
> -
> -retry:
>  	do {
> -		high_pfn = min(high_pfn, curr_iova->pfn_lo);
> -		new_pfn = (high_pfn - size) & align_mask;
> +		limit_pfn = min(limit_pfn, curr_iova->pfn_lo);
> +		new_pfn = (limit_pfn - size) & align_mask;
>  		prev = curr;
>  		curr = rb_prev(curr);
>  		curr_iova = rb_entry(curr, struct iova, node);
> -	} while (curr && new_pfn <= curr_iova->pfn_hi && new_pfn >= low_pfn);
> -
> -	if (high_pfn < size || new_pfn < low_pfn) {
> -		if (low_pfn == iovad->start_pfn && retry_pfn < limit_pfn) {
> -			high_pfn = limit_pfn;
> -			low_pfn = retry_pfn;
> -			curr = &iovad->anchor.node;
> -			curr_iova = rb_entry(curr, struct iova, node);
> -			goto retry;
> -		}
> +	} while (curr && new_pfn <= curr_iova->pfn_hi);
> +
> +	if (limit_pfn < size || new_pfn < iovad->start_pfn) {
>  		iovad->max32_alloc_size = size;
>  		goto iova32_full;
>  	}
> 

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"
  2021-01-29  9:48 ` Leizhen (ThunderTown)
@ 2021-01-29 12:03   ` Robin Murphy
  2021-01-29 12:43     ` chenxiang (M)
  2021-02-25 13:54       ` John Garry
  0 siblings, 2 replies; 18+ messages in thread
From: Robin Murphy @ 2021-01-29 12:03 UTC (permalink / raw)
  To: Leizhen (ThunderTown), Will Deacon, Joerg Roedel, iommu, linux-kernel
  Cc: Vijayanand Jitta

On 2021-01-29 09:48, Leizhen (ThunderTown) wrote:
> 
> Currently, we are thinking about the solution to the problem. However, because the end time of v5.11 is approaching, this patch is sent first.

However, that commit was made for a reason - how do we justify that one 
thing being slow is more important than another thing being completely 
broken? It's not practical to just keep doing the patch hokey-cokey 
based on whoever shouts loudest :(

> On 2021/1/29 17:21, Zhen Lei wrote:
>> This reverts commit 4e89dce725213d3d0b0475211b500eda4ef4bf2f.
>>
>> We find that this patch has a great impact on performance. According to
>> our test: the iops decreases from 1655.6K to 893.5K, about half.
>>
>> Hardware: 1 SAS expander with 12 SAS SSD
>> Command:  Only the main parameters are listed.
>>            fio bs=4k rw=read iodepth=128 cpus_allowed=0-127

FWIW, I'm 99% sure that what you really want is [1], but then you get to 
battle against an unknown quantity of dodgy firmware instead.

Robin.

[1] 
https://lore.kernel.org/linux-iommu/d412c292d222eb36469effd338e985f9d9e24cd6.1594207679.git.robin.murphy@arm.com/

>> Fixes: 4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search fails")
>> Tested-by: Xiang Chen <chenxiang66@hisilicon.com>
>> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
>> ---
>>   drivers/iommu/iova.c | 23 ++++++-----------------
>>   1 file changed, 6 insertions(+), 17 deletions(-)
>>
>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>> index d20b8b333d30d17..f840c7207efbced 100644
>> --- a/drivers/iommu/iova.c
>> +++ b/drivers/iommu/iova.c
>> @@ -185,9 +185,8 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
>>   	struct rb_node *curr, *prev;
>>   	struct iova *curr_iova;
>>   	unsigned long flags;
>> -	unsigned long new_pfn, retry_pfn;
>> +	unsigned long new_pfn;
>>   	unsigned long align_mask = ~0UL;
>> -	unsigned long high_pfn = limit_pfn, low_pfn = iovad->start_pfn;
>>   
>>   	if (size_aligned)
>>   		align_mask <<= fls_long(size - 1);
>> @@ -200,25 +199,15 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
>>   
>>   	curr = __get_cached_rbnode(iovad, limit_pfn);
>>   	curr_iova = rb_entry(curr, struct iova, node);
>> -	retry_pfn = curr_iova->pfn_hi + 1;
>> -
>> -retry:
>>   	do {
>> -		high_pfn = min(high_pfn, curr_iova->pfn_lo);
>> -		new_pfn = (high_pfn - size) & align_mask;
>> +		limit_pfn = min(limit_pfn, curr_iova->pfn_lo);
>> +		new_pfn = (limit_pfn - size) & align_mask;
>>   		prev = curr;
>>   		curr = rb_prev(curr);
>>   		curr_iova = rb_entry(curr, struct iova, node);
>> -	} while (curr && new_pfn <= curr_iova->pfn_hi && new_pfn >= low_pfn);
>> -
>> -	if (high_pfn < size || new_pfn < low_pfn) {
>> -		if (low_pfn == iovad->start_pfn && retry_pfn < limit_pfn) {
>> -			high_pfn = limit_pfn;
>> -			low_pfn = retry_pfn;
>> -			curr = &iovad->anchor.node;
>> -			curr_iova = rb_entry(curr, struct iova, node);
>> -			goto retry;
>> -		}
>> +	} while (curr && new_pfn <= curr_iova->pfn_hi);
>> +
>> +	if (limit_pfn < size || new_pfn < iovad->start_pfn) {
>>   		iovad->max32_alloc_size = size;
>>   		goto iova32_full;
>>   	}
>>
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"
  2021-01-29 12:03   ` Robin Murphy
@ 2021-01-29 12:43     ` chenxiang (M)
  2021-02-25 13:54       ` John Garry
  1 sibling, 0 replies; 18+ messages in thread
From: chenxiang (M) @ 2021-01-29 12:43 UTC (permalink / raw)
  To: Robin Murphy, Leizhen (ThunderTown),
	Will Deacon, Joerg Roedel, iommu, linux-kernel
  Cc: Vijayanand Jitta

Hi Robin,


在 2021/1/29 20:03, Robin Murphy 写道:
> On 2021-01-29 09:48, Leizhen (ThunderTown) wrote:
>>
>> Currently, we are thinking about the solution to the problem. 
>> However, because the end time of v5.11 is approaching, this patch is 
>> sent first.
>
> However, that commit was made for a reason - how do we justify that 
> one thing being slow is more important than another thing being 
> completely broken? It's not practical to just keep doing the patch 
> hokey-cokey based on whoever shouts loudest :(
>
>> On 2021/1/29 17:21, Zhen Lei wrote:
>>> This reverts commit 4e89dce725213d3d0b0475211b500eda4ef4bf2f.
>>>
>>> We find that this patch has a great impact on performance. According to
>>> our test: the iops decreases from 1655.6K to 893.5K, about half.
>>>
>>> Hardware: 1 SAS expander with 12 SAS SSD
>>> Command:  Only the main parameters are listed.
>>>            fio bs=4k rw=read iodepth=128 cpus_allowed=0-127
>
> FWIW, I'm 99% sure that what you really want is [1], but then you get 
> to battle against an unknown quantity of dodgy firmware instead.
>
> Robin.
>
> [1] 
> https://lore.kernel.org/linux-iommu/d412c292d222eb36469effd338e985f9d9e24cd6.1594207679.git.robin.murphy@arm.com/

Thank you for pointing that out. I have tested it, and it solves the 
performance drop issue mentioned above.
I noticed that you sent it July 2020, and do you have a plan to merge it 
recently?


>
>>> Fixes: 4e89dce72521 ("iommu/iova: Retry from last rb tree node if 
>>> iova search fails")
>>> Tested-by: Xiang Chen <chenxiang66@hisilicon.com>
>>> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
>>> ---
>>>   drivers/iommu/iova.c | 23 ++++++-----------------
>>>   1 file changed, 6 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>>> index d20b8b333d30d17..f840c7207efbced 100644
>>> --- a/drivers/iommu/iova.c
>>> +++ b/drivers/iommu/iova.c
>>> @@ -185,9 +185,8 @@ static int __alloc_and_insert_iova_range(struct 
>>> iova_domain *iovad,
>>>       struct rb_node *curr, *prev;
>>>       struct iova *curr_iova;
>>>       unsigned long flags;
>>> -    unsigned long new_pfn, retry_pfn;
>>> +    unsigned long new_pfn;
>>>       unsigned long align_mask = ~0UL;
>>> -    unsigned long high_pfn = limit_pfn, low_pfn = iovad->start_pfn;
>>>         if (size_aligned)
>>>           align_mask <<= fls_long(size - 1);
>>> @@ -200,25 +199,15 @@ static int 
>>> __alloc_and_insert_iova_range(struct iova_domain *iovad,
>>>         curr = __get_cached_rbnode(iovad, limit_pfn);
>>>       curr_iova = rb_entry(curr, struct iova, node);
>>> -    retry_pfn = curr_iova->pfn_hi + 1;
>>> -
>>> -retry:
>>>       do {
>>> -        high_pfn = min(high_pfn, curr_iova->pfn_lo);
>>> -        new_pfn = (high_pfn - size) & align_mask;
>>> +        limit_pfn = min(limit_pfn, curr_iova->pfn_lo);
>>> +        new_pfn = (limit_pfn - size) & align_mask;
>>>           prev = curr;
>>>           curr = rb_prev(curr);
>>>           curr_iova = rb_entry(curr, struct iova, node);
>>> -    } while (curr && new_pfn <= curr_iova->pfn_hi && new_pfn >= 
>>> low_pfn);
>>> -
>>> -    if (high_pfn < size || new_pfn < low_pfn) {
>>> -        if (low_pfn == iovad->start_pfn && retry_pfn < limit_pfn) {
>>> -            high_pfn = limit_pfn;
>>> -            low_pfn = retry_pfn;
>>> -            curr = &iovad->anchor.node;
>>> -            curr_iova = rb_entry(curr, struct iova, node);
>>> -            goto retry;
>>> -        }
>>> +    } while (curr && new_pfn <= curr_iova->pfn_hi);
>>> +
>>> +    if (limit_pfn < size || new_pfn < iovad->start_pfn) {
>>>           iovad->max32_alloc_size = size;
>>>           goto iova32_full;
>>>       }
>>>
>>
>
> .
>


_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"
  2021-01-29 12:03   ` Robin Murphy
@ 2021-02-25 13:54       ` John Garry
  2021-02-25 13:54       ` John Garry
  1 sibling, 0 replies; 18+ messages in thread
From: John Garry @ 2021-02-25 13:54 UTC (permalink / raw)
  To: Robin Murphy, Leizhen (ThunderTown),
	Will Deacon, Joerg Roedel, iommu, linux-kernel
  Cc: Vijayanand Jitta, Linuxarm

On 29/01/2021 12:03, Robin Murphy wrote:
> On 2021-01-29 09:48, Leizhen (ThunderTown) wrote:
>>
>> Currently, we are thinking about the solution to the problem. However, 
>> because the end time of v5.11 is approaching, this patch is sent first.
> 
> However, that commit was made for a reason - how do we justify that one 
> thing being slow is more important than another thing being completely 
> broken? It's not practical to just keep doing the patch hokey-cokey 
> based on whoever shouts loudest :(
> 
>> On 2021/1/29 17:21, Zhen Lei wrote:
>>> This reverts commit 4e89dce725213d3d0b0475211b500eda4ef4bf2f.
>>>
>>> We find that this patch has a great impact on performance. According to
>>> our test: the iops decreases from 1655.6K to 893.5K, about half.
>>>
>>> Hardware: 1 SAS expander with 12 SAS SSD
>>> Command:  Only the main parameters are listed.
>>>            fio bs=4k rw=read iodepth=128 cpus_allowed=0-127
> 
> FWIW, I'm 99% sure that what you really want is [1], but then you get to 
> battle against an unknown quantity of dodgy firmware instead.
>

Something which has not been said before is that this only happens for 
strict mode.

Anyway, we see ~50% throughput regression, which is intolerable. As seen 
in [0], I put this down to the fact that we have so many IOVA requests 
which exceed the rcache size limit, which means many RB tree accesses 
for non-cacheble IOVAs, which are now slower.

On another point, as for longterm IOVA aging issue, it seems that there 
is no conclusion there. However I did mention the issue of IOVA sizes 
exceeding rcache size for that issue, so maybe we can find a common 
solution. Similar to a fixed rcache depot size, it seems that having a 
fixed rcache max size range value (at 6) doesn't scale either.

As for 4e89dce72521, so even if it's proper to retry for a failed alloc, 
it is not always necessary. I mean, if we're limiting ourselves to 32b 
subspace for this SAC trick and we fail the alloc, then we can try the 
space above 32b first (if usable). If that fails, then retry there. I 
don't see a need to retry the 32b subspace if we're not limited to it. 
How about it? We tried that idea and it looks to just about restore 
performance.

Thanks,
John

[0] 
https://raw.githubusercontent.com/hisilicon/kernel-dev/topic-iommu-5.10-iova-debug-v3/aging_test

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"
@ 2021-02-25 13:54       ` John Garry
  0 siblings, 0 replies; 18+ messages in thread
From: John Garry @ 2021-02-25 13:54 UTC (permalink / raw)
  To: Robin Murphy, Leizhen (ThunderTown),
	Will Deacon, Joerg Roedel, iommu, linux-kernel
  Cc: Vijayanand Jitta, Linuxarm

On 29/01/2021 12:03, Robin Murphy wrote:
> On 2021-01-29 09:48, Leizhen (ThunderTown) wrote:
>>
>> Currently, we are thinking about the solution to the problem. However, 
>> because the end time of v5.11 is approaching, this patch is sent first.
> 
> However, that commit was made for a reason - how do we justify that one 
> thing being slow is more important than another thing being completely 
> broken? It's not practical to just keep doing the patch hokey-cokey 
> based on whoever shouts loudest :(
> 
>> On 2021/1/29 17:21, Zhen Lei wrote:
>>> This reverts commit 4e89dce725213d3d0b0475211b500eda4ef4bf2f.
>>>
>>> We find that this patch has a great impact on performance. According to
>>> our test: the iops decreases from 1655.6K to 893.5K, about half.
>>>
>>> Hardware: 1 SAS expander with 12 SAS SSD
>>> Command:  Only the main parameters are listed.
>>>            fio bs=4k rw=read iodepth=128 cpus_allowed=0-127
> 
> FWIW, I'm 99% sure that what you really want is [1], but then you get to 
> battle against an unknown quantity of dodgy firmware instead.
>

Something which has not been said before is that this only happens for 
strict mode.

Anyway, we see ~50% throughput regression, which is intolerable. As seen 
in [0], I put this down to the fact that we have so many IOVA requests 
which exceed the rcache size limit, which means many RB tree accesses 
for non-cacheble IOVAs, which are now slower.

On another point, as for longterm IOVA aging issue, it seems that there 
is no conclusion there. However I did mention the issue of IOVA sizes 
exceeding rcache size for that issue, so maybe we can find a common 
solution. Similar to a fixed rcache depot size, it seems that having a 
fixed rcache max size range value (at 6) doesn't scale either.

As for 4e89dce72521, so even if it's proper to retry for a failed alloc, 
it is not always necessary. I mean, if we're limiting ourselves to 32b 
subspace for this SAC trick and we fail the alloc, then we can try the 
space above 32b first (if usable). If that fails, then retry there. I 
don't see a need to retry the 32b subspace if we're not limited to it. 
How about it? We tried that idea and it looks to just about restore 
performance.

Thanks,
John

[0] 
https://raw.githubusercontent.com/hisilicon/kernel-dev/topic-iommu-5.10-iova-debug-v3/aging_test
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"
  2021-02-25 13:54       ` John Garry
@ 2021-03-01 13:20         ` Robin Murphy
  -1 siblings, 0 replies; 18+ messages in thread
From: Robin Murphy @ 2021-03-01 13:20 UTC (permalink / raw)
  To: John Garry, Leizhen (ThunderTown),
	Will Deacon, Joerg Roedel, iommu, linux-kernel
  Cc: Vijayanand Jitta, Linuxarm

On 2021-02-25 13:54, John Garry wrote:
> On 29/01/2021 12:03, Robin Murphy wrote:
>> On 2021-01-29 09:48, Leizhen (ThunderTown) wrote:
>>>
>>> Currently, we are thinking about the solution to the problem. 
>>> However, because the end time of v5.11 is approaching, this patch is 
>>> sent first.
>>
>> However, that commit was made for a reason - how do we justify that 
>> one thing being slow is more important than another thing being 
>> completely broken? It's not practical to just keep doing the patch 
>> hokey-cokey based on whoever shouts loudest :(
>>
>>> On 2021/1/29 17:21, Zhen Lei wrote:
>>>> This reverts commit 4e89dce725213d3d0b0475211b500eda4ef4bf2f.
>>>>
>>>> We find that this patch has a great impact on performance. According to
>>>> our test: the iops decreases from 1655.6K to 893.5K, about half.
>>>>
>>>> Hardware: 1 SAS expander with 12 SAS SSD
>>>> Command:  Only the main parameters are listed.
>>>>            fio bs=4k rw=read iodepth=128 cpus_allowed=0-127
>>
>> FWIW, I'm 99% sure that what you really want is [1], but then you get 
>> to battle against an unknown quantity of dodgy firmware instead.
>>
> 
> Something which has not been said before is that this only happens for 
> strict mode.

I think that makes sense - once you *have* actually failed to allocate 
from the 32-bit space, max32_alloc_size will make subsequent attempts 
fail immediately. In non-strict mode you're most likely freeing 32-bit 
IOVAs back to the tree - and thus reset max32_alloc_size - much less 
often, and you'll make more total space available each time, both of 
which will amortise the cost of getting back into that failed state 
again. Conversely, the worst case in strict mode is to have multiple 
threads getting into this pathological cycle:

1: allocate, get last available IOVA
2: allocate, fail and set max32_alloc_size
3: free one IOVA, reset max32_alloc_size, goto 1

Now, given the broken behaviour where the cached PFN can get stuck near 
the bottom of the address space, step 2 might well have been faster and 
more premature than it should have, but I hope you can appreciate that 
relying on an allocator being broken at its fundamental purpose of 
allocating is not a good or sustainable thing to do.

While max32_alloc_size indirectly tracks the largest *contiguous* 
available space, one of the ideas from which it grew was to simply keep 
count of the total number of free PFNs. If you're really spending 
significant time determining that the tree is full, as opposed to just 
taking longer to eventually succeed, then it might be relatively 
innocuous to tack on that semi-redundant extra accounting as a 
self-contained quick fix for that worst case.

> Anyway, we see ~50% throughput regression, which is intolerable. As seen 
> in [0], I put this down to the fact that we have so many IOVA requests 
> which exceed the rcache size limit, which means many RB tree accesses 
> for non-cacheble IOVAs, which are now slower.
> 
> On another point, as for longterm IOVA aging issue, it seems that there 
> is no conclusion there. However I did mention the issue of IOVA sizes 
> exceeding rcache size for that issue, so maybe we can find a common 
> solution. Similar to a fixed rcache depot size, it seems that having a 
> fixed rcache max size range value (at 6) doesn't scale either.

Well, I'd say that's more of a workload tuning thing than a scalability 
one - a massive system with hundreds of CPUs that spends all day 
flinging 1500-byte network packets around as fast as it can might be 
happy with an even smaller value and using the saved memory for 
something else. IIRC the value of 6 is a fairly arbitrary choice for a 
tradeoff between expected utility and memory consumption, so making it a 
Kconfig or command-line tuneable does seem like a sensible thing to explore.

> As for 4e89dce72521, so even if it's proper to retry for a failed alloc, 
> it is not always necessary. I mean, if we're limiting ourselves to 32b 
> subspace for this SAC trick and we fail the alloc, then we can try the 
> space above 32b first (if usable). If that fails, then retry there. I 
> don't see a need to retry the 32b subspace if we're not limited to it. 
> How about it? We tried that idea and it looks to just about restore 
> performance.

The thing is, if you do have an actual PCI device where DAC might mean a 
33% throughput loss and you're mapping a long-lived buffer, or you're on 
one of these systems where firmware fails to document address limits and 
using the full IOMMU address width quietly breaks things, then you 
almost certainly *do* want the allocator to actually do a proper job of 
trying to satisfy the given request.

Furthermore, what you propose is still fragile for your own use-case 
anyway. If someone makes internal changes to the allocator - converts it 
to a different tree structure, implements split locking for concurrency, 
that sort of thing - and it fundamentally loses the dodgy cached32_node 
behaviour which makes the initial failure unintentionally fast for your 
workload's allocation pattern, that extra complexity will suddenly just 
be dead weight and you'll probably be complaining of a performance 
regression again.

We're talking about an allocation that you know you don't need to make, 
and that you even expect to fail, so I still maintain that it's absurd 
to focus on optimising for failure; focus on *not even doing it at all*. 
It just needs an approach that's not going to mess up the unknown but 
apparently nonzero number of systems inadvertently relying on 32-bit 
IOVAs for correctness.

Robin.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"
@ 2021-03-01 13:20         ` Robin Murphy
  0 siblings, 0 replies; 18+ messages in thread
From: Robin Murphy @ 2021-03-01 13:20 UTC (permalink / raw)
  To: John Garry, Leizhen (ThunderTown),
	Will Deacon, Joerg Roedel, iommu, linux-kernel
  Cc: Vijayanand Jitta, Linuxarm

On 2021-02-25 13:54, John Garry wrote:
> On 29/01/2021 12:03, Robin Murphy wrote:
>> On 2021-01-29 09:48, Leizhen (ThunderTown) wrote:
>>>
>>> Currently, we are thinking about the solution to the problem. 
>>> However, because the end time of v5.11 is approaching, this patch is 
>>> sent first.
>>
>> However, that commit was made for a reason - how do we justify that 
>> one thing being slow is more important than another thing being 
>> completely broken? It's not practical to just keep doing the patch 
>> hokey-cokey based on whoever shouts loudest :(
>>
>>> On 2021/1/29 17:21, Zhen Lei wrote:
>>>> This reverts commit 4e89dce725213d3d0b0475211b500eda4ef4bf2f.
>>>>
>>>> We find that this patch has a great impact on performance. According to
>>>> our test: the iops decreases from 1655.6K to 893.5K, about half.
>>>>
>>>> Hardware: 1 SAS expander with 12 SAS SSD
>>>> Command:  Only the main parameters are listed.
>>>>            fio bs=4k rw=read iodepth=128 cpus_allowed=0-127
>>
>> FWIW, I'm 99% sure that what you really want is [1], but then you get 
>> to battle against an unknown quantity of dodgy firmware instead.
>>
> 
> Something which has not been said before is that this only happens for 
> strict mode.

I think that makes sense - once you *have* actually failed to allocate 
from the 32-bit space, max32_alloc_size will make subsequent attempts 
fail immediately. In non-strict mode you're most likely freeing 32-bit 
IOVAs back to the tree - and thus reset max32_alloc_size - much less 
often, and you'll make more total space available each time, both of 
which will amortise the cost of getting back into that failed state 
again. Conversely, the worst case in strict mode is to have multiple 
threads getting into this pathological cycle:

1: allocate, get last available IOVA
2: allocate, fail and set max32_alloc_size
3: free one IOVA, reset max32_alloc_size, goto 1

Now, given the broken behaviour where the cached PFN can get stuck near 
the bottom of the address space, step 2 might well have been faster and 
more premature than it should have, but I hope you can appreciate that 
relying on an allocator being broken at its fundamental purpose of 
allocating is not a good or sustainable thing to do.

While max32_alloc_size indirectly tracks the largest *contiguous* 
available space, one of the ideas from which it grew was to simply keep 
count of the total number of free PFNs. If you're really spending 
significant time determining that the tree is full, as opposed to just 
taking longer to eventually succeed, then it might be relatively 
innocuous to tack on that semi-redundant extra accounting as a 
self-contained quick fix for that worst case.

> Anyway, we see ~50% throughput regression, which is intolerable. As seen 
> in [0], I put this down to the fact that we have so many IOVA requests 
> which exceed the rcache size limit, which means many RB tree accesses 
> for non-cacheble IOVAs, which are now slower.
> 
> On another point, as for longterm IOVA aging issue, it seems that there 
> is no conclusion there. However I did mention the issue of IOVA sizes 
> exceeding rcache size for that issue, so maybe we can find a common 
> solution. Similar to a fixed rcache depot size, it seems that having a 
> fixed rcache max size range value (at 6) doesn't scale either.

Well, I'd say that's more of a workload tuning thing than a scalability 
one - a massive system with hundreds of CPUs that spends all day 
flinging 1500-byte network packets around as fast as it can might be 
happy with an even smaller value and using the saved memory for 
something else. IIRC the value of 6 is a fairly arbitrary choice for a 
tradeoff between expected utility and memory consumption, so making it a 
Kconfig or command-line tuneable does seem like a sensible thing to explore.

> As for 4e89dce72521, so even if it's proper to retry for a failed alloc, 
> it is not always necessary. I mean, if we're limiting ourselves to 32b 
> subspace for this SAC trick and we fail the alloc, then we can try the 
> space above 32b first (if usable). If that fails, then retry there. I 
> don't see a need to retry the 32b subspace if we're not limited to it. 
> How about it? We tried that idea and it looks to just about restore 
> performance.

The thing is, if you do have an actual PCI device where DAC might mean a 
33% throughput loss and you're mapping a long-lived buffer, or you're on 
one of these systems where firmware fails to document address limits and 
using the full IOMMU address width quietly breaks things, then you 
almost certainly *do* want the allocator to actually do a proper job of 
trying to satisfy the given request.

Furthermore, what you propose is still fragile for your own use-case 
anyway. If someone makes internal changes to the allocator - converts it 
to a different tree structure, implements split locking for concurrency, 
that sort of thing - and it fundamentally loses the dodgy cached32_node 
behaviour which makes the initial failure unintentionally fast for your 
workload's allocation pattern, that extra complexity will suddenly just 
be dead weight and you'll probably be complaining of a performance 
regression again.

We're talking about an allocation that you know you don't need to make, 
and that you even expect to fail, so I still maintain that it's absurd 
to focus on optimising for failure; focus on *not even doing it at all*. 
It just needs an approach that's not going to mess up the unknown but 
apparently nonzero number of systems inadvertently relying on 32-bit 
IOVAs for correctness.

Robin.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"
  2021-03-01 13:20         ` Robin Murphy
@ 2021-03-01 15:48           ` John Garry
  -1 siblings, 0 replies; 18+ messages in thread
From: John Garry @ 2021-03-01 15:48 UTC (permalink / raw)
  To: Robin Murphy, Leizhen (ThunderTown),
	Will Deacon, Joerg Roedel, iommu, linux-kernel
  Cc: Vijayanand Jitta, Linuxarm

On 01/03/2021 13:20, Robin Murphy wrote:
>>> FWIW, I'm 99% sure that what you really want is [1], but then you get
>>> to battle against an unknown quantity of dodgy firmware instead.
>>>
>> Something which has not been said before is that this only happens for
>> strict mode.
> I think that makes sense - once you*have*  actually failed to allocate
> from the 32-bit space, max32_alloc_size will make subsequent attempts
> fail immediately. In non-strict mode you're most likely freeing 32-bit
> IOVAs back to the tree - and thus reset max32_alloc_size - much less
> often, and you'll make more total space available each time, both of
> which will amortise the cost of getting back into that failed state
> again. Conversely, the worst case in strict mode is to have multiple
> threads getting into this pathological cycle:
> 
> 1: allocate, get last available IOVA
> 2: allocate, fail and set max32_alloc_size
> 3: free one IOVA, reset max32_alloc_size, goto 1
> 
> Now, given the broken behaviour where the cached PFN can get stuck near
> the bottom of the address space, step 2 might well have been faster and
> more premature than it should have, but I hope you can appreciate that
> relying on an allocator being broken at its fundamental purpose of
> allocating is not a good or sustainable thing to do.

I figure that you're talking about 4e89dce72521 now. I would have liked 
to know which real-life problem it solved in practice.

> 
> While max32_alloc_size indirectly tracks the largest*contiguous*  
> available space, one of the ideas from which it grew was to simply keep
> count of the total number of free PFNs. If you're really spending
> significant time determining that the tree is full, as opposed to just
> taking longer to eventually succeed, then it might be relatively
> innocuous to tack on that semi-redundant extra accounting as a
> self-contained quick fix for that worst case.
> 
>> Anyway, we see ~50% throughput regression, which is intolerable. As seen
>> in [0], I put this down to the fact that we have so many IOVA requests
>> which exceed the rcache size limit, which means many RB tree accesses
>> for non-cacheble IOVAs, which are now slower.

I will attempt to prove this by increasing RCACHE RANGE, such that all 
IOVA sizes may be cached.

>>
>> On another point, as for longterm IOVA aging issue, it seems that there
>> is no conclusion there. However I did mention the issue of IOVA sizes
>> exceeding rcache size for that issue, so maybe we can find a common
>> solution. Similar to a fixed rcache depot size, it seems that having a
>> fixed rcache max size range value (at 6) doesn't scale either.
> Well, I'd say that's more of a workload tuning thing than a scalability
> one -

ok

> a massive system with hundreds of CPUs that spends all day
> flinging 1500-byte network packets around as fast as it can might be
> happy with an even smaller value and using the saved memory for
> something else. IIRC the value of 6 is a fairly arbitrary choice for a
> tradeoff between expected utility and memory consumption, so making it a
> Kconfig or command-line tuneable does seem like a sensible thing to explore.

Even if it is were configurable, wouldn't it make sense to have it 
configurable per IOVA domain?

Furthermore, as mentioned above, I still want to solve this IOVA aging 
issue, and this fixed RCACHE RANGE size seems to be the at the center of 
that problem.

> 
>> As for 4e89dce72521, so even if it's proper to retry for a failed alloc,
>> it is not always necessary. I mean, if we're limiting ourselves to 32b
>> subspace for this SAC trick and we fail the alloc, then we can try the
>> space above 32b first (if usable). If that fails, then retry there. I
>> don't see a need to retry the 32b subspace if we're not limited to it.
>> How about it? We tried that idea and it looks to just about restore
>> performance.
> The thing is, if you do have an actual PCI device where DAC might mean a
> 33% throughput loss and you're mapping a long-lived buffer, or you're on
> one of these systems where firmware fails to document address limits and
> using the full IOMMU address width quietly breaks things, then you
> almost certainly*do*  want the allocator to actually do a proper job of
> trying to satisfy the given request.

If those conditions were true, then it seems quite a tenuous position, 
so trying to help that scenario in general terms will have limited efficacy.

> 
> Furthermore, what you propose is still fragile for your own use-case
> anyway. If someone makes internal changes to the allocator - converts it
> to a different tree structure, implements split locking for concurrency,
> that sort of thing - and it fundamentally loses the dodgy cached32_node
> behaviour which makes the initial failure unintentionally fast for your
> workload's allocation pattern, that extra complexity will suddenly just
> be dead weight and you'll probably be complaining of a performance
> regression again.
> 
> We're talking about an allocation that you know you don't need to make,
> and that you even expect to fail, so I still maintain that it's absurd
> to focus on optimising for failure; 

Of course, but....

> focus on*not even doing it at all*.
> It just needs an approach that's not going to mess up the unknown but
> apparently nonzero number of systems inadvertently relying on 32-bit
> IOVAs for correctness.

We are seeing a ~50% throughput performance hit, and it's quite 
reasonable to request a short-term fix, rather than accepting that this 
problem is something which we need to solve medium/long-term and we 
don't know how yet.

Going forward, we should try to fix/workaround any broken platforms, 
rather than hide them all. Indeed, the current approach will just give 
rise to more broken platforms - people only fix generally what they see 
is broken. I do wonder how many there really are.

So how about stick the change to avoid the SAC trick for PCIe devices 
behind a kconfig option, and handle issues on a case-by-case basis, as 
they arise? I think that this is what Joerg suggested earlier.

In addition to that, revisit IOVA aging issue and related topic of fixed 
RCACHE RANGE. Hopefully we can solve our short-term performance issue there.

Thanks,
John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"
@ 2021-03-01 15:48           ` John Garry
  0 siblings, 0 replies; 18+ messages in thread
From: John Garry @ 2021-03-01 15:48 UTC (permalink / raw)
  To: Robin Murphy, Leizhen (ThunderTown),
	Will Deacon, Joerg Roedel, iommu, linux-kernel
  Cc: Vijayanand Jitta, Linuxarm

On 01/03/2021 13:20, Robin Murphy wrote:
>>> FWIW, I'm 99% sure that what you really want is [1], but then you get
>>> to battle against an unknown quantity of dodgy firmware instead.
>>>
>> Something which has not been said before is that this only happens for
>> strict mode.
> I think that makes sense - once you*have*  actually failed to allocate
> from the 32-bit space, max32_alloc_size will make subsequent attempts
> fail immediately. In non-strict mode you're most likely freeing 32-bit
> IOVAs back to the tree - and thus reset max32_alloc_size - much less
> often, and you'll make more total space available each time, both of
> which will amortise the cost of getting back into that failed state
> again. Conversely, the worst case in strict mode is to have multiple
> threads getting into this pathological cycle:
> 
> 1: allocate, get last available IOVA
> 2: allocate, fail and set max32_alloc_size
> 3: free one IOVA, reset max32_alloc_size, goto 1
> 
> Now, given the broken behaviour where the cached PFN can get stuck near
> the bottom of the address space, step 2 might well have been faster and
> more premature than it should have, but I hope you can appreciate that
> relying on an allocator being broken at its fundamental purpose of
> allocating is not a good or sustainable thing to do.

I figure that you're talking about 4e89dce72521 now. I would have liked 
to know which real-life problem it solved in practice.

> 
> While max32_alloc_size indirectly tracks the largest*contiguous*  
> available space, one of the ideas from which it grew was to simply keep
> count of the total number of free PFNs. If you're really spending
> significant time determining that the tree is full, as opposed to just
> taking longer to eventually succeed, then it might be relatively
> innocuous to tack on that semi-redundant extra accounting as a
> self-contained quick fix for that worst case.
> 
>> Anyway, we see ~50% throughput regression, which is intolerable. As seen
>> in [0], I put this down to the fact that we have so many IOVA requests
>> which exceed the rcache size limit, which means many RB tree accesses
>> for non-cacheble IOVAs, which are now slower.

I will attempt to prove this by increasing RCACHE RANGE, such that all 
IOVA sizes may be cached.

>>
>> On another point, as for longterm IOVA aging issue, it seems that there
>> is no conclusion there. However I did mention the issue of IOVA sizes
>> exceeding rcache size for that issue, so maybe we can find a common
>> solution. Similar to a fixed rcache depot size, it seems that having a
>> fixed rcache max size range value (at 6) doesn't scale either.
> Well, I'd say that's more of a workload tuning thing than a scalability
> one -

ok

> a massive system with hundreds of CPUs that spends all day
> flinging 1500-byte network packets around as fast as it can might be
> happy with an even smaller value and using the saved memory for
> something else. IIRC the value of 6 is a fairly arbitrary choice for a
> tradeoff between expected utility and memory consumption, so making it a
> Kconfig or command-line tuneable does seem like a sensible thing to explore.

Even if it is were configurable, wouldn't it make sense to have it 
configurable per IOVA domain?

Furthermore, as mentioned above, I still want to solve this IOVA aging 
issue, and this fixed RCACHE RANGE size seems to be the at the center of 
that problem.

> 
>> As for 4e89dce72521, so even if it's proper to retry for a failed alloc,
>> it is not always necessary. I mean, if we're limiting ourselves to 32b
>> subspace for this SAC trick and we fail the alloc, then we can try the
>> space above 32b first (if usable). If that fails, then retry there. I
>> don't see a need to retry the 32b subspace if we're not limited to it.
>> How about it? We tried that idea and it looks to just about restore
>> performance.
> The thing is, if you do have an actual PCI device where DAC might mean a
> 33% throughput loss and you're mapping a long-lived buffer, or you're on
> one of these systems where firmware fails to document address limits and
> using the full IOMMU address width quietly breaks things, then you
> almost certainly*do*  want the allocator to actually do a proper job of
> trying to satisfy the given request.

If those conditions were true, then it seems quite a tenuous position, 
so trying to help that scenario in general terms will have limited efficacy.

> 
> Furthermore, what you propose is still fragile for your own use-case
> anyway. If someone makes internal changes to the allocator - converts it
> to a different tree structure, implements split locking for concurrency,
> that sort of thing - and it fundamentally loses the dodgy cached32_node
> behaviour which makes the initial failure unintentionally fast for your
> workload's allocation pattern, that extra complexity will suddenly just
> be dead weight and you'll probably be complaining of a performance
> regression again.
> 
> We're talking about an allocation that you know you don't need to make,
> and that you even expect to fail, so I still maintain that it's absurd
> to focus on optimising for failure; 

Of course, but....

> focus on*not even doing it at all*.
> It just needs an approach that's not going to mess up the unknown but
> apparently nonzero number of systems inadvertently relying on 32-bit
> IOVAs for correctness.

We are seeing a ~50% throughput performance hit, and it's quite 
reasonable to request a short-term fix, rather than accepting that this 
problem is something which we need to solve medium/long-term and we 
don't know how yet.

Going forward, we should try to fix/workaround any broken platforms, 
rather than hide them all. Indeed, the current approach will just give 
rise to more broken platforms - people only fix generally what they see 
is broken. I do wonder how many there really are.

So how about stick the change to avoid the SAC trick for PCIe devices 
behind a kconfig option, and handle issues on a case-by-case basis, as 
they arise? I think that this is what Joerg suggested earlier.

In addition to that, revisit IOVA aging issue and related topic of fixed 
RCACHE RANGE. Hopefully we can solve our short-term performance issue there.

Thanks,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"
  2021-03-01 15:48           ` John Garry
@ 2021-03-02 12:30             ` John Garry
  -1 siblings, 0 replies; 18+ messages in thread
From: John Garry @ 2021-03-02 12:30 UTC (permalink / raw)
  To: Robin Murphy, Leizhen (ThunderTown),
	Will Deacon, Joerg Roedel, iommu, linux-kernel
  Cc: Vijayanand Jitta, Linuxarm, chenxiang

On 01/03/2021 15:48, John Garry wrote:
>>
>> While max32_alloc_size indirectly tracks the largest*contiguous* 
>> available space, one of the ideas from which it grew was to simply keep
>> count of the total number of free PFNs. If you're really spending
>> significant time determining that the tree is full, as opposed to just
>> taking longer to eventually succeed, then it might be relatively
>> innocuous to tack on that semi-redundant extra accounting as a
>> self-contained quick fix for that worst case.
>>
>>> Anyway, we see ~50% throughput regression, which is intolerable. As seen
>>> in [0], I put this down to the fact that we have so many IOVA requests
>>> which exceed the rcache size limit, which means many RB tree accesses
>>> for non-cacheble IOVAs, which are now slower.
> 
> I will attempt to prove this by increasing RCACHE RANGE, such that all 
> IOVA sizes may be cached.

About this one, as expected, we restore performance by increasing the 
RCACHE RANGE.

Some figures:
Baseline v5.12-rc1

strict mode:
600K IOPs

Revert "iommu/iova: Retry from last rb tree node if iova search fails":
1215K

Increase IOVA RCACHE range 6 -> 10 (All IOVAs size requests now 
cacheable for this experiment):
1400K

Reduce LLDD max SGE count 124 -> 16:
1288K

non-strict mode
1650K

So ideally we can work towards something for which IOVAs of all size 
could be cached.

Cheers,
John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"
@ 2021-03-02 12:30             ` John Garry
  0 siblings, 0 replies; 18+ messages in thread
From: John Garry @ 2021-03-02 12:30 UTC (permalink / raw)
  To: Robin Murphy, Leizhen (ThunderTown),
	Will Deacon, Joerg Roedel, iommu, linux-kernel
  Cc: Vijayanand Jitta, Linuxarm

On 01/03/2021 15:48, John Garry wrote:
>>
>> While max32_alloc_size indirectly tracks the largest*contiguous* 
>> available space, one of the ideas from which it grew was to simply keep
>> count of the total number of free PFNs. If you're really spending
>> significant time determining that the tree is full, as opposed to just
>> taking longer to eventually succeed, then it might be relatively
>> innocuous to tack on that semi-redundant extra accounting as a
>> self-contained quick fix for that worst case.
>>
>>> Anyway, we see ~50% throughput regression, which is intolerable. As seen
>>> in [0], I put this down to the fact that we have so many IOVA requests
>>> which exceed the rcache size limit, which means many RB tree accesses
>>> for non-cacheble IOVAs, which are now slower.
> 
> I will attempt to prove this by increasing RCACHE RANGE, such that all 
> IOVA sizes may be cached.

About this one, as expected, we restore performance by increasing the 
RCACHE RANGE.

Some figures:
Baseline v5.12-rc1

strict mode:
600K IOPs

Revert "iommu/iova: Retry from last rb tree node if iova search fails":
1215K

Increase IOVA RCACHE range 6 -> 10 (All IOVAs size requests now 
cacheable for this experiment):
1400K

Reduce LLDD max SGE count 124 -> 16:
1288K

non-strict mode
1650K

So ideally we can work towards something for which IOVAs of all size 
could be cached.

Cheers,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"
  2021-03-01 15:48           ` John Garry
@ 2021-03-08 15:15             ` Robin Murphy
  -1 siblings, 0 replies; 18+ messages in thread
From: Robin Murphy @ 2021-03-08 15:15 UTC (permalink / raw)
  To: John Garry, Leizhen (ThunderTown),
	Will Deacon, Joerg Roedel, iommu, linux-kernel
  Cc: Vijayanand Jitta, Linuxarm

On 2021-03-01 15:48, John Garry wrote:
> On 01/03/2021 13:20, Robin Murphy wrote:
>>>> FWIW, I'm 99% sure that what you really want is [1], but then you get
>>>> to battle against an unknown quantity of dodgy firmware instead.
>>>>
>>> Something which has not been said before is that this only happens for
>>> strict mode.
>> I think that makes sense - once you*have*  actually failed to allocate
>> from the 32-bit space, max32_alloc_size will make subsequent attempts
>> fail immediately. In non-strict mode you're most likely freeing 32-bit
>> IOVAs back to the tree - and thus reset max32_alloc_size - much less
>> often, and you'll make more total space available each time, both of
>> which will amortise the cost of getting back into that failed state
>> again. Conversely, the worst case in strict mode is to have multiple
>> threads getting into this pathological cycle:
>>
>> 1: allocate, get last available IOVA
>> 2: allocate, fail and set max32_alloc_size
>> 3: free one IOVA, reset max32_alloc_size, goto 1
>>
>> Now, given the broken behaviour where the cached PFN can get stuck near
>> the bottom of the address space, step 2 might well have been faster and
>> more premature than it should have, but I hope you can appreciate that
>> relying on an allocator being broken at its fundamental purpose of
>> allocating is not a good or sustainable thing to do.
> 
> I figure that you're talking about 4e89dce72521 now. I would have liked 
> to know which real-life problem it solved in practice.

 From what I remember, the problem reported was basically the one 
illustrated in that commit and the one I alluded to above - namely that 
certain allocation patterns with a broad mix of sizes and relative 
lifetimes end up pushing the cached PFN down to the bottom of the 
address space such that allocations start failing despite there still 
being sufficient free space overall, which was breaking some media 
workload. What was originally proposed was an overcomplicated palaver 
with DMA attributes and a whole extra allocation algorithm rather than 
just fixing the clearly unintended and broken behaviour.

>> While max32_alloc_size indirectly tracks the largest*contiguous* 
>> available space, one of the ideas from which it grew was to simply keep
>> count of the total number of free PFNs. If you're really spending
>> significant time determining that the tree is full, as opposed to just
>> taking longer to eventually succeed, then it might be relatively
>> innocuous to tack on that semi-redundant extra accounting as a
>> self-contained quick fix for that worst case.
>>
>>> Anyway, we see ~50% throughput regression, which is intolerable. As seen
>>> in [0], I put this down to the fact that we have so many IOVA requests
>>> which exceed the rcache size limit, which means many RB tree accesses
>>> for non-cacheble IOVAs, which are now slower.
> 
> I will attempt to prove this by increasing RCACHE RANGE, such that all 
> IOVA sizes may be cached.
> 
>>>
>>> On another point, as for longterm IOVA aging issue, it seems that there
>>> is no conclusion there. However I did mention the issue of IOVA sizes
>>> exceeding rcache size for that issue, so maybe we can find a common
>>> solution. Similar to a fixed rcache depot size, it seems that having a
>>> fixed rcache max size range value (at 6) doesn't scale either.
>> Well, I'd say that's more of a workload tuning thing than a scalability
>> one -
> 
> ok
> 
>> a massive system with hundreds of CPUs that spends all day
>> flinging 1500-byte network packets around as fast as it can might be
>> happy with an even smaller value and using the saved memory for
>> something else. IIRC the value of 6 is a fairly arbitrary choice for a
>> tradeoff between expected utility and memory consumption, so making it a
>> Kconfig or command-line tuneable does seem like a sensible thing to 
>> explore.
> 
> Even if it is were configurable, wouldn't it make sense to have it 
> configurable per IOVA domain?

Perhaps, but I don't see that being at all easy to implement. We can't 
arbitrarily *increase* the scope of caching once a domain is active due 
to the size-rounding-up requirement, which would be prohibitive to 
larger allocations if applied universally.

> Furthermore, as mentioned above, I still want to solve this IOVA aging 
> issue, and this fixed RCACHE RANGE size seems to be the at the center of 
> that problem.
> 
>>
>>> As for 4e89dce72521, so even if it's proper to retry for a failed alloc,
>>> it is not always necessary. I mean, if we're limiting ourselves to 32b
>>> subspace for this SAC trick and we fail the alloc, then we can try the
>>> space above 32b first (if usable). If that fails, then retry there. I
>>> don't see a need to retry the 32b subspace if we're not limited to it.
>>> How about it? We tried that idea and it looks to just about restore
>>> performance.
>> The thing is, if you do have an actual PCI device where DAC might mean a
>> 33% throughput loss and you're mapping a long-lived buffer, or you're on
>> one of these systems where firmware fails to document address limits and
>> using the full IOMMU address width quietly breaks things, then you
>> almost certainly*do*  want the allocator to actually do a proper job of
>> trying to satisfy the given request.
> 
> If those conditions were true, then it seems quite a tenuous position, 
> so trying to help that scenario in general terms will have limited 
> efficacy.

Still, I'd be curious to see if making the restart a bit cleverer offers 
a noticeable improvement. IIRC I suggested it at the time, but in the 
end the push was just to get *something* merged.

Robin.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"
@ 2021-03-08 15:15             ` Robin Murphy
  0 siblings, 0 replies; 18+ messages in thread
From: Robin Murphy @ 2021-03-08 15:15 UTC (permalink / raw)
  To: John Garry, Leizhen (ThunderTown),
	Will Deacon, Joerg Roedel, iommu, linux-kernel
  Cc: Vijayanand Jitta, Linuxarm

On 2021-03-01 15:48, John Garry wrote:
> On 01/03/2021 13:20, Robin Murphy wrote:
>>>> FWIW, I'm 99% sure that what you really want is [1], but then you get
>>>> to battle against an unknown quantity of dodgy firmware instead.
>>>>
>>> Something which has not been said before is that this only happens for
>>> strict mode.
>> I think that makes sense - once you*have*  actually failed to allocate
>> from the 32-bit space, max32_alloc_size will make subsequent attempts
>> fail immediately. In non-strict mode you're most likely freeing 32-bit
>> IOVAs back to the tree - and thus reset max32_alloc_size - much less
>> often, and you'll make more total space available each time, both of
>> which will amortise the cost of getting back into that failed state
>> again. Conversely, the worst case in strict mode is to have multiple
>> threads getting into this pathological cycle:
>>
>> 1: allocate, get last available IOVA
>> 2: allocate, fail and set max32_alloc_size
>> 3: free one IOVA, reset max32_alloc_size, goto 1
>>
>> Now, given the broken behaviour where the cached PFN can get stuck near
>> the bottom of the address space, step 2 might well have been faster and
>> more premature than it should have, but I hope you can appreciate that
>> relying on an allocator being broken at its fundamental purpose of
>> allocating is not a good or sustainable thing to do.
> 
> I figure that you're talking about 4e89dce72521 now. I would have liked 
> to know which real-life problem it solved in practice.

 From what I remember, the problem reported was basically the one 
illustrated in that commit and the one I alluded to above - namely that 
certain allocation patterns with a broad mix of sizes and relative 
lifetimes end up pushing the cached PFN down to the bottom of the 
address space such that allocations start failing despite there still 
being sufficient free space overall, which was breaking some media 
workload. What was originally proposed was an overcomplicated palaver 
with DMA attributes and a whole extra allocation algorithm rather than 
just fixing the clearly unintended and broken behaviour.

>> While max32_alloc_size indirectly tracks the largest*contiguous* 
>> available space, one of the ideas from which it grew was to simply keep
>> count of the total number of free PFNs. If you're really spending
>> significant time determining that the tree is full, as opposed to just
>> taking longer to eventually succeed, then it might be relatively
>> innocuous to tack on that semi-redundant extra accounting as a
>> self-contained quick fix for that worst case.
>>
>>> Anyway, we see ~50% throughput regression, which is intolerable. As seen
>>> in [0], I put this down to the fact that we have so many IOVA requests
>>> which exceed the rcache size limit, which means many RB tree accesses
>>> for non-cacheble IOVAs, which are now slower.
> 
> I will attempt to prove this by increasing RCACHE RANGE, such that all 
> IOVA sizes may be cached.
> 
>>>
>>> On another point, as for longterm IOVA aging issue, it seems that there
>>> is no conclusion there. However I did mention the issue of IOVA sizes
>>> exceeding rcache size for that issue, so maybe we can find a common
>>> solution. Similar to a fixed rcache depot size, it seems that having a
>>> fixed rcache max size range value (at 6) doesn't scale either.
>> Well, I'd say that's more of a workload tuning thing than a scalability
>> one -
> 
> ok
> 
>> a massive system with hundreds of CPUs that spends all day
>> flinging 1500-byte network packets around as fast as it can might be
>> happy with an even smaller value and using the saved memory for
>> something else. IIRC the value of 6 is a fairly arbitrary choice for a
>> tradeoff between expected utility and memory consumption, so making it a
>> Kconfig or command-line tuneable does seem like a sensible thing to 
>> explore.
> 
> Even if it is were configurable, wouldn't it make sense to have it 
> configurable per IOVA domain?

Perhaps, but I don't see that being at all easy to implement. We can't 
arbitrarily *increase* the scope of caching once a domain is active due 
to the size-rounding-up requirement, which would be prohibitive to 
larger allocations if applied universally.

> Furthermore, as mentioned above, I still want to solve this IOVA aging 
> issue, and this fixed RCACHE RANGE size seems to be the at the center of 
> that problem.
> 
>>
>>> As for 4e89dce72521, so even if it's proper to retry for a failed alloc,
>>> it is not always necessary. I mean, if we're limiting ourselves to 32b
>>> subspace for this SAC trick and we fail the alloc, then we can try the
>>> space above 32b first (if usable). If that fails, then retry there. I
>>> don't see a need to retry the 32b subspace if we're not limited to it.
>>> How about it? We tried that idea and it looks to just about restore
>>> performance.
>> The thing is, if you do have an actual PCI device where DAC might mean a
>> 33% throughput loss and you're mapping a long-lived buffer, or you're on
>> one of these systems where firmware fails to document address limits and
>> using the full IOMMU address width quietly breaks things, then you
>> almost certainly*do*  want the allocator to actually do a proper job of
>> trying to satisfy the given request.
> 
> If those conditions were true, then it seems quite a tenuous position, 
> so trying to help that scenario in general terms will have limited 
> efficacy.

Still, I'd be curious to see if making the restart a bit cleverer offers 
a noticeable improvement. IIRC I suggested it at the time, but in the 
end the push was just to get *something* merged.

Robin.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"
  2021-03-08 15:15             ` Robin Murphy
@ 2021-03-08 16:22               ` John Garry
  -1 siblings, 0 replies; 18+ messages in thread
From: John Garry @ 2021-03-08 16:22 UTC (permalink / raw)
  To: Robin Murphy, Leizhen (ThunderTown),
	Will Deacon, Joerg Roedel, iommu, linux-kernel
  Cc: Vijayanand Jitta, Linuxarm

On 08/03/2021 15:15, Robin Murphy wrote:
>> I figure that you're talking about 4e89dce72521 now. I would have 
>> liked to know which real-life problem it solved in practice.
> 
>  From what I remember, the problem reported was basically the one 
> illustrated in that commit and the one I alluded to above - namely that 
> certain allocation patterns with a broad mix of sizes and relative 
> lifetimes end up pushing the cached PFN down to the bottom of the 
> address space such that allocations start failing despite there still 
> being sufficient free space overall, which was breaking some media 
> workload. What was originally proposed was an overcomplicated palaver 
> with DMA attributes and a whole extra allocation algorithm rather than 
> just fixing the clearly unintended and broken behaviour.

ok, fine. I just wondered if this was a theoretical problem only.

> 
>>> While max32_alloc_size indirectly tracks the largest*contiguous* 
>>> available space, one of the ideas from which it grew was to simply keep
>>> count of the total number of free PFNs. If you're really spending
>>> significant time determining that the tree is full, as opposed to just
>>> taking longer to eventually succeed, then it might be relatively
>>> innocuous to tack on that semi-redundant extra accounting as a
>>> self-contained quick fix for that worst case.
>>>

...

>>
>> Even if it is were configurable, wouldn't it make sense to have it 
>> configurable per IOVA domain?
> 
> Perhaps, but I don't see that being at all easy to implement. We can't 
> arbitrarily *increase* the scope of caching once a domain is active due 
> to the size-rounding-up requirement, which would be prohibitive to 
> larger allocations if applied universally.
> 

Agreed.

But having that (all IOVAs sizes being cacheable) available could be 
really great, though, for some situations.

>> Furthermore, as mentioned above, I still want to solve this IOVA aging 
>> issue, and this fixed RCACHE RANGE size seems to be the at the center 
>> of that problem.
>>
>>>
>>>> As for 4e89dce72521, so even if it's proper to retry for a failed 
>>>> alloc,
>>>> it is not always necessary. I mean, if we're limiting ourselves to 32b
>>>> subspace for this SAC trick and we fail the alloc, then we can try the
>>>> space above 32b first (if usable). If that fails, then retry there. I
>>>> don't see a need to retry the 32b subspace if we're not limited to it.
>>>> How about it? We tried that idea and it looks to just about restore
>>>> performance.
>>> The thing is, if you do have an actual PCI device where DAC might mean a
>>> 33% throughput loss and you're mapping a long-lived buffer, or you're on
>>> one of these systems where firmware fails to document address limits and
>>> using the full IOMMU address width quietly breaks things, then you
>>> almost certainly*do*  want the allocator to actually do a proper job of
>>> trying to satisfy the given request.
>>
>> If those conditions were true, then it seems quite a tenuous position, 
>> so trying to help that scenario in general terms will have limited 
>> efficacy.
> 
> Still, I'd be curious to see if making the restart a bit cleverer offers 
> a noticeable improvement. IIRC I suggested it at the time, but in the 
> end the push was just to get *something* merged.

Sorry to say, I just tested that ("iommu/iova: Improve restart logic") 
and there is no obvious improvement.

I'll have a further look at what might be going on.

Thanks very much,
John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"
@ 2021-03-08 16:22               ` John Garry
  0 siblings, 0 replies; 18+ messages in thread
From: John Garry @ 2021-03-08 16:22 UTC (permalink / raw)
  To: Robin Murphy, Leizhen (ThunderTown),
	Will Deacon, Joerg Roedel, iommu, linux-kernel
  Cc: Vijayanand Jitta, Linuxarm

On 08/03/2021 15:15, Robin Murphy wrote:
>> I figure that you're talking about 4e89dce72521 now. I would have 
>> liked to know which real-life problem it solved in practice.
> 
>  From what I remember, the problem reported was basically the one 
> illustrated in that commit and the one I alluded to above - namely that 
> certain allocation patterns with a broad mix of sizes and relative 
> lifetimes end up pushing the cached PFN down to the bottom of the 
> address space such that allocations start failing despite there still 
> being sufficient free space overall, which was breaking some media 
> workload. What was originally proposed was an overcomplicated palaver 
> with DMA attributes and a whole extra allocation algorithm rather than 
> just fixing the clearly unintended and broken behaviour.

ok, fine. I just wondered if this was a theoretical problem only.

> 
>>> While max32_alloc_size indirectly tracks the largest*contiguous* 
>>> available space, one of the ideas from which it grew was to simply keep
>>> count of the total number of free PFNs. If you're really spending
>>> significant time determining that the tree is full, as opposed to just
>>> taking longer to eventually succeed, then it might be relatively
>>> innocuous to tack on that semi-redundant extra accounting as a
>>> self-contained quick fix for that worst case.
>>>

...

>>
>> Even if it is were configurable, wouldn't it make sense to have it 
>> configurable per IOVA domain?
> 
> Perhaps, but I don't see that being at all easy to implement. We can't 
> arbitrarily *increase* the scope of caching once a domain is active due 
> to the size-rounding-up requirement, which would be prohibitive to 
> larger allocations if applied universally.
> 

Agreed.

But having that (all IOVAs sizes being cacheable) available could be 
really great, though, for some situations.

>> Furthermore, as mentioned above, I still want to solve this IOVA aging 
>> issue, and this fixed RCACHE RANGE size seems to be the at the center 
>> of that problem.
>>
>>>
>>>> As for 4e89dce72521, so even if it's proper to retry for a failed 
>>>> alloc,
>>>> it is not always necessary. I mean, if we're limiting ourselves to 32b
>>>> subspace for this SAC trick and we fail the alloc, then we can try the
>>>> space above 32b first (if usable). If that fails, then retry there. I
>>>> don't see a need to retry the 32b subspace if we're not limited to it.
>>>> How about it? We tried that idea and it looks to just about restore
>>>> performance.
>>> The thing is, if you do have an actual PCI device where DAC might mean a
>>> 33% throughput loss and you're mapping a long-lived buffer, or you're on
>>> one of these systems where firmware fails to document address limits and
>>> using the full IOMMU address width quietly breaks things, then you
>>> almost certainly*do*  want the allocator to actually do a proper job of
>>> trying to satisfy the given request.
>>
>> If those conditions were true, then it seems quite a tenuous position, 
>> so trying to help that scenario in general terms will have limited 
>> efficacy.
> 
> Still, I'd be curious to see if making the restart a bit cleverer offers 
> a noticeable improvement. IIRC I suggested it at the time, but in the 
> end the push was just to get *something* merged.

Sorry to say, I just tested that ("iommu/iova: Improve restart logic") 
and there is no obvious improvement.

I'll have a further look at what might be going on.

Thanks very much,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"
  2021-03-08 16:22               ` John Garry
@ 2021-03-10 17:50                 ` John Garry
  -1 siblings, 0 replies; 18+ messages in thread
From: John Garry @ 2021-03-10 17:50 UTC (permalink / raw)
  To: Robin Murphy, Leizhen (ThunderTown),
	Will Deacon, Joerg Roedel, iommu, linux-kernel
  Cc: Vijayanand Jitta, Linuxarm

On 08/03/2021 16:22, John Garry wrote:
> 
>>
>>>> While max32_alloc_size indirectly tracks the largest*contiguous* 
>>>> available space, one of the ideas from which it grew was to simply keep
>>>> count of the total number of free PFNs. If you're really spending
>>>> significant time determining that the tree is full, as opposed to just
>>>> taking longer to eventually succeed, then it might be relatively
>>>> innocuous to tack on that semi-redundant extra accounting as a
>>>> self-contained quick fix for that worst case.
>>>>
> 
> ...

So since the retry means that we search through the complete pfn range 
most of the time (due to poor success rate quoted), we should be able to 
do a better job at maintaining an accurate max alloc size, by 
calculating it during the failed search, and not relying on max alloc 
failed or resetting it frequently. Hopefully that would mean that we're 
smarter about quickly failing the allocation.

I'll further look at that.

Thanks,
John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"
@ 2021-03-10 17:50                 ` John Garry
  0 siblings, 0 replies; 18+ messages in thread
From: John Garry @ 2021-03-10 17:50 UTC (permalink / raw)
  To: Robin Murphy, Leizhen (ThunderTown),
	Will Deacon, Joerg Roedel, iommu, linux-kernel
  Cc: Vijayanand Jitta, Linuxarm

On 08/03/2021 16:22, John Garry wrote:
> 
>>
>>>> While max32_alloc_size indirectly tracks the largest*contiguous* 
>>>> available space, one of the ideas from which it grew was to simply keep
>>>> count of the total number of free PFNs. If you're really spending
>>>> significant time determining that the tree is full, as opposed to just
>>>> taking longer to eventually succeed, then it might be relatively
>>>> innocuous to tack on that semi-redundant extra accounting as a
>>>> self-contained quick fix for that worst case.
>>>>
> 
> ...

So since the retry means that we search through the complete pfn range 
most of the time (due to poor success rate quoted), we should be able to 
do a better job at maintaining an accurate max alloc size, by 
calculating it during the failed search, and not relying on max alloc 
failed or resetting it frequently. Hopefully that would mean that we're 
smarter about quickly failing the allocation.

I'll further look at that.

Thanks,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-03-10 17:53 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-29  9:21 [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails" Zhen Lei
2021-01-29  9:48 ` Leizhen (ThunderTown)
2021-01-29 12:03   ` Robin Murphy
2021-01-29 12:43     ` chenxiang (M)
2021-02-25 13:54     ` John Garry
2021-02-25 13:54       ` John Garry
2021-03-01 13:20       ` Robin Murphy
2021-03-01 13:20         ` Robin Murphy
2021-03-01 15:48         ` John Garry
2021-03-01 15:48           ` John Garry
2021-03-02 12:30           ` John Garry
2021-03-02 12:30             ` John Garry
2021-03-08 15:15           ` Robin Murphy
2021-03-08 15:15             ` Robin Murphy
2021-03-08 16:22             ` John Garry
2021-03-08 16:22               ` John Garry
2021-03-10 17:50               ` John Garry
2021-03-10 17:50                 ` John Garry

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.