alloc_contig_range() with MIGRATE_MOVABLE performance regression since 4.9

All of lore.kernel.org
 help / color / mirror / Atom feed

* alloc_contig_range() with MIGRATE_MOVABLE performance regression since 4.9
@ 2021-04-21 18:36 Florian Fainelli
  2021-04-22  7:49 ` Michal Hocko
  0 siblings, 1 reply; 8+ messages in thread
From: Florian Fainelli @ 2021-04-21 18:36 UTC (permalink / raw)
  To: mhocko, Vlastimil Babka, Mel Gorman, Minchan Kim, Johannes Weiner
  Cc: l.stach, LKML, linux-mmc, Jaewon Kim, Michal Nazarewicz, Joonsoo Kim

[-- Attachment #1: Type: text/plain, Size: 2175 bytes --]

Hi all,

I have been trying for the past few days to identify the source of a
performance regression that we are seeing with the 5.4 kernel but not
with the 4.9 kernel on ARM64. Testing something newer like 5.10 is a bit
challenging at the moment but will happen eventually.

What we are seeing is a ~3x increase in the time needed for
alloc_contig_range() to allocate 1GB in blocks of 2MB pages. The system
is idle at the time and there are no other contenders for memory other
than the user-space programs already started (DHCP client, shell, etc.).

I have tried playing with the compact_control structure settings but
have not found anything that would bring us back to the performance of
4.9. More often than not, we see test_pages_isolated() returning an
non-zero error code which would explain the slow down, since we have
some logic that re-tries the allocation if alloc_contig_range() returns
-EBUSY. If I remove the retry logic however, we don't get -EBUSY and we
get the results below:

4.9 shows this:

[  457.537634] allocating: size: 1024MB avg: 59172 (us), max: 137306
(us), min: 44859 (us), total: 591723 (us), pages: 512, per-page: 115 (us)
[  457.550222] freeing: size: 1024MB avg: 67397 (us), max: 151408 (us),
min: 52630 (us), total: 673974 (us), pages: 512, per-page: 131 (us)

5.4 show this:

[  222.388758] allocating: size: 1024MB avg: 156739 (us), max: 157254
(us), min: 155915 (us), total: 1567394 (us), pages: 512, per-page: 306 (us)
[  222.401601] freeing: size: 1024MB avg: 209899 (us), max: 210085 (us),
min: 209749 (us), total: 2098999 (us), pages: 512, per-page: 409 (us)

This regression is not seen when MIGRATE_CMA is specified instead of
MIGRATE_MOVABLE.

A few characteristics that you should probably be aware of:

- There is 4GB of memory populated with the memory being mapped into the
CPU's address starting at space at 0x4000_0000 (1GB), PAGE_SIZE is 4KB

- there is a ZONE_DMA32 that starts at 0x4000_0000 and ends at
0xE480_0000, from there on we have a ZONE_MOVABLE which is comprised of
0xE480_0000 - 0xfdc00000 and another range spanning 0x1_0000_0000 -
0x1_4000_0000

Attached is the kernel configuration.

Thanks!
-- 
Florian

[-- Attachment #2: config.gz --]
[-- Type: application/x-gzip, Size: 32742 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: alloc_contig_range() with MIGRATE_MOVABLE performance regression since 4.9
  2021-04-21 18:36 alloc_contig_range() with MIGRATE_MOVABLE performance regression since 4.9 Florian Fainelli
@ 2021-04-22  7:49 ` Michal Hocko
  2021-04-22  8:56   ` David Hildenbrand
  0 siblings, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2021-04-22  7:49 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Vlastimil Babka, Mel Gorman, Minchan Kim, Johannes Weiner,
	l.stach, LKML, linux-mmc, Jaewon Kim, Michal Nazarewicz,
	Joonsoo Kim, David Hildenbrand, Oscar Salvador

Cc David and Oscar who are familiar with this code as well.

On Wed 21-04-21 11:36:01, Florian Fainelli wrote:
> Hi all,
> 
> I have been trying for the past few days to identify the source of a
> performance regression that we are seeing with the 5.4 kernel but not
> with the 4.9 kernel on ARM64. Testing something newer like 5.10 is a bit
> challenging at the moment but will happen eventually.
> 
> What we are seeing is a ~3x increase in the time needed for
> alloc_contig_range() to allocate 1GB in blocks of 2MB pages. The system
> is idle at the time and there are no other contenders for memory other
> than the user-space programs already started (DHCP client, shell, etc.).
> 
> I have tried playing with the compact_control structure settings but
> have not found anything that would bring us back to the performance of
> 4.9. More often than not, we see test_pages_isolated() returning an
> non-zero error code which would explain the slow down, since we have
> some logic that re-tries the allocation if alloc_contig_range() returns
> -EBUSY. If I remove the retry logic however, we don't get -EBUSY and we
> get the results below:
> 
> 4.9 shows this:
> 
> [  457.537634] allocating: size: 1024MB avg: 59172 (us), max: 137306
> (us), min: 44859 (us), total: 591723 (us), pages: 512, per-page: 115 (us)
> [  457.550222] freeing: size: 1024MB avg: 67397 (us), max: 151408 (us),
> min: 52630 (us), total: 673974 (us), pages: 512, per-page: 131 (us)
> 
> 5.4 show this:
> 
> [  222.388758] allocating: size: 1024MB avg: 156739 (us), max: 157254
> (us), min: 155915 (us), total: 1567394 (us), pages: 512, per-page: 306 (us)
> [  222.401601] freeing: size: 1024MB avg: 209899 (us), max: 210085 (us),
> min: 209749 (us), total: 2098999 (us), pages: 512, per-page: 409 (us)
> 
> This regression is not seen when MIGRATE_CMA is specified instead of
> MIGRATE_MOVABLE.
> 
> A few characteristics that you should probably be aware of:
> 
> - There is 4GB of memory populated with the memory being mapped into the
> CPU's address starting at space at 0x4000_0000 (1GB), PAGE_SIZE is 4KB
> 
> - there is a ZONE_DMA32 that starts at 0x4000_0000 and ends at
> 0xE480_0000, from there on we have a ZONE_MOVABLE which is comprised of
> 0xE480_0000 - 0xfdc00000 and another range spanning 0x1_0000_0000 -
> 0x1_4000_0000
> 
> Attached is the kernel configuration.
> 
> Thanks!
> -- 
> Florian



-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: alloc_contig_range() with MIGRATE_MOVABLE performance regression since 4.9
  2021-04-22  7:49 ` Michal Hocko
@ 2021-04-22  8:56   ` David Hildenbrand
  2021-04-22 17:50     ` Florian Fainelli
  0 siblings, 1 reply; 8+ messages in thread
From: David Hildenbrand @ 2021-04-22  8:56 UTC (permalink / raw)
  To: Michal Hocko, Florian Fainelli
  Cc: Vlastimil Babka, Mel Gorman, Minchan Kim, Johannes Weiner,
	l.stach, LKML, linux-mmc, Jaewon Kim, Michal Nazarewicz,
	Joonsoo Kim, Oscar Salvador, linux-mm

On 22.04.21 09:49, Michal Hocko wrote:
> Cc David and Oscar who are familiar with this code as well.
> 
> On Wed 21-04-21 11:36:01, Florian Fainelli wrote:
>> Hi all,
>>
>> I have been trying for the past few days to identify the source of a
>> performance regression that we are seeing with the 5.4 kernel but not
>> with the 4.9 kernel on ARM64. Testing something newer like 5.10 is a bit
>> challenging at the moment but will happen eventually.
>>
>> What we are seeing is a ~3x increase in the time needed for
>> alloc_contig_range() to allocate 1GB in blocks of 2MB pages. The system
>> is idle at the time and there are no other contenders for memory other
>> than the user-space programs already started (DHCP client, shell, etc.).

Hi,

If you can easily reproduce it might be worth to just try bisecting; 
that could be faster than manually poking around in the code.

Also, it would be worth having a look at the state of upstream Linux. 
Upstream Linux developers tend to not care about minor performance 
regressions on oldish kernels.

There has been work on improving exactly the situation you are 
describing -- a "fail fast" / "no retry" mode for alloc_contig_range(). 
Maybe it tackles exactly this issue.

https://lkml.kernel.org/r/20210121175502.274391-3-minchan@kernel.org

Minchan is already on cc.

(next time, please cc linux-mm on core-mm questions; maybe you tried, 
but ended up with linux-mmc :) )

>>
>> I have tried playing with the compact_control structure settings but
>> have not found anything that would bring us back to the performance of
>> 4.9. More often than not, we see test_pages_isolated() returning an
>> non-zero error code which would explain the slow down, since we have
>> some logic that re-tries the allocation if alloc_contig_range() returns
>> -EBUSY. If I remove the retry logic however, we don't get -EBUSY and we
>> get the results below:
>>
>> 4.9 shows this:
>>
>> [  457.537634] allocating: size: 1024MB avg: 59172 (us), max: 137306
>> (us), min: 44859 (us), total: 591723 (us), pages: 512, per-page: 115 (us)
>> [  457.550222] freeing: size: 1024MB avg: 67397 (us), max: 151408 (us),
>> min: 52630 (us), total: 673974 (us), pages: 512, per-page: 131 (us)
>>
>> 5.4 show this:
>>
>> [  222.388758] allocating: size: 1024MB avg: 156739 (us), max: 157254
>> (us), min: 155915 (us), total: 1567394 (us), pages: 512, per-page: 306 (us)
>> [  222.401601] freeing: size: 1024MB avg: 209899 (us), max: 210085 (us),
>> min: 209749 (us), total: 2098999 (us), pages: 512, per-page: 409 (us)
>>
>> This regression is not seen when MIGRATE_CMA is specified instead of
>> MIGRATE_MOVABLE.
>>
>> A few characteristics that you should probably be aware of:
>>
>> - There is 4GB of memory populated with the memory being mapped into the
>> CPU's address starting at space at 0x4000_0000 (1GB), PAGE_SIZE is 4KB
>>
>> - there is a ZONE_DMA32 that starts at 0x4000_0000 and ends at
>> 0xE480_0000, from there on we have a ZONE_MOVABLE which is comprised of
>> 0xE480_0000 - 0xfdc00000 and another range spanning 0x1_0000_0000 -
>> 0x1_4000_0000
>>
>> Attached is the kernel configuration.
>>
>> Thanks!
>> -- 
>> Florian
> 
> 
> 


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: alloc_contig_range() with MIGRATE_MOVABLE performance regression since 4.9
  2021-04-22  8:56   ` David Hildenbrand
@ 2021-04-22 17:50     ` Florian Fainelli
  2021-04-22 18:35       ` David Hildenbrand
  0 siblings, 1 reply; 8+ messages in thread
From: Florian Fainelli @ 2021-04-22 17:50 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko
  Cc: Vlastimil Babka, Mel Gorman, Minchan Kim, Johannes Weiner,
	l.stach, LKML, Jaewon Kim, Michal Nazarewicz, Joonsoo Kim,
	Oscar Salvador, linux-mm



On 4/22/2021 1:56 AM, David Hildenbrand wrote:
> On 22.04.21 09:49, Michal Hocko wrote:
>> Cc David and Oscar who are familiar with this code as well.
>>
>> On Wed 21-04-21 11:36:01, Florian Fainelli wrote:
>>> Hi all,
>>>
>>> I have been trying for the past few days to identify the source of a
>>> performance regression that we are seeing with the 5.4 kernel but not
>>> with the 4.9 kernel on ARM64. Testing something newer like 5.10 is a bit
>>> challenging at the moment but will happen eventually.
>>>
>>> What we are seeing is a ~3x increase in the time needed for
>>> alloc_contig_range() to allocate 1GB in blocks of 2MB pages. The system
>>> is idle at the time and there are no other contenders for memory other
>>> than the user-space programs already started (DHCP client, shell, etc.).
> 
> Hi,
> 
> If you can easily reproduce it might be worth to just try bisecting;
> that could be faster than manually poking around in the code.
> 
> Also, it would be worth having a look at the state of upstream Linux.
> Upstream Linux developers tend to not care about minor performance
> regressions on oldish kernels.

This is a big pain point here and I cannot agree more, but until we
bridge that gap, this is not exactly easy to do for me unfortunately and
neither is bisection :/

> 
> There has been work on improving exactly the situation you are
> describing -- a "fail fast" / "no retry" mode for alloc_contig_range().
> Maybe it tackles exactly this issue.
> 
> https://lkml.kernel.org/r/20210121175502.274391-3-minchan@kernel.org
> 
> Minchan is already on cc.

This patch does not appear to be helping, in fact, I had locally applied
this patch from way back when:

https://lkml.org/lkml/2014/5/28/113

which would effectively do this unconditionally. Let me see if I can
showcase this problem a x86 virtual machine operating in similar
conditions to ours.

> 
> (next time, please cc linux-mm on core-mm questions; maybe you tried,
> but ended up with linux-mmc :) )

Yes that was the intent, thanks for correcting that.

> 
>>>
>>> I have tried playing with the compact_control structure settings but
>>> have not found anything that would bring us back to the performance of
>>> 4.9. More often than not, we see test_pages_isolated() returning an
>>> non-zero error code which would explain the slow down, since we have
>>> some logic that re-tries the allocation if alloc_contig_range() returns
>>> -EBUSY. If I remove the retry logic however, we don't get -EBUSY and we
>>> get the results below:
>>>
>>> 4.9 shows this:
>>>
>>> [  457.537634] allocating: size: 1024MB avg: 59172 (us), max: 137306
>>> (us), min: 44859 (us), total: 591723 (us), pages: 512, per-page: 115
>>> (us)
>>> [  457.550222] freeing: size: 1024MB avg: 67397 (us), max: 151408 (us),
>>> min: 52630 (us), total: 673974 (us), pages: 512, per-page: 131 (us)
>>>
>>> 5.4 show this:
>>>
>>> [  222.388758] allocating: size: 1024MB avg: 156739 (us), max: 157254
>>> (us), min: 155915 (us), total: 1567394 (us), pages: 512, per-page:
>>> 306 (us)
>>> [  222.401601] freeing: size: 1024MB avg: 209899 (us), max: 210085 (us),
>>> min: 209749 (us), total: 2098999 (us), pages: 512, per-page: 409 (us)
>>>
>>> This regression is not seen when MIGRATE_CMA is specified instead of
>>> MIGRATE_MOVABLE.
>>>
>>> A few characteristics that you should probably be aware of:
>>>
>>> - There is 4GB of memory populated with the memory being mapped into the
>>> CPU's address starting at space at 0x4000_0000 (1GB), PAGE_SIZE is 4KB
>>>
>>> - there is a ZONE_DMA32 that starts at 0x4000_0000 and ends at
>>> 0xE480_0000, from there on we have a ZONE_MOVABLE which is comprised of
>>> 0xE480_0000 - 0xfdc00000 and another range spanning 0x1_0000_0000 -
>>> 0x1_4000_0000
>>>
>>> Attached is the kernel configuration.
>>>
>>> Thanks!
>>> -- 
>>> Florian
>>
>>
>>
> 
> 

-- 
Florian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: alloc_contig_range() with MIGRATE_MOVABLE performance regression since 4.9
  2021-04-22 17:50     ` Florian Fainelli
@ 2021-04-22 18:35       ` David Hildenbrand
  2021-04-22 19:31         ` Florian Fainelli
  0 siblings, 1 reply; 8+ messages in thread
From: David Hildenbrand @ 2021-04-22 18:35 UTC (permalink / raw)
  To: Florian Fainelli, Michal Hocko
  Cc: Vlastimil Babka, Mel Gorman, Minchan Kim, Johannes Weiner,
	l.stach, LKML, Jaewon Kim, Michal Nazarewicz, Joonsoo Kim,
	Oscar Salvador, linux-mm

On 22.04.21 19:50, Florian Fainelli wrote:
> 
> 
> On 4/22/2021 1:56 AM, David Hildenbrand wrote:
>> On 22.04.21 09:49, Michal Hocko wrote:
>>> Cc David and Oscar who are familiar with this code as well.
>>>
>>> On Wed 21-04-21 11:36:01, Florian Fainelli wrote:
>>>> Hi all,
>>>>
>>>> I have been trying for the past few days to identify the source of a
>>>> performance regression that we are seeing with the 5.4 kernel but not
>>>> with the 4.9 kernel on ARM64. Testing something newer like 5.10 is a bit
>>>> challenging at the moment but will happen eventually.
>>>>
>>>> What we are seeing is a ~3x increase in the time needed for
>>>> alloc_contig_range() to allocate 1GB in blocks of 2MB pages. The system
>>>> is idle at the time and there are no other contenders for memory other
>>>> than the user-space programs already started (DHCP client, shell, etc.).
>>
>> Hi,
>>
>> If you can easily reproduce it might be worth to just try bisecting;
>> that could be faster than manually poking around in the code.
>>
>> Also, it would be worth having a look at the state of upstream Linux.
>> Upstream Linux developers tend to not care about minor performance
>> regressions on oldish kernels.
> 
> This is a big pain point here and I cannot agree more, but until we
> bridge that gap, this is not exactly easy to do for me unfortunately and
> neither is bisection :/
> 
>>
>> There has been work on improving exactly the situation you are
>> describing -- a "fail fast" / "no retry" mode for alloc_contig_range().
>> Maybe it tackles exactly this issue.
>>
>> https://lkml.kernel.org/r/20210121175502.274391-3-minchan@kernel.org
>>
>> Minchan is already on cc.
> 
> This patch does not appear to be helping, in fact, I had locally applied
> this patch from way back when:
> 
> https://lkml.org/lkml/2014/5/28/113
> 
> which would effectively do this unconditionally. Let me see if I can
> showcase this problem a x86 virtual machine operating in similar
> conditions to ours.

How exactly are you allocating these 2MiB blocks?

Via CMA->alloc_contig_range() or via alloc_contig_range() directly? I 
assume via CMA.

For

https://lkml.kernel.org/r/20210121175502.274391-3-minchan@kernel.org

to do its work you'll have to pass  __GFP_NORETRY to 
alloc_contig_range(). This requires CMA adaptions, from where we call 
alloc_contig_range().

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: alloc_contig_range() with MIGRATE_MOVABLE performance regression since 4.9
  2021-04-22 18:35       ` David Hildenbrand
@ 2021-04-22 19:31         ` Florian Fainelli
  2021-05-16 16:13           ` Florian Fainelli
  0 siblings, 1 reply; 8+ messages in thread
From: Florian Fainelli @ 2021-04-22 19:31 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko
  Cc: Vlastimil Babka, Mel Gorman, Minchan Kim, Johannes Weiner,
	l.stach, LKML, Jaewon Kim, Michal Nazarewicz, Joonsoo Kim,
	Oscar Salvador, linux-mm



On 4/22/2021 11:35 AM, David Hildenbrand wrote:
> On 22.04.21 19:50, Florian Fainelli wrote:
>>
>>
>> On 4/22/2021 1:56 AM, David Hildenbrand wrote:
>>> On 22.04.21 09:49, Michal Hocko wrote:
>>>> Cc David and Oscar who are familiar with this code as well.
>>>>
>>>> On Wed 21-04-21 11:36:01, Florian Fainelli wrote:
>>>>> Hi all,
>>>>>
>>>>> I have been trying for the past few days to identify the source of a
>>>>> performance regression that we are seeing with the 5.4 kernel but not
>>>>> with the 4.9 kernel on ARM64. Testing something newer like 5.10 is
>>>>> a bit
>>>>> challenging at the moment but will happen eventually.
>>>>>
>>>>> What we are seeing is a ~3x increase in the time needed for
>>>>> alloc_contig_range() to allocate 1GB in blocks of 2MB pages. The
>>>>> system
>>>>> is idle at the time and there are no other contenders for memory other
>>>>> than the user-space programs already started (DHCP client, shell,
>>>>> etc.).
>>>
>>> Hi,
>>>
>>> If you can easily reproduce it might be worth to just try bisecting;
>>> that could be faster than manually poking around in the code.
>>>
>>> Also, it would be worth having a look at the state of upstream Linux.
>>> Upstream Linux developers tend to not care about minor performance
>>> regressions on oldish kernels.
>>
>> This is a big pain point here and I cannot agree more, but until we
>> bridge that gap, this is not exactly easy to do for me unfortunately and
>> neither is bisection :/
>>
>>>
>>> There has been work on improving exactly the situation you are
>>> describing -- a "fail fast" / "no retry" mode for alloc_contig_range().
>>> Maybe it tackles exactly this issue.
>>>
>>> https://lkml.kernel.org/r/20210121175502.274391-3-minchan@kernel.org
>>>
>>> Minchan is already on cc.
>>
>> This patch does not appear to be helping, in fact, I had locally applied
>> this patch from way back when:
>>
>> https://lkml.org/lkml/2014/5/28/113
>>
>> which would effectively do this unconditionally. Let me see if I can
>> showcase this problem a x86 virtual machine operating in similar
>> conditions to ours.
> 
> How exactly are you allocating these 2MiB blocks?
> 
> Via CMA->alloc_contig_range() or via alloc_contig_range() directly? I
> assume via CMA.

I am allocating this memory directly via alloc_contig_range(start, end,
MIGRATE_MOVABLE, GFP_KERNEL), just looping over 1024MB via 2MB
increments. This is just a synthetic benchmark though we do have an
allocator that behaves just like that as well.

> 
> For
> 
> https://lkml.kernel.org/r/20210121175502.274391-3-minchan@kernel.org
> 
> to do its work you'll have to pass  __GFP_NORETRY to
> alloc_contig_range(). This requires CMA adaptions, from where we call
> alloc_contig_range().

Yes, I did modify the alloc_contig_range() caller to pass GFP_KERNEL |
__GFP_NORETRY. I did run for a more iterations (1000) and the results
are not very conclusive as with __GFP_NORETRY the allocation time per
allocation was not significantly better, in fact it was slightly worse
by 100us than without.

My x86 VM with 1GB of DRAM including 512MB being in ZONE_MOVABLE does
shows identical numbers for both 4.9 and 5.4 so this must be something
specific to ARM64 and/or the code we added to create a ZONE_MOVABLE on
that architecture since movablecore does not appear to have any effect
unlike x86.
-- 
Florian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: alloc_contig_range() with MIGRATE_MOVABLE performance regression since 4.9
  2021-04-22 19:31         ` Florian Fainelli
@ 2021-05-16 16:13           ` Florian Fainelli
  2021-05-17  7:46             ` David Hildenbrand
  0 siblings, 1 reply; 8+ messages in thread
From: Florian Fainelli @ 2021-05-16 16:13 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko
  Cc: Vlastimil Babka, Mel Gorman, Minchan Kim, Johannes Weiner,
	l.stach, LKML, Jaewon Kim, Michal Nazarewicz, Joonsoo Kim,
	Oscar Salvador, linux-mm



On 4/22/2021 12:31 PM, Florian Fainelli wrote:
>> For
>>
>> https://lkml.kernel.org/r/20210121175502.274391-3-minchan@kernel.org
>>
>> to do its work you'll have to pass  __GFP_NORETRY to
>> alloc_contig_range(). This requires CMA adaptions, from where we call
>> alloc_contig_range().
> 
> Yes, I did modify the alloc_contig_range() caller to pass GFP_KERNEL |
> __GFP_NORETRY. I did run for a more iterations (1000) and the results
> are not very conclusive as with __GFP_NORETRY the allocation time per
> allocation was not significantly better, in fact it was slightly worse
> by 100us than without.
> 
> My x86 VM with 1GB of DRAM including 512MB being in ZONE_MOVABLE does
> shows identical numbers for both 4.9 and 5.4 so this must be something
> specific to ARM64 and/or the code we added to create a ZONE_MOVABLE on
> that architecture since movablecore does not appear to have any effect
> unlike x86.

We tracked down the slowdowns to be caused by two major contributors:

- for a reason that we do not fully understand yet the same cpufreq
governor (conservative) did not cause alloc_contig_range() to be slowed
down on 4.9 as much as it it with 5.4, running tests with the
performance cpufreq governor works a tad better and the results are more
consistent from run to run with a smaller variation.

- another large contributor to the slowdown was having enabled
CONFIG_IRQSOFF_TRACER. After c3bc8fd637a9623f5c507bd18f9677effbddf584
("tracing: Centralize preemptirq tracepoints and unify their usage") we
now prepare arguments for tracing even if we end-up not using them since
tracing is not enabled at runtime. Getting the caller function's return
address is cheap on arm64 for level == 0, but getting the preceding
caller involves doing a backtrace walk which is expensive (see
arch/arm64/kernel/return_address.c).

So with these two variables eliminated we are only about x2 slower on
5.4 than we were on 4.9 and this is acceptable for our use case. I would
not say the case is closed but at least we understand it better. We now
have 5.10 brought up to speed so any new investigation will be focused
on that kernel.

Thanks a lot for your help David!
-- 
Florian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: alloc_contig_range() with MIGRATE_MOVABLE performance regression since 4.9
  2021-05-16 16:13           ` Florian Fainelli
@ 2021-05-17  7:46             ` David Hildenbrand
  0 siblings, 0 replies; 8+ messages in thread
From: David Hildenbrand @ 2021-05-17  7:46 UTC (permalink / raw)
  To: Florian Fainelli, Michal Hocko
  Cc: Vlastimil Babka, Mel Gorman, Minchan Kim, Johannes Weiner,
	l.stach, LKML, Jaewon Kim, Michal Nazarewicz, Joonsoo Kim,
	Oscar Salvador, linux-mm

On 16.05.21 18:13, Florian Fainelli wrote:
> 
> 
> On 4/22/2021 12:31 PM, Florian Fainelli wrote:
>>> For
>>>
>>> https://lkml.kernel.org/r/20210121175502.274391-3-minchan@kernel.org
>>>
>>> to do its work you'll have to pass  __GFP_NORETRY to
>>> alloc_contig_range(). This requires CMA adaptions, from where we call
>>> alloc_contig_range().
>>
>> Yes, I did modify the alloc_contig_range() caller to pass GFP_KERNEL |
>> __GFP_NORETRY. I did run for a more iterations (1000) and the results
>> are not very conclusive as with __GFP_NORETRY the allocation time per
>> allocation was not significantly better, in fact it was slightly worse
>> by 100us than without.
>>
>> My x86 VM with 1GB of DRAM including 512MB being in ZONE_MOVABLE does
>> shows identical numbers for both 4.9 and 5.4 so this must be something
>> specific to ARM64 and/or the code we added to create a ZONE_MOVABLE on
>> that architecture since movablecore does not appear to have any effect
>> unlike x86.
> 
> We tracked down the slowdowns to be caused by two major contributors:
> 
> - for a reason that we do not fully understand yet the same cpufreq
> governor (conservative) did not cause alloc_contig_range() to be slowed
> down on 4.9 as much as it it with 5.4, running tests with the
> performance cpufreq governor works a tad better and the results are more
> consistent from run to run with a smaller variation.

Interesting! So your CPU is down-clocking while performing (heavy) 
kernel work? Is that expected or are we mis-accounting kernel cpu time 
somehow when it comes to determining the CPU target frequency?

> 
> - another large contributor to the slowdown was having enabled
> CONFIG_IRQSOFF_TRACER. After c3bc8fd637a9623f5c507bd18f9677effbddf584
> ("tracing: Centralize preemptirq tracepoints and unify their usage") we
> now prepare arguments for tracing even if we end-up not using them since
> tracing is not enabled at runtime. Getting the caller function's return
> address is cheap on arm64 for level == 0, but getting the preceding
> caller involves doing a backtrace walk which is expensive (see
> arch/arm64/kernel/return_address.c).

Again, very interesting finding.

> 
> So with these two variables eliminated we are only about x2 slower on
> 5.4 than we were on 4.9 and this is acceptable for our use case. I would
> not say the case is closed but at least we understand it better. We now
> have 5.10 brought up to speed so any new investigation will be focused
> on that kernel.
> 

Thanks for the insight, please do let me know when you learn more. x2 
slowdown still is quite a lot.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-05-17  7:46 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-21 18:36 alloc_contig_range() with MIGRATE_MOVABLE performance regression since 4.9 Florian Fainelli
2021-04-22  7:49 ` Michal Hocko
2021-04-22  8:56   ` David Hildenbrand
2021-04-22 17:50     ` Florian Fainelli
2021-04-22 18:35       ` David Hildenbrand
2021-04-22 19:31         ` Florian Fainelli
2021-05-16 16:13           ` Florian Fainelli
2021-05-17  7:46             ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.