linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUG] Page allocation failures with newest kernels
@ 2016-05-31  3:02 Marcin Wojtas
  2016-05-31 10:17 ` Robin Murphy
  0 siblings, 1 reply; 15+ messages in thread
From: Marcin Wojtas @ 2016-05-31  3:02 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-arm-kernel
  Cc: Yehuda Yitschak, nadavh, Lior Amsalem, Thomas Petazzoni,
	Gregory Clément, Grzegorz Jaszczyk, Tomasz Nowicki,
	Will Deacon, Catalin Marinas, Arnd Bergmann

Hi,

After rebasing platform support of two different ARMv8 SoC's from v4.1
baseline to v4.4 it occurred that stressed systems tend to have page
allocation problems, related to creating new slabs:

http://pastebin.com/FhRW5DsF

Steps to reproduce:
- use SATA drive (on-board or over PCIe) with 2 btrfs 50G partitions
- run a couple of loops of following script:
mount /dev/sd${1}1 /mnt
mount /dev/sd${1}2 /mnt2
i=0
while [[ $i -lt ${2} ]]
do
echo -e "i = ${i}\n"
dd if=/dev/zero of=/mnt/3g bs=3M count=1024 &
dd if=/dev/zero of=/mnt/2g bs=2M count=1024 &
dd if=/dev/zero of=/mnt/1g bs=1M count=1024 &
dd if=/dev/zero of=/mnt2/2g bs=2M count=1024 &
dd if=/dev/zero of=/mnt2/1g bs=1M count=1024 &
dd if=/dev/zero of=/mnt2/3g bs=3M count=1024
let "i++"
done

The issue also reproduced on v4.6. Usually problems occur within first
iteration and then the rest is done without errors, also kernel remain
stable. I got an information, that page alloc problem were observed
also on Marvell ARMv7 platfrom (Armada38x).

About the debug itself - after adding simplest possible trace in
trace/events/kmem.h (single argument u64 for counter or whatever kind
of number), it was shown both on v4.1 and v4.4 following condition is
achieved multiple times during test:
In __alloc_pages_nodemask(), during the test kernel jumps huge amount
of times (~250k times in v4.1 and ~570k in v4.4 per one script loop)
into following 'unlikely' condition:
page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
if (unlikely(!page)) {
    [...]
    page = __alloc_pages_slowpath(alloc_mask, order, &ac);
}

The further difference is seen in __alloc_pages_slowpath().
warn_alloc_page() (routine responsible for printing page alloc failure
message) is reached via following condition:
if (!can_direct_reclaim) {
    [...]
    goto nopage;
}
In v4.1 ~5 times and in v4.4 ~40 times per one script loop.

Printing message however can be blocked by following condition in
warn_alloc_fail():
if ((gfp_mask & _GFP_NOWARN) || !_ratelimit(&nopage_rs) ||
    debug_guardpage_minorder() > 0)
        return;
Only first two are relevant. As ratelimit is derived directly from
CONFIG_HZ and this parameter differ between v4.1 and v4.4 (100 vs 250,
also CONFIG_SCHED_HRTICK is enabled only in v4.4) the configs were
swapped, but no change in behavior.

Also within 'faulty' revision there is a difference, depending on
filesystem used - with buildroot the dumps occur, but with same test
under ubuntu - it's impossible see the failure output (and it's not a
question of dmesg level:)). Comparing /proc/sys/vm contents didn't
show anything meaningful.

I tried to analyze changes around mm/ folder between v4.1 and v4.4
that may cause such difference, but wasn't able to find out what may
be causing the issue. Have anyone encountered such problems in recent
revisions? I would be very grateful for any hint or comment. Also if
any other data can be captured, please let know.

Best regards,
Marcin Wojtas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] Page allocation failures with newest kernels
  2016-05-31  3:02 [BUG] Page allocation failures with newest kernels Marcin Wojtas
@ 2016-05-31 10:17 ` Robin Murphy
  2016-05-31 10:29   ` Marcin Wojtas
  0 siblings, 1 reply; 15+ messages in thread
From: Robin Murphy @ 2016-05-31 10:17 UTC (permalink / raw)
  To: Marcin Wojtas, linux-mm, linux-kernel, linux-arm-kernel
  Cc: Lior Amsalem, Thomas Petazzoni, Yehuda Yitschak, Catalin Marinas,
	Arnd Bergmann, Grzegorz Jaszczyk, Will Deacon, nadavh,
	Tomasz Nowicki, Gregory Clément

On 31/05/16 04:02, Marcin Wojtas wrote:
> Hi,
>
> After rebasing platform support of two different ARMv8 SoC's from v4.1
> baseline to v4.4 it occurred that stressed systems tend to have page
> allocation problems, related to creating new slabs:
>
> http://pastebin.com/FhRW5DsF
>
> Steps to reproduce:
> - use SATA drive (on-board or over PCIe) with 2 btrfs 50G partitions
> - run a couple of loops of following script:
> mount /dev/sd${1}1 /mnt
> mount /dev/sd${1}2 /mnt2
> i=0
> while [[ $i -lt ${2} ]]
> do
> echo -e "i = ${i}\n"
> dd if=/dev/zero of=/mnt/3g bs=3M count=1024 &
> dd if=/dev/zero of=/mnt/2g bs=2M count=1024 &
> dd if=/dev/zero of=/mnt/1g bs=1M count=1024 &
> dd if=/dev/zero of=/mnt2/2g bs=2M count=1024 &
> dd if=/dev/zero of=/mnt2/1g bs=1M count=1024 &
> dd if=/dev/zero of=/mnt2/3g bs=3M count=1024
> let "i++"
> done
>
> The issue also reproduced on v4.6. Usually problems occur within first
> iteration and then the rest is done without errors, also kernel remain
> stable. I got an information, that page alloc problem were observed
> also on Marvell ARMv7 platfrom (Armada38x).

I remember there were some issues around 4.2 with the revision of the 
arm64 atomic implementations affecting the cmpxchg_double() in SLUB, but 
those should all be fixed (and the symptoms tended to be considerably 
more fatal). A stronger candidate would be 97303480753e (which landed in 
4.4), which has various knock-on effects on the layout of SLUB internals 
- does fiddling with L1_CACHE_SHIFT make any difference?

Robin.

> About the debug itself - after adding simplest possible trace in
> trace/events/kmem.h (single argument u64 for counter or whatever kind
> of number), it was shown both on v4.1 and v4.4 following condition is
> achieved multiple times during test:
> In __alloc_pages_nodemask(), during the test kernel jumps huge amount
> of times (~250k times in v4.1 and ~570k in v4.4 per one script loop)
> into following 'unlikely' condition:
> page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
> if (unlikely(!page)) {
>      [...]
>      page = __alloc_pages_slowpath(alloc_mask, order, &ac);
> }
>
> The further difference is seen in __alloc_pages_slowpath().
> warn_alloc_page() (routine responsible for printing page alloc failure
> message) is reached via following condition:
> if (!can_direct_reclaim) {
>      [...]
>      goto nopage;
> }
> In v4.1 ~5 times and in v4.4 ~40 times per one script loop.
>
> Printing message however can be blocked by following condition in
> warn_alloc_fail():
> if ((gfp_mask & _GFP_NOWARN) || !_ratelimit(&nopage_rs) ||
>      debug_guardpage_minorder() > 0)
>          return;
> Only first two are relevant. As ratelimit is derived directly from
> CONFIG_HZ and this parameter differ between v4.1 and v4.4 (100 vs 250,
> also CONFIG_SCHED_HRTICK is enabled only in v4.4) the configs were
> swapped, but no change in behavior.
>
> Also within 'faulty' revision there is a difference, depending on
> filesystem used - with buildroot the dumps occur, but with same test
> under ubuntu - it's impossible see the failure output (and it's not a
> question of dmesg level:)). Comparing /proc/sys/vm contents didn't
> show anything meaningful.
>
> I tried to analyze changes around mm/ folder between v4.1 and v4.4
> that may cause such difference, but wasn't able to find out what may
> be causing the issue. Have anyone encountered such problems in recent
> revisions? I would be very grateful for any hint or comment. Also if
> any other data can be captured, please let know.
>
> Best regards,
> Marcin Wojtas
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] Page allocation failures with newest kernels
  2016-05-31 10:17 ` Robin Murphy
@ 2016-05-31 10:29   ` Marcin Wojtas
  2016-05-31 13:10     ` Yehuda Yitschak
  0 siblings, 1 reply; 15+ messages in thread
From: Marcin Wojtas @ 2016-05-31 10:29 UTC (permalink / raw)
  To: Robin Murphy
  Cc: linux-mm, linux-kernel, linux-arm-kernel, Lior Amsalem,
	Thomas Petazzoni, Yehuda Yitschak, Catalin Marinas,
	Arnd Bergmann, Grzegorz Jaszczyk, Will Deacon, nadavh,
	Tomasz Nowicki, Gregory Clément

Hi Robin,

>
> I remember there were some issues around 4.2 with the revision of the arm64
> atomic implementations affecting the cmpxchg_double() in SLUB, but those
> should all be fixed (and the symptoms tended to be considerably more fatal).
> A stronger candidate would be 97303480753e (which landed in 4.4), which has
> various knock-on effects on the layout of SLUB internals - does fiddling
> with L1_CACHE_SHIFT make any difference?
>

I'll check the commits, thanks. I forgot to add L1_CACHE_SHIFT was my
first suspect - I had spent a long time debugging network controller,
which stopped working because of this change - L1_CACHE_BYTES (and
hence NET_SKB_PAD) not fitting HW constraints. Anyway reverting it
didn't help at all for page alloc issue.

Best regards,
Marcin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [BUG] Page allocation failures with newest kernels
  2016-05-31 10:29   ` Marcin Wojtas
@ 2016-05-31 13:10     ` Yehuda Yitschak
  2016-05-31 13:15       ` Will Deacon
  0 siblings, 1 reply; 15+ messages in thread
From: Yehuda Yitschak @ 2016-05-31 13:10 UTC (permalink / raw)
  To: Marcin Wojtas, Robin Murphy
  Cc: linux-mm, linux-kernel, linux-arm-kernel, Lior Amsalem,
	Thomas Petazzoni, Catalin Marinas, Arnd Bergmann,
	Grzegorz Jaszczyk, Will Deacon, Nadav Haklai, Tomasz Nowicki,
	Gregory Clément

Hi Robin 

During some of the stress tests we also came across a different warning from the arm64  page management code
It looks like a race is detected between HW and SW marking a bit in the PTE

Not sure it's really related but I thought it might give a clue on the issue
http://pastebin.com/ASv19vZP

Thanks

Yehuda 


> -----Original Message-----
> From: Marcin Wojtas [mailto:mw@semihalf.com]
> Sent: Tuesday, May 31, 2016 13:30
> To: Robin Murphy
> Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; linux-arm-
> kernel@lists.infradead.org; Lior Amsalem; Thomas Petazzoni; Yehuda
> Yitschak; Catalin Marinas; Arnd Bergmann; Grzegorz Jaszczyk; Will Deacon;
> Nadav Haklai; Tomasz Nowicki; Gregory Clément
> Subject: Re: [BUG] Page allocation failures with newest kernels
> 
> Hi Robin,
> 
> >
> > I remember there were some issues around 4.2 with the revision of the
> > arm64 atomic implementations affecting the cmpxchg_double() in SLUB,
> > but those should all be fixed (and the symptoms tended to be
> considerably more fatal).
> > A stronger candidate would be 97303480753e (which landed in 4.4),
> > which has various knock-on effects on the layout of SLUB internals -
> > does fiddling with L1_CACHE_SHIFT make any difference?
> >
> 
> I'll check the commits, thanks. I forgot to add L1_CACHE_SHIFT was my first
> suspect - I had spent a long time debugging network controller, which
> stopped working because of this change - L1_CACHE_BYTES (and hence
> NET_SKB_PAD) not fitting HW constraints. Anyway reverting it didn't help at
> all for page alloc issue.
> 
> Best regards,
> Marcin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] Page allocation failures with newest kernels
  2016-05-31 13:10     ` Yehuda Yitschak
@ 2016-05-31 13:15       ` Will Deacon
  2016-06-02  5:48         ` Marcin Wojtas
  0 siblings, 1 reply; 15+ messages in thread
From: Will Deacon @ 2016-05-31 13:15 UTC (permalink / raw)
  To: Yehuda Yitschak
  Cc: Marcin Wojtas, Robin Murphy, linux-mm, linux-kernel,
	linux-arm-kernel, Lior Amsalem, Thomas Petazzoni,
	Catalin Marinas, Arnd Bergmann, Grzegorz Jaszczyk, Nadav Haklai,
	Tomasz Nowicki, Gregory Clément

On Tue, May 31, 2016 at 01:10:44PM +0000, Yehuda Yitschak wrote:
> During some of the stress tests we also came across a different warning
> from the arm64  page management code
> It looks like a race is detected between HW and SW marking a bit in the PTE

A72 (which I believe is the CPU in that SoC) is a v8.0 CPU and therefore
doesn't have hardware DBM.

> Not sure it's really related but I thought it might give a clue on the issue
> http://pastebin.com/ASv19vZP

There have been a few patches from Catalin to fix up the hardware DBM
patches, so it might be worth trying to reproduce this failure with a
more recent kernel. I doubt this is related to the allocation failures,
however.

Will

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] Page allocation failures with newest kernels
  2016-05-31 13:15       ` Will Deacon
@ 2016-06-02  5:48         ` Marcin Wojtas
  2016-06-02 13:52           ` Mel Gorman
  0 siblings, 1 reply; 15+ messages in thread
From: Marcin Wojtas @ 2016-06-02  5:48 UTC (permalink / raw)
  To: Will Deacon
  Cc: Yehuda Yitschak, Robin Murphy, linux-mm, linux-kernel,
	linux-arm-kernel, Lior Amsalem, Thomas Petazzoni,
	Catalin Marinas, Arnd Bergmann, Grzegorz Jaszczyk, Nadav Haklai,
	Tomasz Nowicki, Gregory Clément, mgorman

Hi Will,

I think I found a right trace. Following one-liner fixes the issue
beginning from v4.2-rc1 up to v4.4 included:

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -294,7 +294,7 @@ static inline bool
early_page_uninitialised(unsigned long pfn)

 static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid)
 {
-       return false;
+       return true;
 }

The regression was introduced by commit 7e18adb4f80b ("mm: meminit:
initialise remaining struct pages in parallel with kswapd"), which in
fact disabled memblock reserve at all for all platfroms not using
CONFIG_DEFERRED_STRUCT_PAGE_INIT (x86 is the only user), hence
temporary shortage of memory possible to allocate during my test.

Since v4.4-rc1 following changes of approach have been introduced:
97a16fc - mm, page_alloc: only enforce watermarks for order-0 allocations
0aaa29a - mm, page_alloc: reserve pageblocks for high-order atomic
allocations on demand
974a786 - mm, page_alloc: remove MIGRATE_RESERVE

>From what I understood, now order-0 allocation keep no reserve at all.
I checked all gathered logs and indeed it was order-0 which failed and
apparently weren't able to reclaim successfully. Since the problem is
very easy to reproduce (at least in my test, as well as stressing
device in NAS setup) is there any chance to avoid destiny of page
alloc failures? Or any trick to play with fragmentation parameters,
etc.?

I would be grateful for any hint.

Best regards,
Marcin

2016-05-31 15:15 GMT+02:00 Will Deacon <will.deacon@arm.com>:
> On Tue, May 31, 2016 at 01:10:44PM +0000, Yehuda Yitschak wrote:
>> During some of the stress tests we also came across a different warning
>> from the arm64  page management code
>> It looks like a race is detected between HW and SW marking a bit in the PTE
>
> A72 (which I believe is the CPU in that SoC) is a v8.0 CPU and therefore
> doesn't have hardware DBM.
>
>> Not sure it's really related but I thought it might give a clue on the issue
>> http://pastebin.com/ASv19vZP
>
> There have been a few patches from Catalin to fix up the hardware DBM
> patches, so it might be worth trying to reproduce this failure with a
> more recent kernel. I doubt this is related to the allocation failures,
> however.
>
> Will

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] Page allocation failures with newest kernels
  2016-06-02  5:48         ` Marcin Wojtas
@ 2016-06-02 13:52           ` Mel Gorman
  2016-06-02 19:01             ` Marcin Wojtas
  0 siblings, 1 reply; 15+ messages in thread
From: Mel Gorman @ 2016-06-02 13:52 UTC (permalink / raw)
  To: Marcin Wojtas
  Cc: Will Deacon, Yehuda Yitschak, Robin Murphy, linux-mm,
	linux-kernel, linux-arm-kernel, Lior Amsalem, Thomas Petazzoni,
	Catalin Marinas, Arnd Bergmann, Grzegorz Jaszczyk, Nadav Haklai,
	Tomasz Nowicki, Gregory Clément

On Thu, Jun 02, 2016 at 07:48:38AM +0200, Marcin Wojtas wrote:
> Hi Will,
> 
> I think I found a right trace. Following one-liner fixes the issue
> beginning from v4.2-rc1 up to v4.4 included:
> 
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -294,7 +294,7 @@ static inline bool
> early_page_uninitialised(unsigned long pfn)
> 
>  static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid)
>  {
> -       return false;
> +       return true;
>  }
> 

How does that make a difference in v4.4 since commit
974a786e63c96a2401a78ddba926f34c128474f1 removed the only
early_page_nid_uninitialised() ? It further doesn't make sense if deferred
memory initialisation is not enabled as the pages will always be
initialised.

> From what I understood, now order-0 allocation keep no reserve at all.

Watermarks should still be preserved. zone_watermark_ok is still there.
What might change is the size of reserves for high-order atomic
allocations only. Fragmentation shouldn't be a factor. I'm missing some
major part of the picture.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] Page allocation failures with newest kernels
  2016-06-02 13:52           ` Mel Gorman
@ 2016-06-02 19:01             ` Marcin Wojtas
  2016-06-03  9:53               ` Mel Gorman
  0 siblings, 1 reply; 15+ messages in thread
From: Marcin Wojtas @ 2016-06-02 19:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Will Deacon, Yehuda Yitschak, Robin Murphy, linux-mm,
	linux-kernel, linux-arm-kernel, Lior Amsalem, Thomas Petazzoni,
	Catalin Marinas, Arnd Bergmann, Grzegorz Jaszczyk, Nadav Haklai,
	Tomasz Nowicki, Gregory Clément

Hi Mel,

2016-06-02 15:52 GMT+02:00 Mel Gorman <mgorman@techsingularity.net>:
> On Thu, Jun 02, 2016 at 07:48:38AM +0200, Marcin Wojtas wrote:
>> Hi Will,
>>
>> I think I found a right trace. Following one-liner fixes the issue
>> beginning from v4.2-rc1 up to v4.4 included:
>>
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -294,7 +294,7 @@ static inline bool
>> early_page_uninitialised(unsigned long pfn)
>>
>>  static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid)
>>  {
>> -       return false;
>> +       return true;
>>  }
>>
>
> How does that make a difference in v4.4 since commit
> 974a786e63c96a2401a78ddba926f34c128474f1 removed the only
> early_page_nid_uninitialised() ? It further doesn't make sense if deferred
> memory initialisation is not enabled as the pages will always be
> initialised.
>

Right, it should be "v4.3 included". Your changes were merged to
v4.4-rc1 and indeed deferred initialization doesn't play a role from
then, but the behavior remained identical.

>> From what I understood, now order-0 allocation keep no reserve at all.
>
> Watermarks should still be preserved. zone_watermark_ok is still there.
> What might change is the size of reserves for high-order atomic
> allocations only. Fragmentation shouldn't be a factor. I'm missing some
> major part of the picture.
>

I CC'ed you in the last email, as I found out your authorship of
interesting patches - please see problem description
https://lkml.org/lkml/2016/5/30/1056

Anyway when using v4.4.8 baseline, after reverting below patches:
97a16fc - mm, page_alloc: only enforce watermarks for order-0 allocations
0aaa29a - mm, page_alloc: reserve pageblocks for high-order atomic
allocations on demand
974a786 - mm, page_alloc: remove MIGRATE_RESERVE
+ adding early_page_nid_uninitialised() modification

I stop receiving page alloc fail dumps like this one
http://pastebin.com/FhRW5DsF, also performance in my test looks very
similar. I'd like to understand this phenomenon and check if it's
possible to avoid such page-alloc-fail hickups in a nice way.
Afterwards, once the dumps finish, the kernel remain stable, but is
such behavior expected and intended?

What interested me from above-mentioned patches is that last-resort
migration on page-alloc fail ('retry_reserve') was removed from
rmqueue() in patch:
974a786 - mm, page_alloc: remove MIGRATE_RESERVE
Also a section next commit log (0aaa29a - mm, page_alloc: reserve
pageblocks for high-order atomic allocations on demand) caught my
attention - it began from words: "The reserved pageblocks can not be
used for order-0 allocations." This is why I understood that for this
kind of allocation there is no reserve kept and we need to count on
successful reclaim. However under big stress it seems that the
mechanism may not be sufficient. Am I interpreting it correctly?

For the record: the newest kernel I was able to reproduce the dumps
was v4.6: http://pastebin.com/ekDdACn5. I've just checked v4.7-rc1,
which comprise a lot (mainly yours) changes in mm, and I'm wondering
if there may be a spot fix or rather a series of improvements. I'm
looking forward to your opinion and would be grateful for any advice.

Best regards,
Marcin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] Page allocation failures with newest kernels
  2016-06-02 19:01             ` Marcin Wojtas
@ 2016-06-03  9:53               ` Mel Gorman
  2016-06-03 11:57                 ` Marcin Wojtas
  0 siblings, 1 reply; 15+ messages in thread
From: Mel Gorman @ 2016-06-03  9:53 UTC (permalink / raw)
  To: Marcin Wojtas
  Cc: Will Deacon, Yehuda Yitschak, Robin Murphy, linux-mm,
	linux-kernel, linux-arm-kernel, Lior Amsalem, Thomas Petazzoni,
	Catalin Marinas, Arnd Bergmann, Grzegorz Jaszczyk, Nadav Haklai,
	Tomasz Nowicki, Gregory Clément

On Thu, Jun 02, 2016 at 09:01:55PM +0200, Marcin Wojtas wrote:
> >> From what I understood, now order-0 allocation keep no reserve at all.
> >
> > Watermarks should still be preserved. zone_watermark_ok is still there.
> > What might change is the size of reserves for high-order atomic
> > allocations only. Fragmentation shouldn't be a factor. I'm missing some
> > major part of the picture.
> >
> 
> I CC'ed you in the last email, as I found out your authorship of
> interesting patches - please see problem description
> https://lkml.org/lkml/2016/5/30/1056
> 
> Anyway when using v4.4.8 baseline, after reverting below patches:
> 97a16fc - mm, page_alloc: only enforce watermarks for order-0 allocations
> 0aaa29a - mm, page_alloc: reserve pageblocks for high-order atomic
> allocations on demand
> 974a786 - mm, page_alloc: remove MIGRATE_RESERVE
> + adding early_page_nid_uninitialised() modification
> 

The early_page check is wrong because of the check itself rather than
the function so that was the bug there.

> I stop receiving page alloc fail dumps like this one
> http://pastebin.com/FhRW5DsF, also performance in my test looks very
> similar. I'd like to understand this phenomenon and check if it's
> possible to avoid such page-alloc-fail hickups in a nice way.
> Afterwards, once the dumps finish, the kernel remain stable, but is
> such behavior expected and intended?
> 

Looking at the pastebin, the page allocation failure appears to be partially
due to CMA. If the free_cma pages are substracted from the free pages then
it's very close to the low watermark. I suspect kswapd was already active
but it had not acted in time to prevent the first allocation. The impact
of MIGRATE_RESERVE was to give a larger window for kswapd to do work in
but it's a co-incidence. By relying on it for an order-0 allocation it
would fragment that area which in your particular case may not matter but
actually violates what MIGRATE_RESERVE was for.

> For the record: the newest kernel I was able to reproduce the dumps
> was v4.6: http://pastebin.com/ekDdACn5. I've just checked v4.7-rc1,
> which comprise a lot (mainly yours) changes in mm, and I'm wondering
> if there may be a spot fix or rather a series of improvements. I'm
> looking forward to your opinion and would be grateful for any advice.
> 

I don't believe we want to reintroduce the reserve to cope with CMA. One
option would be to widen the gap between low and min watermark by the
size of the CMA region. The effect would be to wake kswapd earlier which
matters considering the context of the failing allocation was
GFP_ATOMIC.

The GFP_ATOMIC itself is interesting. If I'm reading this correctly,
scsi_get_cmd_from_req() was called from scsi_prep() where it was passing
in GFP_ATOMIC but in the page allocation failure, __GFP_ATOMIC is not
set. It would be worth chasing down if the allocation site really was
GFP_ATOMIC and if so, isolate what stripped that flag and see if it was
a mistake.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] Page allocation failures with newest kernels
  2016-06-03  9:53               ` Mel Gorman
@ 2016-06-03 11:57                 ` Marcin Wojtas
  2016-06-03 12:36                   ` Mel Gorman
  0 siblings, 1 reply; 15+ messages in thread
From: Marcin Wojtas @ 2016-06-03 11:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Will Deacon, Yehuda Yitschak, Robin Murphy, linux-mm,
	linux-kernel, linux-arm-kernel, Lior Amsalem, Thomas Petazzoni,
	Catalin Marinas, Arnd Bergmann, Grzegorz Jaszczyk, Nadav Haklai,
	Tomasz Nowicki, Gregory Clément

Hi Mel,


2016-06-03 11:53 GMT+02:00 Mel Gorman <mgorman@techsingularity.net>:
> On Thu, Jun 02, 2016 at 09:01:55PM +0200, Marcin Wojtas wrote:
>> >> From what I understood, now order-0 allocation keep no reserve at all.
>> >
>> > Watermarks should still be preserved. zone_watermark_ok is still there.
>> > What might change is the size of reserves for high-order atomic
>> > allocations only. Fragmentation shouldn't be a factor. I'm missing some
>> > major part of the picture.
>> >
>>
>> I CC'ed you in the last email, as I found out your authorship of
>> interesting patches - please see problem description
>> https://lkml.org/lkml/2016/5/30/1056
>>
>> Anyway when using v4.4.8 baseline, after reverting below patches:
>> 97a16fc - mm, page_alloc: only enforce watermarks for order-0 allocations
>> 0aaa29a - mm, page_alloc: reserve pageblocks for high-order atomic
>> allocations on demand
>> 974a786 - mm, page_alloc: remove MIGRATE_RESERVE
>> + adding early_page_nid_uninitialised() modification
>>
>
> The early_page check is wrong because of the check itself rather than
> the function so that was the bug there.

Regardless if it was reasonable to do this check here, behavior for
all archs other than x86 was changed silently because of 7e18adb4f80b
("mm: meminit:
initialise remaining struct pages in parallel with kswapd"), so I'd
consider it as a bug as well.

>
>> I stop receiving page alloc fail dumps like this one
>> http://pastebin.com/FhRW5DsF, also performance in my test looks very
>> similar. I'd like to understand this phenomenon and check if it's
>> possible to avoid such page-alloc-fail hickups in a nice way.
>> Afterwards, once the dumps finish, the kernel remain stable, but is
>> such behavior expected and intended?
>>
>
> Looking at the pastebin, the page allocation failure appears to be partially
> due to CMA. If the free_cma pages are substracted from the free pages then
> it's very close to the low watermark. I suspect kswapd was already active
> but it had not acted in time to prevent the first allocation. The impact
> of MIGRATE_RESERVE was to give a larger window for kswapd to do work in
> but it's a co-incidence. By relying on it for an order-0 allocation it
> would fragment that area which in your particular case may not matter but
> actually violates what MIGRATE_RESERVE was for.

Indeed it's very fragile problem and seems like basing on coincidents
- e.g. contrary to buildroot in ubuntu same test can't end up with
dumping fail information, so I suspect some timings are satisfied,
because e.g. more services run in background. Indeed free_cma is very
close to free pages overall, but usually (especially in older kernels
(v4.4.8: http://pastebin.com/FhRW5DsF) the gap is much bigger. This
may show that the root cause may have varied in time.

>
>> For the record: the newest kernel I was able to reproduce the dumps
>> was v4.6: http://pastebin.com/ekDdACn5. I've just checked v4.7-rc1,
>> which comprise a lot (mainly yours) changes in mm, and I'm wondering
>> if there may be a spot fix or rather a series of improvements. I'm
>> looking forward to your opinion and would be grateful for any advice.
>>
>
> I don't believe we want to reintroduce the reserve to cope with CMA. One
> option would be to widen the gap between low and min watermark by the
> size of the CMA region. The effect would be to wake kswapd earlier which
> matters considering the context of the failing allocation was
> GFP_ATOMIC.

Of course my intention is not reintroducing anything that's gone
forever, but just to find out way to overcome current issues. Do you
mean increasing CMA size? At the very beginning I played with CMA size
(even increased it from 16M to 96M), but it didn't help. Do you think
is there any other way to trigger kswapd earlier?

>
> The GFP_ATOMIC itself is interesting. If I'm reading this correctly,
> scsi_get_cmd_from_req() was called from scsi_prep() where it was passing
> in GFP_ATOMIC but in the page allocation failure, __GFP_ATOMIC is not
> set. It would be worth chasing down if the allocation site really was
> GFP_ATOMIC and if so, isolate what stripped that flag and see if it was
> a mistake.
>

Printing flags was introduced recently and I didn't check them (apart
of playing with GFP_NOWARN in various places) in older kernels. Thanks
for this observation, I'll try to track this down.

Best regards,
Marcin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] Page allocation failures with newest kernels
  2016-06-03 11:57                 ` Marcin Wojtas
@ 2016-06-03 12:36                   ` Mel Gorman
  2016-06-07 17:36                     ` Marcin Wojtas
  0 siblings, 1 reply; 15+ messages in thread
From: Mel Gorman @ 2016-06-03 12:36 UTC (permalink / raw)
  To: Marcin Wojtas
  Cc: Will Deacon, Yehuda Yitschak, Robin Murphy, linux-mm,
	linux-kernel, linux-arm-kernel, Lior Amsalem, Thomas Petazzoni,
	Catalin Marinas, Arnd Bergmann, Grzegorz Jaszczyk, Nadav Haklai,
	Tomasz Nowicki, Gregory Clément

On Fri, Jun 03, 2016 at 01:57:06PM +0200, Marcin Wojtas wrote:
> >> For the record: the newest kernel I was able to reproduce the dumps
> >> was v4.6: http://pastebin.com/ekDdACn5. I've just checked v4.7-rc1,
> >> which comprise a lot (mainly yours) changes in mm, and I'm wondering
> >> if there may be a spot fix or rather a series of improvements. I'm
> >> looking forward to your opinion and would be grateful for any advice.
> >>
> >
> > I don't believe we want to reintroduce the reserve to cope with CMA. One
> > option would be to widen the gap between low and min watermark by the
> > size of the CMA region. The effect would be to wake kswapd earlier which
> > matters considering the context of the failing allocation was
> > GFP_ATOMIC.
> 
> Of course my intention is not reintroducing anything that's gone
> forever, but just to find out way to overcome current issues. Do you
> mean increasing CMA size?

No. There is a gap between the low and min watermarks. At the low point,
kswapd is woken up and at the min point allocation requests either
either direct reclaim or fail if they are atomic. What I'm suggesting
is that you adjust the low watermark and add the size of the CMA area
to it so that kswapd is woken earlier. The watermarks are calculated in
__setup_per_zone_wmarks

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] Page allocation failures with newest kernels
  2016-06-03 12:36                   ` Mel Gorman
@ 2016-06-07 17:36                     ` Marcin Wojtas
  2016-06-08 10:09                       ` Mel Gorman
  0 siblings, 1 reply; 15+ messages in thread
From: Marcin Wojtas @ 2016-06-07 17:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Will Deacon, Yehuda Yitschak, Robin Murphy, linux-mm,
	linux-kernel, linux-arm-kernel, Lior Amsalem, Thomas Petazzoni,
	Catalin Marinas, Arnd Bergmann, Grzegorz Jaszczyk, Nadav Haklai,
	Tomasz Nowicki, Gregory Clément

Hi Mel,



2016-06-03 14:36 GMT+02:00 Mel Gorman <mgorman@techsingularity.net>:
> On Fri, Jun 03, 2016 at 01:57:06PM +0200, Marcin Wojtas wrote:
>> >> For the record: the newest kernel I was able to reproduce the dumps
>> >> was v4.6: http://pastebin.com/ekDdACn5. I've just checked v4.7-rc1,
>> >> which comprise a lot (mainly yours) changes in mm, and I'm wondering
>> >> if there may be a spot fix or rather a series of improvements. I'm
>> >> looking forward to your opinion and would be grateful for any advice.
>> >>
>> >
>> > I don't believe we want to reintroduce the reserve to cope with CMA. One
>> > option would be to widen the gap between low and min watermark by the
>> > size of the CMA region. The effect would be to wake kswapd earlier which
>> > matters considering the context of the failing allocation was
>> > GFP_ATOMIC.
>>
>> Of course my intention is not reintroducing anything that's gone
>> forever, but just to find out way to overcome current issues. Do you
>> mean increasing CMA size?
>
> No. There is a gap between the low and min watermarks. At the low point,
> kswapd is woken up and at the min point allocation requests either
> either direct reclaim or fail if they are atomic. What I'm suggesting
> is that you adjust the low watermark and add the size of the CMA area
> to it so that kswapd is woken earlier. The watermarks are calculated in
> __setup_per_zone_wmarks
>

I printed all zones' settings, whose watermarks are configured within
__setup_per_zone_wmarks(). There are three DMA, Normal and Movable -
only first one's watermarks have non-zero values. Increasing DMA min
watermark didn't help. I also played with increasing
/proc/sys/vm/min_free_kbytes from ~2560 to 16000
(__setup_per_zone_wmarks() recalculates watermarks after that) - no
effect either.

Best regards,
Marcin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] Page allocation failures with newest kernels
  2016-06-07 17:36                     ` Marcin Wojtas
@ 2016-06-08 10:09                       ` Mel Gorman
  2016-06-09 18:13                         ` Marcin Wojtas
  0 siblings, 1 reply; 15+ messages in thread
From: Mel Gorman @ 2016-06-08 10:09 UTC (permalink / raw)
  To: Marcin Wojtas
  Cc: Will Deacon, Yehuda Yitschak, Robin Murphy, linux-mm,
	linux-kernel, linux-arm-kernel, Lior Amsalem, Thomas Petazzoni,
	Catalin Marinas, Arnd Bergmann, Grzegorz Jaszczyk, Nadav Haklai,
	Tomasz Nowicki, Gregory Clément

On Tue, Jun 07, 2016 at 07:36:57PM +0200, Marcin Wojtas wrote:
> Hi Mel,
> 
> 
> 
> 2016-06-03 14:36 GMT+02:00 Mel Gorman <mgorman@techsingularity.net>:
> > On Fri, Jun 03, 2016 at 01:57:06PM +0200, Marcin Wojtas wrote:
> >> >> For the record: the newest kernel I was able to reproduce the dumps
> >> >> was v4.6: http://pastebin.com/ekDdACn5. I've just checked v4.7-rc1,
> >> >> which comprise a lot (mainly yours) changes in mm, and I'm wondering
> >> >> if there may be a spot fix or rather a series of improvements. I'm
> >> >> looking forward to your opinion and would be grateful for any advice.
> >> >>
> >> >
> >> > I don't believe we want to reintroduce the reserve to cope with CMA. One
> >> > option would be to widen the gap between low and min watermark by the
> >> > size of the CMA region. The effect would be to wake kswapd earlier which
> >> > matters considering the context of the failing allocation was
> >> > GFP_ATOMIC.
> >>
> >> Of course my intention is not reintroducing anything that's gone
> >> forever, but just to find out way to overcome current issues. Do you
> >> mean increasing CMA size?
> >
> > No. There is a gap between the low and min watermarks. At the low point,
> > kswapd is woken up and at the min point allocation requests either
> > either direct reclaim or fail if they are atomic. What I'm suggesting
> > is that you adjust the low watermark and add the size of the CMA area
> > to it so that kswapd is woken earlier. The watermarks are calculated in
> > __setup_per_zone_wmarks
> >
> 
> I printed all zones' settings, whose watermarks are configured within
> __setup_per_zone_wmarks(). There are three DMA, Normal and Movable -
> only first one's watermarks have non-zero values. Increasing DMA min
> watermark didn't help. I also played with increasing

Patch?

Did you establish why GFP_ATOMIC (assuming that's the failing site) had
not specified __GFP_ATOMIC at the time of the allocation failure?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] Page allocation failures with newest kernels
  2016-06-08 10:09                       ` Mel Gorman
@ 2016-06-09 18:13                         ` Marcin Wojtas
  2016-06-10 16:08                           ` Marcin Wojtas
  0 siblings, 1 reply; 15+ messages in thread
From: Marcin Wojtas @ 2016-06-09 18:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Will Deacon, Yehuda Yitschak, Robin Murphy, linux-mm,
	linux-kernel, linux-arm-kernel, Lior Amsalem, Thomas Petazzoni,
	Catalin Marinas, Arnd Bergmann, Grzegorz Jaszczyk, Nadav Haklai,
	Tomasz Nowicki, Gregory Clément

Hi Mel,

My last email got cut in half.

2016-06-08 12:09 GMT+02:00 Mel Gorman <mgorman@techsingularity.net>:
> On Tue, Jun 07, 2016 at 07:36:57PM +0200, Marcin Wojtas wrote:
>> Hi Mel,
>>
>>
>>
>> 2016-06-03 14:36 GMT+02:00 Mel Gorman <mgorman@techsingularity.net>:
>> > On Fri, Jun 03, 2016 at 01:57:06PM +0200, Marcin Wojtas wrote:
>> >> >> For the record: the newest kernel I was able to reproduce the dumps
>> >> >> was v4.6: http://pastebin.com/ekDdACn5. I've just checked v4.7-rc1,
>> >> >> which comprise a lot (mainly yours) changes in mm, and I'm wondering
>> >> >> if there may be a spot fix or rather a series of improvements. I'm
>> >> >> looking forward to your opinion and would be grateful for any advice.
>> >> >>
>> >> >
>> >> > I don't believe we want to reintroduce the reserve to cope with CMA. One
>> >> > option would be to widen the gap between low and min watermark by the
>> >> > size of the CMA region. The effect would be to wake kswapd earlier which
>> >> > matters considering the context of the failing allocation was
>> >> > GFP_ATOMIC.
>> >>
>> >> Of course my intention is not reintroducing anything that's gone
>> >> forever, but just to find out way to overcome current issues. Do you
>> >> mean increasing CMA size?
>> >
>> > No. There is a gap between the low and min watermarks. At the low point,
>> > kswapd is woken up and at the min point allocation requests either
>> > either direct reclaim or fail if they are atomic. What I'm suggesting
>> > is that you adjust the low watermark and add the size of the CMA area
>> > to it so that kswapd is woken earlier. The watermarks are calculated in
>> > __setup_per_zone_wmarks
>> >
>>
>> I printed all zones' settings, whose watermarks are configured within
>> __setup_per_zone_wmarks(). There are three DMA, Normal and Movable -
>> only first one's watermarks have non-zero values. Increasing DMA min
>> watermark didn't help. I also played with increasing
>
> Patch?
>

I played with increasing min_free_kbytes from ~2600 to 16000. It
resulted in shifting watermarks levels in __setup_per_zone_wmarks(),
however only for zone DMA. Normal and Movable remained at 0. No
progress with avoiding page alloc failures - a gap between 'free' and
'free_cma' was huge, so I don't think that CMA itself would be a root
cause.

> Did you establish why GFP_ATOMIC (assuming that's the failing site) had
> not specified __GFP_ATOMIC at the time of the allocation failure?
>

Yes. It happens in new_slab() in following lines:
return allocate_slab(s, flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
I added "| GFP_ATOMIC" and in such case I got same dumps but with one
bit set more in gfp_mask, so I don't think it's an issue.

Latest patches in v4.7-rc1 seem to boost page alloc performance enough
to avoid problems observed between v4.2 and v4.6. Hence before
rebasing from v4.4 to another LTS >v4.7 in future, we decided as a WA
to return to using MIGRATE_RESERVE + adding fix for
early_page_nid_uninitialised(). Now operation seems stable on all our
SoC's during the tests.

Best regards,
Marcin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] Page allocation failures with newest kernels
  2016-06-09 18:13                         ` Marcin Wojtas
@ 2016-06-10 16:08                           ` Marcin Wojtas
  0 siblings, 0 replies; 15+ messages in thread
From: Marcin Wojtas @ 2016-06-10 16:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Will Deacon, Yehuda Yitschak, Robin Murphy, linux-mm,
	linux-kernel, linux-arm-kernel, Lior Amsalem, Thomas Petazzoni,
	Catalin Marinas, Arnd Bergmann, Grzegorz Jaszczyk, Nadav Haklai,
	Tomasz Nowicki, Gregory Clément

Hi Mel,

Thanks for posting patch. I tested it on LKv4.4.8. Despite
"mode:0x2284020" shows that __GFP_ATOMIC is now not stripped, the
issue remains:
http://pastebin.com/DmezUJSc

Best regards,
Marcin

2016-06-09 20:13 GMT+02:00 Marcin Wojtas <mw@semihalf.com>:
> Hi Mel,
>
> My last email got cut in half.
>
> 2016-06-08 12:09 GMT+02:00 Mel Gorman <mgorman@techsingularity.net>:
>> On Tue, Jun 07, 2016 at 07:36:57PM +0200, Marcin Wojtas wrote:
>>> Hi Mel,
>>>
>>>
>>>
>>> 2016-06-03 14:36 GMT+02:00 Mel Gorman <mgorman@techsingularity.net>:
>>> > On Fri, Jun 03, 2016 at 01:57:06PM +0200, Marcin Wojtas wrote:
>>> >> >> For the record: the newest kernel I was able to reproduce the dumps
>>> >> >> was v4.6: http://pastebin.com/ekDdACn5. I've just checked v4.7-rc1,
>>> >> >> which comprise a lot (mainly yours) changes in mm, and I'm wondering
>>> >> >> if there may be a spot fix or rather a series of improvements. I'm
>>> >> >> looking forward to your opinion and would be grateful for any advice.
>>> >> >>
>>> >> >
>>> >> > I don't believe we want to reintroduce the reserve to cope with CMA. One
>>> >> > option would be to widen the gap between low and min watermark by the
>>> >> > size of the CMA region. The effect would be to wake kswapd earlier which
>>> >> > matters considering the context of the failing allocation was
>>> >> > GFP_ATOMIC.
>>> >>
>>> >> Of course my intention is not reintroducing anything that's gone
>>> >> forever, but just to find out way to overcome current issues. Do you
>>> >> mean increasing CMA size?
>>> >
>>> > No. There is a gap between the low and min watermarks. At the low point,
>>> > kswapd is woken up and at the min point allocation requests either
>>> > either direct reclaim or fail if they are atomic. What I'm suggesting
>>> > is that you adjust the low watermark and add the size of the CMA area
>>> > to it so that kswapd is woken earlier. The watermarks are calculated in
>>> > __setup_per_zone_wmarks
>>> >
>>>
>>> I printed all zones' settings, whose watermarks are configured within
>>> __setup_per_zone_wmarks(). There are three DMA, Normal and Movable -
>>> only first one's watermarks have non-zero values. Increasing DMA min
>>> watermark didn't help. I also played with increasing
>>
>> Patch?
>>
>
> I played with increasing min_free_kbytes from ~2600 to 16000. It
> resulted in shifting watermarks levels in __setup_per_zone_wmarks(),
> however only for zone DMA. Normal and Movable remained at 0. No
> progress with avoiding page alloc failures - a gap between 'free' and
> 'free_cma' was huge, so I don't think that CMA itself would be a root
> cause.
>
>> Did you establish why GFP_ATOMIC (assuming that's the failing site) had
>> not specified __GFP_ATOMIC at the time of the allocation failure?
>>
>
> Yes. It happens in new_slab() in following lines:
> return allocate_slab(s, flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> I added "| GFP_ATOMIC" and in such case I got same dumps but with one
> bit set more in gfp_mask, so I don't think it's an issue.
>
> Latest patches in v4.7-rc1 seem to boost page alloc performance enough
> to avoid problems observed between v4.2 and v4.6. Hence before
> rebasing from v4.4 to another LTS >v4.7 in future, we decided as a WA
> to return to using MIGRATE_RESERVE + adding fix for
> early_page_nid_uninitialised(). Now operation seems stable on all our
> SoC's during the tests.
>
> Best regards,
> Marcin

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-06-10 16:08 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-31  3:02 [BUG] Page allocation failures with newest kernels Marcin Wojtas
2016-05-31 10:17 ` Robin Murphy
2016-05-31 10:29   ` Marcin Wojtas
2016-05-31 13:10     ` Yehuda Yitschak
2016-05-31 13:15       ` Will Deacon
2016-06-02  5:48         ` Marcin Wojtas
2016-06-02 13:52           ` Mel Gorman
2016-06-02 19:01             ` Marcin Wojtas
2016-06-03  9:53               ` Mel Gorman
2016-06-03 11:57                 ` Marcin Wojtas
2016-06-03 12:36                   ` Mel Gorman
2016-06-07 17:36                     ` Marcin Wojtas
2016-06-08 10:09                       ` Mel Gorman
2016-06-09 18:13                         ` Marcin Wojtas
2016-06-10 16:08                           ` Marcin Wojtas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).