All of lore.kernel.org
 help / color / mirror / Atom feed
* THP broken on OCTEON?
@ 2016-05-23 15:13 Aaro Koskinen
  2016-05-23 15:20 ` Ralf Baechle
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Aaro Koskinen @ 2016-05-23 15:13 UTC (permalink / raw)
  To: David Daney, Ralf Baechle, linux-mips

Hi,

I'm getting kernel crashes (see below) reliably when building Perl in
parallel (make -j16) on OCTEON EBH5600 board (8 cores, 4 GB RAM) with
Linux 4.6.

It seems that CONFIG_TRANSPARENT_HUGEPAGE has something to do with the
issue - disabling it makes build go through fine.

Any ideas?

A.

[ 2457.467155] Got mcheck at 00000001200a82b4
[ 2457.479447] CPU: 6 PID: 15916 Comm: lib/unicore/mkt Not tainted 4.6.0-octeon-distro.git-v2.16-1-gfc3b10e-dirty-00001-g16a7aa0 #1
[ 2457.514121] task: 80000000eccf2b80 ti: 80000000ecda4000 task.ti: 80000000ecda4000
[ 2457.536551] $ 0   : 0000000000000000 3e000000105bc006 0000000000000000 ffffffff957e4728
[ 2457.560686] $ 4   : 00000000000000f2 0000000000000067 000000012015e8ab 00000000332295cf
[ 2457.584822] $ 8   : 0000000000000000 0000000000000000 0000000000000001 0000000000000003
[ 2457.608957] $12   : 00000001204e04d8 0000000000000008 0000000000000001 ffffffffffffffff
[ 2457.633093] $16   : 0000000120383d60 00000001203a3828 00000000332295cf 000000000000000b
[ 2457.657228] $20   : 000000012015e8a0 0000000000000000 000000000000000c 0000000000000000
[ 2457.681363] $24   : 0000000000000010 00000001200a80e8                                  
[ 2457.705496] $28   : 00000001201a0300 000000ffffda82a0 000000012019b9b8 0000000120039f5c
[ 2457.729631] Hi    : 0000000000000000
[ 2457.740341] Lo    : 0000000000000008
[ 2457.751055] epc   : 00000001200a82b4 0x1200a82b4
[ 2457.764891] ra    : 0000000120039f5c 0x120039f5c
[ 2457.778726] Status: 00308cf3	KX SX UX USER EXL IE 
[ 2457.793284] Cause : 00800060 (ExcCode 18)
[ 2457.805296] PrId  : 000d0409 (Cavium Octeon+)
[ 2457.818350] Index    : 80000000
[ 2457.827759] PageMask : 1fe000
[ 2457.836646] EntryHi  : 00000001203820f4
[ 2457.848136] EntryLo0 : 00000000105b8006
[ 2457.859628] EntryLo1 : 00000000105bc006
[ 2457.871119] Wired    : 0
[ 2457.878704] PageGrain: e0000000
[ 2457.888111] 
[ 2457.892573] Index: 25 pgmask=4kb va=00120456000 asid=f4
[ 2457.908256] 	[ri=0 xi=0 pa=000e47d3000 c=0 d=1 v=1 g=0] [ri=0 xi=0 pa=000c31bc000 c=0 d=1 v=1 g=0]
[ 2457.935230] Index: 26 pgmask=4kb va=001200a8000 asid=f4
[ 2457.950915] 	[ri=0 xi=0 pa=000e0e1c000 c=0 d=0 v=1 g=0] [ri=0 xi=0 pa=000c50ed000 c=0 d=0 v=1 g=0]
[ 2457.977888] Index: 27 pgmask=4kb va=001203a2000 asid=f4
[ 2457.993574] 	[ri=0 xi=0 pa=00000000000 c=0 d=0 v=0 g=0] [ri=0 xi=1 pa=0009005a000 c=1 d=0 v=1 g=0]
[ 2458.020548] 
[ 2458.025008] 
Code: de100000  1200001c  00000000 <de110008> 8e220000  1452fffa  00000000  8e220004  1453fff7 
[ 2458.054470] Kernel panic - not syncing: Caught Machine Check exception - caused by multiple matching entries in the TLB.
[ 2458.087614] ---[ end Kernel panic - not syncing: Caught Machine Check exception - caused by multiple matching entries in the TLB.
[ 2458.122835] 
do_page_fault(): sending SIGSEGV to make for invalid write access to 0000000000000012[ 2458.149565] 
[ 2458.149565] do_page_fault(): sending SIGSEGV to miniperl for invalid write access to 0000000000000010epc = 0000000120089500 in miniperl[120000000+181000]ra  = 00000001200c18a4 in miniperl[120000000+181000][ 2458.149590] 

[ 2458.212999] epc = 0000000120015400 in make[120000000+35000]
[ 2458.229780] ra  = 000000ffeca7f570 in[ 2458.240797] 

*** NMI Watchdog interrupt on Core 0x0 ***

A.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
  2016-05-23 15:13 THP broken on OCTEON? Aaro Koskinen
@ 2016-05-23 15:20 ` Ralf Baechle
  2016-05-23 16:21   ` David Daney
  2016-05-23 18:57   ` Joshua Kinard
  2016-05-25 13:41 ` Aaro Koskinen
  2016-06-22 22:05 ` David Daney
  2 siblings, 2 replies; 28+ messages in thread
From: Ralf Baechle @ 2016-05-23 15:20 UTC (permalink / raw)
  To: Aaro Koskinen, Joshua Kinard; +Cc: David Daney, linux-mips

On Mon, May 23, 2016 at 06:13:46PM +0300, Aaro Koskinen wrote:

> I'm getting kernel crashes (see below) reliably when building Perl in
> parallel (make -j16) on OCTEON EBH5600 board (8 cores, 4 GB RAM) with
> Linux 4.6.
> 
> It seems that CONFIG_TRANSPARENT_HUGEPAGE has something to do with the
> issue - disabling it makes build go through fine.
> 
> Any ideas?

I thought it was working except on SGI Origin 200/2000 aka IP27 where
Joshua Kinard (added to cc) was hitting issues as well.

Joshua, does that similar to the issues you were hitting?

  Ralf

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
  2016-05-23 15:20 ` Ralf Baechle
@ 2016-05-23 16:21   ` David Daney
  2016-05-23 18:52     ` Aaro Koskinen
  2016-05-23 18:57   ` Joshua Kinard
  1 sibling, 1 reply; 28+ messages in thread
From: David Daney @ 2016-05-23 16:21 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: Aaro Koskinen, Joshua Kinard, linux-mips, Hill, Steven

On 05/23/2016 08:20 AM, Ralf Baechle wrote:
> On Mon, May 23, 2016 at 06:13:46PM +0300, Aaro Koskinen wrote:
>
>> I'm getting kernel crashes (see below) reliably when building Perl in
>> parallel (make -j16) on OCTEON EBH5600 board (8 cores, 4 GB RAM) with
>> Linux 4.6.
>>
>> It seems that CONFIG_TRANSPARENT_HUGEPAGE has something to do with the
>> issue - disabling it makes build go through fine.
>>
>> Any ideas?
>
> I thought it was working except on SGI Origin 200/2000 aka IP27 where
> Joshua Kinard (added to cc) was hitting issues as well.
>
> Joshua, does that similar to the issues you were hitting?


There is nothing OCTEON specific in the THP code, or huge pages in general.

That said, we have seen other THP related failures, and have never been 
able to find the cause.

If someone can come up with a reproducible test case that triggers 
quickly, we can run it in our simulator and easily find the problem.

There are THP tweaking knobs in /sys/kernel/mm/transparent_hugepage.  If 
you reduce the time in khugepaged/scan_sleep_millisecs, it often makes 
things fail much more quickly.

David.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
  2016-05-23 16:21   ` David Daney
@ 2016-05-23 18:52     ` Aaro Koskinen
  2016-05-23 19:03         ` David Daney
  2016-05-23 19:08       ` Joshua Kinard
  0 siblings, 2 replies; 28+ messages in thread
From: Aaro Koskinen @ 2016-05-23 18:52 UTC (permalink / raw)
  To: David Daney
  Cc: Ralf Baechle, Aaro Koskinen, Joshua Kinard, linux-mips, Hill, Steven

On Mon, May 23, 2016 at 09:21:22AM -0700, David Daney wrote:
> On 05/23/2016 08:20 AM, Ralf Baechle wrote:
> >On Mon, May 23, 2016 at 06:13:46PM +0300, Aaro Koskinen wrote:
> >>I'm getting kernel crashes (see below) reliably when building Perl in
> >>parallel (make -j16) on OCTEON EBH5600 board (8 cores, 4 GB RAM) with
> >>Linux 4.6.
> >>
> >>It seems that CONFIG_TRANSPARENT_HUGEPAGE has something to do with the
> >>issue - disabling it makes build go through fine.
> >>
> >>Any ideas?
> >
> >I thought it was working except on SGI Origin 200/2000 aka IP27 where
> >Joshua Kinard (added to cc) was hitting issues as well.
> >
> >Joshua, does that similar to the issues you were hitting?
> 
> There is nothing OCTEON specific in the THP code, or huge pages in general.
> 
> That said, we have seen other THP related failures, and have never been able
> to find the cause.
> 
> If someone can come up with a reproducible test case that triggers quickly,
> we can run it in our simulator and easily find the problem.

Trying to build Perl is a reliable reproducer. Is that too heavyweight
for your simulator?

I was able to reproduce this also on EdgeRouter Pro, but there the kernel
does not fail, only compiler dies with SIGBUS:

[  315.095264] Data bus error, epc == 0000000000a801c4, ra == 0000000000a80624

And without THP the build is fine.

I also tried CN68XX board with 16 GB RAM and also there I get SIGBUS failure
instead of Machine Check.

A.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
  2016-05-23 15:20 ` Ralf Baechle
  2016-05-23 16:21   ` David Daney
@ 2016-05-23 18:57   ` Joshua Kinard
  2016-05-23 19:22     ` Ralf Baechle
  1 sibling, 1 reply; 28+ messages in thread
From: Joshua Kinard @ 2016-05-23 18:57 UTC (permalink / raw)
  To: linux-mips

On 05/23/2016 11:20, Ralf Baechle wrote:
> On Mon, May 23, 2016 at 06:13:46PM +0300, Aaro Koskinen wrote:
> 
>> I'm getting kernel crashes (see below) reliably when building Perl in
>> parallel (make -j16) on OCTEON EBH5600 board (8 cores, 4 GB RAM) with
>> Linux 4.6.
>>
>> It seems that CONFIG_TRANSPARENT_HUGEPAGE has something to do with the
>> issue - disabling it makes build go through fine.
>>
>> Any ideas?
> 
> I thought it was working except on SGI Origin 200/2000 aka IP27 where
> Joshua Kinard (added to cc) was hitting issues as well.
> 
> Joshua, does that similar to the issues you were hitting?
> 
>   Ralf

NAK, this issue looks completely different to IP30/IP27.  In this case, it
looks like the hardware is detecting the case where multiple TLB entries match
and it's killing the machine to avoid hardware damage.  I don't want to know
how the SGI systems handle this scenario (does the R10000 do a TLB shutdown??).

On IP30, using THP usually results in instruction bus errors (IBE), after a set
time, depending on the machine's configuration (<2GB RAM, virtually instant on
userland init; >2GB RAM, might survive for a few minutes, even getting all the
way to runlevel 3 randomly).

IP27 was somewhat similar to IP30, in that THP usually results in IBEs after a
few seconds of hitting userland bringup (bash is pretty quick at triggering an
IBE), but I haven't tried experimenting with varying the amount of RAM in that
machine, due to the fragility of pulling the nodeboards out constantly.  I also
haven't tried THP since refactoring/rewriting the IP27 code back in Feb to see
if I magically fixed it...

-- 
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
6144R/F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And our
lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
@ 2016-05-23 19:03         ` David Daney
  0 siblings, 0 replies; 28+ messages in thread
From: David Daney @ 2016-05-23 19:03 UTC (permalink / raw)
  To: Aaro Koskinen
  Cc: Ralf Baechle, Aaro Koskinen, Joshua Kinard, linux-mips, Hill, Steven

On 05/23/2016 11:52 AM, Aaro Koskinen wrote:
> On Mon, May 23, 2016 at 09:21:22AM -0700, David Daney wrote:
>> On 05/23/2016 08:20 AM, Ralf Baechle wrote:
>>> On Mon, May 23, 2016 at 06:13:46PM +0300, Aaro Koskinen wrote:
>>>> I'm getting kernel crashes (see below) reliably when building Perl in
>>>> parallel (make -j16) on OCTEON EBH5600 board (8 cores, 4 GB RAM) with
>>>> Linux 4.6.
>>>>
>>>> It seems that CONFIG_TRANSPARENT_HUGEPAGE has something to do with the
>>>> issue - disabling it makes build go through fine.
>>>>
>>>> Any ideas?
>>>
>>> I thought it was working except on SGI Origin 200/2000 aka IP27 where
>>> Joshua Kinard (added to cc) was hitting issues as well.
>>>
>>> Joshua, does that similar to the issues you were hitting?
>>
>> There is nothing OCTEON specific in the THP code, or huge pages in general.
>>
>> That said, we have seen other THP related failures, and have never been able
>> to find the cause.
>>
>> If someone can come up with a reproducible test case that triggers quickly,
>> we can run it in our simulator and easily find the problem.
>
> Trying to build Perl is a reliable reproducer. Is that too heavyweight
> for your simulator?
>
> I was able to reproduce this also on EdgeRouter Pro, but there the kernel
> does not fail, only compiler dies with SIGBUS:
>
> [  315.095264] Data bus error, epc == 0000000000a801c4, ra == 0000000000a80624
>
> And without THP the build is fine.
>
> I also tried CN68XX board with 16 GB RAM and also there I get SIGBUS failure
> instead of Machine Check.
>

Yes.  I think the problem is some sort of corruption of the page tables. 
  This may show up as MachineCheck Errors, or bus errors, or SIGSEGV.

David.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
@ 2016-05-23 19:03         ` David Daney
  0 siblings, 0 replies; 28+ messages in thread
From: David Daney @ 2016-05-23 19:03 UTC (permalink / raw)
  To: Aaro Koskinen
  Cc: Ralf Baechle, Aaro Koskinen, Joshua Kinard, linux-mips, Hill, Steven

On 05/23/2016 11:52 AM, Aaro Koskinen wrote:
> On Mon, May 23, 2016 at 09:21:22AM -0700, David Daney wrote:
>> On 05/23/2016 08:20 AM, Ralf Baechle wrote:
>>> On Mon, May 23, 2016 at 06:13:46PM +0300, Aaro Koskinen wrote:
>>>> I'm getting kernel crashes (see below) reliably when building Perl in
>>>> parallel (make -j16) on OCTEON EBH5600 board (8 cores, 4 GB RAM) with
>>>> Linux 4.6.
>>>>
>>>> It seems that CONFIG_TRANSPARENT_HUGEPAGE has something to do with the
>>>> issue - disabling it makes build go through fine.
>>>>
>>>> Any ideas?
>>>
>>> I thought it was working except on SGI Origin 200/2000 aka IP27 where
>>> Joshua Kinard (added to cc) was hitting issues as well.
>>>
>>> Joshua, does that similar to the issues you were hitting?
>>
>> There is nothing OCTEON specific in the THP code, or huge pages in general.
>>
>> That said, we have seen other THP related failures, and have never been able
>> to find the cause.
>>
>> If someone can come up with a reproducible test case that triggers quickly,
>> we can run it in our simulator and easily find the problem.
>
> Trying to build Perl is a reliable reproducer. Is that too heavyweight
> for your simulator?
>
> I was able to reproduce this also on EdgeRouter Pro, but there the kernel
> does not fail, only compiler dies with SIGBUS:
>
> [  315.095264] Data bus error, epc == 0000000000a801c4, ra == 0000000000a80624
>
> And without THP the build is fine.
>
> I also tried CN68XX board with 16 GB RAM and also there I get SIGBUS failure
> instead of Machine Check.
>

Yes.  I think the problem is some sort of corruption of the page tables. 
  This may show up as MachineCheck Errors, or bus errors, or SIGSEGV.

David.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
  2016-05-23 18:52     ` Aaro Koskinen
  2016-05-23 19:03         ` David Daney
@ 2016-05-23 19:08       ` Joshua Kinard
  2016-05-23 20:02         ` Alastair Bridgewater
  1 sibling, 1 reply; 28+ messages in thread
From: Joshua Kinard @ 2016-05-23 19:08 UTC (permalink / raw)
  To: Aaro Koskinen, David Daney
  Cc: Ralf Baechle, Aaro Koskinen, linux-mips, Hill, Steven,
	Alastair Bridgewater

On 05/23/2016 14:52, Aaro Koskinen wrote:
> On Mon, May 23, 2016 at 09:21:22AM -0700, David Daney wrote:
>> On 05/23/2016 08:20 AM, Ralf Baechle wrote:
>>> On Mon, May 23, 2016 at 06:13:46PM +0300, Aaro Koskinen wrote:
>>>> I'm getting kernel crashes (see below) reliably when building Perl in
>>>> parallel (make -j16) on OCTEON EBH5600 board (8 cores, 4 GB RAM) with
>>>> Linux 4.6.
>>>>
>>>> It seems that CONFIG_TRANSPARENT_HUGEPAGE has something to do with the
>>>> issue - disabling it makes build go through fine.
>>>>
>>>> Any ideas?
>>>
>>> I thought it was working except on SGI Origin 200/2000 aka IP27 where
>>> Joshua Kinard (added to cc) was hitting issues as well.
>>>
>>> Joshua, does that similar to the issues you were hitting?
>>
>> There is nothing OCTEON specific in the THP code, or huge pages in general.
>>
>> That said, we have seen other THP related failures, and have never been able
>> to find the cause.
>>
>> If someone can come up with a reproducible test case that triggers quickly,
>> we can run it in our simulator and easily find the problem.
> 
> Trying to build Perl is a reliable reproducer. Is that too heavyweight
> for your simulator?
> 
> I was able to reproduce this also on EdgeRouter Pro, but there the kernel
> does not fail, only compiler dies with SIGBUS:
> 
> [  315.095264] Data bus error, epc == 0000000000a801c4, ra == 0000000000a80624
> 
> And without THP the build is fine.
> 
> I also tried CN68XX board with 16 GB RAM and also there I get SIGBUS failure
> instead of Machine Check.

SIGBUS is closer to what I was seeing on IP30/IP27, but there's two different
SIGBUS errors in MIPS, a Data Bus Error (DBE) and Instruction Bus Error (IBE).
 I've only seen IBEs result from using THP on Octane/IP30 and Origin/Onyx2/IP27.

Also CC'ing Alastair Bridgewater (nyef), who was working on bringing up the
IP35 hardware (Origin 300/350), as he had been working on tracing down some
possible issues in the TLB code.  He had a small test case at the below address
(use annotation #2), but I don't know if he got any further on debugging it.

http://paste.lisp.org/display/305809#2

-- 
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
6144R/F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And our
lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
  2016-05-23 18:57   ` Joshua Kinard
@ 2016-05-23 19:22     ` Ralf Baechle
  2016-05-23 19:40       ` Joshua Kinard
  0 siblings, 1 reply; 28+ messages in thread
From: Ralf Baechle @ 2016-05-23 19:22 UTC (permalink / raw)
  To: Joshua Kinard; +Cc: linux-mips

On Mon, May 23, 2016 at 02:57:30PM -0400, Joshua Kinard wrote:

> NAK, this issue looks completely different to IP30/IP27.  In this case, it
> looks like the hardware is detecting the case where multiple TLB entries match
> and it's killing the machine to avoid hardware damage.  I don't want to know
> how the SGI systems handle this scenario (does the R10000 do a TLB shutdown??).

The R10000 detects if duplicate entries when writing to the TLB and
invalidates the previous entry.  That is, there will never be duplicate
entries in the TLB and of course no TLB shutdown.

That's the theory.  I'm wondering how well that is going to work if
the entries are having a different page size.

And Aaro doesn't always get machine checks so it's not like always a
duplicate entry is written.

> On IP30, using THP usually results in instruction bus errors (IBE), after a set
> time, depending on the machine's configuration (<2GB RAM, virtually instant on
> userland init; >2GB RAM, might survive for a few minutes, even getting all the
> way to runlevel 3 randomly).
> 
> IP27 was somewhat similar to IP30, in that THP usually results in IBEs after a
> few seconds of hitting userland bringup (bash is pretty quick at triggering an
> IBE), but I haven't tried experimenting with varying the amount of RAM in that
> machine, due to the fragility of pulling the nodeboards out constantly.  I also
> haven't tried THP since refactoring/rewriting the IP27 code back in Feb to see
> if I magically fixed it...

  Ralf

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
  2016-05-23 19:22     ` Ralf Baechle
@ 2016-05-23 19:40       ` Joshua Kinard
  2016-05-23 20:01         ` Ralf Baechle
  2016-05-24 21:21         ` Aaro Koskinen
  0 siblings, 2 replies; 28+ messages in thread
From: Joshua Kinard @ 2016-05-23 19:40 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: linux-mips

On 05/23/2016 15:22, Ralf Baechle wrote:
> On Mon, May 23, 2016 at 02:57:30PM -0400, Joshua Kinard wrote:
> 
>> NAK, this issue looks completely different to IP30/IP27.  In this case, it
>> looks like the hardware is detecting the case where multiple TLB entries match
>> and it's killing the machine to avoid hardware damage.  I don't want to know
>> how the SGI systems handle this scenario (does the R10000 do a TLB shutdown??).
> 
> The R10000 detects if duplicate entries when writing to the TLB and
> invalidates the previous entry.  That is, there will never be duplicate
> entries in the TLB and of course no TLB shutdown.
> 
> That's the theory.  I'm wondering how well that is going to work if
> the entries are having a different page size.
> 
> And Aaro doesn't always get machine checks so it's not like always a
> duplicate entry is written.
> 
>> On IP30, using THP usually results in instruction bus errors (IBE), after a set
>> time, depending on the machine's configuration (<2GB RAM, virtually instant on
>> userland init; >2GB RAM, might survive for a few minutes, even getting all the
>> way to runlevel 3 randomly).
>>
>> IP27 was somewhat similar to IP30, in that THP usually results in IBEs after a
>> few seconds of hitting userland bringup (bash is pretty quick at triggering an
>> IBE), but I haven't tried experimenting with varying the amount of RAM in that
>> machine, due to the fragility of pulling the nodeboards out constantly.  I also
>> haven't tried THP since refactoring/rewriting the IP27 code back in Feb to see
>> if I magically fixed it...

For IP30, I created a BUGS file in my local source (also in the IP30 patch I
still maintain) that documented some combinations of settings that affected THP
on the platform.  Most importantly, using a different PAGE_SIZE than 4KB also
required setting MAX_ZONE_ORDER to a decent value, too, else on Octane, it'd
hit IBEs at soon as the kernel executed /sbin/init.  Also depended on the
amount of RAM in that system:

>>2GB RAM:
>  - In order to use more than 2GB RAM in IP30/Octane requires selecting
>    VERY specific values for certain Kconfig options.  Specifically,
>    the following options under the "Kernel type" submenu:
>      - PAGE_SIZE
>      - Maximum Zone Order
>      - Transparent Hugepages (THP)
> 
>    A table of the specific settings is below:
>     PAGE_SIZE | Zone Order | THP
>    -----------|------------|-----
>        4KB    | 11 to 13   |  N
>       16KB    | 12 Only    |  Y
>       64KB*   | 14 Only    |  Y
> 
>    Any other configuration of these three options will likely lead to
>    Instruction Bus Errors (IBEs) when the kernel loads userland up (when it
>    execve()'s /sbin/init).  Even then, however, the machine will still be
>    very unstable (depending on the operations it does).  Heavy disk I/O
>    still seems capable of crashing the machine due to either NULL pointer
>    dereferences, unhandled kernel unaligned accesses, or Instruction Bus
>    Errors.
> 
>    * Impact users cannot currently use an Impact board with 64KB PAGE_SIZE,
>      THP, and >2GB RAM.  This will trigger a NULL pointer deference in
>      impact_resize_kpool() (when called initially from impact_common_probe()
>      to set the initial 64KB kpool on pool '0') due to (possibly) vzalloc()
>      returning a NULL pointer when allocating kpool_virt[pool].
> 
>    * THP still has issues on R1x000 CPUs, so user beware.  YMMV.


Might try some of those combinations and see if things improve on the Octeon?
IP27 was equally affected by this, minus the bits about RAM and Impact Gfx.
turning off THP, IP30 can run 64KB PAGE_SIZE without issue (compiles of
packages is actually sped up quite significantly under >4KB PAGE_SIZE).

IP27 has a bug in it somewhere that causes an immediate Oops on 64KB PAGE_SIZE
that I haven't traced down yet (I have the Oops saved somewhere if needed).  So
I use 16KB on that system.

An O2 w/ an RM7000 has virtually no issues at all with 64KB or 16KB PAGE_SIZE
and THP, though it's been several months since I last booted my O2.

-- 
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
6144R/F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And our
lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
  2016-05-23 19:40       ` Joshua Kinard
@ 2016-05-23 20:01         ` Ralf Baechle
  2016-05-24 21:21         ` Aaro Koskinen
  1 sibling, 0 replies; 28+ messages in thread
From: Ralf Baechle @ 2016-05-23 20:01 UTC (permalink / raw)
  To: Joshua Kinard; +Cc: linux-mips

On Mon, May 23, 2016 at 03:40:36PM -0400, Joshua Kinard wrote:

> >    * THP still has issues on R1x000 CPUs, so user beware.  YMMV.

As of now I think you can remove the "on R1x000 CPUs, " part, it seems.

> Might try some of those combinations and see if things improve on the Octeon?
> IP27 was equally affected by this, minus the bits about RAM and Impact Gfx.
> turning off THP, IP30 can run 64KB PAGE_SIZE without issue (compiles of
> packages is actually sped up quite significantly under >4KB PAGE_SIZE).
> 
> IP27 has a bug in it somewhere that causes an immediate Oops on 64KB PAGE_SIZE
> that I haven't traced down yet (I have the Oops saved somewhere if needed).  So
> I use 16KB on that system.

> An O2 w/ an RM7000 has virtually no issues at all with 64KB or 16KB PAGE_SIZE
> and THP, though it's been several months since I last booted my O2.

Rather different CPU and notably one that doesn't have any fancy
anti-alias protection afair.

  Ralf

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
  2016-05-23 19:08       ` Joshua Kinard
@ 2016-05-23 20:02         ` Alastair Bridgewater
  0 siblings, 0 replies; 28+ messages in thread
From: Alastair Bridgewater @ 2016-05-23 20:02 UTC (permalink / raw)
  To: Joshua Kinard
  Cc: Aaro Koskinen, David Daney, Ralf Baechle, Aaro Koskinen,
	linux-mips, Hill, Steven

[-- Attachment #1: Type: text/plain, Size: 2959 bytes --]

On Mon, May 23, 2016 at 3:08 PM, Joshua Kinard <kumba@gentoo.org> wrote:

> On 05/23/2016 14:52, Aaro Koskinen wrote:
> > On Mon, May 23, 2016 at 09:21:22AM -0700, David Daney wrote:
> >> On 05/23/2016 08:20 AM, Ralf Baechle wrote:
> >>> On Mon, May 23, 2016 at 06:13:46PM +0300, Aaro Koskinen wrote:
> >>>> I'm getting kernel crashes (see below) reliably when building Perl in
> >>>> parallel (make -j16) on OCTEON EBH5600 board (8 cores, 4 GB RAM) with
> >>>> Linux 4.6.
> >>>>
> >>>> It seems that CONFIG_TRANSPARENT_HUGEPAGE has something to do with the
> >>>> issue - disabling it makes build go through fine.
> >>>>
> >>>> Any ideas?
> >>>
> >>> I thought it was working except on SGI Origin 200/2000 aka IP27 where
> >>> Joshua Kinard (added to cc) was hitting issues as well.
> >>>
> >>> Joshua, does that similar to the issues you were hitting?
> >>
> >> There is nothing OCTEON specific in the THP code, or huge pages in
> general.
> >>
> >> That said, we have seen other THP related failures, and have never been
> able
> >> to find the cause.
> >>
> >> If someone can come up with a reproducible test case that triggers
> quickly,
> >> we can run it in our simulator and easily find the problem.
> >
> > Trying to build Perl is a reliable reproducer. Is that too heavyweight
> > for your simulator?
> >
> > I was able to reproduce this also on EdgeRouter Pro, but there the kernel
> > does not fail, only compiler dies with SIGBUS:
> >
> > [  315.095264] Data bus error, epc == 0000000000a801c4, ra ==
> 0000000000a80624
> >
> > And without THP the build is fine.
> >
> > I also tried CN68XX board with 16 GB RAM and also there I get SIGBUS
> failure
> > instead of Machine Check.
>
> SIGBUS is closer to what I was seeing on IP30/IP27, but there's two
> different
> SIGBUS errors in MIPS, a Data Bus Error (DBE) and Instruction Bus Error
> (IBE).
>  I've only seen IBEs result from using THP on Octane/IP30 and
> Origin/Onyx2/IP27.
>
> Also CC'ing Alastair Bridgewater (nyef), who was working on bringing up the
> IP35 hardware (Origin 300/350), as he had been working on tracing down some
> possible issues in the TLB code.  He had a small test case at the below
> address
> (use annotation #2), but I don't know if he got any further on debugging
> it.
>
> http://paste.lisp.org/display/305809#2
>

I still haven't gotten anywhere with this. FWIW, we confirmed that it
affects at least some of the R1x000 CPUs, and it doesn't seem to affect
OCTEON (tested on an ERlite-3). And, at least for any IP35 testing (R14000
/ R16000), my kernel configuration does not include
CONFIG_TRANSPARENT_HUGEPAGE. One of the scary parts of this one is the
cases where it DOESN'T fault. It manages to read SOMETHING, but it's not
the right data (typically zeros, but it's still scary, and what if it was a
write instead of a read?). I don't really have any other systems to test
with, so I don't know if it's R10k-specific or somewhat more general.

-- Alastair

[-- Attachment #2: Type: text/html, Size: 3888 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
  2016-05-23 19:40       ` Joshua Kinard
  2016-05-23 20:01         ` Ralf Baechle
@ 2016-05-24 21:21         ` Aaro Koskinen
  2016-05-24 22:39           ` David Daney
  1 sibling, 1 reply; 28+ messages in thread
From: Aaro Koskinen @ 2016-05-24 21:21 UTC (permalink / raw)
  To: Joshua Kinard; +Cc: Ralf Baechle, linux-mips

Hi,

On Mon, May 23, 2016 at 03:40:36PM -0400, Joshua Kinard wrote:
> Might try some of those combinations and see if things improve on the Octeon?
> IP27 was equally affected by this, minus the bits about RAM and Impact Gfx.
> turning off THP, IP30 can run 64KB PAGE_SIZE without issue (compiles of
> packages is actually sped up quite significantly under >4KB PAGE_SIZE).

I think with 64KB page size, huge pages (512MB) are never allocated
unless you have insane amounts of memory? I tried today some builds
with 64KB pages on 4GB system and it was stable, but also AnonHugePages
stayed constantly at zero. But with 4KB pages it is frequently changing,
and crashes in minutes.

A.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
  2016-05-24 21:21         ` Aaro Koskinen
@ 2016-05-24 22:39           ` David Daney
  0 siblings, 0 replies; 28+ messages in thread
From: David Daney @ 2016-05-24 22:39 UTC (permalink / raw)
  To: Aaro Koskinen; +Cc: Joshua Kinard, Ralf Baechle, linux-mips

On 05/24/2016 02:21 PM, Aaro Koskinen wrote:
> Hi,
>
> On Mon, May 23, 2016 at 03:40:36PM -0400, Joshua Kinard wrote:
>> Might try some of those combinations and see if things improve on the Octeon?
>> IP27 was equally affected by this, minus the bits about RAM and Impact Gfx.
>> turning off THP, IP30 can run 64KB PAGE_SIZE without issue (compiles of
>> packages is actually sped up quite significantly under >4KB PAGE_SIZE).
>
> I think with 64KB page size, huge pages (512MB) are never allocated
> unless you have insane amounts of memory?

Yes, you need a lot of memory, but more importantly the replacement of 
pages with a THP cannot happen unless there are 512MB aligned chunks of 
VMAs, and that is rare.

> I tried today some builds
> with 64KB pages on 4GB system and it was stable, but also AnonHugePages
> stayed constantly at zero. But with 4KB pages it is frequently changing,
> and crashes in minutes.
>
> A.
>
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
  2016-05-23 15:13 THP broken on OCTEON? Aaro Koskinen
  2016-05-23 15:20 ` Ralf Baechle
@ 2016-05-25 13:41 ` Aaro Koskinen
  2016-05-26  9:33   ` Joshua Kinard
  2016-05-26 17:59   ` David Daney
  2016-06-22 22:05 ` David Daney
  2 siblings, 2 replies; 28+ messages in thread
From: Aaro Koskinen @ 2016-05-25 13:41 UTC (permalink / raw)
  To: Ralf Baechle, linux-mips; +Cc: David Daney

Hi,

On Mon, May 23, 2016 at 06:13:46PM +0300, Aaro Koskinen wrote:
> I'm getting kernel crashes (see below) reliably when building Perl in
> parallel (make -j16) on OCTEON EBH5600 board (8 cores, 4 GB RAM) with
> Linux 4.6.
> 
> It seems that CONFIG_TRANSPARENT_HUGEPAGE has something to do with the
> issue - disabling it makes build go through fine.

Seems to be also reproducible on MIPS64/Malta/QEMU (UP, 2 GB RAM). This
happened during Perl's Configure script on the first try:

...

killpg() found.
 
lchown() found.
 
LDBL_DIG found.
 
<math.h> found.
 
Checking to see if your libm supports _LIB_VERSION...
[ 1180.488704] Data bus error, epc == 000000fff4c0ae10, ra == 000000fff4df5d3c
[ 1180.650437] Unhandled kernel unaligned access[#1]:
[ 1180.651021] CPU: 0 PID: 3213 Comm: ld Not tainted 4.6.0-mipsqemu-distro.git-v2.16-3-g8f2e042-dirty-00002-g97bf1a1 #1
[ 1180.651619] task: 98000000fdc6e300 ti: 98000000fdd98000 task.ti: 98000000fdd98000
[ 1180.652049] $ 0   : 0000000000000000 ffffffff8021b2c8 9800000001000600 00000000f1a005bf
[ 1180.652928] $ 4   : 00000000f1a005bf 0000000120200000 00000000000f1a00 0000000000100077
[ 1180.653417] $ 8   : 000000000000001c 98000000fdd9ba60 98000000fdd9ba68 0000000000000000
[ 1180.653852] $12   : 98000000fdd9ba58 000000000000a400 0000000000000000 0000000000000000
[ 1180.654309] $16   : 0000000120200000 0000000120200000 0000000120200000 98000000fdcfd500
[ 1180.654764] $20   : 0000000000000000 ffffffff80e10000 0000000000000003 00000001206f5000
[ 1180.655220] $24   : 0000000000000000 ffffffff801629d0                                  
[ 1180.655725] $28   : 98000000fdd98000 98000000fdd9ba20 0000000000000000 ffffffff8021b2c8
[ 1180.656219] Hi    : 00000000002d4e00
[ 1180.656453] Lo    : 00000000000f1a00
[ 1180.657115] epc   : ffffffff8012c990 r4k_flush_cache_page+0x80/0x4f0
[ 1180.657529] ra    : ffffffff8021b2c8 get_dump_page+0x90/0xb8
[ 1180.657809] Status: 1400a4e3	KX SX UX KERNEL EXL IE 
[ 1180.658268] Cause : 00800010 (ExcCode 04)
[ 1180.658500] BadVA : 00000000f1a005bf
[ 1180.658703] PrId  : 000182a0 (MIPS 20Kc)
[ 1180.658931] Modules linked in: autofs4
[ 1180.659360] Process ld (pid: 3213, threadinfo=98000000fdd98000, task=98000000fdc6e300, tls=000000fff4eba700)
[ 1180.659821] Stack : 0000000120200000 98000000fdee2480 0000000120200000 98000000fdee2a80
	  0000000000000000 98000000fdc95580 0000000000000003 ffffffff8021b2c8
	  98000000044db600 98000000fdc95580 98000000fe602100 ffffffff802ad418
	  98000000fe636800 ffffffff806c0188 0000000300000088 98000000fe602300
	  ffffffff806c0188 5349474900000080 98000000fdd9bae8 ffffffff806c0188
	  0000000600000120 98000000fdcfd630 ffffffff806c0188 46494c45000004c7
	  c000000000171000 0000000a00000080 0000000000000000 0000000000000000
	  0000000000000000 0000000000000000 0000000000000000 0000000000000000
	  0000000000000000 0000000000000000 0000000000000000 0000000000000000
	  0000000000000000 0000000000000000 0000000000000000 0000000000000000
	  ...
[ 1180.664677] Call Trace:
[ 1180.665054] [<ffffffff8012c990>] r4k_flush_cache_page+0x80/0x4f0
[ 1180.666422] [<ffffffff8021b2c8>] get_dump_page+0x90/0xb8
[ 1180.667099] [<ffffffff802ad418>] elf_core_dump+0x11a0/0x1350
[ 1180.667813] [<ffffffff802b1b30>] do_coredump+0x5a0/0xdf0
[ 1180.668500] [<ffffffff80143bfc>] get_signal+0x2bc/0x688
[ 1180.669172] [<ffffffff8010a874>] do_signal+0x24/0x1e8
[ 1180.669808] [<ffffffff8010b740>] do_notify_resume+0xa8/0xc8
[ 1180.670504] [<ffffffff80105d00>] work_notifysig+0x10/0x18
[ 1180.671514] 
[ 1180.671806] 
Code: 00111a7a  30630ff8  0064182d <dc630000> 30640001  1080005b  00000000  df840000  dc8402b8 
[ 1180.674251] ---[ end trace b03bb9be4922a576 ]---
[ 1180.675178] Fatal exception: panic in 5 seconds
[ 1185.681555] Kernel panic - not syncing: Fatal exception
[ 1185.682849] ---[ end Kernel panic - not syncing: Fatal exception

Used kernel config:

CONFIG_MIPS_MALTA=y
CONFIG_CPU_MIPS64_R1=y
CONFIG_64BIT=y
# CONFIG_BOUNCE is not set
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_HZ_100=y
CONFIG_KEXEC=y
# CONFIG_SECCOMP is not set
CONFIG_LOCALVERSION="-mipsqemu"
CONFIG_SYSVIPC=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_CROSS_MEMORY_ATTACH is not set
CONFIG_AUDIT=y
CONFIG_NO_HZ_IDLE=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_LOG_BUF_SHIFT=14
CONFIG_NAMESPACES=y
CONFIG_SCHED_AUTOGROUP=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_KALLSYMS_ALL=y
CONFIG_EMBEDDED=y
CONFIG_MODULES=y
CONFIG_MODULE_FORCE_LOAD=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
# CONFIG_BLK_DEV_BSG is not set
CONFIG_PARTITION_ADVANCED=y
# CONFIG_EFI_PARTITION is not set
# CONFIG_IOSCHED_DEADLINE is not set
CONFIG_PCI=y
CONFIG_MIPS32_O32=y
CONFIG_MIPS32_N32=y
CONFIG_NET=y
CONFIG_PACKET=y
CONFIG_UNIX=y
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_MROUTE=y
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_BEET is not set
# CONFIG_INET_DIAG is not set
CONFIG_TCP_MD5SIG=y
CONFIG_INET6_AH=m
CONFIG_INET6_ESP=m
CONFIG_INET6_IPCOMP=m
CONFIG_INET6_XFRM_MODE_TRANSPORT=m
CONFIG_INET6_XFRM_MODE_TUNNEL=m
CONFIG_INET6_XFRM_MODE_BEET=m
CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION=m
CONFIG_IPV6_SIT=m
CONFIG_IPV6_GRE=m
CONFIG_IPV6_MULTIPLE_TABLES=y
CONFIG_NETFILTER=y
CONFIG_NF_CONNTRACK=m
CONFIG_NF_CONNTRACK_IPV4=m
CONFIG_IP_NF_IPTABLES=m
CONFIG_IP_NF_NAT=m
CONFIG_IP_NF_TARGET_MASQUERADE=m
CONFIG_IP_NF_TARGET_NETMAP=m
CONFIG_IP_NF_TARGET_REDIRECT=m
CONFIG_NF_CONNTRACK_IPV6=m
CONFIG_IP6_NF_IPTABLES=m
CONFIG_IP6_NF_NAT=m
CONFIG_IP6_NF_TARGET_MASQUERADE=m
CONFIG_IP6_NF_TARGET_NPT=m
CONFIG_VLAN_8021Q=m
# CONFIG_WIRELESS is not set
CONFIG_NET_9P=y
CONFIG_NET_9P_VIRTIO=y
CONFIG_DEVTMPFS=y
# CONFIG_FW_LOADER is not set
CONFIG_BLK_DEV_LOOP=y
CONFIG_BLK_DEV_DRBD=m
CONFIG_VIRTIO_BLK=y
CONFIG_BLK_DEV_SD=y
# CONFIG_SCSI_LOWLEVEL is not set
CONFIG_ATA=y
CONFIG_ATA_PIIX=y
CONFIG_MD=y
CONFIG_BLK_DEV_DM=m
CONFIG_DM_MULTIPATH=m
CONFIG_DM_MULTIPATH_QL=m
CONFIG_DM_MULTIPATH_ST=m
CONFIG_NETDEVICES=y
CONFIG_DUMMY=m
CONFIG_VXLAN=m
CONFIG_NETCONSOLE=m
CONFIG_VIRTIO_NET=y
# CONFIG_ETHERNET is not set
# CONFIG_WLAN is not set
# CONFIG_INPUT is not set
# CONFIG_SERIO is not set
# CONFIG_VT is not set
# CONFIG_LEGACY_PTYS is not set
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_SERIAL_8250_NR_UARTS=3
CONFIG_SERIAL_8250_RUNTIME_UARTS=3
# CONFIG_HW_RANDOM is not set
# CONFIG_HWMON is not set
# CONFIG_VGA_ARB is not set
CONFIG_VIRTIO_PCI=y
# CONFIG_IOMMU_SUPPORT is not set
CONFIG_EXT4_FS=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y
CONFIG_AUTOFS4_FS=m
CONFIG_FUSE_FS=m
CONFIG_ISO9660_FS=m
CONFIG_VFAT_FS=m
CONFIG_PROC_KCORE=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
# CONFIG_MISC_FILESYSTEMS is not set
CONFIG_NFS_FS=m
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=m
CONFIG_NFSD=m
CONFIG_NFSD_V3_ACL=y
CONFIG_NFSD_V4=y
CONFIG_9P_FS=y
CONFIG_PRINTK_TIME=y
CONFIG_DEBUG_INFO=y
CONFIG_MAGIC_SYSRQ=y
CONFIG_SCHEDSTATS=y
CONFIG_IRQSOFF_TRACER=y
CONFIG_SCHED_TRACER=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_STACK_TRACER=y
CONFIG_BLK_DEV_IO_TRACE=y
# CONFIG_CRYPTO_HW is not set

A.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
  2016-05-25 13:41 ` Aaro Koskinen
@ 2016-05-26  9:33   ` Joshua Kinard
  2016-05-26 13:36     ` Aaro Koskinen
  2016-05-26 17:59   ` David Daney
  1 sibling, 1 reply; 28+ messages in thread
From: Joshua Kinard @ 2016-05-26  9:33 UTC (permalink / raw)
  To: Aaro Koskinen, Ralf Baechle, linux-mips; +Cc: David Daney

On 05/25/2016 09:41, Aaro Koskinen wrote:
> Hi,
> 
> On Mon, May 23, 2016 at 06:13:46PM +0300, Aaro Koskinen wrote:
>> I'm getting kernel crashes (see below) reliably when building Perl in
>> parallel (make -j16) on OCTEON EBH5600 board (8 cores, 4 GB RAM) with
>> Linux 4.6.
>>
>> It seems that CONFIG_TRANSPARENT_HUGEPAGE has something to do with the
>> issue - disabling it makes build go through fine.
> 
> Seems to be also reproducible on MIPS64/Malta/QEMU (UP, 2 GB RAM). This
> happened during Perl's Configure script on the first try:
> 
> ...

What do you have for CONFIG_FORCE_MAX_ZONEORDER?  I don't see that in your
config?  Also, what do you have for CONFIG_PAGE_SIZE_*?



> killpg() found.
>  
> lchown() found.
>  
> LDBL_DIG found.
>  
> <math.h> found.
>  
> Checking to see if your libm supports _LIB_VERSION...
> [ 1180.488704] Data bus error, epc == 000000fff4c0ae10, ra == 000000fff4df5d3c
> [ 1180.650437] Unhandled kernel unaligned access[#1]:

Unhandled kernel unaligned access is one of the errors I got before under THP
on the IP27 as well.  This can't be coincidental.


> [ 1180.651021] CPU: 0 PID: 3213 Comm: ld Not tainted 4.6.0-mipsqemu-distro.git-v2.16-3-g8f2e042-dirty-00002-g97bf1a1 #1
> [ 1180.651619] task: 98000000fdc6e300 ti: 98000000fdd98000 task.ti: 98000000fdd98000
> [ 1180.652049] $ 0   : 0000000000000000 ffffffff8021b2c8 9800000001000600 00000000f1a005bf
> [ 1180.652928] $ 4   : 00000000f1a005bf 0000000120200000 00000000000f1a00 0000000000100077
> [ 1180.653417] $ 8   : 000000000000001c 98000000fdd9ba60 98000000fdd9ba68 0000000000000000
> [ 1180.653852] $12   : 98000000fdd9ba58 000000000000a400 0000000000000000 0000000000000000
> [ 1180.654309] $16   : 0000000120200000 0000000120200000 0000000120200000 98000000fdcfd500
> [ 1180.654764] $20   : 0000000000000000 ffffffff80e10000 0000000000000003 00000001206f5000
> [ 1180.655220] $24   : 0000000000000000 ffffffff801629d0                                  
> [ 1180.655725] $28   : 98000000fdd98000 98000000fdd9ba20 0000000000000000 ffffffff8021b2c8
> [ 1180.656219] Hi    : 00000000002d4e00
> [ 1180.656453] Lo    : 00000000000f1a00
> [ 1180.657115] epc   : ffffffff8012c990 r4k_flush_cache_page+0x80/0x4f0
> [ 1180.657529] ra    : ffffffff8021b2c8 get_dump_page+0x90/0xb8
> [ 1180.657809] Status: 1400a4e3	KX SX UX KERNEL EXL IE 
> [ 1180.658268] Cause : 00800010 (ExcCode 04)
> [ 1180.658500] BadVA : 00000000f1a005bf
> [ 1180.658703] PrId  : 000182a0 (MIPS 20Kc)
> [ 1180.658931] Modules linked in: autofs4
> [ 1180.659360] Process ld (pid: 3213, threadinfo=98000000fdd98000, task=98000000fdc6e300, tls=000000fff4eba700)
> [ 1180.659821] Stack : 0000000120200000 98000000fdee2480 0000000120200000 98000000fdee2a80
> 	  0000000000000000 98000000fdc95580 0000000000000003 ffffffff8021b2c8
> 	  98000000044db600 98000000fdc95580 98000000fe602100 ffffffff802ad418
> 	  98000000fe636800 ffffffff806c0188 0000000300000088 98000000fe602300
> 	  ffffffff806c0188 5349474900000080 98000000fdd9bae8 ffffffff806c0188
> 	  0000000600000120 98000000fdcfd630 ffffffff806c0188 46494c45000004c7
> 	  c000000000171000 0000000a00000080 0000000000000000 0000000000000000
> 	  0000000000000000 0000000000000000 0000000000000000 0000000000000000
> 	  0000000000000000 0000000000000000 0000000000000000 0000000000000000
> 	  0000000000000000 0000000000000000 0000000000000000 0000000000000000
> 	  ...
> [ 1180.664677] Call Trace:
> [ 1180.665054] [<ffffffff8012c990>] r4k_flush_cache_page+0x80/0x4f0
> [ 1180.666422] [<ffffffff8021b2c8>] get_dump_page+0x90/0xb8
> [ 1180.667099] [<ffffffff802ad418>] elf_core_dump+0x11a0/0x1350
> [ 1180.667813] [<ffffffff802b1b30>] do_coredump+0x5a0/0xdf0
> [ 1180.668500] [<ffffffff80143bfc>] get_signal+0x2bc/0x688
> [ 1180.669172] [<ffffffff8010a874>] do_signal+0x24/0x1e8
> [ 1180.669808] [<ffffffff8010b740>] do_notify_resume+0xa8/0xc8
> [ 1180.670504] [<ffffffff80105d00>] work_notifysig+0x10/0x18
> [ 1180.671514] 
> [ 1180.671806] 
> Code: 00111a7a  30630ff8  0064182d <dc630000> 30640001  1080005b  00000000  df840000  dc8402b8 
> [ 1180.674251] ---[ end trace b03bb9be4922a576 ]---
> [ 1180.675178] Fatal exception: panic in 5 seconds
> [ 1185.681555] Kernel panic - not syncing: Fatal exception
> [ 1185.682849] ---[ end Kernel panic - not syncing: Fatal exception
> 

[snip]

-- 
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
6144R/F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And our
lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
  2016-05-26  9:33   ` Joshua Kinard
@ 2016-05-26 13:36     ` Aaro Koskinen
  0 siblings, 0 replies; 28+ messages in thread
From: Aaro Koskinen @ 2016-05-26 13:36 UTC (permalink / raw)
  To: Joshua Kinard; +Cc: Ralf Baechle, linux-mips, David Daney

Hi,

On Thu, May 26, 2016 at 05:33:13AM -0400, Joshua Kinard wrote:
> On 05/25/2016 09:41, Aaro Koskinen wrote:
> > On Mon, May 23, 2016 at 06:13:46PM +0300, Aaro Koskinen wrote:
> >> I'm getting kernel crashes (see below) reliably when building Perl in
> >> parallel (make -j16) on OCTEON EBH5600 board (8 cores, 4 GB RAM) with
> >> Linux 4.6.
> >>
> >> It seems that CONFIG_TRANSPARENT_HUGEPAGE has something to do with the
> >> issue - disabling it makes build go through fine.
> > 
> > Seems to be also reproducible on MIPS64/Malta/QEMU (UP, 2 GB RAM). This
> > happened during Perl's Configure script on the first try:
> 
> What do you have for CONFIG_FORCE_MAX_ZONEORDER?  I don't see that in your
> config?  Also, what do you have for CONFIG_PAGE_SIZE_*?

The config was generated with "savedefconfig", so those values are at
their defaults (4 KB page size, max zoneorder 11).

A.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
  2016-05-25 13:41 ` Aaro Koskinen
  2016-05-26  9:33   ` Joshua Kinard
@ 2016-05-26 17:59   ` David Daney
  2016-05-26 19:23     ` Aaro Koskinen
  1 sibling, 1 reply; 28+ messages in thread
From: David Daney @ 2016-05-26 17:59 UTC (permalink / raw)
  To: Aaro Koskinen; +Cc: Ralf Baechle, linux-mips

On 05/25/2016 06:41 AM, Aaro Koskinen wrote:
> Hi,
>
> On Mon, May 23, 2016 at 06:13:46PM +0300, Aaro Koskinen wrote:
>> I'm getting kernel crashes (see below) reliably when building Perl in
>> parallel (make -j16) on OCTEON EBH5600 board (8 cores, 4 GB RAM) with
>> Linux 4.6.
>>
>> It seems that CONFIG_TRANSPARENT_HUGEPAGE has something to do with the
>> issue - disabling it makes build go through fine.
>
> Seems to be also reproducible on MIPS64/Malta/QEMU (UP, 2 GB RAM). This
> happened during Perl's Configure script on the first try:
>

Are you sure this failure is THP related?

Also, what is the source of your rootfs (compilers, tools etc.)?

Will a similar config fail this way on OCTEON booted with numcores=1?




> ...
>
> killpg() found.
>
> lchown() found.
>
> LDBL_DIG found.
>
> <math.h> found.
>
> Checking to see if your libm supports _LIB_VERSION...
> [ 1180.488704] Data bus error, epc == 000000fff4c0ae10, ra == 000000fff4df5d3c
> [ 1180.650437] Unhandled kernel unaligned access[#1]:
> [ 1180.651021] CPU: 0 PID: 3213 Comm: ld Not tainted 4.6.0-mipsqemu-distro.git-v2.16-3-g8f2e042-dirty-00002-g97bf1a1 #1
> [ 1180.651619] task: 98000000fdc6e300 ti: 98000000fdd98000 task.ti: 98000000fdd98000
> [ 1180.652049] $ 0   : 0000000000000000 ffffffff8021b2c8 9800000001000600 00000000f1a005bf
> [ 1180.652928] $ 4   : 00000000f1a005bf 0000000120200000 00000000000f1a00 0000000000100077
> [ 1180.653417] $ 8   : 000000000000001c 98000000fdd9ba60 98000000fdd9ba68 0000000000000000
> [ 1180.653852] $12   : 98000000fdd9ba58 000000000000a400 0000000000000000 0000000000000000
> [ 1180.654309] $16   : 0000000120200000 0000000120200000 0000000120200000 98000000fdcfd500
> [ 1180.654764] $20   : 0000000000000000 ffffffff80e10000 0000000000000003 00000001206f5000
> [ 1180.655220] $24   : 0000000000000000 ffffffff801629d0
> [ 1180.655725] $28   : 98000000fdd98000 98000000fdd9ba20 0000000000000000 ffffffff8021b2c8
> [ 1180.656219] Hi    : 00000000002d4e00
> [ 1180.656453] Lo    : 00000000000f1a00
> [ 1180.657115] epc   : ffffffff8012c990 r4k_flush_cache_page+0x80/0x4f0
> [ 1180.657529] ra    : ffffffff8021b2c8 get_dump_page+0x90/0xb8
> [ 1180.657809] Status: 1400a4e3	KX SX UX KERNEL EXL IE
> [ 1180.658268] Cause : 00800010 (ExcCode 04)
> [ 1180.658500] BadVA : 00000000f1a005bf
> [ 1180.658703] PrId  : 000182a0 (MIPS 20Kc)
> [ 1180.658931] Modules linked in: autofs4
> [ 1180.659360] Process ld (pid: 3213, threadinfo=98000000fdd98000, task=98000000fdc6e300, tls=000000fff4eba700)
> [ 1180.659821] Stack : 0000000120200000 98000000fdee2480 0000000120200000 98000000fdee2a80
> 	  0000000000000000 98000000fdc95580 0000000000000003 ffffffff8021b2c8
> 	  98000000044db600 98000000fdc95580 98000000fe602100 ffffffff802ad418
> 	  98000000fe636800 ffffffff806c0188 0000000300000088 98000000fe602300
> 	  ffffffff806c0188 5349474900000080 98000000fdd9bae8 ffffffff806c0188
> 	  0000000600000120 98000000fdcfd630 ffffffff806c0188 46494c45000004c7
> 	  c000000000171000 0000000a00000080 0000000000000000 0000000000000000
> 	  0000000000000000 0000000000000000 0000000000000000 0000000000000000
> 	  0000000000000000 0000000000000000 0000000000000000 0000000000000000
> 	  0000000000000000 0000000000000000 0000000000000000 0000000000000000
> 	  ...
> [ 1180.664677] Call Trace:
> [ 1180.665054] [<ffffffff8012c990>] r4k_flush_cache_page+0x80/0x4f0
> [ 1180.666422] [<ffffffff8021b2c8>] get_dump_page+0x90/0xb8
> [ 1180.667099] [<ffffffff802ad418>] elf_core_dump+0x11a0/0x1350
> [ 1180.667813] [<ffffffff802b1b30>] do_coredump+0x5a0/0xdf0
> [ 1180.668500] [<ffffffff80143bfc>] get_signal+0x2bc/0x688
> [ 1180.669172] [<ffffffff8010a874>] do_signal+0x24/0x1e8
> [ 1180.669808] [<ffffffff8010b740>] do_notify_resume+0xa8/0xc8
> [ 1180.670504] [<ffffffff80105d00>] work_notifysig+0x10/0x18
> [ 1180.671514]
> [ 1180.671806]
> Code: 00111a7a  30630ff8  0064182d <dc630000> 30640001  1080005b  00000000  df840000  dc8402b8
> [ 1180.674251] ---[ end trace b03bb9be4922a576 ]---
> [ 1180.675178] Fatal exception: panic in 5 seconds
> [ 1185.681555] Kernel panic - not syncing: Fatal exception
> [ 1185.682849] ---[ end Kernel panic - not syncing: Fatal exception
>
> Used kernel config:
>
> CONFIG_MIPS_MALTA=y
> CONFIG_CPU_MIPS64_R1=y
> CONFIG_64BIT=y
> # CONFIG_BOUNCE is not set
> CONFIG_TRANSPARENT_HUGEPAGE=y
> CONFIG_HZ_100=y
> CONFIG_KEXEC=y
> # CONFIG_SECCOMP is not set
> CONFIG_LOCALVERSION="-mipsqemu"
> CONFIG_SYSVIPC=y
> CONFIG_POSIX_MQUEUE=y
> # CONFIG_CROSS_MEMORY_ATTACH is not set
> CONFIG_AUDIT=y
> CONFIG_NO_HZ_IDLE=y
> CONFIG_HIGH_RES_TIMERS=y
> CONFIG_LOG_BUF_SHIFT=14
> CONFIG_NAMESPACES=y
> CONFIG_SCHED_AUTOGROUP=y
> CONFIG_BLK_DEV_INITRD=y
> CONFIG_KALLSYMS_ALL=y
> CONFIG_EMBEDDED=y
> CONFIG_MODULES=y
> CONFIG_MODULE_FORCE_LOAD=y
> CONFIG_MODULE_UNLOAD=y
> CONFIG_MODULE_FORCE_UNLOAD=y
> # CONFIG_BLK_DEV_BSG is not set
> CONFIG_PARTITION_ADVANCED=y
> # CONFIG_EFI_PARTITION is not set
> # CONFIG_IOSCHED_DEADLINE is not set
> CONFIG_PCI=y
> CONFIG_MIPS32_O32=y
> CONFIG_MIPS32_N32=y
> CONFIG_NET=y
> CONFIG_PACKET=y
> CONFIG_UNIX=y
> CONFIG_INET=y
> CONFIG_IP_MULTICAST=y
> CONFIG_IP_MROUTE=y
> CONFIG_IP_PIMSM_V1=y
> CONFIG_IP_PIMSM_V2=y
> # CONFIG_INET_XFRM_MODE_TRANSPORT is not set
> # CONFIG_INET_XFRM_MODE_TUNNEL is not set
> # CONFIG_INET_XFRM_MODE_BEET is not set
> # CONFIG_INET_DIAG is not set
> CONFIG_TCP_MD5SIG=y
> CONFIG_INET6_AH=m
> CONFIG_INET6_ESP=m
> CONFIG_INET6_IPCOMP=m
> CONFIG_INET6_XFRM_MODE_TRANSPORT=m
> CONFIG_INET6_XFRM_MODE_TUNNEL=m
> CONFIG_INET6_XFRM_MODE_BEET=m
> CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION=m
> CONFIG_IPV6_SIT=m
> CONFIG_IPV6_GRE=m
> CONFIG_IPV6_MULTIPLE_TABLES=y
> CONFIG_NETFILTER=y
> CONFIG_NF_CONNTRACK=m
> CONFIG_NF_CONNTRACK_IPV4=m
> CONFIG_IP_NF_IPTABLES=m
> CONFIG_IP_NF_NAT=m
> CONFIG_IP_NF_TARGET_MASQUERADE=m
> CONFIG_IP_NF_TARGET_NETMAP=m
> CONFIG_IP_NF_TARGET_REDIRECT=m
> CONFIG_NF_CONNTRACK_IPV6=m
> CONFIG_IP6_NF_IPTABLES=m
> CONFIG_IP6_NF_NAT=m
> CONFIG_IP6_NF_TARGET_MASQUERADE=m
> CONFIG_IP6_NF_TARGET_NPT=m
> CONFIG_VLAN_8021Q=m
> # CONFIG_WIRELESS is not set
> CONFIG_NET_9P=y
> CONFIG_NET_9P_VIRTIO=y
> CONFIG_DEVTMPFS=y
> # CONFIG_FW_LOADER is not set
> CONFIG_BLK_DEV_LOOP=y
> CONFIG_BLK_DEV_DRBD=m
> CONFIG_VIRTIO_BLK=y
> CONFIG_BLK_DEV_SD=y
> # CONFIG_SCSI_LOWLEVEL is not set
> CONFIG_ATA=y
> CONFIG_ATA_PIIX=y
> CONFIG_MD=y
> CONFIG_BLK_DEV_DM=m
> CONFIG_DM_MULTIPATH=m
> CONFIG_DM_MULTIPATH_QL=m
> CONFIG_DM_MULTIPATH_ST=m
> CONFIG_NETDEVICES=y
> CONFIG_DUMMY=m
> CONFIG_VXLAN=m
> CONFIG_NETCONSOLE=m
> CONFIG_VIRTIO_NET=y
> # CONFIG_ETHERNET is not set
> # CONFIG_WLAN is not set
> # CONFIG_INPUT is not set
> # CONFIG_SERIO is not set
> # CONFIG_VT is not set
> # CONFIG_LEGACY_PTYS is not set
> CONFIG_SERIAL_8250=y
> CONFIG_SERIAL_8250_CONSOLE=y
> CONFIG_SERIAL_8250_NR_UARTS=3
> CONFIG_SERIAL_8250_RUNTIME_UARTS=3
> # CONFIG_HW_RANDOM is not set
> # CONFIG_HWMON is not set
> # CONFIG_VGA_ARB is not set
> CONFIG_VIRTIO_PCI=y
> # CONFIG_IOMMU_SUPPORT is not set
> CONFIG_EXT4_FS=y
> CONFIG_EXT4_FS_POSIX_ACL=y
> CONFIG_EXT4_FS_SECURITY=y
> CONFIG_AUTOFS4_FS=m
> CONFIG_FUSE_FS=m
> CONFIG_ISO9660_FS=m
> CONFIG_VFAT_FS=m
> CONFIG_PROC_KCORE=y
> CONFIG_TMPFS=y
> CONFIG_TMPFS_POSIX_ACL=y
> # CONFIG_MISC_FILESYSTEMS is not set
> CONFIG_NFS_FS=m
> CONFIG_NFS_V3_ACL=y
> CONFIG_NFS_V4=m
> CONFIG_NFSD=m
> CONFIG_NFSD_V3_ACL=y
> CONFIG_NFSD_V4=y
> CONFIG_9P_FS=y
> CONFIG_PRINTK_TIME=y
> CONFIG_DEBUG_INFO=y
> CONFIG_MAGIC_SYSRQ=y
> CONFIG_SCHEDSTATS=y
> CONFIG_IRQSOFF_TRACER=y
> CONFIG_SCHED_TRACER=y
> CONFIG_FTRACE_SYSCALLS=y
> CONFIG_STACK_TRACER=y
> CONFIG_BLK_DEV_IO_TRACE=y
> # CONFIG_CRYPTO_HW is not set
>
> A.
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
  2016-05-26 17:59   ` David Daney
@ 2016-05-26 19:23     ` Aaro Koskinen
  2016-05-26 22:13       ` David Daney
  0 siblings, 1 reply; 28+ messages in thread
From: Aaro Koskinen @ 2016-05-26 19:23 UTC (permalink / raw)
  To: David Daney; +Cc: Aaro Koskinen, Ralf Baechle, linux-mips

Hi,

On Thu, May 26, 2016 at 10:59:50AM -0700, David Daney wrote:
> On 05/25/2016 06:41 AM, Aaro Koskinen wrote:
> >On Mon, May 23, 2016 at 06:13:46PM +0300, Aaro Koskinen wrote:
> >>I'm getting kernel crashes (see below) reliably when building Perl in
> >>parallel (make -j16) on OCTEON EBH5600 board (8 cores, 4 GB RAM) with
> >>Linux 4.6.
> >>
> >>It seems that CONFIG_TRANSPARENT_HUGEPAGE has something to do with the
> >>issue - disabling it makes build go through fine.
> >
> >Seems to be also reproducible on MIPS64/Malta/QEMU (UP, 2 GB RAM). This
> >happened during Perl's Configure script on the first try:
> 
> Are you sure this failure is THP related?

We have used MIPS64 Malta QEMU for regular package builds many months,
and it has never failed with THP disabled. With THP enabled builds
fail reliably.

> Also, what is the source of your rootfs (compilers, tools etc.)?

Compiler is mainline GCC 4.9.3 and binutils 2.26. The whole rootfs is
compiled from scratch (64-bit ABI).

On ER Pro, I use O32 userspace and kernel compiled with GCC 6.1.0.

> Will a similar config fail this way on OCTEON booted with numcores=1?

Yes, just tested with ER Pro using single core, and also with just "make"
without any parallel threads. And it failed with SIGSEGV this time:

[  744.268063] 
do_page_fault(): sending SIGSEGV to miniperl for invalid read access from 000000000000000c
[  744.277418] epc = 00000000004ca8e8 in miniperl[400000+19e000]
[  744.283202] ra  = 00000000004c9e8c in miniperl[400000+19e000]
[  744.289005] 

A.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
  2016-05-26 19:23     ` Aaro Koskinen
@ 2016-05-26 22:13       ` David Daney
  2016-05-27 17:14         ` Aaro Koskinen
  0 siblings, 1 reply; 28+ messages in thread
From: David Daney @ 2016-05-26 22:13 UTC (permalink / raw)
  To: Aaro Koskinen; +Cc: Aaro Koskinen, Ralf Baechle, linux-mips

On 05/26/2016 12:23 PM, Aaro Koskinen wrote:
> Hi,
>
> On Thu, May 26, 2016 at 10:59:50AM -0700, David Daney wrote:
>> On 05/25/2016 06:41 AM, Aaro Koskinen wrote:
>>> On Mon, May 23, 2016 at 06:13:46PM +0300, Aaro Koskinen wrote:
>>>> I'm getting kernel crashes (see below) reliably when building Perl in
>>>> parallel (make -j16) on OCTEON EBH5600 board (8 cores, 4 GB RAM) with
>>>> Linux 4.6.
>>>>
>>>> It seems that CONFIG_TRANSPARENT_HUGEPAGE has something to do with the
>>>> issue - disabling it makes build go through fine.
>>>
>>> Seems to be also reproducible on MIPS64/Malta/QEMU (UP, 2 GB RAM). This
>>> happened during Perl's Configure script on the first try:
>>
>> Are you sure this failure is THP related?
>
> We have used MIPS64 Malta QEMU for regular package builds many months,
> and it has never failed with THP disabled. With THP enabled builds
> fail reliably.
>
>> Also, what is the source of your rootfs (compilers, tools etc.)?
>
> Compiler is mainline GCC 4.9.3 and binutils 2.26. The whole rootfs is
> compiled from scratch (64-bit ABI).
>
> On ER Pro, I use O32 userspace and kernel compiled with GCC 6.1.0.
>
>> Will a similar config fail this way on OCTEON booted with numcores=1?
>
> Yes, just tested with ER Pro using single core, and also with just "make"
> without any parallel threads. And it failed with SIGSEGV this time:

Is it possible for you to create a root file system that substitutes 
/sbin/init with a script that does the minimal amount of system 
initialization and then runs the "make" command?

The idea being that the system without any user input of any kind would 
boot directly to the failure case.  Ideally this would be for the single 
CPU case.

With an ext2 image of that and the vmlinux file, it would be child's 
play to run in our simulator and find the cause.



>
> [  744.268063]
> do_page_fault(): sending SIGSEGV to miniperl for invalid read access from 000000000000000c
> [  744.277418] epc = 00000000004ca8e8 in miniperl[400000+19e000]
> [  744.283202] ra  = 00000000004c9e8c in miniperl[400000+19e000]
> [  744.289005]
>
> A.
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
  2016-05-26 22:13       ` David Daney
@ 2016-05-27 17:14         ` Aaro Koskinen
  2016-05-27 21:03           ` Joshua Kinard
  0 siblings, 1 reply; 28+ messages in thread
From: Aaro Koskinen @ 2016-05-27 17:14 UTC (permalink / raw)
  To: David Daney; +Cc: Aaro Koskinen, Ralf Baechle, linux-mips

Hi,

On Thu, May 26, 2016 at 03:13:34PM -0700, David Daney wrote:
> On 05/26/2016 12:23 PM, Aaro Koskinen wrote:
> >Yes, just tested with ER Pro using single core, and also with just "make"
> >without any parallel threads. And it failed with SIGSEGV this time:
> 
> Is it possible for you to create a root file system that substitutes
> /sbin/init with a script that does the minimal amount of system
> initialization and then runs the "make" command?
> 
> The idea being that the system without any user input of any kind would boot
> directly to the failure case.  Ideally this would be for the single CPU
> case.
> 
> With an ext2 image of that and the vmlinux file, it would be child's play to
> run in our simulator and find the cause.

OK, I have prepared such image and will send you the URL off-list.

One thing however is that you will probably need 2 GB RAM, and I think
the mainline kernel limits the memory to 64 MB on simulator?

When testing my image, I came across a new type of failure. This time
kernel reported a bug:

[ 1918.046230] BUG: Bad page state in process khugepaged  pfn:6896e
[ 1918.052256] page:8000000005000010 count:-1610612736 mapcount:1 mapping:          (null) index:0x0
[ 1918.061189] flags: 0x0()
[ 1918.063735] page dumped because: nonzero _count
[ 1918.068271] Modules linked in: broadcom bcm_phy_lib
[ 1918.073191] CPU: 0 PID: 124 Comm: khugepaged Not tainted 4.6.0-octeon-los_a568fa0+ #1
[ 1918.081030] Stack : ffffffff81790000 ffffffff81784d48 0000000000000000 ffffffff82070000
	  00000000000003e0 0044b82fa09b0000 ffffffff8206efd0 0000000000000004
	  0000000000000049 ffffffff81790000 0000000000000000 0000000000000000
	  0000000000000049 0000000000000006 0000000000000000 ffffffff8118a588
	  0000000000000000 ffffffff82070000 0000000000000000 ffffffff82060000
	  ffffffff8178ad07 ffffffff816dd238 800000008ec2ce00 0000000000000000
	  000000000000007c ffffffff82066878 ffffffff817b8a00 0000000000000001
	  8000000004ffb000 800000008f557850 800000008f557968 ffffffff813648d4
	  ffffffff817b8a00 ffffffff8118b678 000000000000009e ffffffff816dd238
	  0000000000000000 ffffffff811242f8 0000000000000000 0000000000000000
	  ...
[ 1918.146630] Call Trace:
[ 1918.149089] [<ffffffff811242f8>] show_stack+0x88/0xa8
[ 1918.154154] [<ffffffff813648d4>] dump_stack+0x94/0xd0
[ 1918.159213] [<ffffffff811d9dbc>] bad_page+0x134/0x1a0
[ 1918.164273] [<ffffffff811ddd90>] get_page_from_freelist+0x3b0/0xac8
[ 1918.170550] [<ffffffff811de79c>] __alloc_pages_nodemask+0xcc/0x998
[ 1918.176740] [<ffffffff8122c274>] khugepaged+0x7e4/0x1878
[ 1918.182063] [<ffffffff81162284>] kthread+0xd4/0xf0
[ 1918.186861] [<ffffffff8111ec18>] ret_from_kernel_thread+0x14/0x1c
[ 1918.192961] 
[ 1918.194465] Disabling lock debugging due to kernel taint

However, the build went through apparently OK.

Much later, Perl got stuck in the test suite. SysRq showed it was
apparently busylooping in the page fault handler:

[ 8567.431460] perl            R  running task        0 19437  19408 0x08d00000
[ 8567.438538] Stack : ffffffff816e0fdd ffffffff8163bed8 ffffffff816e0fdc ffffffff816e1cf8
	  ffffffff82070000 0000000000000001 ffffffff820682e0 ffffffff82070000
	  ffffffff81790000 ffffffff81784d48 0000000000000000 ffffffff82070000
	  0000000000000000 0044b82fa09b0000 ffffffff8206efd0 0000000000000004
	  000000000000001d ffffffff81790000 0000000000000000 0000000000000000
	  000000000000001d 0000000000000002 0000000000000000 ffffffff8118a588
	  0000000000000000 0000000000000001 0000000000000000 ffffffff8206a6f8
	  800000008eca0000 800000008ebb76f0 800000008eca02e0 ffffffff8116d568
	  ffffffff8178ae80 ffffffff817a0000 ffffffff8179fd00 ffffffff817c0000
	  0000000000000000 ffffffff811242f8 0000000000000000 0000000000000000
	  ...
[ 8567.504144] Call Trace:
[ 8567.506596] [<ffffffff811242f8>] show_stack+0x88/0xa8
[ 8567.511656] [<ffffffff8116d568>] show_state_filter+0x88/0xc8
[ 8567.517325] [<ffffffff813cda28>] sysrq_handle_showstate+0x10/0x20
[ 8567.523427] [<ffffffff813cde60>] __handle_sysrq+0xb8/0x1f0
[ 8567.528921] [<ffffffff813d4378>] serial8250_rx_chars+0x110/0x228
[ 8567.534938] [<ffffffff813d7550>] serial8250_handle_irq.part.14+0x80/0x110
[ 8567.541736] [<ffffffff813db460>] dw8250_handle_irq+0x38/0xb0
[ 8567.547405] [<ffffffff813d2da8>] serial8250_interrupt+0x60/0xf8
[ 8567.553335] [<ffffffff8118c1f8>] handle_irq_event_percpu+0x78/0x1b8
[ 8567.559611] [<ffffffff8118c390>] handle_irq_event+0x58/0x98
[ 8567.565194] [<ffffffff81190074>] handle_level_irq+0xd4/0x198
[ 8567.570862] [<ffffffff8118b780>] generic_handle_irq+0x40/0x58
[ 8567.576617] [<ffffffff81120b70>] do_IRQ+0x18/0x28
[ 8567.581329] [<ffffffff811052bc>] plat_irq_dispatch+0xc4/0x130
[ 8567.587083] [<ffffffff8111ebd0>] ret_from_irq+0x0/0x4
[ 8567.592143] [<ffffffff8160eed0>] _raw_spin_lock+0x18/0x30
[ 8567.597550] [<ffffffff8122d8f0>] huge_pmd_set_accessed+0x40/0xd8
[ 8567.603566] [<ffffffff81205220>] handle_mm_fault+0x110/0x1bb0
[ 8567.609320] [<ffffffff811344b8>] __do_page_fault+0x158/0x508
[ 8567.614988] [<ffffffff8111ebc0>] ret_from_exception+0x0/0x10

A.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
  2016-05-27 17:14         ` Aaro Koskinen
@ 2016-05-27 21:03           ` Joshua Kinard
  2016-05-27 22:05             ` Aaro Koskinen
  0 siblings, 1 reply; 28+ messages in thread
From: Joshua Kinard @ 2016-05-27 21:03 UTC (permalink / raw)
  To: Aaro Koskinen, David Daney; +Cc: Aaro Koskinen, Ralf Baechle, linux-mips

On 05/27/2016 13:14, Aaro Koskinen wrote:
> Hi,
> 
> On Thu, May 26, 2016 at 03:13:34PM -0700, David Daney wrote:
>> On 05/26/2016 12:23 PM, Aaro Koskinen wrote:
>>> Yes, just tested with ER Pro using single core, and also with just "make"
>>> without any parallel threads. And it failed with SIGSEGV this time:
>>
>> Is it possible for you to create a root file system that substitutes
>> /sbin/init with a script that does the minimal amount of system
>> initialization and then runs the "make" command?
>>
>> The idea being that the system without any user input of any kind would boot
>> directly to the failure case.  Ideally this would be for the single CPU
>> case.
>>
>> With an ext2 image of that and the vmlinux file, it would be child's play to
>> run in our simulator and find the cause.
> 
> OK, I have prepared such image and will send you the URL off-list.
> 
> One thing however is that you will probably need 2 GB RAM, and I think
> the mainline kernel limits the memory to 64 MB on simulator?

If the binaries on the initramfs are built to any of the MIPS-I to MIPS-IV
ISAs, I can test this on my IP27/Onyx2 system as well, though I'll have to
build an IP27 kernel and just use the initramfs.



> When testing my image, I came across a new type of failure. This time
> kernel reported a bug:
> 
> [ 1918.046230] BUG: Bad page state in process khugepaged  pfn:6896e
> [ 1918.052256] page:8000000005000010 count:-1610612736 mapcount:1 mapping:          (null) index:0x0
> [ 1918.061189] flags: 0x0()
> [ 1918.063735] page dumped because: nonzero _count
> [ 1918.068271] Modules linked in: broadcom bcm_phy_lib
> [ 1918.073191] CPU: 0 PID: 124 Comm: khugepaged Not tainted 4.6.0-octeon-los_a568fa0+ #1
> [ 1918.081030] Stack : ffffffff81790000 ffffffff81784d48 0000000000000000 ffffffff82070000
> 	  00000000000003e0 0044b82fa09b0000 ffffffff8206efd0 0000000000000004
> 	  0000000000000049 ffffffff81790000 0000000000000000 0000000000000000
> 	  0000000000000049 0000000000000006 0000000000000000 ffffffff8118a588
> 	  0000000000000000 ffffffff82070000 0000000000000000 ffffffff82060000
> 	  ffffffff8178ad07 ffffffff816dd238 800000008ec2ce00 0000000000000000
> 	  000000000000007c ffffffff82066878 ffffffff817b8a00 0000000000000001
> 	  8000000004ffb000 800000008f557850 800000008f557968 ffffffff813648d4
> 	  ffffffff817b8a00 ffffffff8118b678 000000000000009e ffffffff816dd238
> 	  0000000000000000 ffffffff811242f8 0000000000000000 0000000000000000
> 	  ...
> [ 1918.146630] Call Trace:
> [ 1918.149089] [<ffffffff811242f8>] show_stack+0x88/0xa8
> [ 1918.154154] [<ffffffff813648d4>] dump_stack+0x94/0xd0
> [ 1918.159213] [<ffffffff811d9dbc>] bad_page+0x134/0x1a0
> [ 1918.164273] [<ffffffff811ddd90>] get_page_from_freelist+0x3b0/0xac8
> [ 1918.170550] [<ffffffff811de79c>] __alloc_pages_nodemask+0xcc/0x998
> [ 1918.176740] [<ffffffff8122c274>] khugepaged+0x7e4/0x1878
> [ 1918.182063] [<ffffffff81162284>] kthread+0xd4/0xf0
> [ 1918.186861] [<ffffffff8111ec18>] ret_from_kernel_thread+0x14/0x1c
> [ 1918.192961] 
> [ 1918.194465] Disabling lock debugging due to kernel taint
> 
> However, the build went through apparently OK.
> 
> Much later, Perl got stuck in the test suite. SysRq showed it was
> apparently busylooping in the page fault handler:
> 
> [ 8567.431460] perl            R  running task        0 19437  19408 0x08d00000
> [ 8567.438538] Stack : ffffffff816e0fdd ffffffff8163bed8 ffffffff816e0fdc ffffffff816e1cf8
> 	  ffffffff82070000 0000000000000001 ffffffff820682e0 ffffffff82070000
> 	  ffffffff81790000 ffffffff81784d48 0000000000000000 ffffffff82070000
> 	  0000000000000000 0044b82fa09b0000 ffffffff8206efd0 0000000000000004
> 	  000000000000001d ffffffff81790000 0000000000000000 0000000000000000
> 	  000000000000001d 0000000000000002 0000000000000000 ffffffff8118a588
> 	  0000000000000000 0000000000000001 0000000000000000 ffffffff8206a6f8
> 	  800000008eca0000 800000008ebb76f0 800000008eca02e0 ffffffff8116d568
> 	  ffffffff8178ae80 ffffffff817a0000 ffffffff8179fd00 ffffffff817c0000
> 	  0000000000000000 ffffffff811242f8 0000000000000000 0000000000000000
> 	  ...
> [ 8567.504144] Call Trace:
> [ 8567.506596] [<ffffffff811242f8>] show_stack+0x88/0xa8
> [ 8567.511656] [<ffffffff8116d568>] show_state_filter+0x88/0xc8
> [ 8567.517325] [<ffffffff813cda28>] sysrq_handle_showstate+0x10/0x20
> [ 8567.523427] [<ffffffff813cde60>] __handle_sysrq+0xb8/0x1f0
> [ 8567.528921] [<ffffffff813d4378>] serial8250_rx_chars+0x110/0x228
> [ 8567.534938] [<ffffffff813d7550>] serial8250_handle_irq.part.14+0x80/0x110
> [ 8567.541736] [<ffffffff813db460>] dw8250_handle_irq+0x38/0xb0
> [ 8567.547405] [<ffffffff813d2da8>] serial8250_interrupt+0x60/0xf8
> [ 8567.553335] [<ffffffff8118c1f8>] handle_irq_event_percpu+0x78/0x1b8
> [ 8567.559611] [<ffffffff8118c390>] handle_irq_event+0x58/0x98
> [ 8567.565194] [<ffffffff81190074>] handle_level_irq+0xd4/0x198
> [ 8567.570862] [<ffffffff8118b780>] generic_handle_irq+0x40/0x58
> [ 8567.576617] [<ffffffff81120b70>] do_IRQ+0x18/0x28
> [ 8567.581329] [<ffffffff811052bc>] plat_irq_dispatch+0xc4/0x130
> [ 8567.587083] [<ffffffff8111ebd0>] ret_from_irq+0x0/0x4
> [ 8567.592143] [<ffffffff8160eed0>] _raw_spin_lock+0x18/0x30
> [ 8567.597550] [<ffffffff8122d8f0>] huge_pmd_set_accessed+0x40/0xd8
> [ 8567.603566] [<ffffffff81205220>] handle_mm_fault+0x110/0x1bb0
> [ 8567.609320] [<ffffffff811344b8>] __do_page_fault+0x158/0x508
> [ 8567.614988] [<ffffffff8111ebc0>] ret_from_exception+0x0/0x10
> 
> A.
> 
> 


-- 
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
6144R/F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And our
lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
  2016-05-27 21:03           ` Joshua Kinard
@ 2016-05-27 22:05             ` Aaro Koskinen
  2016-05-27 22:22               ` Joshua Kinard
  0 siblings, 1 reply; 28+ messages in thread
From: Aaro Koskinen @ 2016-05-27 22:05 UTC (permalink / raw)
  To: Joshua Kinard; +Cc: David Daney, Aaro Koskinen, Ralf Baechle, linux-mips

Hi,

On Fri, May 27, 2016 at 05:03:06PM -0400, Joshua Kinard wrote:
> If the binaries on the initramfs are built to any of the MIPS-I to MIPS-IV
> ISAs, I can test this on my IP27/Onyx2 system as well, though I'll have to
> build an IP27 kernel and just use the initramfs.

I built them using --with-arch=octeon+ so they won't work on other HW.

But there isn't any magic in my binaries. If you have a working 64-bit
Linux MIPS system (with GCC and make) you can easily try my test case:

- compile Linux 4.6 with THP (always enabled) and 4KB page size

	(preferably using GCC >= 4.9.3)

- boot with the new kernel & log in

- execute the following commands:

	curl -O http://www.cpan.org/src/5.0/perl-5.22.2.tar.gz
	tar xf perl-5.22.2.tar.gz
	cd perl-5.22.2
	sh Configure -de -Dprefix=/usr -Dcc=gcc && make && make test

If this passes without odd crashes or hangs (which I highly doubt),
please post the output of:

	grep thp /proc/vmstat

A.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
  2016-05-27 22:05             ` Aaro Koskinen
@ 2016-05-27 22:22               ` Joshua Kinard
  0 siblings, 0 replies; 28+ messages in thread
From: Joshua Kinard @ 2016-05-27 22:22 UTC (permalink / raw)
  To: Aaro Koskinen; +Cc: David Daney, Aaro Koskinen, Ralf Baechle, linux-mips

On 05/27/2016 18:05, Aaro Koskinen wrote:
> Hi,
> 
> On Fri, May 27, 2016 at 05:03:06PM -0400, Joshua Kinard wrote:
>> If the binaries on the initramfs are built to any of the MIPS-I to MIPS-IV
>> ISAs, I can test this on my IP27/Onyx2 system as well, though I'll have to
>> build an IP27 kernel and just use the initramfs.
> 
> I built them using --with-arch=octeon+ so they won't work on other HW.
> 
> But there isn't any magic in my binaries. If you have a working 64-bit
> Linux MIPS system (with GCC and make) you can easily try my test case:
> 
> - compile Linux 4.6 with THP (always enabled) and 4KB page size
> 
> 	(preferably using GCC >= 4.9.3)
> 
> - boot with the new kernel & log in
> 
> - execute the following commands:
> 
> 	curl -O http://www.cpan.org/src/5.0/perl-5.22.2.tar.gz
> 	tar xf perl-5.22.2.tar.gz
> 	cd perl-5.22.2
> 	sh Configure -de -Dprefix=/usr -Dcc=gcc && make && make test
> 
> If this passes without odd crashes or hangs (which I highly doubt),
> please post the output of:
> 
> 	grep thp /proc/vmstat
> 

Perl-5.22 is already built on this platform (it's Gentoo), but 5.24 is out, so
I can run that build w/ THP on.  I doubt it will get that far, though.
Assuming it even survives the boot process and gets to runlevel 3, then just
running "emerge" (Gentoo's default package manager) and letting it calculate
dependencies usually trips things up.  The whole system dies in that instance,
but maybe if I boot to single-user and dork around, I can trigger a bus error
or five and still have enough time to grep /proc/vmstat...

-- 
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
6144R/F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And our
lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
  2016-05-23 15:13 THP broken on OCTEON? Aaro Koskinen
  2016-05-23 15:20 ` Ralf Baechle
  2016-05-25 13:41 ` Aaro Koskinen
@ 2016-06-22 22:05 ` David Daney
  2016-06-23 12:08     ` Aaro Koskinen
  2 siblings, 1 reply; 28+ messages in thread
From: David Daney @ 2016-06-22 22:05 UTC (permalink / raw)
  To: Aaro Koskinen; +Cc: Ralf Baechle, linux-mips

This is caused by a config bug.

For THP to work you must have both:

CONFIG_TRANSPARENT_HUGEPAGE=y
and
CONFIG_HUGETLBFS=y

Please try testing with both of those set as well as applying:

https://www.linux-mips.org/archives/linux-mips/2016-06/msg00397.html

I will look into either a Kconfig fix, or fixing the code that currently 
depends on CONFIG_HUGETLBFS, but is needed for all huge pages.

The faults I saw are caused by:

    #define pmd_huge(x)	0

In include/linux/hugetlb.h

Really we need to replace all occurrences of pmd_huge() under arch/mips 
with something like pte_huge(), but I don't know if that is sufficient. 
  There may be other things gated by CONFIG_HUGETLBFS that I didn't see.

David.

On 05/23/2016 08:13 AM, Aaro Koskinen wrote:
> Hi,
>
> I'm getting kernel crashes (see below) reliably when building Perl in
> parallel (make -j16) on OCTEON EBH5600 board (8 cores, 4 GB RAM) with
> Linux 4.6.
>
> It seems that CONFIG_TRANSPARENT_HUGEPAGE has something to do with the
> issue - disabling it makes build go through fine.
>
> Any ideas?
>
> A.
>
> [ 2457.467155] Got mcheck at 00000001200a82b4
> [ 2457.479447] CPU: 6 PID: 15916 Comm: lib/unicore/mkt Not tainted 4.6.0-octeon-distro.git-v2.16-1-gfc3b10e-dirty-00001-g16a7aa0 #1
> [ 2457.514121] task: 80000000eccf2b80 ti: 80000000ecda4000 task.ti: 80000000ecda4000
> [ 2457.536551] $ 0   : 0000000000000000 3e000000105bc006 0000000000000000 ffffffff957e4728
> [ 2457.560686] $ 4   : 00000000000000f2 0000000000000067 000000012015e8ab 00000000332295cf
> [ 2457.584822] $ 8   : 0000000000000000 0000000000000000 0000000000000001 0000000000000003
> [ 2457.608957] $12   : 00000001204e04d8 0000000000000008 0000000000000001 ffffffffffffffff
> [ 2457.633093] $16   : 0000000120383d60 00000001203a3828 00000000332295cf 000000000000000b
> [ 2457.657228] $20   : 000000012015e8a0 0000000000000000 000000000000000c 0000000000000000
> [ 2457.681363] $24   : 0000000000000010 00000001200a80e8
> [ 2457.705496] $28   : 00000001201a0300 000000ffffda82a0 000000012019b9b8 0000000120039f5c
> [ 2457.729631] Hi    : 0000000000000000
> [ 2457.740341] Lo    : 0000000000000008
> [ 2457.751055] epc   : 00000001200a82b4 0x1200a82b4
> [ 2457.764891] ra    : 0000000120039f5c 0x120039f5c
> [ 2457.778726] Status: 00308cf3	KX SX UX USER EXL IE
> [ 2457.793284] Cause : 00800060 (ExcCode 18)
> [ 2457.805296] PrId  : 000d0409 (Cavium Octeon+)
> [ 2457.818350] Index    : 80000000
> [ 2457.827759] PageMask : 1fe000
> [ 2457.836646] EntryHi  : 00000001203820f4
> [ 2457.848136] EntryLo0 : 00000000105b8006
> [ 2457.859628] EntryLo1 : 00000000105bc006
> [ 2457.871119] Wired    : 0
> [ 2457.878704] PageGrain: e0000000
> [ 2457.888111]
> [ 2457.892573] Index: 25 pgmask=4kb va=00120456000 asid=f4
> [ 2457.908256] 	[ri=0 xi=0 pa=000e47d3000 c=0 d=1 v=1 g=0] [ri=0 xi=0 pa=000c31bc000 c=0 d=1 v=1 g=0]
> [ 2457.935230] Index: 26 pgmask=4kb va=001200a8000 asid=f4
> [ 2457.950915] 	[ri=0 xi=0 pa=000e0e1c000 c=0 d=0 v=1 g=0] [ri=0 xi=0 pa=000c50ed000 c=0 d=0 v=1 g=0]
> [ 2457.977888] Index: 27 pgmask=4kb va=001203a2000 asid=f4
> [ 2457.993574] 	[ri=0 xi=0 pa=00000000000 c=0 d=0 v=0 g=0] [ri=0 xi=1 pa=0009005a000 c=1 d=0 v=1 g=0]
> [ 2458.020548]
> [ 2458.025008]
> Code: de100000  1200001c  00000000 <de110008> 8e220000  1452fffa  00000000  8e220004  1453fff7
> [ 2458.054470] Kernel panic - not syncing: Caught Machine Check exception - caused by multiple matching entries in the TLB.
> [ 2458.087614] ---[ end Kernel panic - not syncing: Caught Machine Check exception - caused by multiple matching entries in the TLB.
> [ 2458.122835]
> do_page_fault(): sending SIGSEGV to make for invalid write access to 0000000000000012[ 2458.149565]
> [ 2458.149565] do_page_fault(): sending SIGSEGV to miniperl for invalid write access to 0000000000000010epc = 0000000120089500 in miniperl[120000000+181000]ra  = 00000001200c18a4 in miniperl[120000000+181000][ 2458.149590]
>
> [ 2458.212999] epc = 0000000120015400 in make[120000000+35000]
> [ 2458.229780] ra  = 000000ffeca7f570 in[ 2458.240797]
>
> *** NMI Watchdog interrupt on Core 0x0 ***
>
> A.
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
@ 2016-06-23 12:08     ` Aaro Koskinen
  0 siblings, 0 replies; 28+ messages in thread
From: Aaro Koskinen @ 2016-06-23 12:08 UTC (permalink / raw)
  To: David Daney; +Cc: Ralf Baechle, linux-mips, Kirill A. Shutemov

Hi,

On Wed, Jun 22, 2016 at 03:05:05PM -0700, David Daney wrote:
> This is caused by a config bug.
> 
> For THP to work you must have both:
> 
> CONFIG_TRANSPARENT_HUGEPAGE=y
> and
> CONFIG_HUGETLBFS=y

Oh... I guess this is with MIPS only?

> Please try testing with both of those set as well as applying:
> 
> https://www.linux-mips.org/archives/linux-mips/2016-06/msg00397.html

Works! Now the system is stable. EBH5600 built dozen of different packages
without any issues and THP being used:

root@localhost:~$ grep thp /proc/vmstat 
thp_fault_alloc 2271
thp_fault_fallback 0
thp_collapse_alloc 2049
thp_collapse_alloc_failed 0
thp_split_page 0
thp_split_page_failed 0
thp_deferred_split_page 3996
thp_split_pmd 186
thp_zero_page_alloc 0
thp_zero_page_alloc_failed 0

Thanks a lot,

A. 

> I will look into either a Kconfig fix, or fixing the code that currently
> depends on CONFIG_HUGETLBFS, but is needed for all huge pages.
> 
> The faults I saw are caused by:
> 
>    #define pmd_huge(x)	0
> 
> In include/linux/hugetlb.h
> 
> Really we need to replace all occurrences of pmd_huge() under arch/mips with
> something like pte_huge(), but I don't know if that is sufficient.  There
> may be other things gated by CONFIG_HUGETLBFS that I didn't see.
> 
> David.
> 
> On 05/23/2016 08:13 AM, Aaro Koskinen wrote:
> >Hi,
> >
> >I'm getting kernel crashes (see below) reliably when building Perl in
> >parallel (make -j16) on OCTEON EBH5600 board (8 cores, 4 GB RAM) with
> >Linux 4.6.
> >
> >It seems that CONFIG_TRANSPARENT_HUGEPAGE has something to do with the
> >issue - disabling it makes build go through fine.
> >
> >Any ideas?
> >
> >A.
> >
> >[ 2457.467155] Got mcheck at 00000001200a82b4
> >[ 2457.479447] CPU: 6 PID: 15916 Comm: lib/unicore/mkt Not tainted 4.6.0-octeon-distro.git-v2.16-1-gfc3b10e-dirty-00001-g16a7aa0 #1
> >[ 2457.514121] task: 80000000eccf2b80 ti: 80000000ecda4000 task.ti: 80000000ecda4000
> >[ 2457.536551] $ 0   : 0000000000000000 3e000000105bc006 0000000000000000 ffffffff957e4728
> >[ 2457.560686] $ 4   : 00000000000000f2 0000000000000067 000000012015e8ab 00000000332295cf
> >[ 2457.584822] $ 8   : 0000000000000000 0000000000000000 0000000000000001 0000000000000003
> >[ 2457.608957] $12   : 00000001204e04d8 0000000000000008 0000000000000001 ffffffffffffffff
> >[ 2457.633093] $16   : 0000000120383d60 00000001203a3828 00000000332295cf 000000000000000b
> >[ 2457.657228] $20   : 000000012015e8a0 0000000000000000 000000000000000c 0000000000000000
> >[ 2457.681363] $24   : 0000000000000010 00000001200a80e8
> >[ 2457.705496] $28   : 00000001201a0300 000000ffffda82a0 000000012019b9b8 0000000120039f5c
> >[ 2457.729631] Hi    : 0000000000000000
> >[ 2457.740341] Lo    : 0000000000000008
> >[ 2457.751055] epc   : 00000001200a82b4 0x1200a82b4
> >[ 2457.764891] ra    : 0000000120039f5c 0x120039f5c
> >[ 2457.778726] Status: 00308cf3	KX SX UX USER EXL IE
> >[ 2457.793284] Cause : 00800060 (ExcCode 18)
> >[ 2457.805296] PrId  : 000d0409 (Cavium Octeon+)
> >[ 2457.818350] Index    : 80000000
> >[ 2457.827759] PageMask : 1fe000
> >[ 2457.836646] EntryHi  : 00000001203820f4
> >[ 2457.848136] EntryLo0 : 00000000105b8006
> >[ 2457.859628] EntryLo1 : 00000000105bc006
> >[ 2457.871119] Wired    : 0
> >[ 2457.878704] PageGrain: e0000000
> >[ 2457.888111]
> >[ 2457.892573] Index: 25 pgmask=4kb va=00120456000 asid=f4
> >[ 2457.908256] 	[ri=0 xi=0 pa=000e47d3000 c=0 d=1 v=1 g=0] [ri=0 xi=0 pa=000c31bc000 c=0 d=1 v=1 g=0]
> >[ 2457.935230] Index: 26 pgmask=4kb va=001200a8000 asid=f4
> >[ 2457.950915] 	[ri=0 xi=0 pa=000e0e1c000 c=0 d=0 v=1 g=0] [ri=0 xi=0 pa=000c50ed000 c=0 d=0 v=1 g=0]
> >[ 2457.977888] Index: 27 pgmask=4kb va=001203a2000 asid=f4
> >[ 2457.993574] 	[ri=0 xi=0 pa=00000000000 c=0 d=0 v=0 g=0] [ri=0 xi=1 pa=0009005a000 c=1 d=0 v=1 g=0]
> >[ 2458.020548]
> >[ 2458.025008]
> >Code: de100000  1200001c  00000000 <de110008> 8e220000  1452fffa  00000000  8e220004  1453fff7
> >[ 2458.054470] Kernel panic - not syncing: Caught Machine Check exception - caused by multiple matching entries in the TLB.
> >[ 2458.087614] ---[ end Kernel panic - not syncing: Caught Machine Check exception - caused by multiple matching entries in the TLB.
> >[ 2458.122835]
> >do_page_fault(): sending SIGSEGV to make for invalid write access to 0000000000000012[ 2458.149565]
> >[ 2458.149565] do_page_fault(): sending SIGSEGV to miniperl for invalid write access to 0000000000000010epc = 0000000120089500 in miniperl[120000000+181000]ra  = 00000001200c18a4 in miniperl[120000000+181000][ 2458.149590]
> >
> >[ 2458.212999] epc = 0000000120015400 in make[120000000+35000]
> >[ 2458.229780] ra  = 000000ffeca7f570 in[ 2458.240797]
> >
> >*** NMI Watchdog interrupt on Core 0x0 ***
> >
> >A.
> >
> >
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
@ 2016-06-23 12:08     ` Aaro Koskinen
  0 siblings, 0 replies; 28+ messages in thread
From: Aaro Koskinen @ 2016-06-23 12:08 UTC (permalink / raw)
  To: David Daney; +Cc: Ralf Baechle, linux-mips, Kirill A. Shutemov

Hi,

On Wed, Jun 22, 2016 at 03:05:05PM -0700, David Daney wrote:
> This is caused by a config bug.
> 
> For THP to work you must have both:
> 
> CONFIG_TRANSPARENT_HUGEPAGE=y
> and
> CONFIG_HUGETLBFS=y

Oh... I guess this is with MIPS only?

> Please try testing with both of those set as well as applying:
> 
> https://www.linux-mips.org/archives/linux-mips/2016-06/msg00397.html

Works! Now the system is stable. EBH5600 built dozen of different packages
without any issues and THP being used:

root@localhost:~$ grep thp /proc/vmstat 
thp_fault_alloc 2271
thp_fault_fallback 0
thp_collapse_alloc 2049
thp_collapse_alloc_failed 0
thp_split_page 0
thp_split_page_failed 0
thp_deferred_split_page 3996
thp_split_pmd 186
thp_zero_page_alloc 0
thp_zero_page_alloc_failed 0

Thanks a lot,

A. 

> I will look into either a Kconfig fix, or fixing the code that currently
> depends on CONFIG_HUGETLBFS, but is needed for all huge pages.
> 
> The faults I saw are caused by:
> 
>    #define pmd_huge(x)	0
> 
> In include/linux/hugetlb.h
> 
> Really we need to replace all occurrences of pmd_huge() under arch/mips with
> something like pte_huge(), but I don't know if that is sufficient.  There
> may be other things gated by CONFIG_HUGETLBFS that I didn't see.
> 
> David.
> 
> On 05/23/2016 08:13 AM, Aaro Koskinen wrote:
> >Hi,
> >
> >I'm getting kernel crashes (see below) reliably when building Perl in
> >parallel (make -j16) on OCTEON EBH5600 board (8 cores, 4 GB RAM) with
> >Linux 4.6.
> >
> >It seems that CONFIG_TRANSPARENT_HUGEPAGE has something to do with the
> >issue - disabling it makes build go through fine.
> >
> >Any ideas?
> >
> >A.
> >
> >[ 2457.467155] Got mcheck at 00000001200a82b4
> >[ 2457.479447] CPU: 6 PID: 15916 Comm: lib/unicore/mkt Not tainted 4.6.0-octeon-distro.git-v2.16-1-gfc3b10e-dirty-00001-g16a7aa0 #1
> >[ 2457.514121] task: 80000000eccf2b80 ti: 80000000ecda4000 task.ti: 80000000ecda4000
> >[ 2457.536551] $ 0   : 0000000000000000 3e000000105bc006 0000000000000000 ffffffff957e4728
> >[ 2457.560686] $ 4   : 00000000000000f2 0000000000000067 000000012015e8ab 00000000332295cf
> >[ 2457.584822] $ 8   : 0000000000000000 0000000000000000 0000000000000001 0000000000000003
> >[ 2457.608957] $12   : 00000001204e04d8 0000000000000008 0000000000000001 ffffffffffffffff
> >[ 2457.633093] $16   : 0000000120383d60 00000001203a3828 00000000332295cf 000000000000000b
> >[ 2457.657228] $20   : 000000012015e8a0 0000000000000000 000000000000000c 0000000000000000
> >[ 2457.681363] $24   : 0000000000000010 00000001200a80e8
> >[ 2457.705496] $28   : 00000001201a0300 000000ffffda82a0 000000012019b9b8 0000000120039f5c
> >[ 2457.729631] Hi    : 0000000000000000
> >[ 2457.740341] Lo    : 0000000000000008
> >[ 2457.751055] epc   : 00000001200a82b4 0x1200a82b4
> >[ 2457.764891] ra    : 0000000120039f5c 0x120039f5c
> >[ 2457.778726] Status: 00308cf3	KX SX UX USER EXL IE
> >[ 2457.793284] Cause : 00800060 (ExcCode 18)
> >[ 2457.805296] PrId  : 000d0409 (Cavium Octeon+)
> >[ 2457.818350] Index    : 80000000
> >[ 2457.827759] PageMask : 1fe000
> >[ 2457.836646] EntryHi  : 00000001203820f4
> >[ 2457.848136] EntryLo0 : 00000000105b8006
> >[ 2457.859628] EntryLo1 : 00000000105bc006
> >[ 2457.871119] Wired    : 0
> >[ 2457.878704] PageGrain: e0000000
> >[ 2457.888111]
> >[ 2457.892573] Index: 25 pgmask=4kb va=00120456000 asid=f4
> >[ 2457.908256] 	[ri=0 xi=0 pa=000e47d3000 c=0 d=1 v=1 g=0] [ri=0 xi=0 pa=000c31bc000 c=0 d=1 v=1 g=0]
> >[ 2457.935230] Index: 26 pgmask=4kb va=001200a8000 asid=f4
> >[ 2457.950915] 	[ri=0 xi=0 pa=000e0e1c000 c=0 d=0 v=1 g=0] [ri=0 xi=0 pa=000c50ed000 c=0 d=0 v=1 g=0]
> >[ 2457.977888] Index: 27 pgmask=4kb va=001203a2000 asid=f4
> >[ 2457.993574] 	[ri=0 xi=0 pa=00000000000 c=0 d=0 v=0 g=0] [ri=0 xi=1 pa=0009005a000 c=1 d=0 v=1 g=0]
> >[ 2458.020548]
> >[ 2458.025008]
> >Code: de100000  1200001c  00000000 <de110008> 8e220000  1452fffa  00000000  8e220004  1453fff7
> >[ 2458.054470] Kernel panic - not syncing: Caught Machine Check exception - caused by multiple matching entries in the TLB.
> >[ 2458.087614] ---[ end Kernel panic - not syncing: Caught Machine Check exception - caused by multiple matching entries in the TLB.
> >[ 2458.122835]
> >do_page_fault(): sending SIGSEGV to make for invalid write access to 0000000000000012[ 2458.149565]
> >[ 2458.149565] do_page_fault(): sending SIGSEGV to miniperl for invalid write access to 0000000000000010epc = 0000000120089500 in miniperl[120000000+181000]ra  = 00000001200c18a4 in miniperl[120000000+181000][ 2458.149590]
> >
> >[ 2458.212999] epc = 0000000120015400 in make[120000000+35000]
> >[ 2458.229780] ra  = 000000ffeca7f570 in[ 2458.240797]
> >
> >*** NMI Watchdog interrupt on Core 0x0 ***
> >
> >A.
> >
> >
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: THP broken on OCTEON?
  2016-06-23 12:08     ` Aaro Koskinen
  (?)
@ 2016-06-24 11:38     ` Joshua Kinard
  -1 siblings, 0 replies; 28+ messages in thread
From: Joshua Kinard @ 2016-06-24 11:38 UTC (permalink / raw)
  To: Aaro Koskinen, David Daney; +Cc: Ralf Baechle, linux-mips, Kirill A. Shutemov

On 06/23/2016 08:08, Aaro Koskinen wrote:
> Hi,
> 
> On Wed, Jun 22, 2016 at 03:05:05PM -0700, David Daney wrote:
>> This is caused by a config bug.
>>
>> For THP to work you must have both:
>>
>> CONFIG_TRANSPARENT_HUGEPAGE=y
>> and
>> CONFIG_HUGETLBFS=y
> 
> Oh... I guess this is with MIPS only?
> 
>> Please try testing with both of those set as well as applying:
>>
>> https://www.linux-mips.org/archives/linux-mips/2016-06/msg00397.html
> 
> Works! Now the system is stable. EBH5600 built dozen of different packages
> without any issues and THP being used:
> 
> root@localhost:~$ grep thp /proc/vmstat 
> thp_fault_alloc 2271
> thp_fault_fallback 0
> thp_collapse_alloc 2049
> thp_collapse_alloc_failed 0
> thp_split_page 0
> thp_split_page_failed 0
> thp_deferred_split_page 3996
> thp_split_pmd 186
> thp_zero_page_alloc 0
> thp_zero_page_alloc_failed 0
> 
> Thanks a lot,
> 
> A. 
> 

The case on the IP27 is still broke, it seems, with this patch.  Actually
triggers a HUB error interrupt now instead of a bus error, so I guess that's an
improvement in a sense.  Though, I am still re-working the entire IP27 code
base, so I'll add this to the list of things to try and hunt down.  I'll have
to add some code to read HUB's cause register and extract what error bit got
flipped on.

Have not tried the IP30/Octane case yet.

-- 
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
6144R/F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And our
lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2016-06-24 11:39 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-23 15:13 THP broken on OCTEON? Aaro Koskinen
2016-05-23 15:20 ` Ralf Baechle
2016-05-23 16:21   ` David Daney
2016-05-23 18:52     ` Aaro Koskinen
2016-05-23 19:03       ` David Daney
2016-05-23 19:03         ` David Daney
2016-05-23 19:08       ` Joshua Kinard
2016-05-23 20:02         ` Alastair Bridgewater
2016-05-23 18:57   ` Joshua Kinard
2016-05-23 19:22     ` Ralf Baechle
2016-05-23 19:40       ` Joshua Kinard
2016-05-23 20:01         ` Ralf Baechle
2016-05-24 21:21         ` Aaro Koskinen
2016-05-24 22:39           ` David Daney
2016-05-25 13:41 ` Aaro Koskinen
2016-05-26  9:33   ` Joshua Kinard
2016-05-26 13:36     ` Aaro Koskinen
2016-05-26 17:59   ` David Daney
2016-05-26 19:23     ` Aaro Koskinen
2016-05-26 22:13       ` David Daney
2016-05-27 17:14         ` Aaro Koskinen
2016-05-27 21:03           ` Joshua Kinard
2016-05-27 22:05             ` Aaro Koskinen
2016-05-27 22:22               ` Joshua Kinard
2016-06-22 22:05 ` David Daney
2016-06-23 12:08   ` Aaro Koskinen
2016-06-23 12:08     ` Aaro Koskinen
2016-06-24 11:38     ` Joshua Kinard

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.