linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* mm, vmscan: commit makes PAE kernel crash nightly (bisected)
@ 2017-01-11 10:32 Trevor Cordes
  2017-01-11 12:11 ` Mel Gorman
  0 siblings, 1 reply; 40+ messages in thread
From: Trevor Cordes @ 2017-01-11 10:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mel Gorman, Joonsoo Kim, Michal Hocko, Minchan Kim, Rik van Riel,
	Srikar Dronamraju

Hi!  I have biected a nightly oom-killer flood and crash/hang on one of 
the boxes I admin.  It doesn't crash on Fedora 23/24 4.7.10 kernel but 
does on any 4.8 Fedora kernel.  I did a vanilla bisect and the bug is 
here:

commit b2e18757f2c9d1cdd746a882e9878852fdec9501
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Thu Jul 28 15:45:37 2016 -0700

    mm, vmscan: begin reclaiming pages on a per-node basis

I bisected between:
# bad: [69973b830859bc6529a7a0468ba0d80ee5117826] Linux 4.9
# good: [523d939ef98fd712632d93a5a2b588e477a7565e] Linux 4.7

I have not tried newer than 4.8.13 Fedora kernel, but if someone thinks 
this bug is already fixed in HEAD I could try that next.  It took 3 weeks 
to bisect because the crash only seems to happen in the middle of the 
night, and not every, but most, nights.

It does not occur on most of my other boxes, just this one.  The box is a 
bit unique in that it's running 32-bit PAE on a 64-bit capable CPU, and I 
have the memory tuned down to mem=6G in the kernel command line (I think 
it has 16GB actual).  I tuned the RAM down because around 8GB the PAE 
kernel has massive IO speed issues.

It is a relatively new Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz on an 
Intel S1200BTL board.  I will eventually change it to 64-bit Fedora which 
I'm sure will solve this bug, but since there's no easy upgrade path, 
that's on the backburner on this production box.

I'm sure this will be another "PAE sucks, don't use it" issue, but like I 
said, I'm currently stuck with it, and in theory the kernel shouldn't 
crash like this (I'm guessing/hoping).

I think I pinned the trigger down to either (or both) big dir scans (like 
"find /bigdir-foo") running at around 3am.  It's either a remote box doing 
indexing via smbd and/or rsync or rdiff-backup also doing big dir scans.  
But when I do "find /" manually I can't trigger the bug.  Very weird.

The commit notes make it sound like the author thought perhaps there could 
be a problem in some scenarios?  I guess I found the scenario.

The only discussion I found on the net regarding this commit is
https://lkml.org/lkml/2016/8/29/154
And perhaps it's somewhat relevant, it's a bit over my head.

I'm available for testing, etc, and can usually rule out a bad kernel 
within 24-hours by just waiting for 3am to roll around.  I also have 
copious logs I can provide and screenshots of the crashes.

The box is extremely lightly loaded, and RAM use is almost always under 
1GB, and swap is 0-20k used most of the time with GB's free.  Everything 
looks great until all of a sudden oom-killer starts running and goes 
through 10-260 iterations before the system just dies.  I wrote a script 
to watch for oom-killer and issue "reboot" immediately, but 80% of the 
time the box will hang before the reboot actually manages to shutdown.

Any information/help I can provide, please just holler.  Thanks!

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2017-02-05 22:54 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-11 10:32 mm, vmscan: commit makes PAE kernel crash nightly (bisected) Trevor Cordes
2017-01-11 12:11 ` Mel Gorman
2017-01-11 12:14   ` Mel Gorman
2017-01-11 22:52     ` Trevor Cordes
2017-01-12  9:36       ` Michal Hocko
2017-01-15  6:27         ` Trevor Cordes
2017-01-16 11:09           ` Mel Gorman
2017-01-17 13:52             ` Michal Hocko
2017-01-17 14:21               ` Mel Gorman
2017-01-17 14:54                 ` Michal Hocko
2017-01-18  7:25                   ` Trevor Cordes
2017-01-18 17:48                   ` Mel Gorman
2017-01-18 18:07                   ` Mel Gorman
2017-01-19  9:48                   ` Trevor Cordes
2017-01-19 11:37                     ` Michal Hocko
2017-01-20  6:35                       ` Trevor Cordes
2017-01-20 11:02                         ` Mel Gorman
2017-01-20 15:55                           ` Mel Gorman
2017-01-23  0:45                             ` Trevor Cordes
2017-01-23 10:48                               ` Mel Gorman
2017-01-23 11:04                                 ` Mel Gorman
2017-01-25  9:46                                   ` Michal Hocko
2017-01-24 12:59                                 ` Michal Hocko
2017-01-25 10:02                                 ` Trevor Cordes
2017-01-25 12:04                                   ` Michal Hocko
2017-01-29 22:50                                     ` Trevor Cordes
2017-01-30  7:51                                       ` Michal Hocko
2017-02-01  9:29                                         ` Trevor Cordes
2017-02-01 10:14                                           ` Michal Hocko
2017-02-04  0:36                                             ` Trevor Cordes
2017-02-04 20:05                                               ` Rik van Riel
2017-02-05 10:03                                               ` Michal Hocko
2017-02-05 22:53                                                 ` Trevor Cordes
2017-01-30  9:10                                       ` Mel Gorman
2017-01-24 12:54                               ` Michal Hocko
2017-01-26 23:18                                 ` Trevor Cordes
2017-01-27  7:36                                   ` Michal Hocko
2017-01-24 12:51                         ` Michal Hocko
2017-01-18  6:52             ` Trevor Cordes
2017-01-17 13:45           ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).