All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [Bug 203715] New: BUG: unable to handle kernel NULL pointer dereference under stress (possibly related to https://lkml.org/lkml/2019/5/24/292 ?)
       [not found] <bug-203715-27@https.bugzilla.kernel.org/>
@ 2019-05-29 23:04 ` Andrew Morton
  2019-06-04 11:05   ` Mel Gorman
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2019-05-29 23:04 UTC (permalink / raw)
  To: Mel Gorman; +Cc: bugzilla-daemon, linux-mm, gabriele balducci



(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

Mel, we may have a regression from e332f741a8dd1 ("mm, compaction: be
selective about what pageblocks to clear skip hints").  The crash sure
looks like the one which 60fce36afa9c77c7 ("mm/compaction.c: correct
zone boundary handling when isolating pages from a pageblock") fixed,
but Gabriele can reproduce it with 5.1.5.  I've confirmed that 5.1.5
has 60fce36afa9c77c7.

Thanks.

On Mon, 27 May 2019 10:12:30 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=203715
> 
>             Bug ID: 203715
>            Summary: BUG: unable to handle kernel NULL pointer dereference
>                     under stress (possibly related to
>                     https://lkml.org/lkml/2019/5/24/292 ?)
>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: 5.1+
>           Hardware: x86-64
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Page Allocator
>           Assignee: akpm@linux-foundation.org
>           Reporter: balducci@units.it
>         Regression: No
> 
> Created attachment 282949
>   --> https://bugzilla.kernel.org/attachment.cgi?id=282949&action=edit
> crash log n.1
> 
> hello
> 
> since 5.1 I'm getting machine freezes like:
> 
>     May  7 18:00:10 dschgrazlin3 kernel: BUG: unable to handle kernel NULL
> pointer dereference at 0000000000000000
>     May  7 18:00:10 dschgrazlin3 kernel: #PF error: [normal kernel read fault]
>     May  7 18:00:10 dschgrazlin3 kernel: PGD 0 P4D 0 
>     May  7 18:00:10 dschgrazlin3 kernel: Oops: 0000 [#1] SMP
>     May  7 18:00:10 dschgrazlin3 kernel: CPU: 3 PID: 44 Comm: kswapd0 Not
> tainted 5.1.0 #1
>     May  7 18:00:10 dschgrazlin3 kernel: Hardware name: System manufacturer
> System Product Name/F2A85-M PRO, BIOS 5104 09/14/2012
>     May  7 18:00:10 dschgrazlin3 kernel: RIP:
> 0010:__reset_isolation_pfn+0x2cb/0x410
>     [...]
>     May  7 18:00:10 dschgrazlin3 kernel: Call Trace:
>     May  7 18:00:10 dschgrazlin3 kernel:  __reset_isolation_suitable+0x95/0x110
>     May  7 18:00:10 dschgrazlin3 kernel:  ? __wake_up_common_lock+0xd0/0xd0
>     May  7 18:00:10 dschgrazlin3 kernel:  reset_isolation_suitable+0x34/0x40
>     May  7 18:00:10 dschgrazlin3 kernel:  kswapd+0xad/0x2c0
>     May  7 18:00:10 dschgrazlin3 kernel:  ? __wake_up_common_lock+0xd0/0xd0
>     May  7 18:00:10 dschgrazlin3 kernel:  ? balance_pgdat+0x440/0x440
>     May  7 18:00:10 dschgrazlin3 kernel:  kthread+0xff/0x120
>     May  7 18:00:10 dschgrazlin3 kernel:  ?
> __kthread_create_on_node+0x1b0/0x1b0
>     May  7 18:00:10 dschgrazlin3 kernel:  ret_from_fork+0x1f/0x30
>     May  7 18:00:10 dschgrazlin3 kernel: CR2: 0000000000000000
>     May  7 18:00:10 dschgrazlin3 kernel: ---[ end trace 075fb7a28df7d1d4 ]---
>     May  7 18:00:10 dschgrazlin3 kernel: RIP:
> 0010:__reset_isolation_pfn+0x2cb/0x410
>     [...]
> 
> (complete logs attached)
> 
> I started having this during firefox build, but experienced it during
> other build processes (mesa, gcc). The problem always appears under
> heavy load of the machine.
> 
> Unfortunately, the problem cannot be triggered with probability=1,
> although firefox build triggers the machine freeze almost always (at
> random points of the build, though)
> 
> I experience the problem on two twin boxes, which makes me exclude HW
> issues.
> 
> Absolutely no problems when running kernels <5.1 (<=5.0.15)
> 
> In some cases, I got the kernel screams without complete machine freeze,
> but with heavily reduced functionality of the whole system (eg ls
> command hanging)
> 
> Due to the issue not being always reproducible, bisection isn't 100%
> reliable; however the first bad commit seems to be
> e332f741a8dd1ec9a6dc8aa997296ecbfe64323e
> 
> I'll be happy to provide any other file/information which might be
> useful
> 
> -- 
> You are receiving this mail because:
> You are the assignee for the bug.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bug 203715] New: BUG: unable to handle kernel NULL pointer dereference under stress (possibly related to https://lkml.org/lkml/2019/5/24/292 ?)
  2019-05-29 23:04 ` [Bug 203715] New: BUG: unable to handle kernel NULL pointer dereference under stress (possibly related to https://lkml.org/lkml/2019/5/24/292 ?) Andrew Morton
@ 2019-06-04 11:05   ` Mel Gorman
  2019-06-04 11:43     ` balducci
  2019-06-05 12:38     ` balducci
  0 siblings, 2 replies; 10+ messages in thread
From: Mel Gorman @ 2019-06-04 11:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: bugzilla-daemon, linux-mm, gabriele balducci

On Wed, May 29, 2019 at 04:04:23PM -0700, Andrew Morton wrote:
> 
> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> Mel, we may have a regression from e332f741a8dd1 ("mm, compaction: be
> selective about what pageblocks to clear skip hints").  The crash sure
> looks like the one which 60fce36afa9c77c7 ("mm/compaction.c: correct
> zone boundary handling when isolating pages from a pageblock") fixed,
> but Gabriele can reproduce it with 5.1.5.  I've confirmed that 5.1.5
> has 60fce36afa9c77c7.
> 

Sorry, I was on holidays and only playing catchup now. Does this happen
to trigger with 5.2-rc3? I ask because there were other fixes in there
with stable cc'd that have not been picked up yet. They are a poor match
for this particular bug but it would be nice to confirm.

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bug 203715] New: BUG: unable to handle kernel NULL pointer dereference under stress (possibly related to https://lkml.org/lkml/2019/5/24/292 ?)
  2019-06-04 11:05   ` Mel Gorman
@ 2019-06-04 11:43     ` balducci
  2019-06-05 12:38     ` balducci
  1 sibling, 0 replies; 10+ messages in thread
From: balducci @ 2019-06-04 11:43 UTC (permalink / raw)
  To: Mel Gorman; +Cc: bugzilla-daemon, linux-mm, akpm

> Sorry, I was on holidays and only playing catchup now. Does this happen
> to trigger with 5.2-rc3? I ask because there were other fixes in there
> with stable cc'd that have not been picked up yet. They are a poor match
> for this particular bug but it would be nice to confirm.

I'll test 5.2-rc3 as soon as possible and report the results

thanks
ciao
-g


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bug 203715] New: BUG: unable to handle kernel NULL pointer dereference under stress (possibly related to https://lkml.org/lkml/2019/5/24/292 ?)
  2019-06-04 11:05   ` Mel Gorman
  2019-06-04 11:43     ` balducci
@ 2019-06-05 12:38     ` balducci
  2019-06-05 12:48       ` Mel Gorman
  2019-06-05 17:21       ` Mel Gorman
  1 sibling, 2 replies; 10+ messages in thread
From: balducci @ 2019-06-05 12:38 UTC (permalink / raw)
  To: Mel Gorman; +Cc: bugzilla-daemon, linux-mm, akpm

hello

> Sorry, I was on holidays and only playing catchup now. Does this happen
> to trigger with 5.2-rc3? I ask because there were other fixes in there
> with stable cc'd that have not been picked up yet. They are a poor match
> for this particular bug but it would be nice to confirm.

I have built v5.2-rc3 from git (stable/linux-stable.git) and tested it
against firefox-67.0.1 build: no joy. 

I'm going to upload the kernel log and the config I used for v5.2-rc3
(there were a couple of new opts) to bugzilla, if that can help

thanks
ciao
-g


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bug 203715] New: BUG: unable to handle kernel NULL pointer dereference under stress (possibly related to https://lkml.org/lkml/2019/5/24/292 ?)
  2019-06-05 12:38     ` balducci
@ 2019-06-05 12:48       ` Mel Gorman
  2019-06-05 17:21       ` Mel Gorman
  1 sibling, 0 replies; 10+ messages in thread
From: Mel Gorman @ 2019-06-05 12:48 UTC (permalink / raw)
  To: balducci; +Cc: bugzilla-daemon, linux-mm, akpm

On Wed, Jun 05, 2019 at 02:38:55PM +0200, balducci@units.it wrote:
> hello
> 
> > Sorry, I was on holidays and only playing catchup now. Does this happen
> > to trigger with 5.2-rc3? I ask because there were other fixes in there
> > with stable cc'd that have not been picked up yet. They are a poor match
> > for this particular bug but it would be nice to confirm.
> 
> I have built v5.2-rc3 from git (stable/linux-stable.git) and tested it
> against firefox-67.0.1 build: no joy. 
> 
> I'm going to upload the kernel log and the config I used for v5.2-rc3
> (there were a couple of new opts) to bugzilla, if that can help
> 

Yes, that would be helpful. Thanks.

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bug 203715] New: BUG: unable to handle kernel NULL pointer dereference under stress (possibly related to https://lkml.org/lkml/2019/5/24/292 ?)
  2019-06-05 12:38     ` balducci
  2019-06-05 12:48       ` Mel Gorman
@ 2019-06-05 17:21       ` Mel Gorman
  2019-06-06 13:20         ` balducci
  1 sibling, 1 reply; 10+ messages in thread
From: Mel Gorman @ 2019-06-05 17:21 UTC (permalink / raw)
  To: balducci; +Cc: bugzilla-daemon, linux-mm, akpm

On Wed, Jun 05, 2019 at 02:38:55PM +0200, balducci@units.it wrote:
> hello
> 
> > Sorry, I was on holidays and only playing catchup now. Does this happen
> > to trigger with 5.2-rc3? I ask because there were other fixes in there
> > with stable cc'd that have not been picked up yet. They are a poor match
> > for this particular bug but it would be nice to confirm.
> 
> I have built v5.2-rc3 from git (stable/linux-stable.git) and tested it
> against firefox-67.0.1 build: no joy. 
> 
> I'm going to upload the kernel log and the config I used for v5.2-rc3
> (there were a couple of new opts) to bugzilla, if that can help
> 

Can you try the following compile-tested only patch please?

diff --git a/mm/compaction.c b/mm/compaction.c
index 9e1b9acb116b..b3f18084866c 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -277,8 +277,7 @@ __reset_isolation_pfn(struct zone *zone, unsigned long pfn, bool check_source,
 	}
 
 	/* Ensure the end of the pageblock or zone is online and valid */
-	block_pfn += pageblock_nr_pages;
-	block_pfn = min(block_pfn, zone_end_pfn(zone) - 1);
+	block_pfn = min(pageblock_end_pfn(block_pfn), zone_end_pfn(zone) - 1);
 	end_page = pfn_to_online_page(block_pfn);
 	if (!end_page)
 		return false;

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Bug 203715] New: BUG: unable to handle kernel NULL pointer dereference under stress (possibly related to https://lkml.org/lkml/2019/5/24/292 ?)
  2019-06-05 17:21       ` Mel Gorman
@ 2019-06-06 13:20         ` balducci
  2019-06-06 14:44           ` Mel Gorman
  0 siblings, 1 reply; 10+ messages in thread
From: balducci @ 2019-06-06 13:20 UTC (permalink / raw)
  To: Mel Gorman; +Cc: bugzilla-daemon, linux-mm, akpm

> Can you try the following compile-tested only patch please?
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 9e1b9acb116b..b3f18084866c 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -277,8 +277,7 @@ __reset_isolation_pfn(struct zone *zone, unsigned long pf
> n, bool check_source,
>  	}
>  
>  	/* Ensure the end of the pageblock or zone is online and valid */
> -	block_pfn += pageblock_nr_pages;
> -	block_pfn = min(block_pfn, zone_end_pfn(zone) - 1);
> +	block_pfn = min(pageblock_end_pfn(block_pfn), zone_end_pfn(zone) - 1);
>  	end_page = pfn_to_online_page(block_pfn);
>  	if (!end_page)
>  		return false;
>

Unfortunately it doesn't help: the test firefox build very soon crashed
as before; this time the machine froze completely (had to hardware
reboot) and I couldn't find any kernel log in the log files (however the
screen of the frozen console looked pretty the same as the previous
times)

(I applied the patch on top of e577c8b64d58fe307ea4d5149d31615df2d90861,
right?)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bug 203715] New: BUG: unable to handle kernel NULL pointer dereference under stress (possibly related to https://lkml.org/lkml/2019/5/24/292 ?)
  2019-06-06 13:20         ` balducci
@ 2019-06-06 14:44           ` Mel Gorman
  2019-06-11  9:03             ` Mel Gorman
  0 siblings, 1 reply; 10+ messages in thread
From: Mel Gorman @ 2019-06-06 14:44 UTC (permalink / raw)
  To: balducci; +Cc: bugzilla-daemon, linux-mm, akpm

On Thu, Jun 06, 2019 at 03:20:49PM +0200, balducci@units.it wrote:
> > Can you try the following compile-tested only patch please?
> >
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 9e1b9acb116b..b3f18084866c 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -277,8 +277,7 @@ __reset_isolation_pfn(struct zone *zone, unsigned long pf
> > n, bool check_source,
> >  	}
> >  
> >  	/* Ensure the end of the pageblock or zone is online and valid */
> > -	block_pfn += pageblock_nr_pages;
> > -	block_pfn = min(block_pfn, zone_end_pfn(zone) - 1);
> > +	block_pfn = min(pageblock_end_pfn(block_pfn), zone_end_pfn(zone) - 1);
> >  	end_page = pfn_to_online_page(block_pfn);
> >  	if (!end_page)
> >  		return false;
> >
> 
> Unfortunately it doesn't help: the test firefox build very soon crashed
> as before; this time the machine froze completely (had to hardware
> reboot) and I couldn't find any kernel log in the log files (however the
> screen of the frozen console looked pretty the same as the previous
> times)
> 

Thanks.

> (I applied the patch on top of e577c8b64d58fe307ea4d5149d31615df2d90861,
> right?)

Please try the following on top of 5.2-rc3

diff --git a/mm/compaction.c b/mm/compaction.c
index 9e1b9acb116b..69f4ddfddfa4 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -277,8 +277,7 @@ __reset_isolation_pfn(struct zone *zone, unsigned long pfn, bool check_source,
 	}
 
 	/* Ensure the end of the pageblock or zone is online and valid */
-	block_pfn += pageblock_nr_pages;
-	block_pfn = min(block_pfn, zone_end_pfn(zone) - 1);
+	block_pfn = min(pageblock_end_pfn(block_pfn), zone_end_pfn(zone) - 1);
 	end_page = pfn_to_online_page(block_pfn);
 	if (!end_page)
 		return false;
@@ -289,7 +288,7 @@ __reset_isolation_pfn(struct zone *zone, unsigned long pfn, bool check_source,
 	 * is necessary for the block to be a migration source/target.
 	 */
 	do {
-		if (pfn_valid_within(pfn)) {
+		if (pfn_valid(pfn)) {
 			if (check_source && PageLRU(page)) {
 				clear_pageblock_skip(page);
 				return true;


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Bug 203715] New: BUG: unable to handle kernel NULL pointer dereference under stress (possibly related to https://lkml.org/lkml/2019/5/24/292 ?)
  2019-06-06 14:44           ` Mel Gorman
@ 2019-06-11  9:03             ` Mel Gorman
  2019-06-11  9:30               ` balducci
  0 siblings, 1 reply; 10+ messages in thread
From: Mel Gorman @ 2019-06-11  9:03 UTC (permalink / raw)
  To: balducci; +Cc: bugzilla-daemon, linux-mm, akpm

On Thu, Jun 06, 2019 at 03:44:24PM +0100, Mel Gorman wrote:
> > (I applied the patch on top of e577c8b64d58fe307ea4d5149d31615df2d90861,
> > right?)
> 
> Please try the following on top of 5.2-rc3
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 9e1b9acb116b..69f4ddfddfa4 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -277,8 +277,7 @@ __reset_isolation_pfn(struct zone *zone, unsigned long pfn, bool check_source,
>  	}
>  
>  	/* Ensure the end of the pageblock or zone is online and valid */
> -	block_pfn += pageblock_nr_pages;
> -	block_pfn = min(block_pfn, zone_end_pfn(zone) - 1);
> +	block_pfn = min(pageblock_end_pfn(block_pfn), zone_end_pfn(zone) - 1);
>  	end_page = pfn_to_online_page(block_pfn);
>  	if (!end_page)
>  		return false;
> @@ -289,7 +288,7 @@ __reset_isolation_pfn(struct zone *zone, unsigned long pfn, bool check_source,
>  	 * is necessary for the block to be a migration source/target.
>  	 */
>  	do {
> -		if (pfn_valid_within(pfn)) {
> +		if (pfn_valid(pfn)) {
>  			if (check_source && PageLRU(page)) {
>  				clear_pageblock_skip(page);
>  				return true;

Any news with this patch?

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bug 203715] New: BUG: unable to handle kernel NULL pointer dereference under stress (possibly related to https://lkml.org/lkml/2019/5/24/292 ?)
  2019-06-11  9:03             ` Mel Gorman
@ 2019-06-11  9:30               ` balducci
  0 siblings, 0 replies; 10+ messages in thread
From: balducci @ 2019-06-11  9:30 UTC (permalink / raw)
  To: Mel Gorman; +Cc: bugzilla-daemon, linux-mm, akpm

Mel Gorman writes:
>
> Any news with this patch?
>

oops: I run the patch and reported by email (CC'ing to bugzilla): either
I botched something with the reporting mail or you missed the email...
(my report is on bugzilla, though, comment 20)

I reproduce the report here:

> no joy; I left the FF build running and found the machine frozen this
> morning; however, firefox build could apparently complete successfully;
> I can't say when exactly the problem happened, as I haven't found any
> message in the logs

I can add that since the last attempt, after rebooting into 5.0.15, I
have built a lot of software (including FF) without any problem; this
enforces me in the conviction that there must be some problem for
kernels >=5.1

Does anybody else reproduce this?

thanks a lot
ciao
-g


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-06-11  9:31 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-203715-27@https.bugzilla.kernel.org/>
2019-05-29 23:04 ` [Bug 203715] New: BUG: unable to handle kernel NULL pointer dereference under stress (possibly related to https://lkml.org/lkml/2019/5/24/292 ?) Andrew Morton
2019-06-04 11:05   ` Mel Gorman
2019-06-04 11:43     ` balducci
2019-06-05 12:38     ` balducci
2019-06-05 12:48       ` Mel Gorman
2019-06-05 17:21       ` Mel Gorman
2019-06-06 13:20         ` balducci
2019-06-06 14:44           ` Mel Gorman
2019-06-11  9:03             ` Mel Gorman
2019-06-11  9:30               ` balducci

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.