All of lore.kernel.org
 help / color / mirror / Atom feed
* MIPS: BUG() in isolate_lru_pages in mm/vmscan.c?
@ 2015-04-25 15:56 Joshua Kinard
  2015-04-25 18:53 ` Joshua Kinard
  0 siblings, 1 reply; 2+ messages in thread
From: Joshua Kinard @ 2015-04-25 15:56 UTC (permalink / raw)
  To: LKML, Linux MIPS List

I keep tripping up a BUG() in isolate_lru_pages in mm/vmscan.c:1345:

	switch (__isolate_lru_page(page, mode)) {
	case 0:
		nr_pages = hpage_nr_pages(page);
		mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
		list_move(&page->lru, dst);
		nr_taken += nr_pages;
		break;

	case -EBUSY:
		/* else it is being freed elsewhere */
		list_move(&page->lru, src);
		continue;

	default:
		BUG();
	}

This is on an SGI Onyx2 platform (MIPS, IP27), two node boards (4x R14000
CPUs), and 8G of RAM.  The problem appears tied to heavy disk I/O, typically
writes.  I can reproduce sometimes with a long bonnie++ run, but I haven't
gotten a recent panic() message under 4.0 yet.  Most of the time, it silently
hardlocks.  I only have serial console access at 9600bps, so it may lock too
fast before the serial driver can dump the panic.

Is there any information behind the purpose or triggers of this BUG()?  I went
back in git all the way to the initial 2006 commit that added this function,
but could not find any comments or explanation of just what it's protecting
against.  That makes it hard to know where to start debugging.

I've already tried switching filesystems, first ext4, now XFS.  Enabling
CONFIG_NUMA seems to make it harder to trigger, but that's not an objective
observation.  An md RAID resync doesn't appear to trigger it either.

Help?


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: MIPS: BUG() in isolate_lru_pages in mm/vmscan.c?
  2015-04-25 15:56 MIPS: BUG() in isolate_lru_pages in mm/vmscan.c? Joshua Kinard
@ 2015-04-25 18:53 ` Joshua Kinard
  0 siblings, 0 replies; 2+ messages in thread
From: Joshua Kinard @ 2015-04-25 18:53 UTC (permalink / raw)
  To: LKML, Linux MIPS List

On 04/25/2015 11:56, Joshua Kinard wrote:
> I keep tripping up a BUG() in isolate_lru_pages in mm/vmscan.c:1345:
> 
> 	switch (__isolate_lru_page(page, mode)) {
> 	case 0:
> 		nr_pages = hpage_nr_pages(page);
> 		mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
> 		list_move(&page->lru, dst);
> 		nr_taken += nr_pages;
> 		break;
> 
> 	case -EBUSY:
> 		/* else it is being freed elsewhere */
> 		list_move(&page->lru, src);
> 		continue;
> 
> 	default:
> 		BUG();
> 	}
> 
> This is on an SGI Onyx2 platform (MIPS, IP27), two node boards (4x R14000
> CPUs), and 8G of RAM.  The problem appears tied to heavy disk I/O, typically
> writes.  I can reproduce sometimes with a long bonnie++ run, but I haven't
> gotten a recent panic() message under 4.0 yet.  Most of the time, it silently
> hardlocks.  I only have serial console access at 9600bps, so it may lock too
> fast before the serial driver can dump the panic.
> 
> Is there any information behind the purpose or triggers of this BUG()?  I went
> back in git all the way to the initial 2006 commit that added this function,
> but could not find any comments or explanation of just what it's protecting
> against.  That makes it hard to know where to start debugging.
> 
> I've already tried switching filesystems, first ext4, now XFS.  Enabling
> CONFIG_NUMA seems to make it harder to trigger, but that's not an objective
> observation.  An md RAID resync doesn't appear to trigger it either.


This patch seems to explain things a little bit (from 20070316):
http://marc.info/?l=linux-mm-commits&m=117401513810763&w=2

> Subject: lumpy: back out removal of active check in isolate_lru_pages
> From: Andy Whitcroft <apw@shadowen.org>
> 
> As pointed out by Christop Lameter it should not be possible for a page to
> change its active/inactive state without taking the lru_lock.  Reinstate this
> safety net.
> 
> Signed-off-by: Andy Whitcroft <apw@shadowen.org>
> Acked-by: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  mm/vmscan.c |    7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff -puN mm/vmscan.c~lumpy-back-out-removal-of-active-check-in-isolate_lru_pages mm/vmscan.c
> --- a/mm/vmscan.c~lumpy-back-out-removal-of-active-check-in-isolate_lru_pages
> +++ a/mm/vmscan.c
> @@ -686,10 +686,13 @@ static unsigned long isolate_lru_pages(u
>  			nr_taken++;
>  			break;
>  
> -		default:
> -			/* page is being freed, or is a missmatch */
> +		case -EBUSY:
> +			/* else it is being freed elsewhere */
>  			list_move(&page->lru, src);
>  			continue;
> +
> +		default:
> +			BUG();
>  		}
>  
>  		if (!order)

So if my reading is correct, the BUG() is being triggered because a page might
be changing its active/inactive state w/o taking the lru_lock.  Given that the
SGI IP27 platform is an early NUMA machine and nodes can have a bit of physical
distance between them (thus some latency), could this be a sign of some kind of
SMP race condition specific to this platform?

--J

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2015-04-25 18:55 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-25 15:56 MIPS: BUG() in isolate_lru_pages in mm/vmscan.c? Joshua Kinard
2015-04-25 18:53 ` Joshua Kinard

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.