linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
To: Andrea Arcangeli <andrea@novell.com>
Cc: Chris Ross <chris@tebibyte.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Nick Piggin <piggin@cyberone.com.au>,
	Rik van Riel <riel@redhat.com>,
	Martin MOKREJ? <mmokrejs@ribosome.natur.cuni.cz>,
	tglx@linutronix.de, akpm@osdl.org
Subject: Re: [PATCH] fix spurious OOM kills
Date: Sun, 14 Nov 2004 18:21:55 -0200	[thread overview]
Message-ID: <20041114202155.GB2764@logos.cnet> (raw)
In-Reply-To: <20041114170339.GB13733@dualathlon.random>


Hi,

On Sun, Nov 14, 2004 at 06:03:39PM +0100, Andrea Arcangeli wrote:
> On Sun, Nov 14, 2004 at 07:44:17AM -0200, Marcelo Tosatti wrote:
> > Well, I'll wait for your correct and definitive approach.
> 
> I doubt my patch will be definitive and you're welcome to keep hacking
> without waiting ;). There are various problems, one issue is the
> try_to_free_pages side, the other severe and obvious bug is the
> invocation of the oom killer in vmscan.c that cannot know if enough
> memory is already free via racing tasks (like other context and like
> kswapd).
> 
> I'm looking at the latter first since that is easy to fix, but as you
> said the zone->all_unreclaimable is hard and it may be buggy too, so I
> doubt my fix will be definitive ;) but it'll worth a try. Feel free to
> work on the zone->all_unreclaimable for example.
> 
> Especially the fact your patch didn't help make me think my firts patch
> also won't fix it and we'll have to look into the try_to_free_pages
> internals to fix it completely.

I'm not sure about Chris's case - Nick mentions that the corrupt perzone 
accounting freepage bug might be interfering with the system. 

If its not the case, increasing the all_unreclaimable "timer" to a higher value
than 5 seconds will certainly delay the OOM killer such to a point where 
its not triggered until the VM reclaiming efforts make progress.


> My patch compared to yours will only save .text/.data/.bss bloat (i.e.
> the opposite of what Martin was worried about) to avoid message passing
> via global variable w/o locks from task context to kswapd.
> 
> Chris, since you can reproduce so easily, could you try a `vmstat 1`
> while the oom killing happens, and can you post it?

Chris, can you change the "500*HZ" in mm/vmscan.c balance_pgdat() function
to "1000*HZ" and see what you get, please?

> We've also to differentiate between true early-oom kills, and genuine
> oom-kills. The oom killing triggering is not always by mistake ;). Chris
> if you could post the vmstat 1 that would help to be sure it's really a
> kernel bug (if you already posted it in another thread just let me know
> and I'll search for such an email ;). Thanks!

This uncompiled/untested patch moves OOM killer call to page_alloc.c as you
strongly advocates.

It changes shrink_caches() to return 1 if it has scanned all zones which are 
part of the allocation zonelist. That is, it will only return 1 if all eligible 
zones do _NOT_ have zone->all_unreclaimable set.

try_to_free_pages() then uses the return value from shrink_caches() and
gfp_mask's GFP_FS capability (meaning the task can writeout data) to 
decide if it should return "-2" (just a magic value meaning 
"we have scanned all eligible zones down to priority zero without 
success"). 

This magic return value from "try_to_free_pages()" is the indicator
that the allocator can trigger the OOM killer because it tried "hard
enough" to free pages.

This has to work in combination with the all_unreclaimable timer 
(which turns off zone->all_unreclaimable after a certain period its in place).

The problem is that kswapd can turn all_unreclaimable ON again 
in the meantime on any of the zonelist's zone's

	if (zone->pages_scanned >= (zone->nr_active + zone->nr_inactive) * 4)
		zone->all_unreclaimable = 1;

which in turn can make shrink_caches() not return 1, meaning OOM killer 
will fail to be triggered.

See the problem ? 

We need some way to account for per-zone page reclaiming efforts globally and 
not thread localized, so to reliably detect per-zone OOM. 

Such accounting would make reliable OOM decision possible with 
both multiple reclaiming threads _and_ perzone all_unreclaimable logic.

The easy alternative is to assume kswapd will detect OOM reliably (which 
is what I've tried). But can not be always true as you have have noted.

Its not trivial but we will get it right.

Are you guys following me or I'm flying high alone here?

--- vmscan.c.ori	2004-11-14 19:51:30.572496864 -0200
+++ vmscan.c	2004-11-14 20:04:38.022786224 -0200
@@ -854,10 +854,11 @@
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
  */
-static void
+static int
 shrink_caches(struct zone **zones, struct scan_control *sc)
 {
 	int i;
+	int scanned_all = 1;
 
 	for (i = 0; zones[i] != NULL; i++) {
 		struct zone *zone = zones[i];
@@ -872,11 +873,15 @@
 		if (zone->prev_priority > sc->priority)
 			zone->prev_priority = sc->priority;
 
-		if (zone->all_unreclaimable && sc->priority != DEF_PRIORITY)
+		if (zone->all_unreclaimable && sc->priority != DEF_PRIORITY) {
+			scanned_all = 0;
 			continue;	/* Let kswapd poll it */
+		}
 
 		shrink_zone(zone, sc);
 	}
+
+	return scanned_all;
 }
  
 /*
@@ -901,7 +906,7 @@
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct scan_control sc;
 	unsigned long lru_pages = 0;
-	int i;
+	int i, scanned_all;
 
 	sc.gfp_mask = gfp_mask;
 	sc.may_writepage = 0;
@@ -923,7 +928,7 @@
 		sc.nr_scanned = 0;
 		sc.nr_reclaimed = 0;
 		sc.priority = priority;
-		shrink_caches(zones, &sc);
+		scanned_all = shrink_caches(zones, &sc);
 		shrink_slab(sc.nr_scanned, gfp_mask, lru_pages);
 		if (reclaim_state) {
 			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
@@ -952,8 +957,8 @@
 		if (sc.nr_scanned && priority < DEF_PRIORITY - 2)
 			blk_congestion_wait(WRITE, HZ/10);
 	}
-	if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY))
-		out_of_memory(gfp_mask);
+	if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY) && scanned_all)
+		ret = -2;
 out:
 	for (i = 0; zones[i] != 0; i++) {
 		struct zone *zone = zones[i];
--- page_alloc.c.orig	2004-11-14 19:50:54.536975096 -0200
+++ page_alloc.c	2004-11-14 20:11:37.254053384 -0200
@@ -773,6 +773,7 @@
 	int alloc_type;
 	int do_retry;
 	int can_try_harder;
+	int reclaim_failure;
 
 	might_sleep_if(wait);
 
@@ -857,7 +858,8 @@
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
-	try_to_free_pages(zones, gfp_mask, order);
+	if (try_to_free_pages(zones, gfp_mask, order) == -2)
+		reclaim_failure = 1;
 
 	p->reclaim_state = NULL;
 	p->flags &= ~PF_MEMALLOC;
@@ -892,6 +894,8 @@
 			do_retry = 1;
 	}
 	if (do_retry) {
+		if (reclaim_failure)
+			out_of_memory();
 		blk_congestion_wait(WRITE, HZ/50);
 		goto rebalance;
 	}

  parent reply	other threads:[~2004-11-15  0:11 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-11-11 11:29 [PATCH] fix spurious OOM kills Marcelo Tosatti
2004-11-11 15:42 ` Andrea Arcangeli
2004-11-11 12:38   ` Marcelo Tosatti
2004-11-11 16:50     ` Andrea Arcangeli
2004-11-11 13:56       ` Marcelo Tosatti
2004-11-11 21:45         ` Andrea Arcangeli
2004-11-11 19:19           ` Marcelo Tosatti
2004-11-11 17:42       ` Martin J. Bligh
2004-11-11 21:50         ` Andrea Arcangeli
2004-11-12 11:13       ` fix for mpol mm corruption on tmpfs Andrea Arcangeli
2004-11-11 21:57 ` [PATCH] fix spurious OOM kills Chris Ross
2004-11-12 16:52   ` Chris Ross
2004-11-12 23:56     ` Nick Piggin
2004-11-13 23:37     ` Andrea Arcangeli
2004-11-14  9:44       ` Marcelo Tosatti
2004-11-14 10:02         ` Marcelo Tosatti
2004-11-14 17:11           ` Andrea Arcangeli
2004-11-14 17:03         ` Andrea Arcangeli
2004-11-14 18:16           ` Martin J. Bligh
2004-11-14 18:27             ` Andrea Arcangeli
2004-11-14 20:21           ` Marcelo Tosatti [this message]
2004-11-16 16:30             ` Chris Ross
2004-11-17  9:08               ` Chris Ross
2004-11-17  9:23                 ` Andrew Morton
2004-11-17  6:06                   ` Marcelo Tosatti
2004-11-17  6:08                     ` Marcelo Tosatti
2004-11-17  6:38                       ` Marcelo Tosatti
2004-11-17 11:04                         ` Chris Ross
2004-11-17 10:26                       ` Andrew Morton
2004-11-17 10:50                       ` Chris Ross
2004-11-17  7:09                         ` Marcelo Tosatti
2004-11-17 11:49                           ` Chris Ross
2004-11-17 12:09                           ` Rik van Riel
2004-11-17 13:12                   ` Chris Ross
     [not found]                   ` <419CD8C1.4030506@ribosome.natur.cuni.cz>
2004-11-18 21:16                     ` Andrew Morton
     [not found]                       ` <419D25B5.1060504@ribosome.natur.cuni.cz>
     [not found]                         ` <419D2987.8010305@cyberone.com.au>
2004-11-19  0:03                           ` Martin MOKREJŠ
2004-11-19  0:08                             ` Andrew Morton
2004-11-19  8:09                               ` Marcelo Tosatti
2004-11-19 16:17                                 ` Thomas Gleixner
     [not found]                               ` <419E821F.7010601@ribosome.natur.cuni.cz>
2004-11-20 10:23                                 ` Thomas Gleixner
2004-11-20 10:45                                   ` Martin MOKREJŠ
2004-11-20 11:29                                   ` Martin MOKREJŠ
2004-11-20 13:29                                     ` Thomas Gleixner
2004-11-20 21:19                                       ` Martin MOKREJŠ
2004-11-21 11:53                                         ` Thomas Gleixner
2004-11-21 12:17                                           ` Martin MOKREJŠ
2004-11-21 13:57                                             ` Thomas Gleixner
2004-11-22 10:55                                               ` Thomas Gleixner
2004-11-23  7:41                                                 ` Martin MOKREJŠ
2004-11-23 10:27                                                   ` Thomas Gleixner
2004-11-24 15:52                                                     ` Martin MOKREJŠ
2004-11-24 16:36                                                       ` Thomas Gleixner
2004-12-14 16:04                                                     ` Martin MOKREJŠ
2004-12-14 17:38                                                       ` Andrea Arcangeli
2004-12-14 23:30                                                         ` Nick Piggin
2004-12-14 23:55                                                           ` Andrea Arcangeli
2004-12-15  0:16                                                             ` Thomas Gleixner
2004-12-15  0:37                                                               ` Andrea Arcangeli
2004-12-15  0:48                                                                 ` Thomas Gleixner
2004-11-21 19:01                   ` Chris Ross
2004-11-22 12:15                     ` Chris Ross
2004-11-22  8:35                       ` Marcelo Tosatti
2004-11-16  8:37           ` Chris Ross
2004-11-17  3:45   ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20041114202155.GB2764@logos.cnet \
    --to=marcelo.tosatti@cyclades.com \
    --cc=akpm@osdl.org \
    --cc=andrea@novell.com \
    --cc=chris@tebibyte.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mmokrejs@ribosome.natur.cuni.cz \
    --cc=piggin@cyberone.com.au \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).