From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755786Ab0JSBPK (ORCPT ); Mon, 18 Oct 2010 21:15:10 -0400 Received: from mail-iw0-f174.google.com ([209.85.214.174]:40790 "EHLO mail-iw0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753430Ab0JSBPJ convert rfc822-to-8bit (ORCPT ); Mon, 18 Oct 2010 21:15:09 -0400 DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=V27otgTY4TEYEYb6FurzcxA+VmQZEwGXHqxvTjhUneVUBI9CHmrOCns41K720/XVez orwR4tgg3rNnFfwN5t/o0HYh036zFRgTj85+h1Dv8K9iuWRnlK4dd+dP6yh9uH8/6mSD UKHQul9sl/TmbmUXeQQbH1L8oWA8aTX4M61n0= MIME-Version: 1.0 In-Reply-To: <20101019095144.A1B0.A69D9226@jp.fujitsu.com> References: <20101019093142.509d6947@notabene> <20101018154137.90f5325f.akpm@linux-foundation.org> <20101019095144.A1B0.A69D9226@jp.fujitsu.com> Date: Tue, 19 Oct 2010 10:15:06 +0900 Message-ID: Subject: Re: Deadlock possibly caused by too_many_isolated. From: Minchan Kim To: KOSAKI Motohiro Cc: Andrew Morton , Neil Brown , Wu Fengguang , Rik van Riel , KAMEZAWA Hiroyuki , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "Li, Shaohua" Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 19, 2010 at 9:57 AM, KOSAKI Motohiro wrote: >> > I think there are two bugs here. >> > The raid1 bug that Torsten mentions is certainly real (and has been around >> > for an embarrassingly long time). >> > The bug that I identified in too_many_isolated is also a real bug and can be >> > triggered without md/raid1 in the mix. >> > So this is not a 'full fix' for every bug in the kernel :-), but it could >> > well be a full fix for this particular bug. >> > >> >> Can we just delete the too_many_isolated() logic?  (Crappy comment >> describes what the code does but not why it does it). > > if my remember is correct, we got bug report that LTP may makes misterious > OOM killer invocation about 1-2 years ago. because, if too many parocess are in > reclaim path, all of reclaimable pages can be isolated and last reclaimer found > the system don't have any reclaimable pages and lead to invoke OOM killer. > We have strong motivation to avoid false positive oom. then, some discusstion > made this patch. > > if my remember is incorrect, I hope Wu or Rik fix me. AFAIR, it's right. How about this? It's rather aggressive throttling than old(ie, it considers not lru type granularity but zone ) But I think it can prevent unnecessary OOM problem and solve deadlock problem. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f12ad18..acd6a65 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1961,6 +1961,21 @@ gfp_to_alloc_flags(gfp_t gfp_mask) return alloc_flags; } +/* + * Are there way too many processes are reclaiming this zone? + */ +static int too_many_isolated_zone(struct zone *zone) +{ + unsigned long inactive, isolated; + + inactive = zone_page_state(zone, NR_INACTIVE_FILE) + + zone_page_state(zone, NR_INACTIVE_ANON); + isolated = zone_page_state(zone, NR_ISOLATED_FILE) + + zone_page_state(zone, NR_ISOLATED_ANON); + + return isolated > inactive; +} + static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, @@ -2054,10 +2069,11 @@ rebalance: goto got_pg; /* - * If we failed to make any progress reclaiming, then we are - * running out of options and have to consider going OOM + * If we failed to make any progress reclaiming and there aren't + * many parallel reclaiming, then we are unning out of options and + * have to consider going OOM */ - if (!did_some_progress) { + if (!did_some_progress && !too_many_isolated_zone(preferred_zone)) { if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) { if (oom_killer_disabled) goto nopage; diff --git a/mm/vmscan.c b/mm/vmscan.c index c5dfabf..f2109af 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1129,31 +1129,6 @@ int isolate_lru_page(struct page *page) } /* - * Are there way too many processes in the direct reclaim path already? - */ -static int too_many_isolated(struct zone *zone, int file, - struct scan_control *sc) -{ - unsigned long inactive, isolated; - - if (current_is_kswapd()) - return 0; - - if (!scanning_global_lru(sc)) - return 0; - - if (file) { - inactive = zone_page_state(zone, NR_INACTIVE_FILE); - isolated = zone_page_state(zone, NR_ISOLATED_FILE); - } else { - inactive = zone_page_state(zone, NR_INACTIVE_ANON); - isolated = zone_page_state(zone, NR_ISOLATED_ANON); - } - - return isolated > inactive; -} - -/* * TODO: Try merging with migrations version of putback_lru_pages */ static noinline_for_stack void @@ -1290,15 +1265,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, unsigned long nr_anon; unsigned long nr_file; - while (unlikely(too_many_isolated(zone, file, sc))) { - congestion_wait(BLK_RW_ASYNC, HZ/10); - - /* We are about to die and free our memory. Return now. */ - if (fatal_signal_pending(current)) - return SWAP_CLUSTER_MAX; - } - - lru_add_drain(); spin_lock_irq(&zone->lru_lock); -- Kind regards, Minchan Kim From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 381146B00A3 for ; Mon, 18 Oct 2010 21:15:09 -0400 (EDT) Received: by gxk27 with SMTP id 27so978215gxk.14 for ; Mon, 18 Oct 2010 18:15:07 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20101019095144.A1B0.A69D9226@jp.fujitsu.com> References: <20101019093142.509d6947@notabene> <20101018154137.90f5325f.akpm@linux-foundation.org> <20101019095144.A1B0.A69D9226@jp.fujitsu.com> Date: Tue, 19 Oct 2010 10:15:06 +0900 Message-ID: Subject: Re: Deadlock possibly caused by too_many_isolated. From: Minchan Kim Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Andrew Morton , Neil Brown , Wu Fengguang , Rik van Riel , KAMEZAWA Hiroyuki , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "Li, Shaohua" List-ID: On Tue, Oct 19, 2010 at 9:57 AM, KOSAKI Motohiro wrote: >> > I think there are two bugs here. >> > The raid1 bug that Torsten mentions is certainly real (and has been ar= ound >> > for an embarrassingly long time). >> > The bug that I identified in too_many_isolated is also a real bug and = can be >> > triggered without md/raid1 in the mix. >> > So this is not a 'full fix' for every bug in the kernel :-), but it co= uld >> > well be a full fix for this particular bug. >> > >> >> Can we just delete the too_many_isolated() logic? =A0(Crappy comment >> describes what the code does but not why it does it). > > if my remember is correct, we got bug report that LTP may makes misteriou= s > OOM killer invocation about 1-2 years ago. because, if too many parocess = are in > reclaim path, all of reclaimable pages can be isolated and last reclaimer= found > the system don't have any reclaimable pages and lead to invoke OOM killer=