From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S966068Ab2EQVu0 (ORCPT ); Thu, 17 May 2012 17:50:26 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:34321 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1761163Ab2EQVuY (ORCPT ); Thu, 17 May 2012 17:50:24 -0400 Date: Thu, 17 May 2012 14:50:22 -0700 From: Andrew Morton To: David Rientjes Cc: KOSAKI Motohiro , KAMEZAWA Hiroyuki , Dave Jones , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [patch] mm, oom: normalize oom scores to oom_score_adj scale only for userspace Message-Id: <20120517145022.a99f41e8.akpm@linux-foundation.org> In-Reply-To: References: <20120426193551.GA24968@redhat.com> <20120503222949.GA13762@redhat.com> X-Mailer: Sylpheed 3.0.2 (GTK+ 2.20.1; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 17 May 2012 14:33:27 -0700 (PDT) David Rientjes wrote: > The oom_score_adj scale ranges from -1000 to 1000 and represents the > proportion of memory available to the process at allocation time. This > means an oom_score_adj value of 300, for example, will bias a process as > though it was using an extra 30.0% of available memory and a value of > -350 will discount 35.0% of available memory from its usage. > > The oom killer badness heuristic also uses this scale to report the oom > score for each eligible process in determining the "best" process to > kill. Thus, it can only differentiate each process's memory usage by > 0.1% of system RAM. > > On large systems, this can end up being a large amount of memory: 256MB > on 256GB systems, for example. > > This can be fixed by having the badness heuristic to use the actual > memory usage in scoring threads and then normalizing it to the > oom_score_adj scale for userspace. This results in better comparison > between eligible threads for kill and no change from the userspace > perspective. > > ... > > @@ -198,45 +198,33 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *memcg, > } > > /* > - * The memory controller may have a limit of 0 bytes, so avoid a divide > - * by zero, if necessary. > - */ > - if (!totalpages) > - totalpages = 1; > - > - /* > * The baseline for the badness score is the proportion of RAM that each > * task's rss, pagetable and swap space use. > */ > - points = get_mm_rss(p->mm) + p->mm->nr_ptes; > - points += get_mm_counter(p->mm, MM_SWAPENTS); > - > - points *= 1000; > - points /= totalpages; > + points = get_mm_rss(p->mm) + p->mm->nr_ptes + > + get_mm_counter(p->mm, MM_SWAPENTS); > task_unlock(p); > > /* > * Root processes get 3% bonus, just like the __vm_enough_memory() > * implementation used by LSMs. > */ > - if (has_capability_noaudit(p, CAP_SYS_ADMIN)) > - points -= 30; > + if (has_capability_noaudit(p, CAP_SYS_ADMIN) && totalpages) There doesn't seem much point in testing totalpages here - it's a micro-optimisation which adds a branch, on a slow path. > + points -= 30 * totalpages / 1000; > > /* > * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may > * either completely disable oom killing or always prefer a certain > * task. > */ > - points += p->signal->oom_score_adj; > + points += p->signal->oom_score_adj * totalpages / 1000; And if we *do* want to add that micro-optimisation, we may as well extend it to cover this expression also: if (totalpages) { /* reason goes here */ if (has_capability_noaudit(...)) points -= 30 * totalpages / 1000; p->signal->oom_score_adj * totalpages / 1000; } > /* > * Never return 0 for an eligible task that may be killed since it's > * possible that no single user task uses more than 0.1% of memory and > * no single admin tasks uses more than 3.0%. > */ > - if (points <= 0) > - return 1; > - return (points < 1000) ? points : 1000; > + return points <= 0 ? 1 : points; `points' is unsigned - testing it for negative looks odd. > } > > /* > @@ -314,7 +302,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > { > struct task_struct *g, *p; > struct task_struct *chosen = NULL; > - *ppoints = 0; > + unsigned long chosen_points = 0; > > do_each_thread(g, p) { > unsigned int points; > @@ -354,7 +342,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > */ > if (p == current) { > chosen = p; > - *ppoints = 1000; > + chosen_points = ULONG_MAX; > } else if (!force_kill) { > /* > * If this task is not being ptraced on exit, > @@ -367,12 +355,13 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > } > > points = oom_badness(p, memcg, nodemask, totalpages); > - if (points > *ppoints) { > + if (points > chosen_points) { > chosen = p; > - *ppoints = points; > + chosen_points = points; > } > } while_each_thread(g, p); > > + *ppoints = chosen_points * 1000 / totalpages; So it's up to the select_bad_process() callers to prevent the divide-by-zero. It is unobvious that they actually do this, and this important and unobvious caller requirement is undocumented. > return chosen; > } > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx147.postini.com [74.125.245.147]) by kanga.kvack.org (Postfix) with SMTP id 173D16B0082 for ; Thu, 17 May 2012 17:50:24 -0400 (EDT) Date: Thu, 17 May 2012 14:50:22 -0700 From: Andrew Morton Subject: Re: [patch] mm, oom: normalize oom scores to oom_score_adj scale only for userspace Message-Id: <20120517145022.a99f41e8.akpm@linux-foundation.org> In-Reply-To: References: <20120426193551.GA24968@redhat.com> <20120503222949.GA13762@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: KOSAKI Motohiro , KAMEZAWA Hiroyuki , Dave Jones , linux-mm@kvack.org, linux-kernel@vger.kernel.org On Thu, 17 May 2012 14:33:27 -0700 (PDT) David Rientjes wrote: > The oom_score_adj scale ranges from -1000 to 1000 and represents the > proportion of memory available to the process at allocation time. This > means an oom_score_adj value of 300, for example, will bias a process as > though it was using an extra 30.0% of available memory and a value of > -350 will discount 35.0% of available memory from its usage. > > The oom killer badness heuristic also uses this scale to report the oom > score for each eligible process in determining the "best" process to > kill. Thus, it can only differentiate each process's memory usage by > 0.1% of system RAM. > > On large systems, this can end up being a large amount of memory: 256MB > on 256GB systems, for example. > > This can be fixed by having the badness heuristic to use the actual > memory usage in scoring threads and then normalizing it to the > oom_score_adj scale for userspace. This results in better comparison > between eligible threads for kill and no change from the userspace > perspective. > > ... > > @@ -198,45 +198,33 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *memcg, > } > > /* > - * The memory controller may have a limit of 0 bytes, so avoid a divide > - * by zero, if necessary. > - */ > - if (!totalpages) > - totalpages = 1; > - > - /* > * The baseline for the badness score is the proportion of RAM that each > * task's rss, pagetable and swap space use. > */ > - points = get_mm_rss(p->mm) + p->mm->nr_ptes; > - points += get_mm_counter(p->mm, MM_SWAPENTS); > - > - points *= 1000; > - points /= totalpages; > + points = get_mm_rss(p->mm) + p->mm->nr_ptes + > + get_mm_counter(p->mm, MM_SWAPENTS); > task_unlock(p); > > /* > * Root processes get 3% bonus, just like the __vm_enough_memory() > * implementation used by LSMs. > */ > - if (has_capability_noaudit(p, CAP_SYS_ADMIN)) > - points -= 30; > + if (has_capability_noaudit(p, CAP_SYS_ADMIN) && totalpages) There doesn't seem much point in testing totalpages here - it's a micro-optimisation which adds a branch, on a slow path. > + points -= 30 * totalpages / 1000; > > /* > * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may > * either completely disable oom killing or always prefer a certain > * task. > */ > - points += p->signal->oom_score_adj; > + points += p->signal->oom_score_adj * totalpages / 1000; And if we *do* want to add that micro-optimisation, we may as well extend it to cover this expression also: if (totalpages) { /* reason goes here */ if (has_capability_noaudit(...)) points -= 30 * totalpages / 1000; p->signal->oom_score_adj * totalpages / 1000; } > /* > * Never return 0 for an eligible task that may be killed since it's > * possible that no single user task uses more than 0.1% of memory and > * no single admin tasks uses more than 3.0%. > */ > - if (points <= 0) > - return 1; > - return (points < 1000) ? points : 1000; > + return points <= 0 ? 1 : points; `points' is unsigned - testing it for negative looks odd. > } > > /* > @@ -314,7 +302,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > { > struct task_struct *g, *p; > struct task_struct *chosen = NULL; > - *ppoints = 0; > + unsigned long chosen_points = 0; > > do_each_thread(g, p) { > unsigned int points; > @@ -354,7 +342,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > */ > if (p == current) { > chosen = p; > - *ppoints = 1000; > + chosen_points = ULONG_MAX; > } else if (!force_kill) { > /* > * If this task is not being ptraced on exit, > @@ -367,12 +355,13 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > } > > points = oom_badness(p, memcg, nodemask, totalpages); > - if (points > *ppoints) { > + if (points > chosen_points) { > chosen = p; > - *ppoints = points; > + chosen_points = points; > } > } while_each_thread(g, p); > > + *ppoints = chosen_points * 1000 / totalpages; So it's up to the select_bad_process() callers to prevent the divide-by-zero. It is unobvious that they actually do this, and this important and unobvious caller requirement is undocumented. > return chosen; > } > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org