All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 1/2] mm: tuning hardcoded reserved memory
@ 2013-02-27 20:56 ` Andrew Shewmaker
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Shewmaker @ 2013-02-27 20:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel

The following patches are against the mmtom git tree as of February 27th.

The first patch only affects OVERCOMMIT_NEVER mode, entirely removing 
the 3% reserve for other user processes.

The second patch affects both OVERCOMMIT_GUESS and OVERCOMMIT_NEVER 
modes, replacing the hardcoded 3% reserve for the root user with a 
tunable knob.

Signed-off-by: Andrew Shewmaker <agshew@gmail.com>

---

__vm_enough_memory reserves 3% of free pages with the default 
overcommit mode and 6% when overcommit is disabled. These hardcoded 
values have become less reasonable as memory sizes have grown.

On scientific clusters, systems are generally dedicated to one user. 
Also, overcommit is sometimes disabled in order to prevent a long 
running job from suddenly failing days or weeks into a calculation.
In this case, a user wishing to allocate as much memory as possible 
to one process may be prevented from using, for example, around 7GB 
out of 128GB.

The effect is less, but still significant when a user starts a job 
with one process per core. I have repeatedly seen a set of processes 
requesting the same amount of memory fail because one of them could  
not allocate the amount of memory a user would expect to be able to 
allocate.

diff --git a/mm/mmap.c b/mm/mmap.c
index d1e4124..5993f33 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -182,11 +182,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 		allowed -= allowed / 32;
 	allowed += total_swap_pages;
 
-	/* Don't let a single process grow too big:
-	   leave 3% of the size of this process for other processes */
-	if (mm)
-		allowed -= mm->total_vm / 32;
-
 	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
 		return 0;
 error:

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 1/2] mm: tuning hardcoded reserved memory
@ 2013-02-27 20:56 ` Andrew Shewmaker
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Shewmaker @ 2013-02-27 20:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel

The following patches are against the mmtom git tree as of February 27th.

The first patch only affects OVERCOMMIT_NEVER mode, entirely removing 
the 3% reserve for other user processes.

The second patch affects both OVERCOMMIT_GUESS and OVERCOMMIT_NEVER 
modes, replacing the hardcoded 3% reserve for the root user with a 
tunable knob.

Signed-off-by: Andrew Shewmaker <agshew@gmail.com>

---

__vm_enough_memory reserves 3% of free pages with the default 
overcommit mode and 6% when overcommit is disabled. These hardcoded 
values have become less reasonable as memory sizes have grown.

On scientific clusters, systems are generally dedicated to one user. 
Also, overcommit is sometimes disabled in order to prevent a long 
running job from suddenly failing days or weeks into a calculation.
In this case, a user wishing to allocate as much memory as possible 
to one process may be prevented from using, for example, around 7GB 
out of 128GB.

The effect is less, but still significant when a user starts a job 
with one process per core. I have repeatedly seen a set of processes 
requesting the same amount of memory fail because one of them could  
not allocate the amount of memory a user would expect to be able to 
allocate.

diff --git a/mm/mmap.c b/mm/mmap.c
index d1e4124..5993f33 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -182,11 +182,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 		allowed -= allowed / 32;
 	allowed += total_swap_pages;
 
-	/* Don't let a single process grow too big:
-	   leave 3% of the size of this process for other processes */
-	if (mm)
-		allowed -= mm->total_vm / 32;
-
 	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
 		return 0;
 error:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 1/2] mm: tuning hardcoded reserved memory
  2013-02-28 22:12   ` Andrew Morton
@ 2013-02-28  3:48     ` Andrew Shewmaker
  -1 siblings, 0 replies; 14+ messages in thread
From: Andrew Shewmaker @ 2013-02-28  3:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Alan Cox

On Thu, Feb 28, 2013 at 02:12:00PM -0800, Andrew Morton wrote:
> On Wed, 27 Feb 2013 15:56:30 -0500
> Andrew Shewmaker <agshew@gmail.com> wrote:
> 
> > The following patches are against the mmtom git tree as of February 27th.
> > 
> > The first patch only affects OVERCOMMIT_NEVER mode, entirely removing 
> > the 3% reserve for other user processes.
> > 
> > The second patch affects both OVERCOMMIT_GUESS and OVERCOMMIT_NEVER 
> > modes, replacing the hardcoded 3% reserve for the root user with a 
> > tunable knob.
> > 
> 
> Gee, it's been years since anyone thought about the overcommit code.
> 
> Documentation/vm/overcommit-accounting says that OVERCOMMIT_ALWAYS is
> "Appropriate for some scientific applications", but doesn't say why. 
> You're running a scientific cluster but you're using OVERCOMMIT_NEVER,
> I think?  Is the documentation wrong?

None of my scientists appeared to use sparse arrays as Alan described. 
My users would run jobs that appeared to initialize correctly. However, 
they wouldn't write to every page they malloced (and they wouldn't use 
calloc), so I saw jobs failing well into a computation once the 
simulation tried to access a page and the kernel couldn't give it to them.

I think Roadrunner (http://en.wikipedia.org/wiki/IBM_Roadrunner) was 
the first cluster I put into OVERCOMMIT_NEVER mode. Jobs with 
infeasible memory requirements fail early and the OOM killer 
gets triggered much less often than in guess mode. More often than not 
the OOM killer seemed to kill the wrong thing causing a subtle brokenness. 
Disabling overcommit worked so well during the stabilization and 
early user phases that we did the same with other clusters. 

> > __vm_enough_memory reserves 3% of free pages with the default 
> > overcommit mode and 6% when overcommit is disabled. These hardcoded 
> > values have become less reasonable as memory sizes have grown.
> > 
> > On scientific clusters, systems are generally dedicated to one user. 
> > Also, overcommit is sometimes disabled in order to prevent a long 
> > running job from suddenly failing days or weeks into a calculation.
> > In this case, a user wishing to allocate as much memory as possible 
> > to one process may be prevented from using, for example, around 7GB 
> > out of 128GB.
> > 
> > The effect is less, but still significant when a user starts a job 
> > with one process per core. I have repeatedly seen a set of processes 
> > requesting the same amount of memory fail because one of them could  
> > not allocate the amount of memory a user would expect to be able to 
> > allocate.
> > 
> > ...
> >
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -182,11 +182,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
> >  		allowed -= allowed / 32;
> >  	allowed += total_swap_pages;
> >  
> > -	/* Don't let a single process grow too big:
> > -	   leave 3% of the size of this process for other processes */
> > -	if (mm)
> > -		allowed -= mm->total_vm / 32;
> > -
> >  	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
> >  		return 0;
> 
> So what might be the downside for this change?  root can't log in, I
> assume.  Have you actually tested for this scenario and observed the
> effects?
> 
> If there *are* observable risks and/or to preserve back-compatibility,
> I guess we could create a fourth overcommit mode which provides the
> headroom which you desire.
> 
> Also, should we be looking at removing root's 3% from OVERCOMMIT_GUESS
> as well?

The downside of the first patch, which removes the "other" reserve 
(sorry about the confusing duplicated subject line), is that a user 
may not be able to kill their process, even if they have a shell prompt. 
When testing, I did sometimes get into spot where I attempted to execute 
kill, but got: "bash: fork: Cannot allocate memory". Of course, a 
user can get in the same predicament with the current 3% reserve--they 
just have to start processes until 3% becomes negligible.

With just the first patch, root still has a 3% reserve, so they can 
still log in.

When I resubmit the second patch, adding a tunable rootuser_reserve_pages 
variable, I'll test both guess and never overcommit modes to see what 
minimum initial values allow root to login and kill a user's memory 
hogging process. This will be safer than the current behavior since 
root's reserve will never shrink to something useless in the case where 
a user has grabbed all available memory with many processes.

As an estimate of a useful rootuser_reserve_pages, the rss+share size of 
sshd, bash, and top is about 16MB. Overcommit disabled mode would need 
closer to 360MB for the same processes. On a 128GB box 3% is 3.8GB, so 
the new tunable would still be a win.

I think the tunable would benefit everyone over the current behavior, 
but would you prefer it if I only made it tunable in a fourth overcommit 
mode in order to preserve back-compatibility?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 1/2] mm: tuning hardcoded reserved memory
@ 2013-02-28  3:48     ` Andrew Shewmaker
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Shewmaker @ 2013-02-28  3:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Alan Cox

On Thu, Feb 28, 2013 at 02:12:00PM -0800, Andrew Morton wrote:
> On Wed, 27 Feb 2013 15:56:30 -0500
> Andrew Shewmaker <agshew@gmail.com> wrote:
> 
> > The following patches are against the mmtom git tree as of February 27th.
> > 
> > The first patch only affects OVERCOMMIT_NEVER mode, entirely removing 
> > the 3% reserve for other user processes.
> > 
> > The second patch affects both OVERCOMMIT_GUESS and OVERCOMMIT_NEVER 
> > modes, replacing the hardcoded 3% reserve for the root user with a 
> > tunable knob.
> > 
> 
> Gee, it's been years since anyone thought about the overcommit code.
> 
> Documentation/vm/overcommit-accounting says that OVERCOMMIT_ALWAYS is
> "Appropriate for some scientific applications", but doesn't say why. 
> You're running a scientific cluster but you're using OVERCOMMIT_NEVER,
> I think?  Is the documentation wrong?

None of my scientists appeared to use sparse arrays as Alan described. 
My users would run jobs that appeared to initialize correctly. However, 
they wouldn't write to every page they malloced (and they wouldn't use 
calloc), so I saw jobs failing well into a computation once the 
simulation tried to access a page and the kernel couldn't give it to them.

I think Roadrunner (http://en.wikipedia.org/wiki/IBM_Roadrunner) was 
the first cluster I put into OVERCOMMIT_NEVER mode. Jobs with 
infeasible memory requirements fail early and the OOM killer 
gets triggered much less often than in guess mode. More often than not 
the OOM killer seemed to kill the wrong thing causing a subtle brokenness. 
Disabling overcommit worked so well during the stabilization and 
early user phases that we did the same with other clusters. 

> > __vm_enough_memory reserves 3% of free pages with the default 
> > overcommit mode and 6% when overcommit is disabled. These hardcoded 
> > values have become less reasonable as memory sizes have grown.
> > 
> > On scientific clusters, systems are generally dedicated to one user. 
> > Also, overcommit is sometimes disabled in order to prevent a long 
> > running job from suddenly failing days or weeks into a calculation.
> > In this case, a user wishing to allocate as much memory as possible 
> > to one process may be prevented from using, for example, around 7GB 
> > out of 128GB.
> > 
> > The effect is less, but still significant when a user starts a job 
> > with one process per core. I have repeatedly seen a set of processes 
> > requesting the same amount of memory fail because one of them could  
> > not allocate the amount of memory a user would expect to be able to 
> > allocate.
> > 
> > ...
> >
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -182,11 +182,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
> >  		allowed -= allowed / 32;
> >  	allowed += total_swap_pages;
> >  
> > -	/* Don't let a single process grow too big:
> > -	   leave 3% of the size of this process for other processes */
> > -	if (mm)
> > -		allowed -= mm->total_vm / 32;
> > -
> >  	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
> >  		return 0;
> 
> So what might be the downside for this change?  root can't log in, I
> assume.  Have you actually tested for this scenario and observed the
> effects?
> 
> If there *are* observable risks and/or to preserve back-compatibility,
> I guess we could create a fourth overcommit mode which provides the
> headroom which you desire.
> 
> Also, should we be looking at removing root's 3% from OVERCOMMIT_GUESS
> as well?

The downside of the first patch, which removes the "other" reserve 
(sorry about the confusing duplicated subject line), is that a user 
may not be able to kill their process, even if they have a shell prompt. 
When testing, I did sometimes get into spot where I attempted to execute 
kill, but got: "bash: fork: Cannot allocate memory". Of course, a 
user can get in the same predicament with the current 3% reserve--they 
just have to start processes until 3% becomes negligible.

With just the first patch, root still has a 3% reserve, so they can 
still log in.

When I resubmit the second patch, adding a tunable rootuser_reserve_pages 
variable, I'll test both guess and never overcommit modes to see what 
minimum initial values allow root to login and kill a user's memory 
hogging process. This will be safer than the current behavior since 
root's reserve will never shrink to something useless in the case where 
a user has grabbed all available memory with many processes.

As an estimate of a useful rootuser_reserve_pages, the rss+share size of 
sshd, bash, and top is about 16MB. Overcommit disabled mode would need 
closer to 360MB for the same processes. On a 128GB box 3% is 3.8GB, so 
the new tunable would still be a win.

I think the tunable would benefit everyone over the current behavior, 
but would you prefer it if I only made it tunable in a fourth overcommit 
mode in order to preserve back-compatibility?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 1/2] mm: tuning hardcoded reserved memory
  2013-02-27 20:56 ` Andrew Shewmaker
@ 2013-02-28 22:12   ` Andrew Morton
  -1 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2013-02-28 22:12 UTC (permalink / raw)
  To: Andrew Shewmaker; +Cc: linux-mm, linux-kernel, Alan Cox

On Wed, 27 Feb 2013 15:56:30 -0500
Andrew Shewmaker <agshew@gmail.com> wrote:

> The following patches are against the mmtom git tree as of February 27th.
> 
> The first patch only affects OVERCOMMIT_NEVER mode, entirely removing 
> the 3% reserve for other user processes.
> 
> The second patch affects both OVERCOMMIT_GUESS and OVERCOMMIT_NEVER 
> modes, replacing the hardcoded 3% reserve for the root user with a 
> tunable knob.
> 

Gee, it's been years since anyone thought about the overcommit code.

Documentation/vm/overcommit-accounting says that OVERCOMMIT_ALWAYS is
"Appropriate for some scientific applications", but doesn't say why. 
You're running a scientific cluster but you're using OVERCOMMIT_NEVER,
I think?  Is the documentation wrong?

> __vm_enough_memory reserves 3% of free pages with the default 
> overcommit mode and 6% when overcommit is disabled. These hardcoded 
> values have become less reasonable as memory sizes have grown.
> 
> On scientific clusters, systems are generally dedicated to one user. 
> Also, overcommit is sometimes disabled in order to prevent a long 
> running job from suddenly failing days or weeks into a calculation.
> In this case, a user wishing to allocate as much memory as possible 
> to one process may be prevented from using, for example, around 7GB 
> out of 128GB.
> 
> The effect is less, but still significant when a user starts a job 
> with one process per core. I have repeatedly seen a set of processes 
> requesting the same amount of memory fail because one of them could  
> not allocate the amount of memory a user would expect to be able to 
> allocate.
> 
> ...
>
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -182,11 +182,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
>  		allowed -= allowed / 32;
>  	allowed += total_swap_pages;
>  
> -	/* Don't let a single process grow too big:
> -	   leave 3% of the size of this process for other processes */
> -	if (mm)
> -		allowed -= mm->total_vm / 32;
> -
>  	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
>  		return 0;

So what might be the downside for this change?  root can't log in, I
assume.  Have you actually tested for this scenario and observed the
effects?

If there *are* observable risks and/or to preserve back-compatibility,
I guess we could create a fourth overcommit mode which provides the
headroom which you desire.

Also, should we be looking at removing root's 3% from OVERCOMMIT_GUESS
as well?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 1/2] mm: tuning hardcoded reserved memory
@ 2013-02-28 22:12   ` Andrew Morton
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2013-02-28 22:12 UTC (permalink / raw)
  To: Andrew Shewmaker; +Cc: linux-mm, linux-kernel, Alan Cox

On Wed, 27 Feb 2013 15:56:30 -0500
Andrew Shewmaker <agshew@gmail.com> wrote:

> The following patches are against the mmtom git tree as of February 27th.
> 
> The first patch only affects OVERCOMMIT_NEVER mode, entirely removing 
> the 3% reserve for other user processes.
> 
> The second patch affects both OVERCOMMIT_GUESS and OVERCOMMIT_NEVER 
> modes, replacing the hardcoded 3% reserve for the root user with a 
> tunable knob.
> 

Gee, it's been years since anyone thought about the overcommit code.

Documentation/vm/overcommit-accounting says that OVERCOMMIT_ALWAYS is
"Appropriate for some scientific applications", but doesn't say why. 
You're running a scientific cluster but you're using OVERCOMMIT_NEVER,
I think?  Is the documentation wrong?

> __vm_enough_memory reserves 3% of free pages with the default 
> overcommit mode and 6% when overcommit is disabled. These hardcoded 
> values have become less reasonable as memory sizes have grown.
> 
> On scientific clusters, systems are generally dedicated to one user. 
> Also, overcommit is sometimes disabled in order to prevent a long 
> running job from suddenly failing days or weeks into a calculation.
> In this case, a user wishing to allocate as much memory as possible 
> to one process may be prevented from using, for example, around 7GB 
> out of 128GB.
> 
> The effect is less, but still significant when a user starts a job 
> with one process per core. I have repeatedly seen a set of processes 
> requesting the same amount of memory fail because one of them could  
> not allocate the amount of memory a user would expect to be able to 
> allocate.
> 
> ...
>
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -182,11 +182,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
>  		allowed -= allowed / 32;
>  	allowed += total_swap_pages;
>  
> -	/* Don't let a single process grow too big:
> -	   leave 3% of the size of this process for other processes */
> -	if (mm)
> -		allowed -= mm->total_vm / 32;
> -
>  	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
>  		return 0;

So what might be the downside for this change?  root can't log in, I
assume.  Have you actually tested for this scenario and observed the
effects?

If there *are* observable risks and/or to preserve back-compatibility,
I guess we could create a fourth overcommit mode which provides the
headroom which you desire.

Also, should we be looking at removing root's 3% from OVERCOMMIT_GUESS
as well?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 1/2] mm: tuning hardcoded reserved memory
  2013-02-28  3:48     ` Andrew Shewmaker
@ 2013-03-01  2:40       ` Ric Mason
  -1 siblings, 0 replies; 14+ messages in thread
From: Ric Mason @ 2013-03-01  2:40 UTC (permalink / raw)
  To: Andrew Shewmaker; +Cc: Andrew Morton, linux-mm, linux-kernel, Alan Cox

On 02/28/2013 11:48 AM, Andrew Shewmaker wrote:
> On Thu, Feb 28, 2013 at 02:12:00PM -0800, Andrew Morton wrote:
>> On Wed, 27 Feb 2013 15:56:30 -0500
>> Andrew Shewmaker <agshew@gmail.com> wrote:
>>
>>> The following patches are against the mmtom git tree as of February 27th.
>>>
>>> The first patch only affects OVERCOMMIT_NEVER mode, entirely removing
>>> the 3% reserve for other user processes.
>>>
>>> The second patch affects both OVERCOMMIT_GUESS and OVERCOMMIT_NEVER
>>> modes, replacing the hardcoded 3% reserve for the root user with a
>>> tunable knob.
>>>
>> Gee, it's been years since anyone thought about the overcommit code.
>>
>> Documentation/vm/overcommit-accounting says that OVERCOMMIT_ALWAYS is
>> "Appropriate for some scientific applications", but doesn't say why.
>> You're running a scientific cluster but you're using OVERCOMMIT_NEVER,
>> I think?  Is the documentation wrong?
> None of my scientists appeared to use sparse arrays as Alan described.
> My users would run jobs that appeared to initialize correctly. However,
> they wouldn't write to every page they malloced (and they wouldn't use
> calloc), so I saw jobs failing well into a computation once the
> simulation tried to access a page and the kernel couldn't give it to them.
>
> I think Roadrunner (http://en.wikipedia.org/wiki/IBM_Roadrunner) was
> the first cluster I put into OVERCOMMIT_NEVER mode. Jobs with
> infeasible memory requirements fail early and the OOM killer
> gets triggered much less often than in guess mode. More often than not
> the OOM killer seemed to kill the wrong thing causing a subtle brokenness.
> Disabling overcommit worked so well during the stabilization and
> early user phases that we did the same with other clusters.

Do you mean OVERCOMMIT_NEVER is more suitable for scientific application 
than OVERCOMMIT_GUESS and OVERCOMMIT_ALWAYS? Or should depend on 
workload? Since your users would run jobs that wouldn't write to every 
page they malloced, so why OVERCOMMIT_GUESS is not more suitable for you?

>
>>> __vm_enough_memory reserves 3% of free pages with the default
>>> overcommit mode and 6% when overcommit is disabled. These hardcoded
>>> values have become less reasonable as memory sizes have grown.
>>>
>>> On scientific clusters, systems are generally dedicated to one user.
>>> Also, overcommit is sometimes disabled in order to prevent a long
>>> running job from suddenly failing days or weeks into a calculation.
>>> In this case, a user wishing to allocate as much memory as possible
>>> to one process may be prevented from using, for example, around 7GB
>>> out of 128GB.
>>>
>>> The effect is less, but still significant when a user starts a job
>>> with one process per core. I have repeatedly seen a set of processes
>>> requesting the same amount of memory fail because one of them could
>>> not allocate the amount of memory a user would expect to be able to
>>> allocate.
>>>
>>> ...
>>>
>>> --- a/mm/mmap.c
>>> +++ b/mm/mmap.c
>>> @@ -182,11 +182,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
>>>   		allowed -= allowed / 32;
>>>   	allowed += total_swap_pages;
>>>   
>>> -	/* Don't let a single process grow too big:
>>> -	   leave 3% of the size of this process for other processes */
>>> -	if (mm)
>>> -		allowed -= mm->total_vm / 32;
>>> -
>>>   	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
>>>   		return 0;
>> So what might be the downside for this change?  root can't log in, I
>> assume.  Have you actually tested for this scenario and observed the
>> effects?
>>
>> If there *are* observable risks and/or to preserve back-compatibility,
>> I guess we could create a fourth overcommit mode which provides the
>> headroom which you desire.
>>
>> Also, should we be looking at removing root's 3% from OVERCOMMIT_GUESS
>> as well?
> The downside of the first patch, which removes the "other" reserve
> (sorry about the confusing duplicated subject line), is that a user
> may not be able to kill their process, even if they have a shell prompt.
> When testing, I did sometimes get into spot where I attempted to execute
> kill, but got: "bash: fork: Cannot allocate memory". Of course, a
> user can get in the same predicament with the current 3% reserve--they
> just have to start processes until 3% becomes negligible.
>
> With just the first patch, root still has a 3% reserve, so they can
> still log in.
>
> When I resubmit the second patch, adding a tunable rootuser_reserve_pages
> variable, I'll test both guess and never overcommit modes to see what
> minimum initial values allow root to login and kill a user's memory
> hogging process. This will be safer than the current behavior since
> root's reserve will never shrink to something useless in the case where
> a user has grabbed all available memory with many processes.

The idea of two patches looks reasonable to me.

>
> As an estimate of a useful rootuser_reserve_pages, the rss+share size of

Sorry for my silly, why you mean share size is not consist in rss size?

> sshd, bash, and top is about 16MB. Overcommit disabled mode would need
> closer to 360MB for the same processes. On a 128GB box 3% is 3.8GB, so
> the new tunable would still be a win.
>
> I think the tunable would benefit everyone over the current behavior,
> but would you prefer it if I only made it tunable in a fourth overcommit
> mode in order to preserve back-compatibility?
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 1/2] mm: tuning hardcoded reserved memory
@ 2013-03-01  2:40       ` Ric Mason
  0 siblings, 0 replies; 14+ messages in thread
From: Ric Mason @ 2013-03-01  2:40 UTC (permalink / raw)
  To: Andrew Shewmaker; +Cc: Andrew Morton, linux-mm, linux-kernel, Alan Cox

On 02/28/2013 11:48 AM, Andrew Shewmaker wrote:
> On Thu, Feb 28, 2013 at 02:12:00PM -0800, Andrew Morton wrote:
>> On Wed, 27 Feb 2013 15:56:30 -0500
>> Andrew Shewmaker <agshew@gmail.com> wrote:
>>
>>> The following patches are against the mmtom git tree as of February 27th.
>>>
>>> The first patch only affects OVERCOMMIT_NEVER mode, entirely removing
>>> the 3% reserve for other user processes.
>>>
>>> The second patch affects both OVERCOMMIT_GUESS and OVERCOMMIT_NEVER
>>> modes, replacing the hardcoded 3% reserve for the root user with a
>>> tunable knob.
>>>
>> Gee, it's been years since anyone thought about the overcommit code.
>>
>> Documentation/vm/overcommit-accounting says that OVERCOMMIT_ALWAYS is
>> "Appropriate for some scientific applications", but doesn't say why.
>> You're running a scientific cluster but you're using OVERCOMMIT_NEVER,
>> I think?  Is the documentation wrong?
> None of my scientists appeared to use sparse arrays as Alan described.
> My users would run jobs that appeared to initialize correctly. However,
> they wouldn't write to every page they malloced (and they wouldn't use
> calloc), so I saw jobs failing well into a computation once the
> simulation tried to access a page and the kernel couldn't give it to them.
>
> I think Roadrunner (http://en.wikipedia.org/wiki/IBM_Roadrunner) was
> the first cluster I put into OVERCOMMIT_NEVER mode. Jobs with
> infeasible memory requirements fail early and the OOM killer
> gets triggered much less often than in guess mode. More often than not
> the OOM killer seemed to kill the wrong thing causing a subtle brokenness.
> Disabling overcommit worked so well during the stabilization and
> early user phases that we did the same with other clusters.

Do you mean OVERCOMMIT_NEVER is more suitable for scientific application 
than OVERCOMMIT_GUESS and OVERCOMMIT_ALWAYS? Or should depend on 
workload? Since your users would run jobs that wouldn't write to every 
page they malloced, so why OVERCOMMIT_GUESS is not more suitable for you?

>
>>> __vm_enough_memory reserves 3% of free pages with the default
>>> overcommit mode and 6% when overcommit is disabled. These hardcoded
>>> values have become less reasonable as memory sizes have grown.
>>>
>>> On scientific clusters, systems are generally dedicated to one user.
>>> Also, overcommit is sometimes disabled in order to prevent a long
>>> running job from suddenly failing days or weeks into a calculation.
>>> In this case, a user wishing to allocate as much memory as possible
>>> to one process may be prevented from using, for example, around 7GB
>>> out of 128GB.
>>>
>>> The effect is less, but still significant when a user starts a job
>>> with one process per core. I have repeatedly seen a set of processes
>>> requesting the same amount of memory fail because one of them could
>>> not allocate the amount of memory a user would expect to be able to
>>> allocate.
>>>
>>> ...
>>>
>>> --- a/mm/mmap.c
>>> +++ b/mm/mmap.c
>>> @@ -182,11 +182,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
>>>   		allowed -= allowed / 32;
>>>   	allowed += total_swap_pages;
>>>   
>>> -	/* Don't let a single process grow too big:
>>> -	   leave 3% of the size of this process for other processes */
>>> -	if (mm)
>>> -		allowed -= mm->total_vm / 32;
>>> -
>>>   	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
>>>   		return 0;
>> So what might be the downside for this change?  root can't log in, I
>> assume.  Have you actually tested for this scenario and observed the
>> effects?
>>
>> If there *are* observable risks and/or to preserve back-compatibility,
>> I guess we could create a fourth overcommit mode which provides the
>> headroom which you desire.
>>
>> Also, should we be looking at removing root's 3% from OVERCOMMIT_GUESS
>> as well?
> The downside of the first patch, which removes the "other" reserve
> (sorry about the confusing duplicated subject line), is that a user
> may not be able to kill their process, even if they have a shell prompt.
> When testing, I did sometimes get into spot where I attempted to execute
> kill, but got: "bash: fork: Cannot allocate memory". Of course, a
> user can get in the same predicament with the current 3% reserve--they
> just have to start processes until 3% becomes negligible.
>
> With just the first patch, root still has a 3% reserve, so they can
> still log in.
>
> When I resubmit the second patch, adding a tunable rootuser_reserve_pages
> variable, I'll test both guess and never overcommit modes to see what
> minimum initial values allow root to login and kill a user's memory
> hogging process. This will be safer than the current behavior since
> root's reserve will never shrink to something useless in the case where
> a user has grabbed all available memory with many processes.

The idea of two patches looks reasonable to me.

>
> As an estimate of a useful rootuser_reserve_pages, the rss+share size of

Sorry for my silly, why you mean share size is not consist in rss size?

> sshd, bash, and top is about 16MB. Overcommit disabled mode would need
> closer to 360MB for the same processes. On a 128GB box 3% is 3.8GB, so
> the new tunable would still be a win.
>
> I think the tunable would benefit everyone over the current behavior,
> but would you prefer it if I only made it tunable in a fourth overcommit
> mode in order to preserve back-compatibility?
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 1/2] mm: tuning hardcoded reserved memory
  2013-02-28  3:48     ` Andrew Shewmaker
@ 2013-03-01 17:48       ` Alan Cox
  -1 siblings, 0 replies; 14+ messages in thread
From: Alan Cox @ 2013-03-01 17:48 UTC (permalink / raw)
  To: Andrew Shewmaker; +Cc: Andrew Morton, linux-mm, linux-kernel

The 3% reserve was added to the original code *because* users kept hitting
problems where they couldn't recover. 

I suspect the tunable should nowdays be something related to min(3%,
someconstant), at the time we did the 3% I think 1GB was an "enterprise
system" ;)

Alan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 1/2] mm: tuning hardcoded reserved memory
@ 2013-03-01 17:48       ` Alan Cox
  0 siblings, 0 replies; 14+ messages in thread
From: Alan Cox @ 2013-03-01 17:48 UTC (permalink / raw)
  To: Andrew Shewmaker; +Cc: Andrew Morton, linux-mm, linux-kernel

The 3% reserve was added to the original code *because* users kept hitting
problems where they couldn't recover. 

I suspect the tunable should nowdays be something related to min(3%,
someconstant), at the time we did the 3% I think 1GB was an "enterprise
system" ;)

Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 1/2] mm: tuning hardcoded reserved memory
  2013-03-01  2:40       ` Ric Mason
@ 2013-03-01 22:41         ` Andrew Shewmaker
  -1 siblings, 0 replies; 14+ messages in thread
From: Andrew Shewmaker @ 2013-03-01 22:41 UTC (permalink / raw)
  To: Ric Mason; +Cc: Andrew Morton, linux-mm, linux-kernel, Alan Cox

On Fri, Mar 01, 2013 at 10:40:43AM +0800, Ric Mason wrote:
> On 02/28/2013 11:48 AM, Andrew Shewmaker wrote:
> >On Thu, Feb 28, 2013 at 02:12:00PM -0800, Andrew Morton wrote:
> >>On Wed, 27 Feb 2013 15:56:30 -0500
> >>Andrew Shewmaker <agshew@gmail.com> wrote:
> >>
> >>>The following patches are against the mmtom git tree as of February 27th.
> >>>
> >>>The first patch only affects OVERCOMMIT_NEVER mode, entirely removing
> >>>the 3% reserve for other user processes.
> >>>
> >>>The second patch affects both OVERCOMMIT_GUESS and OVERCOMMIT_NEVER
> >>>modes, replacing the hardcoded 3% reserve for the root user with a
> >>>tunable knob.
> >>>
> >>Gee, it's been years since anyone thought about the overcommit code.
> >>
> >>Documentation/vm/overcommit-accounting says that OVERCOMMIT_ALWAYS is
> >>"Appropriate for some scientific applications", but doesn't say why.
> >>You're running a scientific cluster but you're using OVERCOMMIT_NEVER,
> >>I think?  Is the documentation wrong?
> >None of my scientists appeared to use sparse arrays as Alan described.
> >My users would run jobs that appeared to initialize correctly. However,
> >they wouldn't write to every page they malloced (and they wouldn't use
> >calloc), so I saw jobs failing well into a computation once the
> >simulation tried to access a page and the kernel couldn't give it to them.
> >
> >I think Roadrunner (http://en.wikipedia.org/wiki/IBM_Roadrunner) was
> >the first cluster I put into OVERCOMMIT_NEVER mode. Jobs with
> >infeasible memory requirements fail early and the OOM killer
> >gets triggered much less often than in guess mode. More often than not
> >the OOM killer seemed to kill the wrong thing causing a subtle brokenness.
> >Disabling overcommit worked so well during the stabilization and
> >early user phases that we did the same with other clusters.
> 
> Do you mean OVERCOMMIT_NEVER is more suitable for scientific
> application than OVERCOMMIT_GUESS and OVERCOMMIT_ALWAYS? Or should
> depend on workload? Since your users would run jobs that wouldn't
> write to every page they malloced, so why OVERCOMMIT_GUESS is not
> more suitable for you?

It depends on the workload. They eventually wrote to every page, 
but not early in the life of the process, so they thought they 
were fine until the simulation crashed.

> >
> >>>__vm_enough_memory reserves 3% of free pages with the default
> >>>overcommit mode and 6% when overcommit is disabled. These hardcoded
> >>>values have become less reasonable as memory sizes have grown.
> >>>
> >>>On scientific clusters, systems are generally dedicated to one user.
> >>>Also, overcommit is sometimes disabled in order to prevent a long
> >>>running job from suddenly failing days or weeks into a calculation.
> >>>In this case, a user wishing to allocate as much memory as possible
> >>>to one process may be prevented from using, for example, around 7GB
> >>>out of 128GB.
> >>>
> >>>The effect is less, but still significant when a user starts a job
> >>>with one process per core. I have repeatedly seen a set of processes
> >>>requesting the same amount of memory fail because one of them could
> >>>not allocate the amount of memory a user would expect to be able to
> >>>allocate.
> >>>
> >>>...
> >>>
> >>>--- a/mm/mmap.c
> >>>+++ b/mm/mmap.c
> >>>@@ -182,11 +182,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
> >>>  		allowed -= allowed / 32;
> >>>  	allowed += total_swap_pages;
> >>>-	/* Don't let a single process grow too big:
> >>>-	   leave 3% of the size of this process for other processes */
> >>>-	if (mm)
> >>>-		allowed -= mm->total_vm / 32;
> >>>-
> >>>  	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
> >>>  		return 0;
> >>So what might be the downside for this change?  root can't log in, I
> >>assume.  Have you actually tested for this scenario and observed the
> >>effects?
> >>
> >>If there *are* observable risks and/or to preserve back-compatibility,
> >>I guess we could create a fourth overcommit mode which provides the
> >>headroom which you desire.
> >>
> >>Also, should we be looking at removing root's 3% from OVERCOMMIT_GUESS
> >>as well?
> >The downside of the first patch, which removes the "other" reserve
> >(sorry about the confusing duplicated subject line), is that a user
> >may not be able to kill their process, even if they have a shell prompt.
> >When testing, I did sometimes get into spot where I attempted to execute
> >kill, but got: "bash: fork: Cannot allocate memory". Of course, a
> >user can get in the same predicament with the current 3% reserve--they
> >just have to start processes until 3% becomes negligible.
> >
> >With just the first patch, root still has a 3% reserve, so they can
> >still log in.
> >
> >When I resubmit the second patch, adding a tunable rootuser_reserve_pages
> >variable, I'll test both guess and never overcommit modes to see what
> >minimum initial values allow root to login and kill a user's memory
> >hogging process. This will be safer than the current behavior since
> >root's reserve will never shrink to something useless in the case where
> >a user has grabbed all available memory with many processes.
> 
> The idea of two patches looks reasonable to me.
> 
> >
> >As an estimate of a useful rootuser_reserve_pages, the rss+share size of
> 
> Sorry for my silly, why you mean share size is not consist in rss size?

For some reason I had it in my head that RSS was just the memory 
private to the process and that I needed to add memory shared for 
libraries. So yeah, it looks like 8MB, or 2000 pages should be 
enough of a reserve.

I'm testing new versions now, where the reserve is min(%3, k) as 
Alan suggested. k being 2000 pages in this case.

> >sshd, bash, and top is about 16MB. Overcommit disabled mode would need
> >closer to 360MB for the same processes. On a 128GB box 3% is 3.8GB, so
> >the new tunable would still be a win.
> >
> >I think the tunable would benefit everyone over the current behavior,
> >but would you prefer it if I only made it tunable in a fourth overcommit
> >mode in order to preserve back-compatibility?
> >
> >--
> >To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >the body to majordomo@kvack.org.  For more info on Linux MM,
> >see: http://www.linux-mm.org/ .
> >Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 1/2] mm: tuning hardcoded reserved memory
@ 2013-03-01 22:41         ` Andrew Shewmaker
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Shewmaker @ 2013-03-01 22:41 UTC (permalink / raw)
  To: Ric Mason; +Cc: Andrew Morton, linux-mm, linux-kernel, Alan Cox

On Fri, Mar 01, 2013 at 10:40:43AM +0800, Ric Mason wrote:
> On 02/28/2013 11:48 AM, Andrew Shewmaker wrote:
> >On Thu, Feb 28, 2013 at 02:12:00PM -0800, Andrew Morton wrote:
> >>On Wed, 27 Feb 2013 15:56:30 -0500
> >>Andrew Shewmaker <agshew@gmail.com> wrote:
> >>
> >>>The following patches are against the mmtom git tree as of February 27th.
> >>>
> >>>The first patch only affects OVERCOMMIT_NEVER mode, entirely removing
> >>>the 3% reserve for other user processes.
> >>>
> >>>The second patch affects both OVERCOMMIT_GUESS and OVERCOMMIT_NEVER
> >>>modes, replacing the hardcoded 3% reserve for the root user with a
> >>>tunable knob.
> >>>
> >>Gee, it's been years since anyone thought about the overcommit code.
> >>
> >>Documentation/vm/overcommit-accounting says that OVERCOMMIT_ALWAYS is
> >>"Appropriate for some scientific applications", but doesn't say why.
> >>You're running a scientific cluster but you're using OVERCOMMIT_NEVER,
> >>I think?  Is the documentation wrong?
> >None of my scientists appeared to use sparse arrays as Alan described.
> >My users would run jobs that appeared to initialize correctly. However,
> >they wouldn't write to every page they malloced (and they wouldn't use
> >calloc), so I saw jobs failing well into a computation once the
> >simulation tried to access a page and the kernel couldn't give it to them.
> >
> >I think Roadrunner (http://en.wikipedia.org/wiki/IBM_Roadrunner) was
> >the first cluster I put into OVERCOMMIT_NEVER mode. Jobs with
> >infeasible memory requirements fail early and the OOM killer
> >gets triggered much less often than in guess mode. More often than not
> >the OOM killer seemed to kill the wrong thing causing a subtle brokenness.
> >Disabling overcommit worked so well during the stabilization and
> >early user phases that we did the same with other clusters.
> 
> Do you mean OVERCOMMIT_NEVER is more suitable for scientific
> application than OVERCOMMIT_GUESS and OVERCOMMIT_ALWAYS? Or should
> depend on workload? Since your users would run jobs that wouldn't
> write to every page they malloced, so why OVERCOMMIT_GUESS is not
> more suitable for you?

It depends on the workload. They eventually wrote to every page, 
but not early in the life of the process, so they thought they 
were fine until the simulation crashed.

> >
> >>>__vm_enough_memory reserves 3% of free pages with the default
> >>>overcommit mode and 6% when overcommit is disabled. These hardcoded
> >>>values have become less reasonable as memory sizes have grown.
> >>>
> >>>On scientific clusters, systems are generally dedicated to one user.
> >>>Also, overcommit is sometimes disabled in order to prevent a long
> >>>running job from suddenly failing days or weeks into a calculation.
> >>>In this case, a user wishing to allocate as much memory as possible
> >>>to one process may be prevented from using, for example, around 7GB
> >>>out of 128GB.
> >>>
> >>>The effect is less, but still significant when a user starts a job
> >>>with one process per core. I have repeatedly seen a set of processes
> >>>requesting the same amount of memory fail because one of them could
> >>>not allocate the amount of memory a user would expect to be able to
> >>>allocate.
> >>>
> >>>...
> >>>
> >>>--- a/mm/mmap.c
> >>>+++ b/mm/mmap.c
> >>>@@ -182,11 +182,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
> >>>  		allowed -= allowed / 32;
> >>>  	allowed += total_swap_pages;
> >>>-	/* Don't let a single process grow too big:
> >>>-	   leave 3% of the size of this process for other processes */
> >>>-	if (mm)
> >>>-		allowed -= mm->total_vm / 32;
> >>>-
> >>>  	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
> >>>  		return 0;
> >>So what might be the downside for this change?  root can't log in, I
> >>assume.  Have you actually tested for this scenario and observed the
> >>effects?
> >>
> >>If there *are* observable risks and/or to preserve back-compatibility,
> >>I guess we could create a fourth overcommit mode which provides the
> >>headroom which you desire.
> >>
> >>Also, should we be looking at removing root's 3% from OVERCOMMIT_GUESS
> >>as well?
> >The downside of the first patch, which removes the "other" reserve
> >(sorry about the confusing duplicated subject line), is that a user
> >may not be able to kill their process, even if they have a shell prompt.
> >When testing, I did sometimes get into spot where I attempted to execute
> >kill, but got: "bash: fork: Cannot allocate memory". Of course, a
> >user can get in the same predicament with the current 3% reserve--they
> >just have to start processes until 3% becomes negligible.
> >
> >With just the first patch, root still has a 3% reserve, so they can
> >still log in.
> >
> >When I resubmit the second patch, adding a tunable rootuser_reserve_pages
> >variable, I'll test both guess and never overcommit modes to see what
> >minimum initial values allow root to login and kill a user's memory
> >hogging process. This will be safer than the current behavior since
> >root's reserve will never shrink to something useless in the case where
> >a user has grabbed all available memory with many processes.
> 
> The idea of two patches looks reasonable to me.
> 
> >
> >As an estimate of a useful rootuser_reserve_pages, the rss+share size of
> 
> Sorry for my silly, why you mean share size is not consist in rss size?

For some reason I had it in my head that RSS was just the memory 
private to the process and that I needed to add memory shared for 
libraries. So yeah, it looks like 8MB, or 2000 pages should be 
enough of a reserve.

I'm testing new versions now, where the reserve is min(%3, k) as 
Alan suggested. k being 2000 pages in this case.

> >sshd, bash, and top is about 16MB. Overcommit disabled mode would need
> >closer to 360MB for the same processes. On a 128GB box 3% is 3.8GB, so
> >the new tunable would still be a win.
> >
> >I think the tunable would benefit everyone over the current behavior,
> >but would you prefer it if I only made it tunable in a fourth overcommit
> >mode in order to preserve back-compatibility?
> >
> >--
> >To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >the body to majordomo@kvack.org.  For more info on Linux MM,
> >see: http://www.linux-mm.org/ .
> >Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 1/2] mm: tuning hardcoded reserved memory
  2013-03-01 22:41         ` Andrew Shewmaker
@ 2013-03-02  0:29           ` Ric Mason
  -1 siblings, 0 replies; 14+ messages in thread
From: Ric Mason @ 2013-03-02  0:29 UTC (permalink / raw)
  To: Andrew Shewmaker; +Cc: Andrew Morton, linux-mm, linux-kernel, Alan Cox

On 03/02/2013 06:41 AM, Andrew Shewmaker wrote:
> On Fri, Mar 01, 2013 at 10:40:43AM +0800, Ric Mason wrote:
>> On 02/28/2013 11:48 AM, Andrew Shewmaker wrote:
>>> On Thu, Feb 28, 2013 at 02:12:00PM -0800, Andrew Morton wrote:
>>>> On Wed, 27 Feb 2013 15:56:30 -0500
>>>> Andrew Shewmaker <agshew@gmail.com> wrote:
>>>>
>>>>> The following patches are against the mmtom git tree as of February 27th.
>>>>>
>>>>> The first patch only affects OVERCOMMIT_NEVER mode, entirely removing
>>>>> the 3% reserve for other user processes.
>>>>>
>>>>> The second patch affects both OVERCOMMIT_GUESS and OVERCOMMIT_NEVER
>>>>> modes, replacing the hardcoded 3% reserve for the root user with a
>>>>> tunable knob.
>>>>>
>>>> Gee, it's been years since anyone thought about the overcommit code.
>>>>
>>>> Documentation/vm/overcommit-accounting says that OVERCOMMIT_ALWAYS is
>>>> "Appropriate for some scientific applications", but doesn't say why.
>>>> You're running a scientific cluster but you're using OVERCOMMIT_NEVER,
>>>> I think?  Is the documentation wrong?
>>> None of my scientists appeared to use sparse arrays as Alan described.
>>> My users would run jobs that appeared to initialize correctly. However,
>>> they wouldn't write to every page they malloced (and they wouldn't use
>>> calloc), so I saw jobs failing well into a computation once the
>>> simulation tried to access a page and the kernel couldn't give it to them.
>>>
>>> I think Roadrunner (http://en.wikipedia.org/wiki/IBM_Roadrunner) was
>>> the first cluster I put into OVERCOMMIT_NEVER mode. Jobs with
>>> infeasible memory requirements fail early and the OOM killer
>>> gets triggered much less often than in guess mode. More often than not
>>> the OOM killer seemed to kill the wrong thing causing a subtle brokenness.
>>> Disabling overcommit worked so well during the stabilization and
>>> early user phases that we did the same with other clusters.
>> Do you mean OVERCOMMIT_NEVER is more suitable for scientific
>> application than OVERCOMMIT_GUESS and OVERCOMMIT_ALWAYS? Or should
>> depend on workload? Since your users would run jobs that wouldn't
>> write to every page they malloced, so why OVERCOMMIT_GUESS is not
>> more suitable for you?
> It depends on the workload. They eventually wrote to every page,
> but not early in the life of the process, so they thought they
> were fine until the simulation crashed.

Why overcommit guess is not suitable even they eventually wrote to every 
page? It takes free pages, file pages, available swap pages, reclaimable 
slab pages into consideration. In other words, these are all pages 
available, then why overcommit is not suitable? Actually, I confuse 
what's the root different of overcommit guess and never?

>
>>>>> __vm_enough_memory reserves 3% of free pages with the default
>>>>> overcommit mode and 6% when overcommit is disabled. These hardcoded
>>>>> values have become less reasonable as memory sizes have grown.
>>>>>
>>>>> On scientific clusters, systems are generally dedicated to one user.
>>>>> Also, overcommit is sometimes disabled in order to prevent a long
>>>>> running job from suddenly failing days or weeks into a calculation.
>>>>> In this case, a user wishing to allocate as much memory as possible
>>>>> to one process may be prevented from using, for example, around 7GB
>>>>> out of 128GB.
>>>>>
>>>>> The effect is less, but still significant when a user starts a job
>>>>> with one process per core. I have repeatedly seen a set of processes
>>>>> requesting the same amount of memory fail because one of them could
>>>>> not allocate the amount of memory a user would expect to be able to
>>>>> allocate.
>>>>>
>>>>> ...
>>>>>
>>>>> --- a/mm/mmap.c
>>>>> +++ b/mm/mmap.c
>>>>> @@ -182,11 +182,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
>>>>>   		allowed -= allowed / 32;
>>>>>   	allowed += total_swap_pages;
>>>>> -	/* Don't let a single process grow too big:
>>>>> -	   leave 3% of the size of this process for other processes */
>>>>> -	if (mm)
>>>>> -		allowed -= mm->total_vm / 32;
>>>>> -
>>>>>   	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
>>>>>   		return 0;
>>>> So what might be the downside for this change?  root can't log in, I
>>>> assume.  Have you actually tested for this scenario and observed the
>>>> effects?
>>>>
>>>> If there *are* observable risks and/or to preserve back-compatibility,
>>>> I guess we could create a fourth overcommit mode which provides the
>>>> headroom which you desire.
>>>>
>>>> Also, should we be looking at removing root's 3% from OVERCOMMIT_GUESS
>>>> as well?
>>> The downside of the first patch, which removes the "other" reserve
>>> (sorry about the confusing duplicated subject line), is that a user
>>> may not be able to kill their process, even if they have a shell prompt.
>>> When testing, I did sometimes get into spot where I attempted to execute
>>> kill, but got: "bash: fork: Cannot allocate memory". Of course, a
>>> user can get in the same predicament with the current 3% reserve--they
>>> just have to start processes until 3% becomes negligible.
>>>
>>> With just the first patch, root still has a 3% reserve, so they can
>>> still log in.
>>>
>>> When I resubmit the second patch, adding a tunable rootuser_reserve_pages
>>> variable, I'll test both guess and never overcommit modes to see what
>>> minimum initial values allow root to login and kill a user's memory
>>> hogging process. This will be safer than the current behavior since
>>> root's reserve will never shrink to something useless in the case where
>>> a user has grabbed all available memory with many processes.
>> The idea of two patches looks reasonable to me.
>>
>>> As an estimate of a useful rootuser_reserve_pages, the rss+share size of
>> Sorry for my silly, why you mean share size is not consist in rss size?
> For some reason I had it in my head that RSS was just the memory
> private to the process and that I needed to add memory shared for
> libraries. So yeah, it looks like 8MB, or 2000 pages should be
> enough of a reserve.
>
> I'm testing new versions now, where the reserve is min(%3, k) as
> Alan suggested. k being 2000 pages in this case.
>
>>> sshd, bash, and top is about 16MB. Overcommit disabled mode would need
>>> closer to 360MB for the same processes. On a 128GB box 3% is 3.8GB, so
>>> the new tunable would still be a win.
>>>
>>> I think the tunable would benefit everyone over the current behavior,
>>> but would you prefer it if I only made it tunable in a fourth overcommit
>>> mode in order to preserve back-compatibility?
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 1/2] mm: tuning hardcoded reserved memory
@ 2013-03-02  0:29           ` Ric Mason
  0 siblings, 0 replies; 14+ messages in thread
From: Ric Mason @ 2013-03-02  0:29 UTC (permalink / raw)
  To: Andrew Shewmaker; +Cc: Andrew Morton, linux-mm, linux-kernel, Alan Cox

On 03/02/2013 06:41 AM, Andrew Shewmaker wrote:
> On Fri, Mar 01, 2013 at 10:40:43AM +0800, Ric Mason wrote:
>> On 02/28/2013 11:48 AM, Andrew Shewmaker wrote:
>>> On Thu, Feb 28, 2013 at 02:12:00PM -0800, Andrew Morton wrote:
>>>> On Wed, 27 Feb 2013 15:56:30 -0500
>>>> Andrew Shewmaker <agshew@gmail.com> wrote:
>>>>
>>>>> The following patches are against the mmtom git tree as of February 27th.
>>>>>
>>>>> The first patch only affects OVERCOMMIT_NEVER mode, entirely removing
>>>>> the 3% reserve for other user processes.
>>>>>
>>>>> The second patch affects both OVERCOMMIT_GUESS and OVERCOMMIT_NEVER
>>>>> modes, replacing the hardcoded 3% reserve for the root user with a
>>>>> tunable knob.
>>>>>
>>>> Gee, it's been years since anyone thought about the overcommit code.
>>>>
>>>> Documentation/vm/overcommit-accounting says that OVERCOMMIT_ALWAYS is
>>>> "Appropriate for some scientific applications", but doesn't say why.
>>>> You're running a scientific cluster but you're using OVERCOMMIT_NEVER,
>>>> I think?  Is the documentation wrong?
>>> None of my scientists appeared to use sparse arrays as Alan described.
>>> My users would run jobs that appeared to initialize correctly. However,
>>> they wouldn't write to every page they malloced (and they wouldn't use
>>> calloc), so I saw jobs failing well into a computation once the
>>> simulation tried to access a page and the kernel couldn't give it to them.
>>>
>>> I think Roadrunner (http://en.wikipedia.org/wiki/IBM_Roadrunner) was
>>> the first cluster I put into OVERCOMMIT_NEVER mode. Jobs with
>>> infeasible memory requirements fail early and the OOM killer
>>> gets triggered much less often than in guess mode. More often than not
>>> the OOM killer seemed to kill the wrong thing causing a subtle brokenness.
>>> Disabling overcommit worked so well during the stabilization and
>>> early user phases that we did the same with other clusters.
>> Do you mean OVERCOMMIT_NEVER is more suitable for scientific
>> application than OVERCOMMIT_GUESS and OVERCOMMIT_ALWAYS? Or should
>> depend on workload? Since your users would run jobs that wouldn't
>> write to every page they malloced, so why OVERCOMMIT_GUESS is not
>> more suitable for you?
> It depends on the workload. They eventually wrote to every page,
> but not early in the life of the process, so they thought they
> were fine until the simulation crashed.

Why overcommit guess is not suitable even they eventually wrote to every 
page? It takes free pages, file pages, available swap pages, reclaimable 
slab pages into consideration. In other words, these are all pages 
available, then why overcommit is not suitable? Actually, I confuse 
what's the root different of overcommit guess and never?

>
>>>>> __vm_enough_memory reserves 3% of free pages with the default
>>>>> overcommit mode and 6% when overcommit is disabled. These hardcoded
>>>>> values have become less reasonable as memory sizes have grown.
>>>>>
>>>>> On scientific clusters, systems are generally dedicated to one user.
>>>>> Also, overcommit is sometimes disabled in order to prevent a long
>>>>> running job from suddenly failing days or weeks into a calculation.
>>>>> In this case, a user wishing to allocate as much memory as possible
>>>>> to one process may be prevented from using, for example, around 7GB
>>>>> out of 128GB.
>>>>>
>>>>> The effect is less, but still significant when a user starts a job
>>>>> with one process per core. I have repeatedly seen a set of processes
>>>>> requesting the same amount of memory fail because one of them could
>>>>> not allocate the amount of memory a user would expect to be able to
>>>>> allocate.
>>>>>
>>>>> ...
>>>>>
>>>>> --- a/mm/mmap.c
>>>>> +++ b/mm/mmap.c
>>>>> @@ -182,11 +182,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
>>>>>   		allowed -= allowed / 32;
>>>>>   	allowed += total_swap_pages;
>>>>> -	/* Don't let a single process grow too big:
>>>>> -	   leave 3% of the size of this process for other processes */
>>>>> -	if (mm)
>>>>> -		allowed -= mm->total_vm / 32;
>>>>> -
>>>>>   	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
>>>>>   		return 0;
>>>> So what might be the downside for this change?  root can't log in, I
>>>> assume.  Have you actually tested for this scenario and observed the
>>>> effects?
>>>>
>>>> If there *are* observable risks and/or to preserve back-compatibility,
>>>> I guess we could create a fourth overcommit mode which provides the
>>>> headroom which you desire.
>>>>
>>>> Also, should we be looking at removing root's 3% from OVERCOMMIT_GUESS
>>>> as well?
>>> The downside of the first patch, which removes the "other" reserve
>>> (sorry about the confusing duplicated subject line), is that a user
>>> may not be able to kill their process, even if they have a shell prompt.
>>> When testing, I did sometimes get into spot where I attempted to execute
>>> kill, but got: "bash: fork: Cannot allocate memory". Of course, a
>>> user can get in the same predicament with the current 3% reserve--they
>>> just have to start processes until 3% becomes negligible.
>>>
>>> With just the first patch, root still has a 3% reserve, so they can
>>> still log in.
>>>
>>> When I resubmit the second patch, adding a tunable rootuser_reserve_pages
>>> variable, I'll test both guess and never overcommit modes to see what
>>> minimum initial values allow root to login and kill a user's memory
>>> hogging process. This will be safer than the current behavior since
>>> root's reserve will never shrink to something useless in the case where
>>> a user has grabbed all available memory with many processes.
>> The idea of two patches looks reasonable to me.
>>
>>> As an estimate of a useful rootuser_reserve_pages, the rss+share size of
>> Sorry for my silly, why you mean share size is not consist in rss size?
> For some reason I had it in my head that RSS was just the memory
> private to the process and that I needed to add memory shared for
> libraries. So yeah, it looks like 8MB, or 2000 pages should be
> enough of a reserve.
>
> I'm testing new versions now, where the reserve is min(%3, k) as
> Alan suggested. k being 2000 pages in this case.
>
>>> sshd, bash, and top is about 16MB. Overcommit disabled mode would need
>>> closer to 360MB for the same processes. On a 128GB box 3% is 3.8GB, so
>>> the new tunable would still be a win.
>>>
>>> I think the tunable would benefit everyone over the current behavior,
>>> but would you prefer it if I only made it tunable in a fourth overcommit
>>> mode in order to preserve back-compatibility?
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2013-03-02  0:29 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-27 20:56 [RFC PATCH v2 1/2] mm: tuning hardcoded reserved memory Andrew Shewmaker
2013-02-27 20:56 ` Andrew Shewmaker
2013-02-28 22:12 ` Andrew Morton
2013-02-28 22:12   ` Andrew Morton
2013-02-28  3:48   ` Andrew Shewmaker
2013-02-28  3:48     ` Andrew Shewmaker
2013-03-01  2:40     ` Ric Mason
2013-03-01  2:40       ` Ric Mason
2013-03-01 22:41       ` Andrew Shewmaker
2013-03-01 22:41         ` Andrew Shewmaker
2013-03-02  0:29         ` Ric Mason
2013-03-02  0:29           ` Ric Mason
2013-03-01 17:48     ` Alan Cox
2013-03-01 17:48       ` Alan Cox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.