All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8][for -mm] mem_notify v6
@ 2008-02-09 15:19 ` KOSAKI Motohiro
  0 siblings, 0 replies; 68+ messages in thread
From: KOSAKI Motohiro @ 2008-02-09 15:19 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: kosaki.motohiro, Marcelo Tosatti, Daniel Spang, Rik van Riel,
	Andrew Morton, Alan Cox, linux-fsdevel, Pavel Machek, Al Boldi,
	Jon Masters, Zan Lynx

Hi

The /dev/mem_notify is low memory notification device.
it can avoid swappness and oom by cooperationg with the user process.

the Linux Today article is very nice description. (great works by Jake Edge)
http://www.linuxworld.com/news/2008/020508-kernel.html

<quoted>
When memory gets tight, it is quite possible that applications have memory
allocated—often caches for better performance—that they could free.
After all, it is generally better to lose some performance than to face the
consequences of being chosen by the OOM killer.
But, currently, there is no way for a process to know that the kernel is
feeling memory pressure.
The patch provides a way for interested programs to monitor the /dev/mem_notify
 file to be notified if memory starts to run low.
</quoted>


You need not be annoyed by OOM any longer :)
please any comments!

patch list
       [1/8] introduce poll_wait_exclusive() new API
       [2/8] introduce wake_up_locked_nr() new API
       [3/8] introduce /dev/mem_notify new device (the core of this
patch series)
       [4/8] memory_pressure_notify() caller
       [5/8] add new mem_notify field to /proc/zoneinfo
       [6/8] (optional) fixed incorrect shrink_zone
       [7/8] ignore very small zone for prevent incorrect low mem notify.
       [8/8] support fasync feature


related discussion:
--------------------------------------------------------------
 LKML OOM notifications requirement discussion
    http://www.gossamer-threads.com/lists/linux/kernel/832802?nohighlight=1#832802
 OOM notifications patch [Marcelo Tosatti]
    http://marc.info/?l=linux-kernel&m=119273914027743&w=2
 mem notifications v3 [Marcelo Tosatti]
    http://marc.info/?l=linux-mm&m=119852828327044&w=2
 Thrashing notification patch  [Daniel Spang]
    http://marc.info/?l=linux-mm&m=119427416315676&w=2
 mem notification v4
    http://marc.info/?l=linux-mm&m=120035840523718&w=2
 mem notification v5
    http://marc.info/?l=linux-mm&m=120114835421602&w=2

Changelog
-------------------------------------------------
 v5 -> v6 (by KOSAKI Motohiro)
   o rebase to 2.6.24-mm1
   o fixed thundering herd guard formula.

 v4 -> v5 (by KOSAKI Motohiro)
   o rebase to 2.6.24-rc8-mm1
   o change display order of /proc/zoneinfo
   o ignore very small zone
   o support fcntl(F_SETFL, FASYNC)
   o fixed some trivial bugs.

 v3 -> v4 (by KOSAKI Motohiro)
   o rebase to 2.6.24-rc6-mm1
   o avoid wake up all.
   o add judgement point to __free_one_page().
   o add zone awareness.

 v2 -> v3 (by Marcelo Tosatti)
   o changes the notification point to happen whenever
     the VM moves an anonymous page to the inactive list.
   o implement notification rate limit.

 v1(oom notify) -> v2 (by Marcelo Tosatti)
   o name change
   o notify timing change from just swap thrashing to
     just before thrashing.
   o also works with swapless device.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 0/8][for -mm] mem_notify v6
@ 2008-02-09 15:19 ` KOSAKI Motohiro
  0 siblings, 0 replies; 68+ messages in thread
From: KOSAKI Motohiro @ 2008-02-09 15:19 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: kosaki.motohiro, Marcelo Tosatti, Daniel Spang, Rik van Riel,
	Andrew Morton, Alan Cox, linux-fsdevel, Pavel Machek, Al Boldi,
	Jon Masters, Zan Lynx

Hi

The /dev/mem_notify is low memory notification device.
it can avoid swappness and oom by cooperationg with the user process.

the Linux Today article is very nice description. (great works by Jake Edge)
http://www.linuxworld.com/news/2008/020508-kernel.html

<quoted>
When memory gets tight, it is quite possible that applications have memory
allocated—often caches for better performance—that they could free.
After all, it is generally better to lose some performance than to face the
consequences of being chosen by the OOM killer.
But, currently, there is no way for a process to know that the kernel is
feeling memory pressure.
The patch provides a way for interested programs to monitor the /dev/mem_notify
 file to be notified if memory starts to run low.
</quoted>


You need not be annoyed by OOM any longer :)
please any comments!

patch list
       [1/8] introduce poll_wait_exclusive() new API
       [2/8] introduce wake_up_locked_nr() new API
       [3/8] introduce /dev/mem_notify new device (the core of this
patch series)
       [4/8] memory_pressure_notify() caller
       [5/8] add new mem_notify field to /proc/zoneinfo
       [6/8] (optional) fixed incorrect shrink_zone
       [7/8] ignore very small zone for prevent incorrect low mem notify.
       [8/8] support fasync feature


related discussion:
--------------------------------------------------------------
 LKML OOM notifications requirement discussion
    http://www.gossamer-threads.com/lists/linux/kernel/832802?nohighlight=1#832802
 OOM notifications patch [Marcelo Tosatti]
    http://marc.info/?l=linux-kernel&m=119273914027743&w=2
 mem notifications v3 [Marcelo Tosatti]
    http://marc.info/?l=linux-mm&m=119852828327044&w=2
 Thrashing notification patch  [Daniel Spang]
    http://marc.info/?l=linux-mm&m=119427416315676&w=2
 mem notification v4
    http://marc.info/?l=linux-mm&m=120035840523718&w=2
 mem notification v5
    http://marc.info/?l=linux-mm&m=120114835421602&w=2

Changelog
-------------------------------------------------
 v5 -> v6 (by KOSAKI Motohiro)
   o rebase to 2.6.24-mm1
   o fixed thundering herd guard formula.

 v4 -> v5 (by KOSAKI Motohiro)
   o rebase to 2.6.24-rc8-mm1
   o change display order of /proc/zoneinfo
   o ignore very small zone
   o support fcntl(F_SETFL, FASYNC)
   o fixed some trivial bugs.

 v3 -> v4 (by KOSAKI Motohiro)
   o rebase to 2.6.24-rc6-mm1
   o avoid wake up all.
   o add judgement point to __free_one_page().
   o add zone awareness.

 v2 -> v3 (by Marcelo Tosatti)
   o changes the notification point to happen whenever
     the VM moves an anonymous page to the inactive list.
   o implement notification rate limit.

 v1(oom notify) -> v2 (by Marcelo Tosatti)
   o name change
   o notify timing change from just swap thrashing to
     just before thrashing.
   o also works with swapless device.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 0/8][for -mm] mem_notify v6
@ 2008-02-09 15:19 ` KOSAKI Motohiro
  0 siblings, 0 replies; 68+ messages in thread
From: KOSAKI Motohiro @ 2008-02-09 15:19 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: kosaki.motohiro, Marcelo Tosatti, Daniel Spang, Rik van Riel,
	Andrew Morton, Alan Cox, linux-fsdevel, Pavel Machek, Al Boldi,
	Jon Masters, Zan Lynx

Hi

The /dev/mem_notify is low memory notification device.
it can avoid swappness and oom by cooperationg with the user process.

the Linux Today article is very nice description. (great works by Jake Edge)
http://www.linuxworld.com/news/2008/020508-kernel.html

<quoted>
When memory gets tight, it is quite possible that applications have memory
allocated?often caches for better performance?that they could free.
After all, it is generally better to lose some performance than to face the
consequences of being chosen by the OOM killer.
But, currently, there is no way for a process to know that the kernel is
feeling memory pressure.
The patch provides a way for interested programs to monitor the /dev/mem_notify
 file to be notified if memory starts to run low.
</quoted>


You need not be annoyed by OOM any longer :)
please any comments!

patch list
       [1/8] introduce poll_wait_exclusive() new API
       [2/8] introduce wake_up_locked_nr() new API
       [3/8] introduce /dev/mem_notify new device (the core of this
patch series)
       [4/8] memory_pressure_notify() caller
       [5/8] add new mem_notify field to /proc/zoneinfo
       [6/8] (optional) fixed incorrect shrink_zone
       [7/8] ignore very small zone for prevent incorrect low mem notify.
       [8/8] support fasync feature


related discussion:
--------------------------------------------------------------
 LKML OOM notifications requirement discussion
    http://www.gossamer-threads.com/lists/linux/kernel/832802?nohighlight=1#832802
 OOM notifications patch [Marcelo Tosatti]
    http://marc.info/?l=linux-kernel&m=119273914027743&w=2
 mem notifications v3 [Marcelo Tosatti]
    http://marc.info/?l=linux-mm&m=119852828327044&w=2
 Thrashing notification patch  [Daniel Spang]
    http://marc.info/?l=linux-mm&m=119427416315676&w=2
 mem notification v4
    http://marc.info/?l=linux-mm&m=120035840523718&w=2
 mem notification v5
    http://marc.info/?l=linux-mm&m=120114835421602&w=2

Changelog
-------------------------------------------------
 v5 -> v6 (by KOSAKI Motohiro)
   o rebase to 2.6.24-mm1
   o fixed thundering herd guard formula.

 v4 -> v5 (by KOSAKI Motohiro)
   o rebase to 2.6.24-rc8-mm1
   o change display order of /proc/zoneinfo
   o ignore very small zone
   o support fcntl(F_SETFL, FASYNC)
   o fixed some trivial bugs.

 v3 -> v4 (by KOSAKI Motohiro)
   o rebase to 2.6.24-rc6-mm1
   o avoid wake up all.
   o add judgement point to __free_one_page().
   o add zone awareness.

 v2 -> v3 (by Marcelo Tosatti)
   o changes the notification point to happen whenever
     the VM moves an anonymous page to the inactive list.
   o implement notification rate limit.

 v1(oom notify) -> v2 (by Marcelo Tosatti)
   o name change
   o notify timing change from just swap thrashing to
     just before thrashing.
   o also works with swapless device.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-02-09 15:19 ` KOSAKI Motohiro
@ 2008-02-09 16:02   ` Jon Masters
  -1 siblings, 0 replies; 68+ messages in thread
From: Jon Masters @ 2008-02-09 16:02 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, linux-kernel, kosaki.motohiro, Marcelo Tosatti,
	Daniel Spang, Rik van Riel, Andrew Morton, Alan Cox,
	linux-fsdevel, Pavel Machek, Al Boldi, Zan Lynx

Yo,

Interesting patch series (I am being yuppie and reading this thread  
from my iPhone on a treadmill at the gym - so further comments later).  
I think that this is broadly along the lines that I was thinking, but  
this should be an RFC only patch series for now.

Some initial questions:

Where is the netlink interface? Polling an FD is so last century :)

What testing have you done?

Still, it is good to start with some code - eventually we might just  
have a full reservation API created. Rik and I and others have bounced  
ideas around for a while and I hope we can pitch in. I will play with  
these patches later.

Jon.



On Feb 9, 2008, at 10:19, "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com 
 > wrote:

> Hi
>
> The /dev/mem_notify is low memory notification device.
> it can avoid swappness and oom by cooperationg with the user process.
>
> the Linux Today article is very nice description. (great works by  
> Jake Edge)
> http://www.linuxworld.com/news/2008/020508-kernel.html
>
> <quoted>
> When memory gets tight, it is quite possible that applications have  
> memory
> allocated—often caches for better performance—that they could fre 
> e.
> After all, it is generally better to lose some performance than to  
> face the
> consequences of being chosen by the OOM killer.
> But, currently, there is no way for a process to know that the  
> kernel is
> feeling memory pressure.
> The patch provides a way for interested programs to monitor the /dev/ 
> mem_notify
> file to be notified if memory starts to run low.
> </quoted>
>
>
> You need not be annoyed by OOM any longer :)
> please any comments!
>
> patch list
>       [1/8] introduce poll_wait_exclusive() new API
>       [2/8] introduce wake_up_locked_nr() new API
>       [3/8] introduce /dev/mem_notify new device (the core of this
> patch series)
>       [4/8] memory_pressure_notify() caller
>       [5/8] add new mem_notify field to /proc/zoneinfo
>       [6/8] (optional) fixed incorrect shrink_zone
>       [7/8] ignore very small zone for prevent incorrect low mem  
> notify.
>       [8/8] support fasync feature
>
>
> related discussion:
> --------------------------------------------------------------
> LKML OOM notifications requirement discussion
>    http://www.gossamer-threads.com/lists/linux/kernel/832802?nohighlight=1#832802
> OOM notifications patch [Marcelo Tosatti]
>    http://marc.info/?l=linux-kernel&m=119273914027743&w=2
> mem notifications v3 [Marcelo Tosatti]
>    http://marc.info/?l=linux-mm&m=119852828327044&w=2
> Thrashing notification patch  [Daniel Spang]
>    http://marc.info/?l=linux-mm&m=119427416315676&w=2
> mem notification v4
>    http://marc.info/?l=linux-mm&m=120035840523718&w=2
> mem notification v5
>    http://marc.info/?l=linux-mm&m=120114835421602&w=2
>
> Changelog
> -------------------------------------------------
> v5 -> v6 (by KOSAKI Motohiro)
>   o rebase to 2.6.24-mm1
>   o fixed thundering herd guard formula.
>
> v4 -> v5 (by KOSAKI Motohiro)
>   o rebase to 2.6.24-rc8-mm1
>   o change display order of /proc/zoneinfo
>   o ignore very small zone
>   o support fcntl(F_SETFL, FASYNC)
>   o fixed some trivial bugs.
>
> v3 -> v4 (by KOSAKI Motohiro)
>   o rebase to 2.6.24-rc6-mm1
>   o avoid wake up all.
>   o add judgement point to __free_one_page().
>   o add zone awareness.
>
> v2 -> v3 (by Marcelo Tosatti)
>   o changes the notification point to happen whenever
>     the VM moves an anonymous page to the inactive list.
>   o implement notification rate limit.
>
> v1(oom notify) -> v2 (by Marcelo Tosatti)
>   o name change
>   o notify timing change from just swap thrashing to
>     just before thrashing.
>   o also works with swapless device.
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-02-09 16:02   ` Jon Masters
  0 siblings, 0 replies; 68+ messages in thread
From: Jon Masters @ 2008-02-09 16:02 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, linux-kernel, Marcelo Tosatti, Daniel Spang,
	Rik van Riel, Andrew Morton, Alan Cox, linux-fsdevel,
	Pavel Machek, Al Boldi, Zan Lynx

Yo,

Interesting patch series (I am being yuppie and reading this thread  
from my iPhone on a treadmill at the gym - so further comments later).  
I think that this is broadly along the lines that I was thinking, but  
this should be an RFC only patch series for now.

Some initial questions:

Where is the netlink interface? Polling an FD is so last century :)

What testing have you done?

Still, it is good to start with some code - eventually we might just  
have a full reservation API created. Rik and I and others have bounced  
ideas around for a while and I hope we can pitch in. I will play with  
these patches later.

Jon.



On Feb 9, 2008, at 10:19, "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com 
 > wrote:

> Hi
>
> The /dev/mem_notify is low memory notification device.
> it can avoid swappness and oom by cooperationg with the user process.
>
> the Linux Today article is very nice description. (great works by  
> Jake Edge)
> http://www.linuxworld.com/news/2008/020508-kernel.html
>
> <quoted>
> When memory gets tight, it is quite possible that applications have  
> memory
> allocateda??often caches for better performancea??that they could fre 
> e.
> After all, it is generally better to lose some performance than to  
> face the
> consequences of being chosen by the OOM killer.
> But, currently, there is no way for a process to know that the  
> kernel is
> feeling memory pressure.
> The patch provides a way for interested programs to monitor the /dev/ 
> mem_notify
> file to be notified if memory starts to run low.
> </quoted>
>
>
> You need not be annoyed by OOM any longer :)
> please any comments!
>
> patch list
>       [1/8] introduce poll_wait_exclusive() new API
>       [2/8] introduce wake_up_locked_nr() new API
>       [3/8] introduce /dev/mem_notify new device (the core of this
> patch series)
>       [4/8] memory_pressure_notify() caller
>       [5/8] add new mem_notify field to /proc/zoneinfo
>       [6/8] (optional) fixed incorrect shrink_zone
>       [7/8] ignore very small zone for prevent incorrect low mem  
> notify.
>       [8/8] support fasync feature
>
>
> related discussion:
> --------------------------------------------------------------
> LKML OOM notifications requirement discussion
>    http://www.gossamer-threads.com/lists/linux/kernel/832802?nohighlight=1#832802
> OOM notifications patch [Marcelo Tosatti]
>    http://marc.info/?l=linux-kernel&m=119273914027743&w=2
> mem notifications v3 [Marcelo Tosatti]
>    http://marc.info/?l=linux-mm&m=119852828327044&w=2
> Thrashing notification patch  [Daniel Spang]
>    http://marc.info/?l=linux-mm&m=119427416315676&w=2
> mem notification v4
>    http://marc.info/?l=linux-mm&m=120035840523718&w=2
> mem notification v5
>    http://marc.info/?l=linux-mm&m=120114835421602&w=2
>
> Changelog
> -------------------------------------------------
> v5 -> v6 (by KOSAKI Motohiro)
>   o rebase to 2.6.24-mm1
>   o fixed thundering herd guard formula.
>
> v4 -> v5 (by KOSAKI Motohiro)
>   o rebase to 2.6.24-rc8-mm1
>   o change display order of /proc/zoneinfo
>   o ignore very small zone
>   o support fcntl(F_SETFL, FASYNC)
>   o fixed some trivial bugs.
>
> v3 -> v4 (by KOSAKI Motohiro)
>   o rebase to 2.6.24-rc6-mm1
>   o avoid wake up all.
>   o add judgement point to __free_one_page().
>   o add zone awareness.
>
> v2 -> v3 (by Marcelo Tosatti)
>   o changes the notification point to happen whenever
>     the VM moves an anonymous page to the inactive list.
>   o implement notification rate limit.
>
> v1(oom notify) -> v2 (by Marcelo Tosatti)
>   o name change
>   o notify timing change from just swap thrashing to
>     just before thrashing.
>   o also works with swapless device.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-02-09 16:02   ` Jon Masters
@ 2008-02-09 16:33     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 68+ messages in thread
From: KOSAKI Motohiro @ 2008-02-09 16:33 UTC (permalink / raw)
  To: Jon Masters
  Cc: linux-mm, linux-kernel, Marcelo Tosatti, Daniel Spang,
	Rik van Riel, Andrew Morton, Alan Cox, linux-fsdevel,
	Pavel Machek, Al Boldi, Zan Lynx

Hi

> Interesting patch series (I am being yuppie and reading this thread
> from my iPhone on a treadmill at the gym - so further comments later).
> I think that this is broadly along the lines that I was thinking, but
> this should be an RFC only patch series for now.

sorry, I fixed at next post.


> Some initial questions:

Thank you.
welcome to any discussion.

> Where is the netlink interface? Polling an FD is so last century :)

to be honest, I don't know anyone use netlink and why hope receive
low memory notify by netlink.

poll() is old way, but it works good enough.

and, netlink have a bit weak point.
end up, netlink philosophy is read/write model.

I afraid to many low-mem message queued in netlink buffer
at under heavy pressure.
it cause degrade memory pressure.


> Still, it is good to start with some code - eventually we might just
> have a full reservation API created. Rik and I and others have bounced
> ideas around for a while and I hope we can pitch in. I will play with
> these patches later.

Great.
Welcome to any idea and any discussion.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-02-09 16:33     ` KOSAKI Motohiro
  0 siblings, 0 replies; 68+ messages in thread
From: KOSAKI Motohiro @ 2008-02-09 16:33 UTC (permalink / raw)
  To: Jon Masters
  Cc: linux-mm, linux-kernel, Marcelo Tosatti, Daniel Spang,
	Rik van Riel, Andrew Morton, Alan Cox, linux-fsdevel,
	Pavel Machek, Al Boldi, Zan Lynx

Hi

> Interesting patch series (I am being yuppie and reading this thread
> from my iPhone on a treadmill at the gym - so further comments later).
> I think that this is broadly along the lines that I was thinking, but
> this should be an RFC only patch series for now.

sorry, I fixed at next post.


> Some initial questions:

Thank you.
welcome to any discussion.

> Where is the netlink interface? Polling an FD is so last century :)

to be honest, I don't know anyone use netlink and why hope receive
low memory notify by netlink.

poll() is old way, but it works good enough.

and, netlink have a bit weak point.
end up, netlink philosophy is read/write model.

I afraid to many low-mem message queued in netlink buffer
at under heavy pressure.
it cause degrade memory pressure.


> Still, it is good to start with some code - eventually we might just
> have a full reservation API created. Rik and I and others have bounced
> ideas around for a while and I hope we can pitch in. I will play with
> these patches later.

Great.
Welcome to any idea and any discussion.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-02-09 16:33     ` KOSAKI Motohiro
@ 2008-02-09 16:43       ` Rik van Riel
  -1 siblings, 0 replies; 68+ messages in thread
From: Rik van Riel @ 2008-02-09 16:43 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Jon Masters, linux-mm, linux-kernel, Marcelo Tosatti,
	Daniel Spang, Andrew Morton, Alan Cox, linux-fsdevel,
	Pavel Machek, Al Boldi, Zan Lynx

On Sun, 10 Feb 2008 01:33:49 +0900
"KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com> wrote:

> > Where is the netlink interface? Polling an FD is so last century :)
> 
> to be honest, I don't know anyone use netlink and why hope receive
> low memory notify by netlink.
> 
> poll() is old way, but it works good enough.

More importantly, all gtk+ programs, as well as most databases and other
system daemons have a poll() loop as their main loop.

A file descriptor fits that main loop perfectly.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-02-09 16:43       ` Rik van Riel
  0 siblings, 0 replies; 68+ messages in thread
From: Rik van Riel @ 2008-02-09 16:43 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Jon Masters, linux-mm, linux-kernel, Marcelo Tosatti,
	Daniel Spang, Andrew Morton, Alan Cox, linux-fsdevel,
	Pavel Machek, Al Boldi, Zan Lynx

On Sun, 10 Feb 2008 01:33:49 +0900
"KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com> wrote:

> > Where is the netlink interface? Polling an FD is so last century :)
> 
> to be honest, I don't know anyone use netlink and why hope receive
> low memory notify by netlink.
> 
> poll() is old way, but it works good enough.

More importantly, all gtk+ programs, as well as most databases and other
system daemons have a poll() loop as their main loop.

A file descriptor fits that main loop perfectly.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-02-09 16:43       ` Rik van Riel
@ 2008-02-09 16:49         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 68+ messages in thread
From: KOSAKI Motohiro @ 2008-02-09 16:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Jon Masters, linux-mm, linux-kernel, Marcelo Tosatti,
	Daniel Spang, Andrew Morton, Alan Cox, linux-fsdevel,
	Pavel Machek, Al Boldi, Zan Lynx

Hi Rik

> More importantly, all gtk+ programs, as well as most databases and other
> system daemons have a poll() loop as their main loop.

not only gtk+, may be all modern GUI program :)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-02-09 16:49         ` KOSAKI Motohiro
  0 siblings, 0 replies; 68+ messages in thread
From: KOSAKI Motohiro @ 2008-02-09 16:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Jon Masters, linux-mm, linux-kernel, Marcelo Tosatti,
	Daniel Spang, Andrew Morton, Alan Cox, linux-fsdevel,
	Pavel Machek, Al Boldi, Zan Lynx

Hi Rik

> More importantly, all gtk+ programs, as well as most databases and other
> system daemons have a poll() loop as their main loop.

not only gtk+, may be all modern GUI program :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 0/8][for -mm] mem_notify v6, Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-02-09 15:19 ` KOSAKI Motohiro
@ 2008-02-11 15:36   ` Jonathan Corbet
  -1 siblings, 0 replies; 68+ messages in thread
From: Jonathan Corbet @ 2008-02-11 15:36 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Marcelo Tosatti, Daniel Spang, Rik van Riel, Andrew Morton,
	Alan Cox, linux-fsdevel, Pavel Machek, Al Boldi, Jon Masters,
	Zan Lynx, linux-mm, linux-kernel

> the Linux Today article is very nice description. (great works by Jake Edge)
> http://www.linuxworld.com/news/2008/020508-kernel.html

Just for future reference...the above-mentioned article is from LWN,
syndicated onto LinuxWorld.  It has, so far as I know, never been near
Linux Today.

Glad you liked it, though :)

Thanks,

jon

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 0/8][for -mm] mem_notify v6, Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-02-11 15:36   ` Jonathan Corbet
  0 siblings, 0 replies; 68+ messages in thread
From: Jonathan Corbet @ 2008-02-11 15:36 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Marcelo Tosatti, Daniel Spang, Rik van Riel, Andrew Morton,
	Alan Cox, linux-fsdevel, Pavel Machek, Al Boldi, Jon Masters,
	Zan Lynx, linux-mm, linux-kernel

> the Linux Today article is very nice description. (great works by Jake Edge)
> http://www.linuxworld.com/news/2008/020508-kernel.html

Just for future reference...the above-mentioned article is from LWN,
syndicated onto LinuxWorld.  It has, so far as I know, never been near
Linux Today.

Glad you liked it, though :)

Thanks,

jon

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-02-11 15:36   ` Jonathan Corbet
@ 2008-02-11 15:46     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 68+ messages in thread
From: KOSAKI Motohiro @ 2008-02-11 15:46 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Marcelo Tosatti, Daniel Spang, Rik van Riel, Andrew Morton,
	Alan Cox, linux-fsdevel, Pavel Machek, Al Boldi, Jon Masters,
	Zan Lynx, linux-mm, linux-kernel

> > the Linux Today article is very nice description. (great works by Jake Edge)
> > http://www.linuxworld.com/news/2008/020508-kernel.html
>
> Just for future reference...the above-mentioned article is from LWN,
> syndicated onto LinuxWorld.  It has, so far as I know, never been near
> Linux Today.
>
> Glad you liked it, though :)

Oops, sorry.
I had serious misunderstand ;-)

sorry, again.
and thank you for your helpful message.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-02-11 15:46     ` KOSAKI Motohiro
  0 siblings, 0 replies; 68+ messages in thread
From: KOSAKI Motohiro @ 2008-02-11 15:46 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Marcelo Tosatti, Daniel Spang, Rik van Riel, Andrew Morton,
	Alan Cox, linux-fsdevel, Pavel Machek, Al Boldi, Jon Masters,
	Zan Lynx, linux-mm, linux-kernel

> > the Linux Today article is very nice description. (great works by Jake Edge)
> > http://www.linuxworld.com/news/2008/020508-kernel.html
>
> Just for future reference...the above-mentioned article is from LWN,
> syndicated onto LinuxWorld.  It has, so far as I know, never been near
> Linux Today.
>
> Glad you liked it, though :)

Oops, sorry.
I had serious misunderstand ;-)

sorry, again.
and thank you for your helpful message.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-02-09 15:19 ` KOSAKI Motohiro
@ 2008-02-17 14:49   ` Paul Jackson
  -1 siblings, 0 replies; 68+ messages in thread
From: Paul Jackson @ 2008-02-17 14:49 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, linux-kernel, kosaki.motohiro, marcelo, daniel.spang,
	riel, akpm, alan, linux-fsdevel, pavel, a1426z, jonathan, zlynx

I just noticed this patchset, kosaki-san.  It looks quite interesting;
my apologies for not commenting earlier.

I see mention somewhere that mem_notify is of particular interest to
embedded systems.

I have what seems, intuitively, a similar problem at the opposite
end of the world, on big-honkin NUMA boxes (hundreds or thousands of
CPUs, terabytes of main memory.)  The problem there is often best
resolved if we can kill the offending task, rather than shrink its
memory footprint.  The situation is that several compute intensive
multi-threaded jobs are running, each in their own dedicated cpuset.

If one of these jobs tries to use more memory than is available in
its cpuset, then

  (1) we quickly loose any hope of that job continuing at the excellent
      performance needed of it, and

  (2) we rapidly get increased risk of that job starting to swap and
      unintentionally impact shared resources (kernel locks, disk
      channels, disk heads).

So we like to identify such jobs as soon as they begin to swap,
and kill them very very quickly (before the direct reclaim code
in mm/vmscan.c can push more than a few pages to the swap device.)

For a much earlier, unsuccessful, attempt to accomplish this, see:

	[Patch] cpusets policy kill no swap
	http://lkml.org/lkml/2005/3/19/148

Now, it may well be that we are too far apart to share any part of
a solution; one seldom uses the same technology to build a Tour de
France bicycle as one uses to build a Lockheed C-5A Galaxy heavy
cargo transport.

One clear difference is the policy of what action we desire to take
when under memory pressure: do we invite user space to free memory so
as to avoid the wrath of the oom killer, or do we go to the opposite
extreme, seeking a nearly instantant killing, faster than the oom
killer can even begin its search for a victim.

Another clear difference is the use of cpusets, which are a major and
vital part of administering the big NUMA boxes, and I presume are not
even compiled into embedded kernels (correct?).  This difference maybe
unbridgeable ... these big NUMA systems require per-cpuset mechanisms,
whereas embedded may require builds without cpusets.

However ... there might be some useful cross pollination of ideas.

I see in the latest posts to your mem_notify patchset v6, responding
to comments by Andrew and Andi on Feb 12 and 13, that you decided to
think more about the design of this, so perhaps this is a good time
for some random ideas from myself, even though I'm clearly coming from
a quite different problem space in some ways.

1) You have a little bit of code in the kernel to throttle the
   thundering herd problem.  Perhaps this could be moved to user space
   ... one user daemon that is always notified of such memory pressure
   alarms, and in turn notifies interested applications.  This might
   avoid the need to add poll_wait_exclusive() to the kernel.  And it
   moves any fussy details of how to tame the thundering herd out of
   the kernel.

2) Another possible mechanism for communicating events from
   the kernel to user space is inotify.  For example, I added
   the line:

   	fsnotify_modify(dentry);   # dentry is current tasks cpuset

   at an interesting spot in vmscan.c, and using inotify-tools
   <inotify-tools.sourceforge.net/> could easily watch all cpusets
   for these events from one user space daemon.

   At this point, I have no idea whether this odd use of inotify
   is better or worse than what your patchset has.  However using
   inotify did require less new kernel code, and with such user space
   mechanisms as inotify-tools already well developed, it made the
   problem I had, of watching an entire hierarcy of special files
   (beneath /dev/cpuset) very easy to implement.  At least inotify
   also presents events on a file descriptor that can be consumed
   using a poll() loop.

3) Perhaps, instead of sending simple events, one could update
   a meter of the rate of recent such events, such as the per-cpuset
   'memory_pressure' mechanism does.  This might lead to addressing
   Andrew Morton's comment:

	If this feature is useful then I'd expect that some
	applications would want notification at different times, or at
	different levels of VM distress.  So this semi-randomly-chosen
	notification point just won't be strong enough in real-world
	use.

4) A place that I found well suited for my purposes (watching for
   swapping from direct reclaim) was just before the lines in the
   pageout() routine in mm/vmscan.c:

   	if (clear_page_dirty_for_io(page)) {
		...
		res = mapping->a_ops->writepage(page, &wbc);

   It seemed that testing "PageAnon(page)" here allowed me to easily
   distinguish between dirty pages going back to the file system, and
   pages going to swap (this detail is from work on a 2.6.16 kernel;
   things might have changed.)

   One possible advantage of the above hook in the direct reclaim
   code path in vmscan.c is that pressure in one cpuset did not cause
   any false alarms in other cpusets.  However even this hook does
   not take into account the constraints of mm/mempolicy (the NUMA
   memory policy that Andi mentioned) nor of cgroup memory controllers.

5) I'd be keen to find an agreeable way that you could have the
   system-wide, no cpuset, mechanism you need, while at the same
   time, I have a cpuset interface that is similar and depends on the
   same set of hooks.  This might involve a single set of hooks in
   the key places in the memory and swapping code, that (1) updated
   the system wide state you need, and (2) if cpusets were present,
   updated similar state for the tasks current cpuset.  The user
   visible API would present both the system-wide connector you need
   (the special file or whatever) and if cpusets are present, similar
   per-cpuset connectors.

Anyhow ... just some thoughts.  Perhaps one of them will be useful.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-02-17 14:49   ` Paul Jackson
  0 siblings, 0 replies; 68+ messages in thread
From: Paul Jackson @ 2008-02-17 14:49 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, linux-kernel, marcelo, daniel.spang, riel, akpm, alan,
	linux-fsdevel, pavel, a1426z, jonathan, zlynx

I just noticed this patchset, kosaki-san.  It looks quite interesting;
my apologies for not commenting earlier.

I see mention somewhere that mem_notify is of particular interest to
embedded systems.

I have what seems, intuitively, a similar problem at the opposite
end of the world, on big-honkin NUMA boxes (hundreds or thousands of
CPUs, terabytes of main memory.)  The problem there is often best
resolved if we can kill the offending task, rather than shrink its
memory footprint.  The situation is that several compute intensive
multi-threaded jobs are running, each in their own dedicated cpuset.

If one of these jobs tries to use more memory than is available in
its cpuset, then

  (1) we quickly loose any hope of that job continuing at the excellent
      performance needed of it, and

  (2) we rapidly get increased risk of that job starting to swap and
      unintentionally impact shared resources (kernel locks, disk
      channels, disk heads).

So we like to identify such jobs as soon as they begin to swap,
and kill them very very quickly (before the direct reclaim code
in mm/vmscan.c can push more than a few pages to the swap device.)

For a much earlier, unsuccessful, attempt to accomplish this, see:

	[Patch] cpusets policy kill no swap
	http://lkml.org/lkml/2005/3/19/148

Now, it may well be that we are too far apart to share any part of
a solution; one seldom uses the same technology to build a Tour de
France bicycle as one uses to build a Lockheed C-5A Galaxy heavy
cargo transport.

One clear difference is the policy of what action we desire to take
when under memory pressure: do we invite user space to free memory so
as to avoid the wrath of the oom killer, or do we go to the opposite
extreme, seeking a nearly instantant killing, faster than the oom
killer can even begin its search for a victim.

Another clear difference is the use of cpusets, which are a major and
vital part of administering the big NUMA boxes, and I presume are not
even compiled into embedded kernels (correct?).  This difference maybe
unbridgeable ... these big NUMA systems require per-cpuset mechanisms,
whereas embedded may require builds without cpusets.

However ... there might be some useful cross pollination of ideas.

I see in the latest posts to your mem_notify patchset v6, responding
to comments by Andrew and Andi on Feb 12 and 13, that you decided to
think more about the design of this, so perhaps this is a good time
for some random ideas from myself, even though I'm clearly coming from
a quite different problem space in some ways.

1) You have a little bit of code in the kernel to throttle the
   thundering herd problem.  Perhaps this could be moved to user space
   ... one user daemon that is always notified of such memory pressure
   alarms, and in turn notifies interested applications.  This might
   avoid the need to add poll_wait_exclusive() to the kernel.  And it
   moves any fussy details of how to tame the thundering herd out of
   the kernel.

2) Another possible mechanism for communicating events from
   the kernel to user space is inotify.  For example, I added
   the line:

   	fsnotify_modify(dentry);   # dentry is current tasks cpuset

   at an interesting spot in vmscan.c, and using inotify-tools
   <inotify-tools.sourceforge.net/> could easily watch all cpusets
   for these events from one user space daemon.

   At this point, I have no idea whether this odd use of inotify
   is better or worse than what your patchset has.  However using
   inotify did require less new kernel code, and with such user space
   mechanisms as inotify-tools already well developed, it made the
   problem I had, of watching an entire hierarcy of special files
   (beneath /dev/cpuset) very easy to implement.  At least inotify
   also presents events on a file descriptor that can be consumed
   using a poll() loop.

3) Perhaps, instead of sending simple events, one could update
   a meter of the rate of recent such events, such as the per-cpuset
   'memory_pressure' mechanism does.  This might lead to addressing
   Andrew Morton's comment:

	If this feature is useful then I'd expect that some
	applications would want notification at different times, or at
	different levels of VM distress.  So this semi-randomly-chosen
	notification point just won't be strong enough in real-world
	use.

4) A place that I found well suited for my purposes (watching for
   swapping from direct reclaim) was just before the lines in the
   pageout() routine in mm/vmscan.c:

   	if (clear_page_dirty_for_io(page)) {
		...
		res = mapping->a_ops->writepage(page, &wbc);

   It seemed that testing "PageAnon(page)" here allowed me to easily
   distinguish between dirty pages going back to the file system, and
   pages going to swap (this detail is from work on a 2.6.16 kernel;
   things might have changed.)

   One possible advantage of the above hook in the direct reclaim
   code path in vmscan.c is that pressure in one cpuset did not cause
   any false alarms in other cpusets.  However even this hook does
   not take into account the constraints of mm/mempolicy (the NUMA
   memory policy that Andi mentioned) nor of cgroup memory controllers.

5) I'd be keen to find an agreeable way that you could have the
   system-wide, no cpuset, mechanism you need, while at the same
   time, I have a cpuset interface that is similar and depends on the
   same set of hooks.  This might involve a single set of hooks in
   the key places in the memory and swapping code, that (1) updated
   the system wide state you need, and (2) if cpusets were present,
   updated similar state for the tasks current cpuset.  The user
   visible API would present both the system-wide connector you need
   (the special file or whatever) and if cpusets are present, similar
   per-cpuset connectors.

Anyhow ... just some thoughts.  Perhaps one of them will be useful.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-02-17 14:49   ` Paul Jackson
@ 2008-02-19  7:36     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 68+ messages in thread
From: KOSAKI Motohiro @ 2008-02-19  7:36 UTC (permalink / raw)
  To: Paul Jackson
  Cc: kosaki.motohiro, linux-mm, linux-kernel, marcelo, daniel.spang,
	riel, akpm, alan, linux-fsdevel, pavel, a1426z, jonathan, zlynx

Hi Paul,

Thank you for wonderful interestings comment.
your comment is really nice.

I was HPC guy with large NUMA box at past. 
I promise i don't ignroe hpc user.
but unfortunately I didn't have experience of use CPUSET
because at that point, it was under development yet.

I hope discuss you that CPUSET usage case and mem_notify requirement.
to be honest, I thought hpc user doesn't use mem_notify, sorry.


> I have what seems, intuitively, a similar problem at the opposite
> end of the world, on big-honkin NUMA boxes (hundreds or thousands of
> CPUs, terabytes of main memory.)  The problem there is often best
> resolved if we can kill the offending task, rather than shrink its
> memory footprint.  The situation is that several compute intensive
> multi-threaded jobs are running, each in their own dedicated cpuset.

agreed.

> So we like to identify such jobs as soon as they begin to swap,
> and kill them very very quickly (before the direct reclaim code
> in mm/vmscan.c can push more than a few pages to the swap device.)

you think kill the process just after swap, right?
but unfortunately, almost user hope receive notification before swap ;-)
because avoid swap.

I think we need discuss this point more.


> For a much earlier, unsuccessful, attempt to accomplish this, see:
> 
> 	[Patch] cpusets policy kill no swap
> 	http://lkml.org/lkml/2005/3/19/148
> 
> Now, it may well be that we are too far apart to share any part of
> a solution; one seldom uses the same technology to build a Tour de
> France bicycle as one uses to build a Lockheed C-5A Galaxy heavy
> cargo transport.
> 
> One clear difference is the policy of what action we desire to take
> when under memory pressure: do we invite user space to free memory so
> as to avoid the wrath of the oom killer, or do we go to the opposite
> extreme, seeking a nearly instantant killing, faster than the oom
> killer can even begin its search for a victim.

Hmm, sorry
I understand your patch yet, because I don't know CPUSET so much.

I learn CPUSET more, about this week and I'll reply again about next week ;-)


> Another clear difference is the use of cpusets, which are a major and
> vital part of administering the big NUMA boxes, and I presume are not
> even compiled into embedded kernels (correct?).  This difference maybe
> unbridgeable ... these big NUMA systems require per-cpuset mechanisms,
> whereas embedded may require builds without cpusets.

Yes, some embedded distribution(i.e. monta vista) distribute as source.
but embedded people strongly dislike bloat code size.
I think they never turn on CPUSET.

I hope mem_notify works fine without CPUSET.


> 1) You have a little bit of code in the kernel to throttle the
>    thundering herd problem.  Perhaps this could be moved to user space
>    ... one user daemon that is always notified of such memory pressure
>    alarms, and in turn notifies interested applications.  This might
>    avoid the need to add poll_wait_exclusive() to the kernel.  And it
>    moves any fussy details of how to tame the thundering herd out of
>    the kernel.

I think you talk about user space oom manager.
it and many user process are obviously different.

I doubt memory manager daemon model doesn't works on desktop and
typical server.
thus, current implementaion optimize to no manager environment.

of course, it doesn't mean i refuse add to code for oom manager.
it is very interesting idea.

i hope discussion it more.


> 2) Another possible mechanism for communicating events from
>    the kernel to user space is inotify.  For example, I added
>    the line:
> 
>    	fsnotify_modify(dentry);   # dentry is current tasks cpuset

Excellent!
that is really good idea.

thaks.


> 3) Perhaps, instead of sending simple events, one could update
>    a meter of the rate of recent such events, such as the per-cpuset
>    'memory_pressure' mechanism does.  This might lead to addressing
>    Andrew Morton's comment:
> 
> 	If this feature is useful then I'd expect that some
> 	applications would want notification at different times, or at
> 	different levels of VM distress.  So this semi-randomly-chosen
> 	notification point just won't be strong enough in real-world
> 	use.

Hmmm, I don't think so.
I think timing of memmory_pressure_notify(1) is already best.

the page move active list to inactive list indicate swap I/O happen
a bit after.

but memmory_pressure_notify(0) is a bit messy.
I'll try to improve more simplify.


> 4) A place that I found well suited for my purposes (watching for
>    swapping from direct reclaim) was just before the lines in the
>    pageout() routine in mm/vmscan.c:
> 
>    	if (clear_page_dirty_for_io(page)) {
> 		...
> 		res = mapping->a_ops->writepage(page, &wbc);
> 
>    It seemed that testing "PageAnon(page)" here allowed me to easily
>    distinguish between dirty pages going back to the file system, and
>    pages going to swap (this detail is from work on a 2.6.16 kernel;
>    things might have changed.)
> 
>    One possible advantage of the above hook in the direct reclaim
>    code path in vmscan.c is that pressure in one cpuset did not cause
>    any false alarms in other cpusets.  However even this hook does
>    not take into account the constraints of mm/mempolicy (the NUMA
>    memory policy that Andi mentioned) nor of cgroup memory controllers.

Disagreed.
that is too late.

after writepage notifify mean can't avoid swap I/O.


> 5) I'd be keen to find an agreeable way that you could have the
>    system-wide, no cpuset, mechanism you need, while at the same
>    time, I have a cpuset interface that is similar and depends on the
>    same set of hooks.  This might involve a single set of hooks in
>    the key places in the memory and swapping code, that (1) updated
>    the system wide state you need, and (2) if cpusets were present,
>    updated similar state for the tasks current cpuset.  The user
>    visible API would present both the system-wide connector you need
>    (the special file or whatever) and if cpusets are present, similar
>    per-cpuset connectors.

that makes sense.
I will learn cpuset and think integrate mem_notify and cpuset.


and,

Please don't think I reject your idea.
your proposal is large different of past our discussion and
i don't know cpuset.

I think we can't drop all current design and accept your idea all, may be.
but we may be able to accept partial until hpc guys content enough.

I will learn to CPUSET more in a few days.
after it, we can discussion more.

please wait for a while.

Thanks!




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-02-19  7:36     ` KOSAKI Motohiro
  0 siblings, 0 replies; 68+ messages in thread
From: KOSAKI Motohiro @ 2008-02-19  7:36 UTC (permalink / raw)
  To: Paul Jackson
  Cc: kosaki.motohiro, linux-mm, linux-kernel, marcelo, daniel.spang,
	riel, akpm, alan, linux-fsdevel, pavel, a1426z, jonathan, zlynx

Hi Paul,

Thank you for wonderful interestings comment.
your comment is really nice.

I was HPC guy with large NUMA box at past. 
I promise i don't ignroe hpc user.
but unfortunately I didn't have experience of use CPUSET
because at that point, it was under development yet.

I hope discuss you that CPUSET usage case and mem_notify requirement.
to be honest, I thought hpc user doesn't use mem_notify, sorry.


> I have what seems, intuitively, a similar problem at the opposite
> end of the world, on big-honkin NUMA boxes (hundreds or thousands of
> CPUs, terabytes of main memory.)  The problem there is often best
> resolved if we can kill the offending task, rather than shrink its
> memory footprint.  The situation is that several compute intensive
> multi-threaded jobs are running, each in their own dedicated cpuset.

agreed.

> So we like to identify such jobs as soon as they begin to swap,
> and kill them very very quickly (before the direct reclaim code
> in mm/vmscan.c can push more than a few pages to the swap device.)

you think kill the process just after swap, right?
but unfortunately, almost user hope receive notification before swap ;-)
because avoid swap.

I think we need discuss this point more.


> For a much earlier, unsuccessful, attempt to accomplish this, see:
> 
> 	[Patch] cpusets policy kill no swap
> 	http://lkml.org/lkml/2005/3/19/148
> 
> Now, it may well be that we are too far apart to share any part of
> a solution; one seldom uses the same technology to build a Tour de
> France bicycle as one uses to build a Lockheed C-5A Galaxy heavy
> cargo transport.
> 
> One clear difference is the policy of what action we desire to take
> when under memory pressure: do we invite user space to free memory so
> as to avoid the wrath of the oom killer, or do we go to the opposite
> extreme, seeking a nearly instantant killing, faster than the oom
> killer can even begin its search for a victim.

Hmm, sorry
I understand your patch yet, because I don't know CPUSET so much.

I learn CPUSET more, about this week and I'll reply again about next week ;-)


> Another clear difference is the use of cpusets, which are a major and
> vital part of administering the big NUMA boxes, and I presume are not
> even compiled into embedded kernels (correct?).  This difference maybe
> unbridgeable ... these big NUMA systems require per-cpuset mechanisms,
> whereas embedded may require builds without cpusets.

Yes, some embedded distribution(i.e. monta vista) distribute as source.
but embedded people strongly dislike bloat code size.
I think they never turn on CPUSET.

I hope mem_notify works fine without CPUSET.


> 1) You have a little bit of code in the kernel to throttle the
>    thundering herd problem.  Perhaps this could be moved to user space
>    ... one user daemon that is always notified of such memory pressure
>    alarms, and in turn notifies interested applications.  This might
>    avoid the need to add poll_wait_exclusive() to the kernel.  And it
>    moves any fussy details of how to tame the thundering herd out of
>    the kernel.

I think you talk about user space oom manager.
it and many user process are obviously different.

I doubt memory manager daemon model doesn't works on desktop and
typical server.
thus, current implementaion optimize to no manager environment.

of course, it doesn't mean i refuse add to code for oom manager.
it is very interesting idea.

i hope discussion it more.


> 2) Another possible mechanism for communicating events from
>    the kernel to user space is inotify.  For example, I added
>    the line:
> 
>    	fsnotify_modify(dentry);   # dentry is current tasks cpuset

Excellent!
that is really good idea.

thaks.


> 3) Perhaps, instead of sending simple events, one could update
>    a meter of the rate of recent such events, such as the per-cpuset
>    'memory_pressure' mechanism does.  This might lead to addressing
>    Andrew Morton's comment:
> 
> 	If this feature is useful then I'd expect that some
> 	applications would want notification at different times, or at
> 	different levels of VM distress.  So this semi-randomly-chosen
> 	notification point just won't be strong enough in real-world
> 	use.

Hmmm, I don't think so.
I think timing of memmory_pressure_notify(1) is already best.

the page move active list to inactive list indicate swap I/O happen
a bit after.

but memmory_pressure_notify(0) is a bit messy.
I'll try to improve more simplify.


> 4) A place that I found well suited for my purposes (watching for
>    swapping from direct reclaim) was just before the lines in the
>    pageout() routine in mm/vmscan.c:
> 
>    	if (clear_page_dirty_for_io(page)) {
> 		...
> 		res = mapping->a_ops->writepage(page, &wbc);
> 
>    It seemed that testing "PageAnon(page)" here allowed me to easily
>    distinguish between dirty pages going back to the file system, and
>    pages going to swap (this detail is from work on a 2.6.16 kernel;
>    things might have changed.)
> 
>    One possible advantage of the above hook in the direct reclaim
>    code path in vmscan.c is that pressure in one cpuset did not cause
>    any false alarms in other cpusets.  However even this hook does
>    not take into account the constraints of mm/mempolicy (the NUMA
>    memory policy that Andi mentioned) nor of cgroup memory controllers.

Disagreed.
that is too late.

after writepage notifify mean can't avoid swap I/O.


> 5) I'd be keen to find an agreeable way that you could have the
>    system-wide, no cpuset, mechanism you need, while at the same
>    time, I have a cpuset interface that is similar and depends on the
>    same set of hooks.  This might involve a single set of hooks in
>    the key places in the memory and swapping code, that (1) updated
>    the system wide state you need, and (2) if cpusets were present,
>    updated similar state for the tasks current cpuset.  The user
>    visible API would present both the system-wide connector you need
>    (the special file or whatever) and if cpusets are present, similar
>    per-cpuset connectors.

that makes sense.
I will learn cpuset and think integrate mem_notify and cpuset.


and,

Please don't think I reject your idea.
your proposal is large different of past our discussion and
i don't know cpuset.

I think we can't drop all current design and accept your idea all, may be.
but we may be able to accept partial until hpc guys content enough.

I will learn to CPUSET more in a few days.
after it, we can discussion more.

please wait for a while.

Thanks!



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-02-19  7:36     ` KOSAKI Motohiro
@ 2008-02-19 15:00       ` Paul Jackson
  -1 siblings, 0 replies; 68+ messages in thread
From: Paul Jackson @ 2008-02-19 15:00 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, linux-kernel, marcelo, daniel.spang, riel, akpm, alan,
	linux-fsdevel, pavel, a1426z, jonathan, zlynx

Kosaki-san wrote:
> Thank you for wonderful interestings comment.

You're most welcome.  The pleasure is all mine.

> you think kill the process just after swap, right?
> but unfortunately, almost user hope receive notification before swap ;-)
> because avoid swap.

There is not much my customers HPC jobs can do with notification before
swap.  Their jobs either have the main memory they need to perform the
requested calculations with the desired performance, or their job is
useless and should be killed.  Unlike the applications you describe,
my customers jobs have no way, once running, to adapt to less memory.
They can only adapt to less memory by being restarted with a different
set of resource requests to the job scheduler (the application that
manages job requests, assigns them CPU, memory and other resources,
and monitors, starts, stops and pauses jobs.)

The primary difficulty my HPC customers have is killing such jobs fast
enough, before a bad job (one that attempts to use more memory than it
signed up for) can harm the performance of other users and the rest of
the system.

I don't mind if a pages are slowly or occassionally written to swap;
but as soon as the task wants to reclaim big chunks of memory by
writing thousands of pages at once to swap, it must die, and die
before it can queue more than a handful of those pages to the swapper.

> but embedded people strongly dislike bloat code size.
> I think they never turn on CPUSET.
> 
> I hope mem_notify works fine without CPUSET.

Yes - understood and agreed - as I guessed, cpusets are not configured
in embedded systems.

> Please don't think I reject your idea.
> your proposal is large different of past our discussion

Yes - I agree that my ideas were quite different.  Please don't
hesitate to reject every one of them, like a Samurai slicing through
air with his favorite sword <grin>.

> Disagreed. that [my direct reclaim hook at mapping->a_ops->writepage()]
> is too late.

For your work, yes that hook is too late.  Agreed.

Depending on what we're trying to do:
 1) warn applications of swap coming soon (your case),
 2) show how close we are to swapping,
 3) show how much swap has happened already,
 4) kill instantly if try to swap (my hpc case),
 5) measure file i/o caused by memory pressure, or
 6) perhaps other goals,
we will need to hook different places in the kernel.

It may well be that your hooks for embedded are simply in different
places than my hooks for HPC.  If so, that's fine.

I look forward to your further thoughts.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-02-19 15:00       ` Paul Jackson
  0 siblings, 0 replies; 68+ messages in thread
From: Paul Jackson @ 2008-02-19 15:00 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, linux-kernel, marcelo, daniel.spang, riel, akpm, alan,
	linux-fsdevel, pavel, a1426z, jonathan, zlynx

Kosaki-san wrote:
> Thank you for wonderful interestings comment.

You're most welcome.  The pleasure is all mine.

> you think kill the process just after swap, right?
> but unfortunately, almost user hope receive notification before swap ;-)
> because avoid swap.

There is not much my customers HPC jobs can do with notification before
swap.  Their jobs either have the main memory they need to perform the
requested calculations with the desired performance, or their job is
useless and should be killed.  Unlike the applications you describe,
my customers jobs have no way, once running, to adapt to less memory.
They can only adapt to less memory by being restarted with a different
set of resource requests to the job scheduler (the application that
manages job requests, assigns them CPU, memory and other resources,
and monitors, starts, stops and pauses jobs.)

The primary difficulty my HPC customers have is killing such jobs fast
enough, before a bad job (one that attempts to use more memory than it
signed up for) can harm the performance of other users and the rest of
the system.

I don't mind if a pages are slowly or occassionally written to swap;
but as soon as the task wants to reclaim big chunks of memory by
writing thousands of pages at once to swap, it must die, and die
before it can queue more than a handful of those pages to the swapper.

> but embedded people strongly dislike bloat code size.
> I think they never turn on CPUSET.
> 
> I hope mem_notify works fine without CPUSET.

Yes - understood and agreed - as I guessed, cpusets are not configured
in embedded systems.

> Please don't think I reject your idea.
> your proposal is large different of past our discussion

Yes - I agree that my ideas were quite different.  Please don't
hesitate to reject every one of them, like a Samurai slicing through
air with his favorite sword <grin>.

> Disagreed. that [my direct reclaim hook at mapping->a_ops->writepage()]
> is too late.

For your work, yes that hook is too late.  Agreed.

Depending on what we're trying to do:
 1) warn applications of swap coming soon (your case),
 2) show how close we are to swapping,
 3) show how much swap has happened already,
 4) kill instantly if try to swap (my hpc case),
 5) measure file i/o caused by memory pressure, or
 6) perhaps other goals,
we will need to hook different places in the kernel.

It may well be that your hooks for embedded are simply in different
places than my hooks for HPC.  If so, that's fine.

I look forward to your further thoughts.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-02-19 15:00       ` Paul Jackson
@ 2008-02-19 19:02         ` Rik van Riel
  -1 siblings, 0 replies; 68+ messages in thread
From: Rik van Riel @ 2008-02-19 19:02 UTC (permalink / raw)
  To: Paul Jackson
  Cc: KOSAKI Motohiro, linux-mm, linux-kernel, marcelo, daniel.spang,
	akpm, alan, linux-fsdevel, pavel, a1426z, jonathan, zlynx

On Tue, 19 Feb 2008 09:00:08 -0600
Paul Jackson <pj@sgi.com> wrote:

> Depending on what we're trying to do:
>  1) warn applications of swap coming soon (your case),
>  2) show how close we are to swapping,
>  3) show how much swap has happened already,
>  4) kill instantly if try to swap (my hpc case),
>  5) measure file i/o caused by memory pressure, or
>  6) perhaps other goals,
> we will need to hook different places in the kernel.
> 
> It may well be that your hooks for embedded are simply in different
> places than my hooks for HPC.  If so, that's fine.

Don't forget the "hooks for desktop" :)

Basically in all situations, the kernel needs to warn at the same point
in time: when the system is about to run out of RAM for anonymous pages.

In the desktop case, that leads to swapping (and programs can free memory).

In the embedded case, it leads to OOM (and a management program can kill or
restart something else, or a program can restart itself).

In the HPC case, it leads to swapping (and a management program can kill or
restart something else).

I do not see the kernel side being different between these situations, only
userspace reacts differently in the different scenarios.

Am I overlooking something?

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-02-19 19:02         ` Rik van Riel
  0 siblings, 0 replies; 68+ messages in thread
From: Rik van Riel @ 2008-02-19 19:02 UTC (permalink / raw)
  To: Paul Jackson
  Cc: KOSAKI Motohiro, linux-mm, linux-kernel, marcelo, daniel.spang,
	akpm, alan, linux-fsdevel, pavel, a1426z, jonathan, zlynx

On Tue, 19 Feb 2008 09:00:08 -0600
Paul Jackson <pj@sgi.com> wrote:

> Depending on what we're trying to do:
>  1) warn applications of swap coming soon (your case),
>  2) show how close we are to swapping,
>  3) show how much swap has happened already,
>  4) kill instantly if try to swap (my hpc case),
>  5) measure file i/o caused by memory pressure, or
>  6) perhaps other goals,
> we will need to hook different places in the kernel.
> 
> It may well be that your hooks for embedded are simply in different
> places than my hooks for HPC.  If so, that's fine.

Don't forget the "hooks for desktop" :)

Basically in all situations, the kernel needs to warn at the same point
in time: when the system is about to run out of RAM for anonymous pages.

In the desktop case, that leads to swapping (and programs can free memory).

In the embedded case, it leads to OOM (and a management program can kill or
restart something else, or a program can restart itself).

In the HPC case, it leads to swapping (and a management program can kill or
restart something else).

I do not see the kernel side being different between these situations, only
userspace reacts differently in the different scenarios.

Am I overlooking something?

-- 
All Rights Reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-02-19 19:02         ` Rik van Riel
@ 2008-02-19 20:18           ` Paul Jackson
  -1 siblings, 0 replies; 68+ messages in thread
From: Paul Jackson @ 2008-02-19 20:18 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, linux-mm, linux-kernel, marcelo, daniel.spang,
	akpm, alan, linux-fsdevel, pavel, a1426z, jonathan, zlynx

Rik wrote:
> Basically in all situations, the kernel needs to warn at the same point
> in time: when the system is about to run out of RAM for anonymous pages.
>
> ...
> 
> In the HPC case, it leads to swapping (and a management program can kill or
> restart something else).

Thanks for stopping by ...

Perhaps with the cgroup based memory controller in progress, or with
other work I'm overlooking, this is or will no longer be a problem,
but on 2.6.16 kernels (the latest ones I have in major production HPC
use) this is not sufficient.

As of at least that point, we don't (didn't ?) have sufficiently
accurate numbers of when we were "about to run out".  We can only
detect when "we just did run out", as evidenced by entering the direct
reclaim code, or as by slightly later events such as starting to push
Anon pages to the swap device from direct reclaim.

Actually, even the point that we enter direct reclaim, near the bottom
of __alloc_pages(), isn't adequate either, as we could be there because
some thread in that cpuset is trying to write out a results file that
is larger than that cpusets memory.   In that case, we really don't want
to kill the job ... it just needs to be (and routinely is) throttled
back to disk speeds as it completes the write out of dirty file system
pages.

So the first clear spot that we -know- serious swapping is commencing
is where the direct reclaim code calls a writepage op with an Anon
page. At that point, having a management program intervene is entirely
too late.  Even having the task at that instant, inline, tag itself
with a SIGKILL, as it queues that first Anon page to a swap device, is
too late.  The direct reclaim code can loop, pushing hundreds or
thousand of pages, on big memory systems, to the swapper, in the
current reclaim loop, before it pops the stack far enough back to even
notice that it has a SIGKILL pending on it.  The suppression of pushing
pages to the swapper has to happen right there, inline in some
mm/vmscan.c code, as part of the direct reclaim loops.

(Hopefully I said something stupid in that last paragraph, and you will
be able to correct it ... it sure would be useful ;).

A year or two ago, I added the 'memory_pressure' per-cpuset meter to
Linux, in an effort to realize just what you suggest, Rik.  My colleagues
at SGI (mostly) and myself (a little) have proven to ourselves that this
doesn't work, for our HPC needs, for two reasons:

 1) once swapping begins, issuing a SIGKILL, no matter how instantly,
    is too late, as explained above, and

 2) that memory_pressure combines and confuses memory pressure due to
    dirty file system buffers filling memory, with memory pressure due
    to anonymous swappable pages filling memory, also as explained above.

I do have a patch in my playarea that adds two more of the
memory_pressure meters, one for swapouts, and one for flushing dirty
file system buffers, both hooking into the spot in the vmscan reclaim
code where the writepage op is called.  This patch ~might~ address the
desktop need here.  It nicely generates two clean, sharp indicators
that we're getting throttled by direct reclaim of dirty file buffers,
and that we're starting to reclaim anon pages to the swappers.  Of course
for embedded use, I'd have to adapt it to a non-cpuset based mechanism
(not difficult), as embedded definitely doesn't do cpusets.

> Don't forget the "hooks for desktop" :)

I'll agree (perhaps out of ignorance) that the desktop (and normal sized
server) cases are like the embedded case ... in that they need to
distribute some event to user space tasks that want to know that memory
is short, so that that user space code can do what it will (reclaim
some user space memory or restart or kill or throttle something?)

However, I'm still stuck carrying a patch out of the community kernel,
to get my HPC customers the "instant kill on direct reclaim swap" they
need, as this still seems to be the special case.  Which is rather
unfortunate, from the business perspective of my good employer, as it
the -only- out of mainline patch, so far as I know, that we have been
having to carry, continuously, for several years now.  But for that
single long standing issue (and now and then various more short term
issues), a vanilla distribution kernel, using the vanilla distribution
config and build, runs production on our one or two thousand CPU,
several terabyte big honkin NUMA boxes.

Part of my motivation for engaging Kosaki-san in this discussion was
to reinvigorate this discussion, as it's getting to be time I took
another shot at getting something in the community kernel that addresses
this.  The more overlap the HPC fix here is with the other 99.978% of
the worlds Linux systems that are desktop, laptop, (ordinary sized)
server or embedded, the better my chances (he said hopefully.)

(Now if I could just get us to consider systems in proportion to
how much power & cooling they need, rather than in proportion to
unit sales ... ;)

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-02-19 20:18           ` Paul Jackson
  0 siblings, 0 replies; 68+ messages in thread
From: Paul Jackson @ 2008-02-19 20:18 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, linux-mm, linux-kernel, marcelo, daniel.spang,
	akpm, alan, linux-fsdevel, pavel, a1426z, jonathan, zlynx

Rik wrote:
> Basically in all situations, the kernel needs to warn at the same point
> in time: when the system is about to run out of RAM for anonymous pages.
>
> ...
> 
> In the HPC case, it leads to swapping (and a management program can kill or
> restart something else).

Thanks for stopping by ...

Perhaps with the cgroup based memory controller in progress, or with
other work I'm overlooking, this is or will no longer be a problem,
but on 2.6.16 kernels (the latest ones I have in major production HPC
use) this is not sufficient.

As of at least that point, we don't (didn't ?) have sufficiently
accurate numbers of when we were "about to run out".  We can only
detect when "we just did run out", as evidenced by entering the direct
reclaim code, or as by slightly later events such as starting to push
Anon pages to the swap device from direct reclaim.

Actually, even the point that we enter direct reclaim, near the bottom
of __alloc_pages(), isn't adequate either, as we could be there because
some thread in that cpuset is trying to write out a results file that
is larger than that cpusets memory.   In that case, we really don't want
to kill the job ... it just needs to be (and routinely is) throttled
back to disk speeds as it completes the write out of dirty file system
pages.

So the first clear spot that we -know- serious swapping is commencing
is where the direct reclaim code calls a writepage op with an Anon
page. At that point, having a management program intervene is entirely
too late.  Even having the task at that instant, inline, tag itself
with a SIGKILL, as it queues that first Anon page to a swap device, is
too late.  The direct reclaim code can loop, pushing hundreds or
thousand of pages, on big memory systems, to the swapper, in the
current reclaim loop, before it pops the stack far enough back to even
notice that it has a SIGKILL pending on it.  The suppression of pushing
pages to the swapper has to happen right there, inline in some
mm/vmscan.c code, as part of the direct reclaim loops.

(Hopefully I said something stupid in that last paragraph, and you will
be able to correct it ... it sure would be useful ;).

A year or two ago, I added the 'memory_pressure' per-cpuset meter to
Linux, in an effort to realize just what you suggest, Rik.  My colleagues
at SGI (mostly) and myself (a little) have proven to ourselves that this
doesn't work, for our HPC needs, for two reasons:

 1) once swapping begins, issuing a SIGKILL, no matter how instantly,
    is too late, as explained above, and

 2) that memory_pressure combines and confuses memory pressure due to
    dirty file system buffers filling memory, with memory pressure due
    to anonymous swappable pages filling memory, also as explained above.

I do have a patch in my playarea that adds two more of the
memory_pressure meters, one for swapouts, and one for flushing dirty
file system buffers, both hooking into the spot in the vmscan reclaim
code where the writepage op is called.  This patch ~might~ address the
desktop need here.  It nicely generates two clean, sharp indicators
that we're getting throttled by direct reclaim of dirty file buffers,
and that we're starting to reclaim anon pages to the swappers.  Of course
for embedded use, I'd have to adapt it to a non-cpuset based mechanism
(not difficult), as embedded definitely doesn't do cpusets.

> Don't forget the "hooks for desktop" :)

I'll agree (perhaps out of ignorance) that the desktop (and normal sized
server) cases are like the embedded case ... in that they need to
distribute some event to user space tasks that want to know that memory
is short, so that that user space code can do what it will (reclaim
some user space memory or restart or kill or throttle something?)

However, I'm still stuck carrying a patch out of the community kernel,
to get my HPC customers the "instant kill on direct reclaim swap" they
need, as this still seems to be the special case.  Which is rather
unfortunate, from the business perspective of my good employer, as it
the -only- out of mainline patch, so far as I know, that we have been
having to carry, continuously, for several years now.  But for that
single long standing issue (and now and then various more short term
issues), a vanilla distribution kernel, using the vanilla distribution
config and build, runs production on our one or two thousand CPU,
several terabyte big honkin NUMA boxes.

Part of my motivation for engaging Kosaki-san in this discussion was
to reinvigorate this discussion, as it's getting to be time I took
another shot at getting something in the community kernel that addresses
this.  The more overlap the HPC fix here is with the other 99.978% of
the worlds Linux systems that are desktop, laptop, (ordinary sized)
server or embedded, the better my chances (he said hopefully.)

(Now if I could just get us to consider systems in proportion to
how much power & cooling they need, rather than in proportion to
unit sales ... ;)

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-02-19 20:18           ` Paul Jackson
@ 2008-02-19 20:43             ` Paul Jackson
  -1 siblings, 0 replies; 68+ messages in thread
From: Paul Jackson @ 2008-02-19 20:43 UTC (permalink / raw)
  To: Paul Jackson
  Cc: riel, kosaki.motohiro, linux-mm, linux-kernel, marcelo,
	daniel.spang, akpm, alan, linux-fsdevel, pavel, a1426z, jonathan,
	zlynx

pj, talking to himself:
> Of course
> for embedded use, I'd have to adapt it to a non-cpuset based mechanism
> (not difficult), as embedded definitely doesn't do cpusets.

I'm forgetting an important detail here.  Kosaki-san has clearly stated
that this hook, at vmscan's writepage, is too late for his embedded needs,
and that they need the feedback a bit earlier, when the page moves from
the active list to the inactive list.

However, except for the placement of such hooks in three or four
places, rather than just one, it may well be (if cpusets could be
factored out) that one mechanism would meet all needs ... except for
that pesky HPC need for throttling to more or less zero the swapping
from select cpusets.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-02-19 20:43             ` Paul Jackson
  0 siblings, 0 replies; 68+ messages in thread
From: Paul Jackson @ 2008-02-19 20:43 UTC (permalink / raw)
  To: Paul Jackson
  Cc: riel, kosaki.motohiro, linux-mm, linux-kernel, marcelo,
	daniel.spang, akpm, alan, linux-fsdevel, pavel, a1426z, jonathan,
	zlynx

pj, talking to himself:
> Of course
> for embedded use, I'd have to adapt it to a non-cpuset based mechanism
> (not difficult), as embedded definitely doesn't do cpusets.

I'm forgetting an important detail here.  Kosaki-san has clearly stated
that this hook, at vmscan's writepage, is too late for his embedded needs,
and that they need the feedback a bit earlier, when the page moves from
the active list to the inactive list.

However, except for the placement of such hooks in three or four
places, rather than just one, it may well be (if cpusets could be
factored out) that one mechanism would meet all needs ... except for
that pesky HPC need for throttling to more or less zero the swapping
from select cpusets.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-02-19 15:00       ` Paul Jackson
@ 2008-02-19 22:28         ` Pavel Machek
  -1 siblings, 0 replies; 68+ messages in thread
From: Pavel Machek @ 2008-02-19 22:28 UTC (permalink / raw)
  To: Paul Jackson
  Cc: KOSAKI Motohiro, linux-mm, linux-kernel, marcelo, daniel.spang,
	riel, akpm, alan, linux-fsdevel, a1426z, jonathan, zlynx

On Tue 2008-02-19 09:00:08, Paul Jackson wrote:
> Kosaki-san wrote:
> > Thank you for wonderful interestings comment.
> 
> You're most welcome.  The pleasure is all mine.
> 
> > you think kill the process just after swap, right?
> > but unfortunately, almost user hope receive notification before swap ;-)
> > because avoid swap.
> 
> There is not much my customers HPC jobs can do with notification before
> swap.  Their jobs either have the main memory they need to perform the
> requested calculations with the desired performance, or their job is
> useless and should be killed.  Unlike the applications you describe,
> my customers jobs have no way, once running, to adapt to less
> memory.

Sounds like a job for memory limits (ulimit?), not for OOM
notification, right?
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-02-19 22:28         ` Pavel Machek
  0 siblings, 0 replies; 68+ messages in thread
From: Pavel Machek @ 2008-02-19 22:28 UTC (permalink / raw)
  To: Paul Jackson
  Cc: KOSAKI Motohiro, linux-mm, linux-kernel, marcelo, daniel.spang,
	riel, akpm, alan, linux-fsdevel, a1426z, jonathan, zlynx

On Tue 2008-02-19 09:00:08, Paul Jackson wrote:
> Kosaki-san wrote:
> > Thank you for wonderful interestings comment.
> 
> You're most welcome.  The pleasure is all mine.
> 
> > you think kill the process just after swap, right?
> > but unfortunately, almost user hope receive notification before swap ;-)
> > because avoid swap.
> 
> There is not much my customers HPC jobs can do with notification before
> swap.  Their jobs either have the main memory they need to perform the
> requested calculations with the desired performance, or their job is
> useless and should be killed.  Unlike the applications you describe,
> my customers jobs have no way, once running, to adapt to less
> memory.

Sounds like a job for memory limits (ulimit?), not for OOM
notification, right?
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-02-19 22:28         ` Pavel Machek
@ 2008-02-20  1:54           ` Paul Jackson
  -1 siblings, 0 replies; 68+ messages in thread
From: Paul Jackson @ 2008-02-20  1:54 UTC (permalink / raw)
  To: Pavel Machek
  Cc: kosaki.motohiro, linux-mm, linux-kernel, marcelo, daniel.spang,
	riel, akpm, alan, linux-fsdevel, a1426z, jonathan, zlynx

Pavel, responding to pj:
> > There is not much my customers HPC jobs can do with notification before
> > swap.  Their jobs either have the main memory they need to perform the
> > requested calculations with the desired performance, or their job is
> > useless and should be killed.  Unlike the applications you describe,
> > my customers jobs have no way, once running, to adapt to less
> > memory.
> 
> Sounds like a job for memory limits (ulimit?), not for OOM
> notification, right?

Er eh -- which one?

The only one I see that might help keep a multi-threaded job
using various kinds of memory on multiple nodes confined could
be the resident set size (RLIMIT_RSS; ulimit -m).  So far as
I can tell, that one is a pure no-op in Linux.

Here's the bash list of all available ulimit (setrlimit) options:

              -a     All current limits are reported
              -c     The maximum size of core files created
              -d     The maximum size of a process's data segment
              -e     The maximum scheduling priority ("nice")
              -f     The maximum size of files written by the shell and its children
              -i     The maximum number of pending signals
              -l     The maximum size that may be locked into memory
              -m     The maximum resident set size
              -n     The maximum number of open file descriptors (most systems do not allow this value to be set)
              -p     The pipe size in 512-byte blocks (this may not be set)
              -q     The maximum number of bytes in POSIX message queues
              -r     The maximum real-time scheduling priority
              -s     The maximum stack size
              -t     The maximum amount of cpu time in seconds
              -u     The maximum number of processes available to a single user
              -v     The maximum amount of virtual memory available to the shell
              -x     The maximum number of file locks

Did I miss seeing one that would be useful?

Actually, given the chronic problem we've had over the years accounting
for how much memory in total (including text, data, stack, mapped
files, locked pages, kernel memory structures that an application is
using many of, ...  I'd be suprised if any such ulimit existed that
actually worked for this purpose (confining an HPC jobs to using almost
exactly all the memory available to it, but no more.)

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-02-20  1:54           ` Paul Jackson
  0 siblings, 0 replies; 68+ messages in thread
From: Paul Jackson @ 2008-02-20  1:54 UTC (permalink / raw)
  To: Pavel Machek
  Cc: kosaki.motohiro, linux-mm, linux-kernel, marcelo, daniel.spang,
	riel, akpm, alan, linux-fsdevel, a1426z, jonathan, zlynx

Pavel, responding to pj:
> > There is not much my customers HPC jobs can do with notification before
> > swap.  Their jobs either have the main memory they need to perform the
> > requested calculations with the desired performance, or their job is
> > useless and should be killed.  Unlike the applications you describe,
> > my customers jobs have no way, once running, to adapt to less
> > memory.
> 
> Sounds like a job for memory limits (ulimit?), not for OOM
> notification, right?

Er eh -- which one?

The only one I see that might help keep a multi-threaded job
using various kinds of memory on multiple nodes confined could
be the resident set size (RLIMIT_RSS; ulimit -m).  So far as
I can tell, that one is a pure no-op in Linux.

Here's the bash list of all available ulimit (setrlimit) options:

              -a     All current limits are reported
              -c     The maximum size of core files created
              -d     The maximum size of a process's data segment
              -e     The maximum scheduling priority ("nice")
              -f     The maximum size of files written by the shell and its children
              -i     The maximum number of pending signals
              -l     The maximum size that may be locked into memory
              -m     The maximum resident set size
              -n     The maximum number of open file descriptors (most systems do not allow this value to be set)
              -p     The pipe size in 512-byte blocks (this may not be set)
              -q     The maximum number of bytes in POSIX message queues
              -r     The maximum real-time scheduling priority
              -s     The maximum stack size
              -t     The maximum amount of cpu time in seconds
              -u     The maximum number of processes available to a single user
              -v     The maximum amount of virtual memory available to the shell
              -x     The maximum number of file locks

Did I miss seeing one that would be useful?

Actually, given the chronic problem we've had over the years accounting
for how much memory in total (including text, data, stack, mapped
files, locked pages, kernel memory structures that an application is
using many of, ...  I'd be suprised if any such ulimit existed that
actually worked for this purpose (confining an HPC jobs to using almost
exactly all the memory available to it, but no more.)

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-02-19 22:28         ` Pavel Machek
@ 2008-02-20  2:07           ` Rik van Riel
  -1 siblings, 0 replies; 68+ messages in thread
From: Rik van Riel @ 2008-02-20  2:07 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Paul Jackson, KOSAKI Motohiro, linux-mm, linux-kernel, marcelo,
	daniel.spang, akpm, alan, linux-fsdevel, a1426z, jonathan, zlynx

On Tue, 19 Feb 2008 23:28:28 +0100
Pavel Machek <pavel@ucw.cz> wrote:

> Sounds like a job for memory limits (ulimit?), not for OOM
> notification, right?

I suspect one problem could be that an HPC job scheduling program
does not know exactly how much memory each job can take, so it can
sometimes end up making a mistake and overcommitting the memory on
one HPC node.

In that case the user is better off having that job killed and
restarted elsewhere, than having all of the jobs on that node
crawl to a halt due to swapping.

Paul, is this guess correct? :)

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-02-20  2:07           ` Rik van Riel
  0 siblings, 0 replies; 68+ messages in thread
From: Rik van Riel @ 2008-02-20  2:07 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Paul Jackson, KOSAKI Motohiro, linux-mm, linux-kernel, marcelo,
	daniel.spang, akpm, alan, linux-fsdevel, a1426z, jonathan, zlynx

On Tue, 19 Feb 2008 23:28:28 +0100
Pavel Machek <pavel@ucw.cz> wrote:

> Sounds like a job for memory limits (ulimit?), not for OOM
> notification, right?

I suspect one problem could be that an HPC job scheduling program
does not know exactly how much memory each job can take, so it can
sometimes end up making a mistake and overcommitting the memory on
one HPC node.

In that case the user is better off having that job killed and
restarted elsewhere, than having all of the jobs on that node
crawl to a halt due to swapping.

Paul, is this guess correct? :)

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-02-20  2:07           ` Rik van Riel
@ 2008-02-20  2:48             ` KOSAKI Motohiro
  -1 siblings, 0 replies; 68+ messages in thread
From: KOSAKI Motohiro @ 2008-02-20  2:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, Pavel Machek, Paul Jackson, linux-mm,
	linux-kernel, marcelo, daniel.spang, akpm, alan, linux-fsdevel,
	a1426z, jonathan, zlynx

Hi Rik

> > Sounds like a job for memory limits (ulimit?), not for OOM
> > notification, right?
> 
> I suspect one problem could be that an HPC job scheduling program
> does not know exactly how much memory each job can take, so it can
> sometimes end up making a mistake and overcommitting the memory on
> one HPC node.
> 
> In that case the user is better off having that job killed and
> restarted elsewhere, than having all of the jobs on that node
> crawl to a halt due to swapping.
> 
> Paul, is this guess correct? :)

Yes.
Fujitsu HPC middleware watching sum of memory consumption of the job
and, if over-consumption happened, kill process and remove job schedule.

I think that is common hpc requirement.
but we watching to user defined memory limit, not swap.

Thanks.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-02-20  2:48             ` KOSAKI Motohiro
  0 siblings, 0 replies; 68+ messages in thread
From: KOSAKI Motohiro @ 2008-02-20  2:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, Pavel Machek, Paul Jackson, linux-mm,
	linux-kernel, marcelo, daniel.spang, akpm, alan, linux-fsdevel,
	a1426z, jonathan, zlynx

Hi Rik

> > Sounds like a job for memory limits (ulimit?), not for OOM
> > notification, right?
> 
> I suspect one problem could be that an HPC job scheduling program
> does not know exactly how much memory each job can take, so it can
> sometimes end up making a mistake and overcommitting the memory on
> one HPC node.
> 
> In that case the user is better off having that job killed and
> restarted elsewhere, than having all of the jobs on that node
> crawl to a halt due to swapping.
> 
> Paul, is this guess correct? :)

Yes.
Fujitsu HPC middleware watching sum of memory consumption of the job
and, if over-consumption happened, kill process and remove job schedule.

I think that is common hpc requirement.
but we watching to user defined memory limit, not swap.

Thanks.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-02-20  2:07           ` Rik van Riel
@ 2008-02-20  4:36             ` Paul Jackson
  -1 siblings, 0 replies; 68+ messages in thread
From: Paul Jackson @ 2008-02-20  4:36 UTC (permalink / raw)
  To: Rik van Riel
  Cc: pavel, kosaki.motohiro, linux-mm, linux-kernel, marcelo,
	daniel.spang, akpm, alan, linux-fsdevel, a1426z, jonathan, zlynx

Rik wrote:
> In that case the user is better off having that job killed and
> restarted elsewhere, than having all of the jobs on that node
> crawl to a halt due to swapping.
> 
> Paul, is this guess correct? :)

Not for the loads I focus on.  Each job gets exclusive use of its own
dedicated set of nodes, for the duration of the job.  With that comes a
quite specific upper limit on how much memory, in total, including node
local kernel data, that job is allowed to use.

One problem with swapping is that nodes aren't entirely isolated.
They share buses, i/o channels, disk arms, kernel data cache lines and
kernel locks with other nodes, running other jobs.   A job thrashing
its swap is a drag on the rest of the system.

Another problem with swapping is that it's a waste of resources.  Once
a pure compute bound job goes into swapping when it shouldn't, that job
has near zero hope of continuing with the intended performance, as it
has just slowed from main memory speeds to disk speeds, which are
thousands of times slower.  Best to get it out of there, immediately.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-02-20  4:36             ` Paul Jackson
  0 siblings, 0 replies; 68+ messages in thread
From: Paul Jackson @ 2008-02-20  4:36 UTC (permalink / raw)
  To: Rik van Riel
  Cc: pavel, kosaki.motohiro, linux-mm, linux-kernel, marcelo,
	daniel.spang, akpm, alan, linux-fsdevel, a1426z, jonathan, zlynx

Rik wrote:
> In that case the user is better off having that job killed and
> restarted elsewhere, than having all of the jobs on that node
> crawl to a halt due to swapping.
> 
> Paul, is this guess correct? :)

Not for the loads I focus on.  Each job gets exclusive use of its own
dedicated set of nodes, for the duration of the job.  With that comes a
quite specific upper limit on how much memory, in total, including node
local kernel data, that job is allowed to use.

One problem with swapping is that nodes aren't entirely isolated.
They share buses, i/o channels, disk arms, kernel data cache lines and
kernel locks with other nodes, running other jobs.   A job thrashing
its swap is a drag on the rest of the system.

Another problem with swapping is that it's a waste of resources.  Once
a pure compute bound job goes into swapping when it shouldn't, that job
has near zero hope of continuing with the intended performance, as it
has just slowed from main memory speeds to disk speeds, which are
thousands of times slower.  Best to get it out of there, immediately.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-02-20  2:48             ` KOSAKI Motohiro
@ 2008-02-20  4:57               ` Paul Jackson
  -1 siblings, 0 replies; 68+ messages in thread
From: Paul Jackson @ 2008-02-20  4:57 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: riel, kosaki.motohiro, pavel, linux-mm, linux-kernel, marcelo,
	daniel.spang, akpm, alan, linux-fsdevel, a1426z, jonathan, zlynx

Kosaki-san wrote:
> Yes.
> Fujitsu HPC middleware watching sum of memory consumption of the job
> and, if over-consumption happened, kill process and remove job schedule.

Did those jobs share nodes -- sometimes two or more jobs using the same
nodes?  I am sure SGI has such users too, though such job mixes make
the runtimes of specific jobs less obvious, so customers are more
tolerant of variations and some inefficiencies, as they get hidden in
the mix.

In other words, Rik, both yes and no ;).  Both sorts of HPC loads
exist, sharing nodes and a dedicated set of nodes for each job.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-02-20  4:57               ` Paul Jackson
  0 siblings, 0 replies; 68+ messages in thread
From: Paul Jackson @ 2008-02-20  4:57 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: riel, pavel, linux-mm, linux-kernel, marcelo, daniel.spang, akpm,
	alan, linux-fsdevel, a1426z, jonathan, zlynx

Kosaki-san wrote:
> Yes.
> Fujitsu HPC middleware watching sum of memory consumption of the job
> and, if over-consumption happened, kill process and remove job schedule.

Did those jobs share nodes -- sometimes two or more jobs using the same
nodes?  I am sure SGI has such users too, though such job mixes make
the runtimes of specific jobs less obvious, so customers are more
tolerant of variations and some inefficiencies, as they get hidden in
the mix.

In other words, Rik, both yes and no ;).  Both sorts of HPC loads
exist, sharing nodes and a dedicated set of nodes for each job.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-02-20  4:57               ` Paul Jackson
@ 2008-02-20  5:21                 ` KOSAKI Motohiro
  -1 siblings, 0 replies; 68+ messages in thread
From: KOSAKI Motohiro @ 2008-02-20  5:21 UTC (permalink / raw)
  To: Paul Jackson
  Cc: kosaki.motohiro, riel, pavel, linux-mm, linux-kernel, marcelo,
	daniel.spang, akpm, alan, linux-fsdevel, a1426z, jonathan, zlynx

> Did those jobs share nodes -- sometimes two or more jobs using the same
> nodes?  I am sure SGI has such users too, though such job mixes make
> the runtimes of specific jobs less obvious, so customers are more
> tolerant of variations and some inefficiencies, as they get hidden in
> the mix.

Hm
our dedicated ndoe user set memory limit to machine physical memory
size (minus a bit).

I think don't have so much share/dedicate and watch user-defined/swap.
am i misundestand?



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-02-20  5:21                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 68+ messages in thread
From: KOSAKI Motohiro @ 2008-02-20  5:21 UTC (permalink / raw)
  To: Paul Jackson
  Cc: kosaki.motohiro, riel, pavel, linux-mm, linux-kernel, marcelo,
	daniel.spang, akpm, alan, linux-fsdevel, a1426z, jonathan, zlynx

> Did those jobs share nodes -- sometimes two or more jobs using the same
> nodes?  I am sure SGI has such users too, though such job mixes make
> the runtimes of specific jobs less obvious, so customers are more
> tolerant of variations and some inefficiencies, as they get hidden in
> the mix.

Hm
our dedicated ndoe user set memory limit to machine physical memory
size (minus a bit).

I think don't have so much share/dedicate and watch user-defined/swap.
am i misundestand?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-02-09 15:19 ` KOSAKI Motohiro
@ 2008-04-01 23:35   ` Tom May
  -1 siblings, 0 replies; 68+ messages in thread
From: Tom May @ 2008-04-01 23:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: KOSAKI Motohiro, linux-fsdevel

On Sat, Feb 9, 2008 at 8:19 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hi
>
>  The /dev/mem_notify is low memory notification device.
>  it can avoid swappness and oom by cooperationg with the user process.
>
>  the Linux Today article is very nice description. (great works by Jake Edge)
>  http://www.linuxworld.com/news/2008/020508-kernel.html
>
>  <quoted>
>  When memory gets tight, it is quite possible that applications have memory
>  allocated—often caches for better performance—that they could free.
>  After all, it is generally better to lose some performance than to face the
>  consequences of being chosen by the OOM killer.
>  But, currently, there is no way for a process to know that the kernel is
>  feeling memory pressure.
>  The patch provides a way for interested programs to monitor the /dev/mem_notify
>   file to be notified if memory starts to run low.
>  </quoted>
>
>
>  You need not be annoyed by OOM any longer :)
>  please any comments!

Thanks for this patch set!  I ported it to 2.6.23.9 and tried it, on a
system with no swap since I'm evaluating this for an embedded system.
In practice, the criterion it uses for notifications wasn't sufficient to avoid
memory problems, including OOM, in a cyclic allocate/notify/free
sequence which is probably typical.

I tried it with a real-world program that, among other things, mmaps
anonymous pages and touches them at a reasonable speed until it gets
notified via /dev/mem_notify, releases most of them with
madvise(MADV_DONTNEED), then loops to start the cycle again.

What tends to happen is that I do indeed get notifications via
/dev/mem_notify when the kernel would like to be swapping, at which
point I free memory.  But the notifications come at a time when the
kernel needs memory, and it gets the memory by discarding some Cached
or Mapped memory (I can see these decreasing in /proc/meminfo with
each notification).  With each mmap/notify/madvise cycle the Cached
and Mapped memory gets smaller, until eventually while I'm touching
pages the kernel can't find enough memory and will either invoke the
OOM killer or return ENOMEM from syscalls.  This is precisely the
situation I'm trying to avoid by using /dev/mem_notify.

The criterion of "notify when the kernel would like to swap" feels
correct, but in addition I seem to need something like "notify when
cached+mapped+free memory is getting low".

I'll need to be looking into doing this, so any comments or ideas are
welcome.

Thanks,
.tom

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-04-01 23:35   ` Tom May
  0 siblings, 0 replies; 68+ messages in thread
From: Tom May @ 2008-04-01 23:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: KOSAKI Motohiro, linux-fsdevel

On Sat, Feb 9, 2008 at 8:19 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hi
>
>  The /dev/mem_notify is low memory notification device.
>  it can avoid swappness and oom by cooperationg with the user process.
>
>  the Linux Today article is very nice description. (great works by Jake Edge)
>  http://www.linuxworld.com/news/2008/020508-kernel.html
>
>  <quoted>
>  When memory gets tight, it is quite possible that applications have memory
>  allocated—often caches for better performance—that they could free.
>  After all, it is generally better to lose some performance than to face the
>  consequences of being chosen by the OOM killer.
>  But, currently, there is no way for a process to know that the kernel is
>  feeling memory pressure.
>  The patch provides a way for interested programs to monitor the /dev/mem_notify
>   file to be notified if memory starts to run low.
>  </quoted>
>
>
>  You need not be annoyed by OOM any longer :)
>  please any comments!

Thanks for this patch set!  I ported it to 2.6.23.9 and tried it, on a
system with no swap since I'm evaluating this for an embedded system.
In practice, the criterion it uses for notifications wasn't sufficient to avoid
memory problems, including OOM, in a cyclic allocate/notify/free
sequence which is probably typical.

I tried it with a real-world program that, among other things, mmaps
anonymous pages and touches them at a reasonable speed until it gets
notified via /dev/mem_notify, releases most of them with
madvise(MADV_DONTNEED), then loops to start the cycle again.

What tends to happen is that I do indeed get notifications via
/dev/mem_notify when the kernel would like to be swapping, at which
point I free memory.  But the notifications come at a time when the
kernel needs memory, and it gets the memory by discarding some Cached
or Mapped memory (I can see these decreasing in /proc/meminfo with
each notification).  With each mmap/notify/madvise cycle the Cached
and Mapped memory gets smaller, until eventually while I'm touching
pages the kernel can't find enough memory and will either invoke the
OOM killer or return ENOMEM from syscalls.  This is precisely the
situation I'm trying to avoid by using /dev/mem_notify.

The criterion of "notify when the kernel would like to swap" feels
correct, but in addition I seem to need something like "notify when
cached+mapped+free memory is getting low".

I'll need to be looking into doing this, so any comments or ideas are
welcome.

Thanks,
.tom
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-04-01 23:35   ` Tom May
  (?)
@ 2008-04-02  7:31   ` KOSAKI Motohiro
  2008-04-02 17:45     ` Tom May
  2008-04-15  0:16       ` Tom May
  -1 siblings, 2 replies; 68+ messages in thread
From: KOSAKI Motohiro @ 2008-04-02  7:31 UTC (permalink / raw)
  To: Tom May; +Cc: kosaki.motohiro, linux-mm, linux-kernel

Hi Tom,

Thank you very useful comment.
that is very interesting.

> I tried it with a real-world program that, among other things, mmaps
> anonymous pages and touches them at a reasonable speed until it gets
> notified via /dev/mem_notify, releases most of them with
> madvise(MADV_DONTNEED), then loops to start the cycle again.
>
> What tends to happen is that I do indeed get notifications via
> /dev/mem_notify when the kernel would like to be swapping, at which
> point I free memory.  But the notifications come at a time when the
> kernel needs memory, and it gets the memory by discarding some Cached
> or Mapped memory (I can see these decreasing in /proc/meminfo with
> each notification).  With each mmap/notify/madvise cycle the Cached
> and Mapped memory gets smaller, until eventually while I'm touching
> pages the kernel can't find enough memory and will either invoke the
> OOM killer or return ENOMEM from syscalls.  This is precisely the
> situation I'm trying to avoid by using /dev/mem_notify.

Could you send your test program?
I can't reproduce that now, sorry.


> The criterion of "notify when the kernel would like to swap" feels
> correct, but in addition I seem to need something like "notify when
> cached+mapped+free memory is getting low".

Hmmm,
I think this idea is only useful when userland process call 
madvise(MADV_DONTNEED) periodically.

but I hope improve my patch and solve your problem.
if you don' mind, please help my testing ;)




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-04-02  7:31   ` KOSAKI Motohiro
@ 2008-04-02 17:45     ` Tom May
  2008-04-15  0:16       ` Tom May
  1 sibling, 0 replies; 68+ messages in thread
From: Tom May @ 2008-04-02 17:45 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-mm, linux-kernel

On Wed, Apr 2, 2008 at 12:31 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hi Tom,
>
>  Thank you very useful comment.
>  that is very interesting.
>
>
>  > I tried it with a real-world program that, among other things, mmaps
>  > anonymous pages and touches them at a reasonable speed until it gets
>  > notified via /dev/mem_notify, releases most of them with
>  > madvise(MADV_DONTNEED), then loops to start the cycle again.
>  >
>  > What tends to happen is that I do indeed get notifications via
>  > /dev/mem_notify when the kernel would like to be swapping, at which
>  > point I free memory.  But the notifications come at a time when the
>  > kernel needs memory, and it gets the memory by discarding some Cached
>  > or Mapped memory (I can see these decreasing in /proc/meminfo with
>  > each notification).  With each mmap/notify/madvise cycle the Cached
>  > and Mapped memory gets smaller, until eventually while I'm touching
>  > pages the kernel can't find enough memory and will either invoke the
>  > OOM killer or return ENOMEM from syscalls.  This is precisely the
>  > situation I'm trying to avoid by using /dev/mem_notify.
>
>  Could you send your test program?

Unfortunately, no, it's a Java Virtual Machine (which is a perfect
user of /dev/mem_notify since it can garbage collect on notification,
among other times).

But it should be possible to make a small program with the same
behavior; I'll do that.

>  I can't reproduce that now, sorry.
>
>
>
>  > The criterion of "notify when the kernel would like to swap" feels
>  > correct, but in addition I seem to need something like "notify when
>  > cached+mapped+free memory is getting low".
>
>  Hmmm,
>  I think this idea is only useful when userland process call
>  madvise(MADV_DONTNEED) periodically.

Do you have a recommendation for freeing memory?  I could maybe use
munmap/mmap, but that's not atomic and may be "worse" (more overhead,
etc.) than madvise(MADV_DONTNEED).

>  but I hope improve my patch and solve your problem.
>  if you don' mind, please help my testing ;)

It's my pleasure to help in any way I can.

.tom

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-04-02  7:31   ` KOSAKI Motohiro
@ 2008-04-15  0:16       ` Tom May
  2008-04-15  0:16       ` Tom May
  1 sibling, 0 replies; 68+ messages in thread
From: Tom May @ 2008-04-15  0:16 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-mm, linux-kernel

On Wed, Apr 2, 2008 at 12:31 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hi Tom,
>
>  Thank you very useful comment.
>  that is very interesting.
>
>
>  > I tried it with a real-world program that, among other things, mmaps
>  > anonymous pages and touches them at a reasonable speed until it gets
>  > notified via /dev/mem_notify, releases most of them with
>  > madvise(MADV_DONTNEED), then loops to start the cycle again.
>  >
>  > What tends to happen is that I do indeed get notifications via
>  > /dev/mem_notify when the kernel would like to be swapping, at which
>  > point I free memory.  But the notifications come at a time when the
>  > kernel needs memory, and it gets the memory by discarding some Cached
>  > or Mapped memory (I can see these decreasing in /proc/meminfo with
>  > each notification).  With each mmap/notify/madvise cycle the Cached
>  > and Mapped memory gets smaller, until eventually while I'm touching
>  > pages the kernel can't find enough memory and will either invoke the
>  > OOM killer or return ENOMEM from syscalls.  This is precisely the
>  > situation I'm trying to avoid by using /dev/mem_notify.
>
>  Could you send your test program?
>  I can't reproduce that now, sorry.
>
>
>
>  > The criterion of "notify when the kernel would like to swap" feels
>  > correct, but in addition I seem to need something like "notify when
>  > cached+mapped+free memory is getting low".
>
>  Hmmm,
>  I think this idea is only useful when userland process call
>  madvise(MADV_DONTNEED) periodically.
>
>  but I hope improve my patch and solve your problem.
>  if you don' mind, please help my testing ;)

Here's a test program that allocates memory and frees on notification.
 It takes an argument which is the number of pages to use; use a
number considerably higher than the amount of memory in the system.
I'm running this on a system without swap.  Each time it gets a
notification, it frees memory and writes out the /proc/meminfo
contents.  What I see is that Cached gradually decreases, then Mapped
decreases, and eventually the kernel invokes the oom killer.  It may
be necessary to tune some of the constants that control the allocation
and free rates and latency; these values work for my system.

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <poll.h>
#include <sched.h>
#include <time.h>

#define PAGESIZE 4096

/* How many pages we've mmap'd. */
static int pages;

/* Pointer to mmap'd memory used as a circular buffer.  One thread
   touches pages, another thread releases them on notification. */
static char *p;

/* How many pages to touch each 5ms.  This makes at most 2000
   pages/sec. */
#define TOUCH_CHUNK 10

/* How many pages to free when we're notified.  With a 100ms FREE_DELAY,
   we can free ~9110 pages/sec, or perhaps only 5*911 = 4555 pages/sec if we're
   notified only 5 times/sec. */
#define FREE_CHUNK 911

/* Delay in milliseconds before freeing pages, to simulate latency while finding
   pages to free. */
#define FREE_DELAY 100

static void touch(void);
static int release(void *arg);
static void release_pages(void);
static void show_meminfo(void);

/* Stack for the release thread. */
static char stack[8192];

int
main (int argc, char **argv)
{
    pages = atoi(argv[1]);
    p = mmap(NULL, pages * PAGESIZE, PROT_READ | PROT_WRITE,
                   MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE, 0, 0);
    if (p == MAP_FAILED) {
        perror("mmap");
        exit(1);
    }

    if (clone(release, stack + sizeof(stack) - 4,
              CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | CLONE_THREAD,
              NULL) == -1) {
        perror("clone failed");
        exit(1);
    }

    touch();
}

static void
touch (void)
{
    int page = 0;

    while (1) {
        int i;
        struct timespec t;
        for (i = 0; i < TOUCH_CHUNK; i++) {
            p[page * PAGESIZE] = 1;
            if (++page >= pages) {
                page = 0;
            }
        }

        t.tv_sec = 0;
        t.tv_nsec = 5000000;
        if (nanosleep(&t, NULL) == -1) {
            perror("nanosleep");
        }
    }
}

static int
release (void *arg)
{
    int fd = open("/dev/mem_notify", O_RDONLY);
    if (fd == -1) {
        perror("open(/dev/mem_notify)");
        exit(1);
    }

    while (1) {
        struct pollfd pfd;
        int nfds;

        pfd.fd = fd;
        pfd.events = POLLIN;

        nfds = poll(&pfd, 1, -1);
        if (nfds == -1) {
            perror("poll");
            exit(1);
        }
        if (nfds == 1) {
            struct timespec t;
            t.tv_sec = 0;
            t.tv_nsec = FREE_DELAY * 1000000;
            if (nanosleep(&t, NULL) == -1) {
                perror("nanosleep");
            }
            release_pages();
            printf("time: %d\n", time(NULL));
            show_meminfo();
        }
    }
}

static void
release_pages (void)
{
    /* Index of the next page to free. */
    static int page = 0;
    int i;

    /* Release FREE_CHUNK pages. */

    for (i = 0; i < FREE_CHUNK; i++) {
        int r = madvise(p + page*PAGESIZE, PAGESIZE, MADV_DONTNEED);
        if (r == -1) {
            perror("madvise");
            exit(1);
        }
        if (++page >= pages) {
            page = 0;
        }
    }
}

static void
show_meminfo (void)
{
    char buffer[2000];
    int fd;
    ssize_t n;

    fd = open("/proc/meminfo", O_RDONLY);
    if (fd == -1) {
        perror("open(/proc/meminfo)");
        exit(1);
    }

    n = read(fd, buffer, sizeof(buffer));
    if (n == -1) {
        perror("read(/proc/meminfo)");
        exit(1);
    }

    n = write(1, buffer, n);
    if (n == -1) {
        perror("write(stdout)");
        exit(1);
    }

    if (close(fd) == -1) {
        perror("close(/proc/meminfo)");
        exit(1);
    }
}

.tom

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-04-15  0:16       ` Tom May
  0 siblings, 0 replies; 68+ messages in thread
From: Tom May @ 2008-04-15  0:16 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-mm, linux-kernel

On Wed, Apr 2, 2008 at 12:31 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hi Tom,
>
>  Thank you very useful comment.
>  that is very interesting.
>
>
>  > I tried it with a real-world program that, among other things, mmaps
>  > anonymous pages and touches them at a reasonable speed until it gets
>  > notified via /dev/mem_notify, releases most of them with
>  > madvise(MADV_DONTNEED), then loops to start the cycle again.
>  >
>  > What tends to happen is that I do indeed get notifications via
>  > /dev/mem_notify when the kernel would like to be swapping, at which
>  > point I free memory.  But the notifications come at a time when the
>  > kernel needs memory, and it gets the memory by discarding some Cached
>  > or Mapped memory (I can see these decreasing in /proc/meminfo with
>  > each notification).  With each mmap/notify/madvise cycle the Cached
>  > and Mapped memory gets smaller, until eventually while I'm touching
>  > pages the kernel can't find enough memory and will either invoke the
>  > OOM killer or return ENOMEM from syscalls.  This is precisely the
>  > situation I'm trying to avoid by using /dev/mem_notify.
>
>  Could you send your test program?
>  I can't reproduce that now, sorry.
>
>
>
>  > The criterion of "notify when the kernel would like to swap" feels
>  > correct, but in addition I seem to need something like "notify when
>  > cached+mapped+free memory is getting low".
>
>  Hmmm,
>  I think this idea is only useful when userland process call
>  madvise(MADV_DONTNEED) periodically.
>
>  but I hope improve my patch and solve your problem.
>  if you don' mind, please help my testing ;)

Here's a test program that allocates memory and frees on notification.
 It takes an argument which is the number of pages to use; use a
number considerably higher than the amount of memory in the system.
I'm running this on a system without swap.  Each time it gets a
notification, it frees memory and writes out the /proc/meminfo
contents.  What I see is that Cached gradually decreases, then Mapped
decreases, and eventually the kernel invokes the oom killer.  It may
be necessary to tune some of the constants that control the allocation
and free rates and latency; these values work for my system.

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <poll.h>
#include <sched.h>
#include <time.h>

#define PAGESIZE 4096

/* How many pages we've mmap'd. */
static int pages;

/* Pointer to mmap'd memory used as a circular buffer.  One thread
   touches pages, another thread releases them on notification. */
static char *p;

/* How many pages to touch each 5ms.  This makes at most 2000
   pages/sec. */
#define TOUCH_CHUNK 10

/* How many pages to free when we're notified.  With a 100ms FREE_DELAY,
   we can free ~9110 pages/sec, or perhaps only 5*911 = 4555 pages/sec if we're
   notified only 5 times/sec. */
#define FREE_CHUNK 911

/* Delay in milliseconds before freeing pages, to simulate latency while finding
   pages to free. */
#define FREE_DELAY 100

static void touch(void);
static int release(void *arg);
static void release_pages(void);
static void show_meminfo(void);

/* Stack for the release thread. */
static char stack[8192];

int
main (int argc, char **argv)
{
    pages = atoi(argv[1]);
    p = mmap(NULL, pages * PAGESIZE, PROT_READ | PROT_WRITE,
                   MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE, 0, 0);
    if (p == MAP_FAILED) {
        perror("mmap");
        exit(1);
    }

    if (clone(release, stack + sizeof(stack) - 4,
              CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | CLONE_THREAD,
              NULL) == -1) {
        perror("clone failed");
        exit(1);
    }

    touch();
}

static void
touch (void)
{
    int page = 0;

    while (1) {
        int i;
        struct timespec t;
        for (i = 0; i < TOUCH_CHUNK; i++) {
            p[page * PAGESIZE] = 1;
            if (++page >= pages) {
                page = 0;
            }
        }

        t.tv_sec = 0;
        t.tv_nsec = 5000000;
        if (nanosleep(&t, NULL) == -1) {
            perror("nanosleep");
        }
    }
}

static int
release (void *arg)
{
    int fd = open("/dev/mem_notify", O_RDONLY);
    if (fd == -1) {
        perror("open(/dev/mem_notify)");
        exit(1);
    }

    while (1) {
        struct pollfd pfd;
        int nfds;

        pfd.fd = fd;
        pfd.events = POLLIN;

        nfds = poll(&pfd, 1, -1);
        if (nfds == -1) {
            perror("poll");
            exit(1);
        }
        if (nfds == 1) {
            struct timespec t;
            t.tv_sec = 0;
            t.tv_nsec = FREE_DELAY * 1000000;
            if (nanosleep(&t, NULL) == -1) {
                perror("nanosleep");
            }
            release_pages();
            printf("time: %d\n", time(NULL));
            show_meminfo();
        }
    }
}

static void
release_pages (void)
{
    /* Index of the next page to free. */
    static int page = 0;
    int i;

    /* Release FREE_CHUNK pages. */

    for (i = 0; i < FREE_CHUNK; i++) {
        int r = madvise(p + page*PAGESIZE, PAGESIZE, MADV_DONTNEED);
        if (r == -1) {
            perror("madvise");
            exit(1);
        }
        if (++page >= pages) {
            page = 0;
        }
    }
}

static void
show_meminfo (void)
{
    char buffer[2000];
    int fd;
    ssize_t n;

    fd = open("/proc/meminfo", O_RDONLY);
    if (fd == -1) {
        perror("open(/proc/meminfo)");
        exit(1);
    }

    n = read(fd, buffer, sizeof(buffer));
    if (n == -1) {
        perror("read(/proc/meminfo)");
        exit(1);
    }

    n = write(1, buffer, n);
    if (n == -1) {
        perror("write(stdout)");
        exit(1);
    }

    if (close(fd) == -1) {
        perror("close(/proc/meminfo)");
        exit(1);
    }
}

.tom

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-04-15  0:16       ` Tom May
@ 2008-04-16  2:30         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 68+ messages in thread
From: KOSAKI Motohiro @ 2008-04-16  2:30 UTC (permalink / raw)
  To: Tom May; +Cc: kosaki.motohiro, linux-mm, linux-kernel

> Here's a test program that allocates memory and frees on notification.
>  It takes an argument which is the number of pages to use; use a
> number considerably higher than the amount of memory in the system.
> I'm running this on a system without swap.  Each time it gets a
> notification, it frees memory and writes out the /proc/meminfo
> contents.  What I see is that Cached gradually decreases, then Mapped
> decreases, and eventually the kernel invokes the oom killer.  It may
> be necessary to tune some of the constants that control the allocation
> and free rates and latency; these values work for my system.

really thanks!
I'll test soon :)



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-04-16  2:30         ` KOSAKI Motohiro
  0 siblings, 0 replies; 68+ messages in thread
From: KOSAKI Motohiro @ 2008-04-16  2:30 UTC (permalink / raw)
  To: Tom May; +Cc: kosaki.motohiro, linux-mm, linux-kernel

> Here's a test program that allocates memory and frees on notification.
>  It takes an argument which is the number of pages to use; use a
> number considerably higher than the amount of memory in the system.
> I'm running this on a system without swap.  Each time it gets a
> notification, it frees memory and writes out the /proc/meminfo
> contents.  What I see is that Cached gradually decreases, then Mapped
> decreases, and eventually the kernel invokes the oom killer.  It may
> be necessary to tune some of the constants that control the allocation
> and free rates and latency; these values work for my system.

really thanks!
I'll test soon :)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-04-15  0:16       ` Tom May
@ 2008-04-17  9:30         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 68+ messages in thread
From: KOSAKI Motohiro @ 2008-04-17  9:30 UTC (permalink / raw)
  To: Tom May; +Cc: kosaki.motohiro, linux-mm, linux-kernel

Hi Tom

> Here's a test program that allocates memory and frees on notification.
>  It takes an argument which is the number of pages to use; use a
> number considerably higher than the amount of memory in the system.
> I'm running this on a system without swap.  Each time it gets a
> notification, it frees memory and writes out the /proc/meminfo
> contents.  What I see is that Cached gradually decreases, then Mapped
> decreases, and eventually the kernel invokes the oom killer.  It may
> be necessary to tune some of the constants that control the allocation
> and free rates and latency; these values work for my system.

may be...

I think you misunderstand madvise(MADV_DONTNEED).  
madvise(DONTNEED) indicate drop process page table.
it mean become easily swap.

when run on system without swap, madvise(DONTNEED) almost doesn't work
as your expected.

I am sorry for being not able to help you. ;)




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-04-17  9:30         ` KOSAKI Motohiro
  0 siblings, 0 replies; 68+ messages in thread
From: KOSAKI Motohiro @ 2008-04-17  9:30 UTC (permalink / raw)
  To: Tom May; +Cc: kosaki.motohiro, linux-mm, linux-kernel

Hi Tom

> Here's a test program that allocates memory and frees on notification.
>  It takes an argument which is the number of pages to use; use a
> number considerably higher than the amount of memory in the system.
> I'm running this on a system without swap.  Each time it gets a
> notification, it frees memory and writes out the /proc/meminfo
> contents.  What I see is that Cached gradually decreases, then Mapped
> decreases, and eventually the kernel invokes the oom killer.  It may
> be necessary to tune some of the constants that control the allocation
> and free rates and latency; these values work for my system.

may be...

I think you misunderstand madvise(MADV_DONTNEED).  
madvise(DONTNEED) indicate drop process page table.
it mean become easily swap.

when run on system without swap, madvise(DONTNEED) almost doesn't work
as your expected.

I am sorry for being not able to help you. ;)



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-04-17  9:30         ` KOSAKI Motohiro
@ 2008-04-17 19:23           ` Tom May
  -1 siblings, 0 replies; 68+ messages in thread
From: Tom May @ 2008-04-17 19:23 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-mm, linux-kernel

On Thu, Apr 17, 2008 at 2:30 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hi Tom
>
>
>  > Here's a test program that allocates memory and frees on notification.
>  >  It takes an argument which is the number of pages to use; use a
>  > number considerably higher than the amount of memory in the system.
>  > I'm running this on a system without swap.  Each time it gets a
>  > notification, it frees memory and writes out the /proc/meminfo
>  > contents.  What I see is that Cached gradually decreases, then Mapped
>  > decreases, and eventually the kernel invokes the oom killer.  It may
>  > be necessary to tune some of the constants that control the allocation
>  > and free rates and latency; these values work for my system.
>
>  may be...
>
>  I think you misunderstand madvise(MADV_DONTNEED).
>  madvise(DONTNEED) indicate drop process page table.
>  it mean become easily swap.
>
>  when run on system without swap, madvise(DONTNEED) almost doesn't work
>  as your expected.

madvise can be replaced with munmap and the same behavior occurs.

--- test.c.orig 2008-04-17 11:41:47.000000000 -0700
+++ test.c 2008-04-17 11:44:04.000000000 -0700
@@ -127,7 +127,7 @@
    /* Release FREE_CHUNK pages. */

    for (i = 0; i < FREE_CHUNK; i++) {
-       int r = madvise(p + page*PAGESIZE, PAGESIZE, MADV_DONTNEED);
+       int r = munmap(p + page*PAGESIZE, PAGESIZE);
        if (r == -1) {
            perror("madvise");
            exit(1);

Here's what I'm seeing on my system.  This is with munmap, but I see
the same thing with madvise.  First, /proc/meminfo on my system before
running the test:

# cat /proc/meminfo
MemTotal:       127612 kB
MemFree:         71348 kB
Buffers:          1404 kB
Cached:          52324 kB
SwapCached:          0 kB
Active:           2336 kB
Inactive:        51656 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:              80 kB
Writeback:           0 kB
AnonPages:         276 kB
Mapped:            376 kB
Slab:             1680 kB
SReclaimable:      824 kB
SUnreclaim:        856 kB
PageTables:         52 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:     63804 kB
Committed_AS:      908 kB
VmallocTotal:   909280 kB
VmallocUsed:       304 kB
VmallocChunk:   908976 kB

Here is the start and end of the output from the test program.  At
each /dev/mem_notify notification Cached decreases, then eventually
Mapped decreases as well, which means the amount of time the program
has to free memory gets smaller and smaller.  Finally the oom killer
is invoked because the program can't react quickly enough to free
memory, even though it can free at a faster rate than it can use
memory.  My test is slow to free because it calls nanosleep, but this
is just a simulation of my actual program that has to perform garbage
collection before it can free memory.

# ./test_unmap 250000
time: 1208458019
MemTotal:       127612 kB
MemFree:          5524 kB
Buffers:           872 kB
Cached:          18388 kB
SwapCached:          0 kB
Active:         101468 kB
Inactive:        18220 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:               0 kB
Writeback:           0 kB
AnonPages:      100436 kB
Mapped:            504 kB
Slab:             1608 kB
SReclaimable:      816 kB
SUnreclaim:        792 kB
PageTables:        152 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:     63804 kB
Committed_AS:      944 kB
VmallocTotal:   909280 kB
VmallocUsed:       304 kB
VmallocChunk:   908976 kB

time: 1208458020
MemTotal:       127612 kB
MemFree:          5732 kB
Buffers:           820 kB
Cached:          17928 kB
SwapCached:          0 kB
Active:         101708 kB
Inactive:        17752 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:               0 kB
Writeback:           0 kB
AnonPages:      100712 kB
Mapped:            504 kB
Slab:             1608 kB
SReclaimable:      816 kB
SUnreclaim:        792 kB
PageTables:        156 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:     63804 kB
Committed_AS:      944 kB
VmallocTotal:   909280 kB
VmallocUsed:       304 kB
VmallocChunk:   908976 kB

time: 1208458021
MemTotal:       127612 kB
MemFree:          5660 kB
Buffers:           820 kB
Cached:          17416 kB
SwapCached:          0 kB
Active:         102228 kB
Inactive:        17316 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:               0 kB
Writeback:           0 kB
AnonPages:      101308 kB
Mapped:            504 kB
Slab:             1608 kB
SReclaimable:      816 kB
SUnreclaim:        792 kB
PageTables:        156 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:     63804 kB
Committed_AS:      944 kB
VmallocTotal:   909280 kB
VmallocUsed:       304 kB
VmallocChunk:   908976 kB

--- snip --- now Mapped is decreasing: ---

time: 1208458049
MemTotal:       127612 kB
MemFree:          5568 kB
Buffers:            40 kB
Cached:            868 kB
SwapCached:          0 kB
Active:         119036 kB
Inactive:          720 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:               0 kB
Writeback:           0 kB
AnonPages:      118848 kB
Mapped:            488 kB
Slab:             1456 kB
SReclaimable:      724 kB
SUnreclaim:        732 kB
PageTables:        172 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:     63804 kB
Committed_AS:      944 kB
VmallocTotal:   909280 kB
VmallocUsed:       304 kB
VmallocChunk:   908976 kB

time: 1208458050
MemTotal:       127612 kB
MemFree:          5608 kB
Buffers:            40 kB
Cached:            356 kB
SwapCached:          0 kB
Active:         119392 kB
Inactive:          328 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:               0 kB
Writeback:           0 kB
AnonPages:      119324 kB
Mapped:            268 kB
Slab:             1456 kB
SReclaimable:      724 kB
SUnreclaim:        732 kB
PageTables:        172 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:     63804 kB
Committed_AS:      944 kB
VmallocTotal:   909280 kB
VmallocUsed:       304 kB
VmallocChunk:   908976 kB

time: 1208458051
MemTotal:       127612 kB
MemFree:          5428 kB
Buffers:            40 kB
Cached:            116 kB
SwapCached:          0 kB
Active:         119832 kB
Inactive:           84 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:               0 kB
Writeback:           0 kB
AnonPages:      119760 kB
Mapped:             60 kB
Slab:             1440 kB
SReclaimable:      720 kB
SUnreclaim:        720 kB
PageTables:        172 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:     63804 kB
Committed_AS:      944 kB
VmallocTotal:   909280 kB
VmallocUsed:       304 kB
VmallocChunk:   908976 kB
test_unmap invoked oom-killer: gfp_mask=0xa80d2, order=0, oomkilladj=0
 [<c012f1db>] out_of_memory+0x16f/0x1a0
 [<c01308dd>] __alloc_pages+0x2c1/0x300
 [<c013757a>] handle_mm_fault+0x262/0x3e4
 [<c010906b>] do_page_fault+0x407/0x638
 [<c011fba0>] hrtimer_wakeup+0x0/0x18
 [<c0108c64>] do_page_fault+0x0/0x638
 [<c024d822>] error_code+0x6a/0x70

If it's possible to get a notification when MemFree + Cached + Mapped
(I'm not sure whether this is the right formula) falls below some
threshold, so that the program has time to find memory to discard
before the system runs out, that would prevent the oom -- as long as
the application(s) can ensure that there is not too much memory
allocated while it is looking for memory to free.   But at least the
threshold would give it a reasonable amount of time to handle the
notification.

Thanks,
.tom

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-04-17 19:23           ` Tom May
  0 siblings, 0 replies; 68+ messages in thread
From: Tom May @ 2008-04-17 19:23 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-mm, linux-kernel

On Thu, Apr 17, 2008 at 2:30 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hi Tom
>
>
>  > Here's a test program that allocates memory and frees on notification.
>  >  It takes an argument which is the number of pages to use; use a
>  > number considerably higher than the amount of memory in the system.
>  > I'm running this on a system without swap.  Each time it gets a
>  > notification, it frees memory and writes out the /proc/meminfo
>  > contents.  What I see is that Cached gradually decreases, then Mapped
>  > decreases, and eventually the kernel invokes the oom killer.  It may
>  > be necessary to tune some of the constants that control the allocation
>  > and free rates and latency; these values work for my system.
>
>  may be...
>
>  I think you misunderstand madvise(MADV_DONTNEED).
>  madvise(DONTNEED) indicate drop process page table.
>  it mean become easily swap.
>
>  when run on system without swap, madvise(DONTNEED) almost doesn't work
>  as your expected.

madvise can be replaced with munmap and the same behavior occurs.

--- test.c.orig 2008-04-17 11:41:47.000000000 -0700
+++ test.c 2008-04-17 11:44:04.000000000 -0700
@@ -127,7 +127,7 @@
    /* Release FREE_CHUNK pages. */

    for (i = 0; i < FREE_CHUNK; i++) {
-       int r = madvise(p + page*PAGESIZE, PAGESIZE, MADV_DONTNEED);
+       int r = munmap(p + page*PAGESIZE, PAGESIZE);
        if (r == -1) {
            perror("madvise");
            exit(1);

Here's what I'm seeing on my system.  This is with munmap, but I see
the same thing with madvise.  First, /proc/meminfo on my system before
running the test:

# cat /proc/meminfo
MemTotal:       127612 kB
MemFree:         71348 kB
Buffers:          1404 kB
Cached:          52324 kB
SwapCached:          0 kB
Active:           2336 kB
Inactive:        51656 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:              80 kB
Writeback:           0 kB
AnonPages:         276 kB
Mapped:            376 kB
Slab:             1680 kB
SReclaimable:      824 kB
SUnreclaim:        856 kB
PageTables:         52 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:     63804 kB
Committed_AS:      908 kB
VmallocTotal:   909280 kB
VmallocUsed:       304 kB
VmallocChunk:   908976 kB

Here is the start and end of the output from the test program.  At
each /dev/mem_notify notification Cached decreases, then eventually
Mapped decreases as well, which means the amount of time the program
has to free memory gets smaller and smaller.  Finally the oom killer
is invoked because the program can't react quickly enough to free
memory, even though it can free at a faster rate than it can use
memory.  My test is slow to free because it calls nanosleep, but this
is just a simulation of my actual program that has to perform garbage
collection before it can free memory.

# ./test_unmap 250000
time: 1208458019
MemTotal:       127612 kB
MemFree:          5524 kB
Buffers:           872 kB
Cached:          18388 kB
SwapCached:          0 kB
Active:         101468 kB
Inactive:        18220 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:               0 kB
Writeback:           0 kB
AnonPages:      100436 kB
Mapped:            504 kB
Slab:             1608 kB
SReclaimable:      816 kB
SUnreclaim:        792 kB
PageTables:        152 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:     63804 kB
Committed_AS:      944 kB
VmallocTotal:   909280 kB
VmallocUsed:       304 kB
VmallocChunk:   908976 kB

time: 1208458020
MemTotal:       127612 kB
MemFree:          5732 kB
Buffers:           820 kB
Cached:          17928 kB
SwapCached:          0 kB
Active:         101708 kB
Inactive:        17752 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:               0 kB
Writeback:           0 kB
AnonPages:      100712 kB
Mapped:            504 kB
Slab:             1608 kB
SReclaimable:      816 kB
SUnreclaim:        792 kB
PageTables:        156 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:     63804 kB
Committed_AS:      944 kB
VmallocTotal:   909280 kB
VmallocUsed:       304 kB
VmallocChunk:   908976 kB

time: 1208458021
MemTotal:       127612 kB
MemFree:          5660 kB
Buffers:           820 kB
Cached:          17416 kB
SwapCached:          0 kB
Active:         102228 kB
Inactive:        17316 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:               0 kB
Writeback:           0 kB
AnonPages:      101308 kB
Mapped:            504 kB
Slab:             1608 kB
SReclaimable:      816 kB
SUnreclaim:        792 kB
PageTables:        156 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:     63804 kB
Committed_AS:      944 kB
VmallocTotal:   909280 kB
VmallocUsed:       304 kB
VmallocChunk:   908976 kB

--- snip --- now Mapped is decreasing: ---

time: 1208458049
MemTotal:       127612 kB
MemFree:          5568 kB
Buffers:            40 kB
Cached:            868 kB
SwapCached:          0 kB
Active:         119036 kB
Inactive:          720 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:               0 kB
Writeback:           0 kB
AnonPages:      118848 kB
Mapped:            488 kB
Slab:             1456 kB
SReclaimable:      724 kB
SUnreclaim:        732 kB
PageTables:        172 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:     63804 kB
Committed_AS:      944 kB
VmallocTotal:   909280 kB
VmallocUsed:       304 kB
VmallocChunk:   908976 kB

time: 1208458050
MemTotal:       127612 kB
MemFree:          5608 kB
Buffers:            40 kB
Cached:            356 kB
SwapCached:          0 kB
Active:         119392 kB
Inactive:          328 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:               0 kB
Writeback:           0 kB
AnonPages:      119324 kB
Mapped:            268 kB
Slab:             1456 kB
SReclaimable:      724 kB
SUnreclaim:        732 kB
PageTables:        172 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:     63804 kB
Committed_AS:      944 kB
VmallocTotal:   909280 kB
VmallocUsed:       304 kB
VmallocChunk:   908976 kB

time: 1208458051
MemTotal:       127612 kB
MemFree:          5428 kB
Buffers:            40 kB
Cached:            116 kB
SwapCached:          0 kB
Active:         119832 kB
Inactive:           84 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:               0 kB
Writeback:           0 kB
AnonPages:      119760 kB
Mapped:             60 kB
Slab:             1440 kB
SReclaimable:      720 kB
SUnreclaim:        720 kB
PageTables:        172 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:     63804 kB
Committed_AS:      944 kB
VmallocTotal:   909280 kB
VmallocUsed:       304 kB
VmallocChunk:   908976 kB
test_unmap invoked oom-killer: gfp_mask=0xa80d2, order=0, oomkilladj=0
 [<c012f1db>] out_of_memory+0x16f/0x1a0
 [<c01308dd>] __alloc_pages+0x2c1/0x300
 [<c013757a>] handle_mm_fault+0x262/0x3e4
 [<c010906b>] do_page_fault+0x407/0x638
 [<c011fba0>] hrtimer_wakeup+0x0/0x18
 [<c0108c64>] do_page_fault+0x0/0x638
 [<c024d822>] error_code+0x6a/0x70

If it's possible to get a notification when MemFree + Cached + Mapped
(I'm not sure whether this is the right formula) falls below some
threshold, so that the program has time to find memory to discard
before the system runs out, that would prevent the oom -- as long as
the application(s) can ensure that there is not too much memory
allocated while it is looking for memory to free.   But at least the
threshold would give it a reasonable amount of time to handle the
notification.

Thanks,
.tom

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-04-17 19:23           ` Tom May
  (?)
@ 2008-04-18 10:07           ` KOSAKI Motohiro
  2008-04-21 20:32               ` Tom May
  -1 siblings, 1 reply; 68+ messages in thread
From: KOSAKI Motohiro @ 2008-04-18 10:07 UTC (permalink / raw)
  To: Tom May; +Cc: kosaki.motohiro, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1854 bytes --]

> madvise can be replaced with munmap and the same behavior occurs.
> 
> --- test.c.orig 2008-04-17 11:41:47.000000000 -0700
> +++ test.c 2008-04-17 11:44:04.000000000 -0700
> @@ -127,7 +127,7 @@
>     /* Release FREE_CHUNK pages. */
> 
>     for (i = 0; i < FREE_CHUNK; i++) {
> -       int r = madvise(p + page*PAGESIZE, PAGESIZE, MADV_DONTNEED);
> +       int r = munmap(p + page*PAGESIZE, PAGESIZE);
>         if (r == -1) {
>             perror("madvise");
>             exit(1);
> 
> Here's what I'm seeing on my system.  This is with munmap, but I see
> the same thing with madvise.  First, /proc/meminfo on my system before
> running the test:

Oh sorry my bad!
I investigated again and found 2 problem in your test program.

1. text segment isn't locked.

   if strong memory pressure happned, kernel may drop program text region.
   then your test program suddenly slow down.

   please use mlockall(MCL_CURRENT) before large buffer allocation.

2. repeat open/close to /proc/meminfo.
   
   in the fact, open(2) system call use a bit memory.
   if call open(2) in strong memory pressure, doesn't return until
   memory freed enough.
   thus, it cause slow down your program sometimes.

attached changed test program :)
it works well on my test environment.


> If it's possible to get a notification when MemFree + Cached + Mapped
> (I'm not sure whether this is the right formula) falls below some
> threshold, so that the program has time to find memory to discard
> before the system runs out, that would prevent the oom -- as long as
> the application(s) can ensure that there is not too much memory
> allocated while it is looking for memory to free.   But at least the
> threshold would give it a reasonable amount of time to handle the
> notification.

your proposal is interesting.
but I hope try to my attached test program at first.

[-- Attachment #2: tom.c --]
[-- Type: application/octet-stream, Size: 3604 bytes --]

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <poll.h>
#include <sched.h>
#include <time.h>
#include <unistd.h>
#include <pthread.h>

#define PAGESIZE (64*1024)

/* How many pages we've mmap'd. */
static long pages;

/* Pointer to mmap'd memory used as a circular buffer.  One thread
   touches pages, another thread releases them on notification. */
static char *p;

/* How many pages to touch each 5ms.  This makes at most 2000
   pages/sec. */
#define TOUCH_CHUNK 10

/* How many pages to free when we're notified.  With a 100ms FREE_DELAY,
   we can free ~9110 pages/sec, or perhaps only 5*911 = 4555 pages/sec if we're
   notified only 5 times/sec. */
#define FREE_CHUNK 911

/* Delay in milliseconds before freeing pages, to simulate latency while finding
   pages to free. */
#define FREE_DELAY 100

static void touch(void);
static int release(void *arg);
static void release_pages(void);
static void show_meminfo(void);
static void* _release (void *arg);

int
main (int argc, char **argv)
{
	pthread_t thr;

	mlockall(MCL_CURRENT); /* lock text*/
	setvbuf(stdout, (char *)NULL, _IOLBF, 0);

	pages = atol(argv[1]) * 1024 * 1024 / PAGESIZE;

	p = mmap(NULL, pages * PAGESIZE, PROT_READ | PROT_WRITE,
		 MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE, 0, 0);
	if (p == MAP_FAILED) {
		perror("mmap");
		exit(1);
	}

	if(pthread_create(&thr, NULL, _release, NULL)) {
		perror("pthread_create");
		exit(1);
	}


	touch();
	return 0;
}

static void
touch (void)
{
	long page = 0;

	while (1) {
		int i;
		struct timespec t;
		for (i = 0; i < TOUCH_CHUNK; i++) {
			p[page * PAGESIZE] = 1;
			if (++page >= pages) {
				page = 0;
			}
		}

#if 1
		t.tv_sec = 0;
		t.tv_nsec = 5 * 1000 * 1000; /* 5ms */
		if (nanosleep(&t, NULL) == -1) {
			perror("nanosleep");
		}
#endif
	}
}

static int
release (void *arg)
{
	int fd = open("/dev/mem_notify", O_RDONLY);
	if (fd == -1) {
		perror("open(/dev/mem_notify)");
		exit(1);
	}

	while (1) {
		struct pollfd pfd;
		int nfds;

		pfd.fd = fd;
		pfd.events = POLLIN;

		printf("poll\n");
		nfds = poll(&pfd, 1, -1);
		if (nfds == -1) {
			perror("poll");
			exit(1);
		}
		printf("notify\n");
		if (nfds == 1) {
			struct timespec t;
			t.tv_sec = 0;
			t.tv_nsec = FREE_DELAY * 1000 * 1000;
			if (nanosleep(&t, NULL) == -1) {
				perror("nanosleep");
			}
			printf("wakeup\n");
			release_pages();
			printf("time: %ld\n", time(NULL));
//			show_meminfo();
		}
	}
}

static void*
_release (void *arg)
{
	release(arg);
	return NULL;
}

static void
release_pages (void)
{
	/* Index of the next page to free. */
	static long page = 0;
	int i;

	/* Release FREE_CHUNK pages. */

	for (i = 0; i < FREE_CHUNK; i++) {
		int r = madvise(p + page*PAGESIZE, PAGESIZE, MADV_DONTNEED);
		if (r == -1) {
			perror("madvise");
			exit(1);
		}
//		printf("free %p\n", p + page*PAGESIZE);
		if (++page >= pages) {
			page = 0;
		}
	}
}

static void
show_meminfo (void)
{
	char buffer[2000];
	int fd;
	ssize_t n;

	fd = open("/proc/meminfo", O_RDONLY);
	if (fd == -1) {
		perror("open(/proc/meminfo)");
		exit(1);
	}

	n = read(fd, buffer, sizeof(buffer));
	if (n == -1) {
		perror("read(/proc/meminfo)");
		exit(1);
	}

	n = write(1, buffer, n);
	if (n == -1) {
		perror("write(stdout)");
		exit(1);
	}

	if (close(fd) == -1) {
		perror("close(/proc/meminfo)");
		exit(1);
	}
}

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-04-18 10:07           ` KOSAKI Motohiro
@ 2008-04-21 20:32               ` Tom May
  0 siblings, 0 replies; 68+ messages in thread
From: Tom May @ 2008-04-21 20:32 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-mm, linux-kernel

On Fri, Apr 18, 2008 at 3:07 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:

>  I investigated again and found 2 problem in your test program.
>
>  1. text segment isn't locked.
>
>    if strong memory pressure happned, kernel may drop program text region.
>    then your test program suddenly slow down.
>
>    please use mlockall(MCL_CURRENT) before large buffer allocation.

Using mlock does enable the program to respond faster (and/or the
kernel doesn't have to find memory to fault the page in) and solves
the problem for this simple test program.  I think we're thinking of
the solution in two different ways: you want the program to react more
quickly or be "nicer", and I want the kernel to give notification
early enough to allow time for things that can (and do) happen when
things aren't so nice. I realize that in extreme circumstances oom may
be unavoidable, but a threshold-based notification, in addition to the
current /dev/mem_notify mechanism, would help avoid extreme
circumstances.  I'm going to look into doing this.

>  2. repeat open/close to /proc/meminfo.
>
>    in the fact, open(2) system call use a bit memory.
>    if call open(2) in strong memory pressure, doesn't return until
>    memory freed enough.
>    thus, it cause slow down your program sometimes.

This should be fine; I intentionally do the open/read/write/close
after freeing memory.

>  attached changed test program :)
>  it works well on my test environment.

I made your changes to my program (I'm using clone since I don't have
a pthreads library on my device, and I left PAGESIZE at 4K instead of
64K), and having memory locked does avoid oom in this case, but
unfortunately I don't think it's a general solution that will work
everywhere in my system.  (Although I'm going to try it.)

Thanks,
.tom

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-04-21 20:32               ` Tom May
  0 siblings, 0 replies; 68+ messages in thread
From: Tom May @ 2008-04-21 20:32 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-mm, linux-kernel

On Fri, Apr 18, 2008 at 3:07 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:

>  I investigated again and found 2 problem in your test program.
>
>  1. text segment isn't locked.
>
>    if strong memory pressure happned, kernel may drop program text region.
>    then your test program suddenly slow down.
>
>    please use mlockall(MCL_CURRENT) before large buffer allocation.

Using mlock does enable the program to respond faster (and/or the
kernel doesn't have to find memory to fault the page in) and solves
the problem for this simple test program.  I think we're thinking of
the solution in two different ways: you want the program to react more
quickly or be "nicer", and I want the kernel to give notification
early enough to allow time for things that can (and do) happen when
things aren't so nice. I realize that in extreme circumstances oom may
be unavoidable, but a threshold-based notification, in addition to the
current /dev/mem_notify mechanism, would help avoid extreme
circumstances.  I'm going to look into doing this.

>  2. repeat open/close to /proc/meminfo.
>
>    in the fact, open(2) system call use a bit memory.
>    if call open(2) in strong memory pressure, doesn't return until
>    memory freed enough.
>    thus, it cause slow down your program sometimes.

This should be fine; I intentionally do the open/read/write/close
after freeing memory.

>  attached changed test program :)
>  it works well on my test environment.

I made your changes to my program (I'm using clone since I don't have
a pthreads library on my device, and I left PAGESIZE at 4K instead of
64K), and having memory locked does avoid oom in this case, but
unfortunately I don't think it's a general solution that will work
everywhere in my system.  (Although I'm going to try it.)

Thanks,
.tom

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-04-17 19:23           ` Tom May
@ 2008-04-23  8:27             ` Daniel Spång
  -1 siblings, 0 replies; 68+ messages in thread
From: Daniel Spång @ 2008-04-23  8:27 UTC (permalink / raw)
  To: Tom May; +Cc: KOSAKI Motohiro, linux-mm, linux-kernel

Hi Tom

On 4/17/08, Tom May <tom@tommay.com> wrote:
>
>  Here is the start and end of the output from the test program.  At
>  each /dev/mem_notify notification Cached decreases, then eventually
>  Mapped decreases as well, which means the amount of time the program
>  has to free memory gets smaller and smaller.  Finally the oom killer
>  is invoked because the program can't react quickly enough to free
>  memory, even though it can free at a faster rate than it can use
>  memory.  My test is slow to free because it calls nanosleep, but this
>  is just a simulation of my actual program that has to perform garbage
>  collection before it can free memory.

I have also seen this behaviour in my static tests with low mem
notification on swapless systems. It is a problem with small programs
(typically static test programs) where the text segment is only a few
pages. I have not seen this behaviour in larger programs which use a
larger working set. As long as the system working set is bigger than
the amount of memory that needs to be allocated, between every
notification reaction opportunity, it seems to be ok.

/Daniel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-04-23  8:27             ` Daniel Spång
  0 siblings, 0 replies; 68+ messages in thread
From: Daniel Spång @ 2008-04-23  8:27 UTC (permalink / raw)
  To: Tom May; +Cc: KOSAKI Motohiro, linux-mm, linux-kernel

Hi Tom

On 4/17/08, Tom May <tom@tommay.com> wrote:
>
>  Here is the start and end of the output from the test program.  At
>  each /dev/mem_notify notification Cached decreases, then eventually
>  Mapped decreases as well, which means the amount of time the program
>  has to free memory gets smaller and smaller.  Finally the oom killer
>  is invoked because the program can't react quickly enough to free
>  memory, even though it can free at a faster rate than it can use
>  memory.  My test is slow to free because it calls nanosleep, but this
>  is just a simulation of my actual program that has to perform garbage
>  collection before it can free memory.

I have also seen this behaviour in my static tests with low mem
notification on swapless systems. It is a problem with small programs
(typically static test programs) where the text segment is only a few
pages. I have not seen this behaviour in larger programs which use a
larger working set. As long as the system working set is bigger than
the amount of memory that needs to be allocated, between every
notification reaction opportunity, it seems to be ok.

/Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-04-23  8:27             ` Daniel Spång
@ 2008-05-01  2:07               ` Tom May
  -1 siblings, 0 replies; 68+ messages in thread
From: Tom May @ 2008-05-01  2:07 UTC (permalink / raw)
  To: Daniel Spång; +Cc: KOSAKI Motohiro, linux-mm, linux-kernel

On Wed, Apr 23, 2008 at 1:27 AM, Daniel Spång <daniel.spang@gmail.com> wrote:
> Hi Tom
>
>
>  On 4/17/08, Tom May <tom@tommay.com> wrote:
>  >
>  >  Here is the start and end of the output from the test program.  At
>  >  each /dev/mem_notify notification Cached decreases, then eventually
>  >  Mapped decreases as well, which means the amount of time the program
>  >  has to free memory gets smaller and smaller.  Finally the oom killer
>  >  is invoked because the program can't react quickly enough to free
>  >  memory, even though it can free at a faster rate than it can use
>  >  memory.  My test is slow to free because it calls nanosleep, but this
>  >  is just a simulation of my actual program that has to perform garbage
>  >  collection before it can free memory.
>
>  I have also seen this behaviour in my static tests with low mem
>  notification on swapless systems. It is a problem with small programs
>  (typically static test programs) where the text segment is only a few
>  pages. I have not seen this behaviour in larger programs which use a
>  larger working set. As long as the system working set is bigger than
>  the amount of memory that needs to be allocated, between every
>  notification reaction opportunity, it seems to be ok.

Hi Daniel,

You're saying the program's in-core text pages serve as a reserve that
the kernel can discard when it needs some memory, correct?  And that
even if the kernel discards them, it will page them back in as a
matter of course as the program runs, to maintain the reserve?  That
certainly makes sense.

In my case of a Java virtual machine, where I originally saw the
problem, most of the code is interpreted byte codes or jit-compiled
native code, all of which resides not in the text segment but in
anonymous pages that aren't backed by a file, and there is no swap
space.  The actual text segment working set can be very small (memory
allocation, garbage collection, synchronization, other random native
code).  And, as KOSAKI Motohiro pointed out, it may be wise to mlock
these areas.  So the text working set doesn't make an adequate
reserve.

However, I can maintain a reserve of cached and/or mapped memory by
touching pages in the text segment (or any mapped file) as the final
step of low memory notification handling, if the cached page count is
getting low.  For my purposes, this is nearly the same as having an
additional threshold-based notification, since it forces notifications
to occur while the kernel still has some memory to satisfy allocations
while userspace code works to free memory.  And it's simple.

Unfortunately, this is more expensive than it could be since the pages
need to be read in from some device (mapping /dev/zero doesn't cause
pages to be allocated). What I'm looking for now is a cheap way to
populate the cache with pages that the kernel can throw away when it
needs to reclaim memory.

Thanks,
.tom

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-05-01  2:07               ` Tom May
  0 siblings, 0 replies; 68+ messages in thread
From: Tom May @ 2008-05-01  2:07 UTC (permalink / raw)
  To: Daniel Spång; +Cc: KOSAKI Motohiro, linux-mm, linux-kernel

On Wed, Apr 23, 2008 at 1:27 AM, Daniel Spang <daniel.spang@gmail.com> wrote:
> Hi Tom
>
>
>  On 4/17/08, Tom May <tom@tommay.com> wrote:
>  >
>  >  Here is the start and end of the output from the test program.  At
>  >  each /dev/mem_notify notification Cached decreases, then eventually
>  >  Mapped decreases as well, which means the amount of time the program
>  >  has to free memory gets smaller and smaller.  Finally the oom killer
>  >  is invoked because the program can't react quickly enough to free
>  >  memory, even though it can free at a faster rate than it can use
>  >  memory.  My test is slow to free because it calls nanosleep, but this
>  >  is just a simulation of my actual program that has to perform garbage
>  >  collection before it can free memory.
>
>  I have also seen this behaviour in my static tests with low mem
>  notification on swapless systems. It is a problem with small programs
>  (typically static test programs) where the text segment is only a few
>  pages. I have not seen this behaviour in larger programs which use a
>  larger working set. As long as the system working set is bigger than
>  the amount of memory that needs to be allocated, between every
>  notification reaction opportunity, it seems to be ok.

Hi Daniel,

You're saying the program's in-core text pages serve as a reserve that
the kernel can discard when it needs some memory, correct?  And that
even if the kernel discards them, it will page them back in as a
matter of course as the program runs, to maintain the reserve?  That
certainly makes sense.

In my case of a Java virtual machine, where I originally saw the
problem, most of the code is interpreted byte codes or jit-compiled
native code, all of which resides not in the text segment but in
anonymous pages that aren't backed by a file, and there is no swap
space.  The actual text segment working set can be very small (memory
allocation, garbage collection, synchronization, other random native
code).  And, as KOSAKI Motohiro pointed out, it may be wise to mlock
these areas.  So the text working set doesn't make an adequate
reserve.

However, I can maintain a reserve of cached and/or mapped memory by
touching pages in the text segment (or any mapped file) as the final
step of low memory notification handling, if the cached page count is
getting low.  For my purposes, this is nearly the same as having an
additional threshold-based notification, since it forces notifications
to occur while the kernel still has some memory to satisfy allocations
while userspace code works to free memory.  And it's simple.

Unfortunately, this is more expensive than it could be since the pages
need to be read in from some device (mapping /dev/zero doesn't cause
pages to be allocated). What I'm looking for now is a cheap way to
populate the cache with pages that the kernel can throw away when it
needs to reclaim memory.

Thanks,
.tom

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-05-01  2:07               ` Tom May
@ 2008-05-01 15:06                 ` KOSAKI Motohiro
  -1 siblings, 0 replies; 68+ messages in thread
From: KOSAKI Motohiro @ 2008-05-01 15:06 UTC (permalink / raw)
  To: Tom May
  Cc: kosaki.motohiro, "Daniel Sp蚣g", linux-mm, linux-kernel

Hi Tom,

> In my case of a Java virtual machine, where I originally saw the
> problem, most of the code is interpreted byte codes or jit-compiled
> native code, all of which resides not in the text segment but in
> anonymous pages that aren't backed by a file, and there is no swap
> space.  The actual text segment working set can be very small (memory
> allocation, garbage collection, synchronization, other random native
> code).  And, as KOSAKI Motohiro pointed out, it may be wise to mlock
> these areas.  So the text working set doesn't make an adequate
> reserve.

your memnotify check routine is written by native or java?
if native, my suggestion is right.
but if java, it is wrong.

my point is "on swapless system, /dev/mem_notify checked routine should be mlocked".


> However, I can maintain a reserve of cached and/or mapped memory by
> touching pages in the text segment (or any mapped file) as the final
> step of low memory notification handling, if the cached page count is
> getting low.  For my purposes, this is nearly the same as having an
> additional threshold-based notification, since it forces notifications
> to occur while the kernel still has some memory to satisfy allocations
> while userspace code works to free memory.  And it's simple.
> 
> Unfortunately, this is more expensive than it could be since the pages
> need to be read in from some device (mapping /dev/zero doesn't cause
> pages to be allocated). What I'm looking for now is a cheap way to
> populate the cache with pages that the kernel can throw away when it
> needs to reclaim memory.

I hope understand your requirement more.
Can I ask your system more?

I think all java text and data is mapped.
When cached+mapped+free memory is happend?
and at the time, What is used memory?

Please don't think I have objection your proposal.
merely, I don't understand your system yet.

if I make new code before understand your requirement exactly, 
It makes many bug.


IMHO threshold based notification has a problems.
if low memory happend and application has no freeable memory,
mem notification don't stop and increase CPU usage dramatically, but it is perfectly useless.

I don't thin embedded java is not important, but I don't hope
desktop regression...




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-05-01 15:06                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 68+ messages in thread
From: KOSAKI Motohiro @ 2008-05-01 15:06 UTC (permalink / raw)
  To: Tom May
  Cc: kosaki.motohiro, "Daniel Sp蚣g", linux-mm, linux-kernel

Hi Tom,

> In my case of a Java virtual machine, where I originally saw the
> problem, most of the code is interpreted byte codes or jit-compiled
> native code, all of which resides not in the text segment but in
> anonymous pages that aren't backed by a file, and there is no swap
> space.  The actual text segment working set can be very small (memory
> allocation, garbage collection, synchronization, other random native
> code).  And, as KOSAKI Motohiro pointed out, it may be wise to mlock
> these areas.  So the text working set doesn't make an adequate
> reserve.

your memnotify check routine is written by native or java?
if native, my suggestion is right.
but if java, it is wrong.

my point is "on swapless system, /dev/mem_notify checked routine should be mlocked".


> However, I can maintain a reserve of cached and/or mapped memory by
> touching pages in the text segment (or any mapped file) as the final
> step of low memory notification handling, if the cached page count is
> getting low.  For my purposes, this is nearly the same as having an
> additional threshold-based notification, since it forces notifications
> to occur while the kernel still has some memory to satisfy allocations
> while userspace code works to free memory.  And it's simple.
> 
> Unfortunately, this is more expensive than it could be since the pages
> need to be read in from some device (mapping /dev/zero doesn't cause
> pages to be allocated). What I'm looking for now is a cheap way to
> populate the cache with pages that the kernel can throw away when it
> needs to reclaim memory.

I hope understand your requirement more.
Can I ask your system more?

I think all java text and data is mapped.
When cached+mapped+free memory is happend?
and at the time, What is used memory?

Please don't think I have objection your proposal.
merely, I don't understand your system yet.

if I make new code before understand your requirement exactly, 
It makes many bug.


IMHO threshold based notification has a problems.
if low memory happend and application has no freeable memory,
mem notification don't stop and increase CPU usage dramatically, but it is perfectly useless.

I don't thin embedded java is not important, but I don't hope
desktop regression...



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-05-01 15:06                 ` KOSAKI Motohiro
@ 2008-05-02 22:21                   ` Tom May
  -1 siblings, 0 replies; 68+ messages in thread
From: Tom May @ 2008-05-02 22:21 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: "Daniel Sp蚣g", linux-mm, linux-kernel

On Thu, May 1, 2008 at 8:06 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hi Tom,
>
>
>  > In my case of a Java virtual machine, where I originally saw the
>  > problem, most of the code is interpreted byte codes or jit-compiled
>  > native code, all of which resides not in the text segment but in
>  > anonymous pages that aren't backed by a file, and there is no swap
>  > space.  The actual text segment working set can be very small (memory
>  > allocation, garbage collection, synchronization, other random native
>  > code).  And, as KOSAKI Motohiro pointed out, it may be wise to mlock
>  > these areas.  So the text working set doesn't make an adequate
>  > reserve.
>
>  your memnotify check routine is written by native or java?

Some of each.

>  if native, my suggestion is right.
>  but if java, it is wrong.
>
>  my point is "on swapless system, /dev/mem_notify checked routine should be mlocked".

mlocking didn't fix things, it just made the oom happen at a different
time (see graphs below), both in the small test program where I used
mlockall, and in the jvm where during initialization I read
/proc/self/maps and mlocked each region of memory that was mapped to a
file.  Note that without swap, all of the anonymous pages containing
the java code are effectively locked in memory, too, so everything
runs without page faults.

>  > However, I can maintain a reserve of cached and/or mapped memory by
>  > touching pages in the text segment (or any mapped file) as the final
>  > step of low memory notification handling, if the cached page count is
>  > getting low.  For my purposes, this is nearly the same as having an
>  > additional threshold-based notification, since it forces notifications
>  > to occur while the kernel still has some memory to satisfy allocations
>  > while userspace code works to free memory.  And it's simple.
>  >
>  > Unfortunately, this is more expensive than it could be since the pages
>  > need to be read in from some device (mapping /dev/zero doesn't cause
>  > pages to be allocated). What I'm looking for now is a cheap way to
>  > populate the cache with pages that the kernel can throw away when it
>  > needs to reclaim memory.
>
>  I hope understand your requirement more.

Most simply, I need to get low memory notifications while there is
still enough memory to handle them before oom.

>  Can I ask your system more?

x86, Linux 2.6.23.9 (with your patches trivially backported), 128MB,
no swap.  Is there something else I can tell you?

>  I think all java text and data is mapped.

It's not what /proc/meminfo calls "Mapped".  It's in anonymous pages
with no backing store, i.e., mmap with MAP_ANONYMOUS.

>  When cached+mapped+free memory is happend?
>  and at the time, What is used memory?

Here's a graph of MemFree, Cached, and Mapped over time (I believe
Mapped is mostly or entirely subset of Cached here, so it's not
actually important):

http://www.tommay.net/memory.png

The amount of MemFree fluctuates as java allocates and garbage
collects, but the Cached memory decreases (since the kernel has to use
it for __alloc_pages when memory is low) until at some point there is
no memory to satisfy __alloc_pages and there is an oom.

The same things happens if I use mlock, only it happens sooner because
the kernel can't discard any of the 15MB of mlock'd memory so it
actually runs out of memory faster:

http://www.tommay.net/memory-mlock.png

I'm not sure how to explain it differently than I have before.  Maybe
someone else could explain it better.  So, at the risk of merely
repeating myself: The jvm allocates memory.  MemFree decreases.  In
the kernel, __alloc_pages is called.  It finds that memory is low,
memory_pressure_notify is called, and some cached pages are moved to
the inactive list.  These pages may then be used to satisfy
__alloc_pages requests.  The jvm gets the notification, collects
garbage, and returns memory to the kernel which appears as MemFree in
/proc/meminfo.  The cycle continues: the jvm allocates memory until
memory_pressure_notify is called, more cached pages are moved to the
inactive list, etc.  Eventually there are no more pages to move to the
inactive list, and __alloc_pages will invoke the oom killer.

>  Please don't think I have objection your proposal.
>  merely, I don't understand your system yet.
>
>  if I make new code before understand your requirement exactly,
>  It makes many bug.

Of course.

>  IMHO threshold based notification has a problems.
>  if low memory happend and application has no freeable memory,
>  mem notification don't stop and increase CPU usage dramatically, but it is perfectly useless.

My thought was to notify only when the threshold is crossed, i.e.,
edge-triggered not level-triggered.  But I now think a threshold
mechanism may be too complicated, and artificially putting pages in
the cache works just as well.  As a proof-of-concept, I do this, and
it works well, but is inefficient:

   extern char _text;
   for (int i = 0; i < bytes; i += 4096) {
       *((volatile char *)&_text + i);
   }

>  I don't thin embedded java is not important, but I don't hope
>  desktop regression...

I think embedded Java is a perfect user of /dev/mem_notify :-) I was
happy to see your patches and the criteria you used for notification.
But I'm having a problem in practice :-(

.tom

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-05-02 22:21                   ` Tom May
  0 siblings, 0 replies; 68+ messages in thread
From: Tom May @ 2008-05-02 22:21 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: "Daniel Sp蚣g", linux-mm, linux-kernel

On Thu, May 1, 2008 at 8:06 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hi Tom,
>
>
>  > In my case of a Java virtual machine, where I originally saw the
>  > problem, most of the code is interpreted byte codes or jit-compiled
>  > native code, all of which resides not in the text segment but in
>  > anonymous pages that aren't backed by a file, and there is no swap
>  > space.  The actual text segment working set can be very small (memory
>  > allocation, garbage collection, synchronization, other random native
>  > code).  And, as KOSAKI Motohiro pointed out, it may be wise to mlock
>  > these areas.  So the text working set doesn't make an adequate
>  > reserve.
>
>  your memnotify check routine is written by native or java?

Some of each.

>  if native, my suggestion is right.
>  but if java, it is wrong.
>
>  my point is "on swapless system, /dev/mem_notify checked routine should be mlocked".

mlocking didn't fix things, it just made the oom happen at a different
time (see graphs below), both in the small test program where I used
mlockall, and in the jvm where during initialization I read
/proc/self/maps and mlocked each region of memory that was mapped to a
file.  Note that without swap, all of the anonymous pages containing
the java code are effectively locked in memory, too, so everything
runs without page faults.

>  > However, I can maintain a reserve of cached and/or mapped memory by
>  > touching pages in the text segment (or any mapped file) as the final
>  > step of low memory notification handling, if the cached page count is
>  > getting low.  For my purposes, this is nearly the same as having an
>  > additional threshold-based notification, since it forces notifications
>  > to occur while the kernel still has some memory to satisfy allocations
>  > while userspace code works to free memory.  And it's simple.
>  >
>  > Unfortunately, this is more expensive than it could be since the pages
>  > need to be read in from some device (mapping /dev/zero doesn't cause
>  > pages to be allocated). What I'm looking for now is a cheap way to
>  > populate the cache with pages that the kernel can throw away when it
>  > needs to reclaim memory.
>
>  I hope understand your requirement more.

Most simply, I need to get low memory notifications while there is
still enough memory to handle them before oom.

>  Can I ask your system more?

x86, Linux 2.6.23.9 (with your patches trivially backported), 128MB,
no swap.  Is there something else I can tell you?

>  I think all java text and data is mapped.

It's not what /proc/meminfo calls "Mapped".  It's in anonymous pages
with no backing store, i.e., mmap with MAP_ANONYMOUS.

>  When cached+mapped+free memory is happend?
>  and at the time, What is used memory?

Here's a graph of MemFree, Cached, and Mapped over time (I believe
Mapped is mostly or entirely subset of Cached here, so it's not
actually important):

http://www.tommay.net/memory.png

The amount of MemFree fluctuates as java allocates and garbage
collects, but the Cached memory decreases (since the kernel has to use
it for __alloc_pages when memory is low) until at some point there is
no memory to satisfy __alloc_pages and there is an oom.

The same things happens if I use mlock, only it happens sooner because
the kernel can't discard any of the 15MB of mlock'd memory so it
actually runs out of memory faster:

http://www.tommay.net/memory-mlock.png

I'm not sure how to explain it differently than I have before.  Maybe
someone else could explain it better.  So, at the risk of merely
repeating myself: The jvm allocates memory.  MemFree decreases.  In
the kernel, __alloc_pages is called.  It finds that memory is low,
memory_pressure_notify is called, and some cached pages are moved to
the inactive list.  These pages may then be used to satisfy
__alloc_pages requests.  The jvm gets the notification, collects
garbage, and returns memory to the kernel which appears as MemFree in
/proc/meminfo.  The cycle continues: the jvm allocates memory until
memory_pressure_notify is called, more cached pages are moved to the
inactive list, etc.  Eventually there are no more pages to move to the
inactive list, and __alloc_pages will invoke the oom killer.

>  Please don't think I have objection your proposal.
>  merely, I don't understand your system yet.
>
>  if I make new code before understand your requirement exactly,
>  It makes many bug.

Of course.

>  IMHO threshold based notification has a problems.
>  if low memory happend and application has no freeable memory,
>  mem notification don't stop and increase CPU usage dramatically, but it is perfectly useless.

My thought was to notify only when the threshold is crossed, i.e.,
edge-triggered not level-triggered.  But I now think a threshold
mechanism may be too complicated, and artificially putting pages in
the cache works just as well.  As a proof-of-concept, I do this, and
it works well, but is inefficient:

   extern char _text;
   for (int i = 0; i < bytes; i += 4096) {
       *((volatile char *)&_text + i);
   }

>  I don't thin embedded java is not important, but I don't hope
>  desktop regression...

I think embedded Java is a perfect user of /dev/mem_notify :-) I was
happy to see your patches and the criteria you used for notification.
But I'm having a problem in practice :-(

.tom

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-05-02 22:21                   ` Tom May
@ 2008-05-03 12:26                     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 68+ messages in thread
From: KOSAKI Motohiro @ 2008-05-03 12:26 UTC (permalink / raw)
  To: Tom May
  Cc: kosaki.motohiro,  Daniel Sp蚣g"" , linux-mm, linux-kernel

> >  your memnotify check routine is written by native or java?
> 
> Some of each.

Wow!
you have 2 /dev/mem_notify checking routine?
java routine free java memory, native routine free native memory, right? 


> >  my point is "on swapless system, /dev/mem_notify checked routine should be mlocked".
> 
> mlocking didn't fix things, it just made the oom happen at a different
> time (see graphs below), both in the small test program where I used
> mlockall, and in the jvm where during initialization I read
> /proc/self/maps and mlocked each region of memory that was mapped to a
> file.  Note that without swap, all of the anonymous pages containing
> the java code are effectively locked in memory, too, so everything
> runs without page faults.

okey.


> >  I hope understand your requirement more.
> 
> Most simply, I need to get low memory notifications while there is
> still enough memory to handle them before oom.

Ah, That's your implementation idea.
I hope know why don't works well my current implementation at first.


> >  Can I ask your system more?
> 
> x86, Linux 2.6.23.9 (with your patches trivially backported), 128MB,
> no swap.  Is there something else I can tell you?
> 
> >  I think all java text and data is mapped.
> 
> It's not what /proc/meminfo calls "Mapped".  It's in anonymous pages
> with no backing store, i.e., mmap with MAP_ANONYMOUS.

okey.
Mapped of /proc/meminfo mean mapped pages with file backing store.

therefore, that isn't contain anonymous memory(e.g. java).

> >  When cached+mapped+free memory is happend?
> >  and at the time, What is used memory?
> 
> Here's a graph of MemFree, Cached, and Mapped over time (I believe
> Mapped is mostly or entirely subset of Cached here, so it's not
> actually important):
> 
> http://www.tommay.net/memory.png

I hope know your system memory usage detail.
your system have 128MB, but your graph vertical line represent 0M - 35M.
Who use remain 93MB(128-35)?
We should know who use memory intead decrease cached memory.

So, Can you below operation before mesurement?

# echo 100 > /proc/sys/vm/swappiness
# echo 3 >/proc/sys/vm/drop_caches

and, Can you mesure AnonPages of /proc/meminfo too?
(Can your memory shrinking routine reduce anonymous memory?)

if JVM use memory as anonymous memory and your memory shrinking routine can't
anonymous memory, that isn't mem_notify proble, 
that is just poor JVM garbege collection problem.

Why I think that?
mapped page of your graph decrease linearly.
if notification doesn't happened, it doesn't decrease.

thus, 
in your system, memory notification is happend rightly.
but your JVM doesn't have enough freeable memory.

if my assumption is right, increase number of memory notification 
doesn't solve your problem.

Sould we find way of good interaction to JVM GC and mem_notify shrinker?
Sould mem_notify shrinker kick JVM GC for shrink anonymous memory?



> My thought was to notify only when the threshold is crossed, i.e.,
> edge-triggered not level-triggered.  

Hm, interesting..


> But I now think a threshold
> mechanism may be too complicated, and artificially putting pages in
> the cache works just as well.  As a proof-of-concept, I do this, and
> it works well, but is inefficient:
> 
>    extern char _text;
>    for (int i = 0; i < bytes; i += 4096) {
>        *((volatile char *)&_text + i);
>    }

you intent populate to .text segment?
if so, you can mamp(MAP_POPULATE), IMHO.


> I think embedded Java is a perfect user of /dev/mem_notify :-) I was
> happy to see your patches and the criteria you used for notification.
> But I'm having a problem in practice :-(

Yeah, absolutely.
I'll try to set up JVM to my test environment tomorrow.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-05-03 12:26                     ` KOSAKI Motohiro
  0 siblings, 0 replies; 68+ messages in thread
From: KOSAKI Motohiro @ 2008-05-03 12:26 UTC (permalink / raw)
  To: Tom May
  Cc: kosaki.motohiro, Daniel Sp蚣g"" , linux-mm, linux-kernel

> >  your memnotify check routine is written by native or java?
> 
> Some of each.

Wow!
you have 2 /dev/mem_notify checking routine?
java routine free java memory, native routine free native memory, right? 


> >  my point is "on swapless system, /dev/mem_notify checked routine should be mlocked".
> 
> mlocking didn't fix things, it just made the oom happen at a different
> time (see graphs below), both in the small test program where I used
> mlockall, and in the jvm where during initialization I read
> /proc/self/maps and mlocked each region of memory that was mapped to a
> file.  Note that without swap, all of the anonymous pages containing
> the java code are effectively locked in memory, too, so everything
> runs without page faults.

okey.


> >  I hope understand your requirement more.
> 
> Most simply, I need to get low memory notifications while there is
> still enough memory to handle them before oom.

Ah, That's your implementation idea.
I hope know why don't works well my current implementation at first.


> >  Can I ask your system more?
> 
> x86, Linux 2.6.23.9 (with your patches trivially backported), 128MB,
> no swap.  Is there something else I can tell you?
> 
> >  I think all java text and data is mapped.
> 
> It's not what /proc/meminfo calls "Mapped".  It's in anonymous pages
> with no backing store, i.e., mmap with MAP_ANONYMOUS.

okey.
Mapped of /proc/meminfo mean mapped pages with file backing store.

therefore, that isn't contain anonymous memory(e.g. java).

> >  When cached+mapped+free memory is happend?
> >  and at the time, What is used memory?
> 
> Here's a graph of MemFree, Cached, and Mapped over time (I believe
> Mapped is mostly or entirely subset of Cached here, so it's not
> actually important):
> 
> http://www.tommay.net/memory.png

I hope know your system memory usage detail.
your system have 128MB, but your graph vertical line represent 0M - 35M.
Who use remain 93MB(128-35)?
We should know who use memory intead decrease cached memory.

So, Can you below operation before mesurement?

# echo 100 > /proc/sys/vm/swappiness
# echo 3 >/proc/sys/vm/drop_caches

and, Can you mesure AnonPages of /proc/meminfo too?
(Can your memory shrinking routine reduce anonymous memory?)

if JVM use memory as anonymous memory and your memory shrinking routine can't
anonymous memory, that isn't mem_notify proble, 
that is just poor JVM garbege collection problem.

Why I think that?
mapped page of your graph decrease linearly.
if notification doesn't happened, it doesn't decrease.

thus, 
in your system, memory notification is happend rightly.
but your JVM doesn't have enough freeable memory.

if my assumption is right, increase number of memory notification 
doesn't solve your problem.

Sould we find way of good interaction to JVM GC and mem_notify shrinker?
Sould mem_notify shrinker kick JVM GC for shrink anonymous memory?



> My thought was to notify only when the threshold is crossed, i.e.,
> edge-triggered not level-triggered.  

Hm, interesting..


> But I now think a threshold
> mechanism may be too complicated, and artificially putting pages in
> the cache works just as well.  As a proof-of-concept, I do this, and
> it works well, but is inefficient:
> 
>    extern char _text;
>    for (int i = 0; i < bytes; i += 4096) {
>        *((volatile char *)&_text + i);
>    }

you intent populate to .text segment?
if so, you can mamp(MAP_POPULATE), IMHO.


> I think embedded Java is a perfect user of /dev/mem_notify :-) I was
> happy to see your patches and the criteria you used for notification.
> But I'm having a problem in practice :-(

Yeah, absolutely.
I'll try to set up JVM to my test environment tomorrow.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
  2008-05-03 12:26                     ` KOSAKI Motohiro
@ 2008-05-06  5:22                       ` Tom May
  -1 siblings, 0 replies; 68+ messages in thread
From: Tom May @ 2008-05-06  5:22 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Daniel Sp蚣g, linux-mm, linux-kernel

All the talk about how what is using memory, whether things are in
Java or native, etc., is missing the point.  I'm not sure I'm making
my point, so I'll try again: regardless of memory size, mlock, etc.,
the interaction between client and /dev/mem_notify can be unstable,
and is unstable in my case:

- User-space code page faults in MAP_ANONYMOUS regions until there is
no free memory.
- The kernel gives a notification.
- There kernel frees some cache to satisfy the memory request.
- The user-space code gets the notification and frees anonymous pages.
 Concurrently
with this, some thread(s) in the system may continue to page fault.
- The cycle repeats.
- This works well, perhaps hundreds or thousands of cycles, until all
or most of the cache has been freed and we get an oom handling a page
fault.

My requirement is to have a stable system, with memory allocated on
demand to whatever process(es) want it (jvm, web browser, ...) until a
low memory notification occurs, which causes them to free whatever
memory they no longer need, then continue, without arbitrary static
limits on Java heap size, web browser cache size, etc.

My workaround to make things stable is to put pages in the cache
(after releasing anonymous pages and increasing free memory) by
accessing pages in _text, but that seems silly and expensive.

.tom

On Sat, May 3, 2008 at 5:26 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> > >  your memnotify check routine is written by native or java?
>  >
>  > Some of each.
>
>  Wow!
>  you have 2 /dev/mem_notify checking routine?
>  java routine free java memory, native routine free native memory, right?
>
>
>
>  > >  my point is "on swapless system, /dev/mem_notify checked routine should be mlocked".
>  >
>  > mlocking didn't fix things, it just made the oom happen at a different
>  > time (see graphs below), both in the small test program where I used
>  > mlockall, and in the jvm where during initialization I read
>  > /proc/self/maps and mlocked each region of memory that was mapped to a
>  > file.  Note that without swap, all of the anonymous pages containing
>  > the java code are effectively locked in memory, too, so everything
>  > runs without page faults.
>
>  okey.
>
>
>
>  > >  I hope understand your requirement more.
>  >
>  > Most simply, I need to get low memory notifications while there is
>  > still enough memory to handle them before oom.
>
>  Ah, That's your implementation idea.
>  I hope know why don't works well my current implementation at first.
>
>
>
>  > >  Can I ask your system more?
>  >
>  > x86, Linux 2.6.23.9 (with your patches trivially backported), 128MB,
>  > no swap.  Is there something else I can tell you?
>  >
>  > >  I think all java text and data is mapped.
>  >
>  > It's not what /proc/meminfo calls "Mapped".  It's in anonymous pages
>  > with no backing store, i.e., mmap with MAP_ANONYMOUS.
>
>  okey.
>  Mapped of /proc/meminfo mean mapped pages with file backing store.
>
>  therefore, that isn't contain anonymous memory(e.g. java).
>
>
>  > >  When cached+mapped+free memory is happend?
>  > >  and at the time, What is used memory?
>  >
>  > Here's a graph of MemFree, Cached, and Mapped over time (I believe
>  > Mapped is mostly or entirely subset of Cached here, so it's not
>  > actually important):
>  >
>  > http://www.tommay.net/memory.png
>
>  I hope know your system memory usage detail.
>  your system have 128MB, but your graph vertical line represent 0M - 35M.
>  Who use remain 93MB(128-35)?
>  We should know who use memory intead decrease cached memory.
>
>  So, Can you below operation before mesurement?
>
>  # echo 100 > /proc/sys/vm/swappiness
>  # echo 3 >/proc/sys/vm/drop_caches
>
>  and, Can you mesure AnonPages of /proc/meminfo too?
>  (Can your memory shrinking routine reduce anonymous memory?)
>
>  if JVM use memory as anonymous memory and your memory shrinking routine can't
>  anonymous memory, that isn't mem_notify proble,
>  that is just poor JVM garbege collection problem.
>
>  Why I think that?
>  mapped page of your graph decrease linearly.
>  if notification doesn't happened, it doesn't decrease.
>
>  thus,
>  in your system, memory notification is happend rightly.
>  but your JVM doesn't have enough freeable memory.
>
>  if my assumption is right, increase number of memory notification
>  doesn't solve your problem.
>
>  Sould we find way of good interaction to JVM GC and mem_notify shrinker?
>  Sould mem_notify shrinker kick JVM GC for shrink anonymous memory?
>
>
>
>
>  > My thought was to notify only when the threshold is crossed, i.e.,
>  > edge-triggered not level-triggered.
>
>  Hm, interesting..
>
>
>
>  > But I now think a threshold
>  > mechanism may be too complicated, and artificially putting pages in
>  > the cache works just as well.  As a proof-of-concept, I do this, and
>  > it works well, but is inefficient:
>  >
>  >    extern char _text;
>  >    for (int i = 0; i < bytes; i += 4096) {
>  >        *((volatile char *)&_text + i);
>  >    }
>
>  you intent populate to .text segment?
>  if so, you can mamp(MAP_POPULATE), IMHO.
>
>
>
>  > I think embedded Java is a perfect user of /dev/mem_notify :-) I was
>  > happy to see your patches and the criteria you used for notification.
>  > But I'm having a problem in practice :-(
>
>  Yeah, absolutely.
>  I'll try to set up JVM to my test environment tomorrow.
>
>
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/8][for -mm] mem_notify v6
@ 2008-05-06  5:22                       ` Tom May
  0 siblings, 0 replies; 68+ messages in thread
From: Tom May @ 2008-05-06  5:22 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Daniel Sp蚣g, linux-mm, linux-kernel

All the talk about how what is using memory, whether things are in
Java or native, etc., is missing the point.  I'm not sure I'm making
my point, so I'll try again: regardless of memory size, mlock, etc.,
the interaction between client and /dev/mem_notify can be unstable,
and is unstable in my case:

- User-space code page faults in MAP_ANONYMOUS regions until there is
no free memory.
- The kernel gives a notification.
- There kernel frees some cache to satisfy the memory request.
- The user-space code gets the notification and frees anonymous pages.
 Concurrently
with this, some thread(s) in the system may continue to page fault.
- The cycle repeats.
- This works well, perhaps hundreds or thousands of cycles, until all
or most of the cache has been freed and we get an oom handling a page
fault.

My requirement is to have a stable system, with memory allocated on
demand to whatever process(es) want it (jvm, web browser, ...) until a
low memory notification occurs, which causes them to free whatever
memory they no longer need, then continue, without arbitrary static
limits on Java heap size, web browser cache size, etc.

My workaround to make things stable is to put pages in the cache
(after releasing anonymous pages and increasing free memory) by
accessing pages in _text, but that seems silly and expensive.

.tom

On Sat, May 3, 2008 at 5:26 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> > >  your memnotify check routine is written by native or java?
>  >
>  > Some of each.
>
>  Wow!
>  you have 2 /dev/mem_notify checking routine?
>  java routine free java memory, native routine free native memory, right?
>
>
>
>  > >  my point is "on swapless system, /dev/mem_notify checked routine should be mlocked".
>  >
>  > mlocking didn't fix things, it just made the oom happen at a different
>  > time (see graphs below), both in the small test program where I used
>  > mlockall, and in the jvm where during initialization I read
>  > /proc/self/maps and mlocked each region of memory that was mapped to a
>  > file.  Note that without swap, all of the anonymous pages containing
>  > the java code are effectively locked in memory, too, so everything
>  > runs without page faults.
>
>  okey.
>
>
>
>  > >  I hope understand your requirement more.
>  >
>  > Most simply, I need to get low memory notifications while there is
>  > still enough memory to handle them before oom.
>
>  Ah, That's your implementation idea.
>  I hope know why don't works well my current implementation at first.
>
>
>
>  > >  Can I ask your system more?
>  >
>  > x86, Linux 2.6.23.9 (with your patches trivially backported), 128MB,
>  > no swap.  Is there something else I can tell you?
>  >
>  > >  I think all java text and data is mapped.
>  >
>  > It's not what /proc/meminfo calls "Mapped".  It's in anonymous pages
>  > with no backing store, i.e., mmap with MAP_ANONYMOUS.
>
>  okey.
>  Mapped of /proc/meminfo mean mapped pages with file backing store.
>
>  therefore, that isn't contain anonymous memory(e.g. java).
>
>
>  > >  When cached+mapped+free memory is happend?
>  > >  and at the time, What is used memory?
>  >
>  > Here's a graph of MemFree, Cached, and Mapped over time (I believe
>  > Mapped is mostly or entirely subset of Cached here, so it's not
>  > actually important):
>  >
>  > http://www.tommay.net/memory.png
>
>  I hope know your system memory usage detail.
>  your system have 128MB, but your graph vertical line represent 0M - 35M.
>  Who use remain 93MB(128-35)?
>  We should know who use memory intead decrease cached memory.
>
>  So, Can you below operation before mesurement?
>
>  # echo 100 > /proc/sys/vm/swappiness
>  # echo 3 >/proc/sys/vm/drop_caches
>
>  and, Can you mesure AnonPages of /proc/meminfo too?
>  (Can your memory shrinking routine reduce anonymous memory?)
>
>  if JVM use memory as anonymous memory and your memory shrinking routine can't
>  anonymous memory, that isn't mem_notify proble,
>  that is just poor JVM garbege collection problem.
>
>  Why I think that?
>  mapped page of your graph decrease linearly.
>  if notification doesn't happened, it doesn't decrease.
>
>  thus,
>  in your system, memory notification is happend rightly.
>  but your JVM doesn't have enough freeable memory.
>
>  if my assumption is right, increase number of memory notification
>  doesn't solve your problem.
>
>  Sould we find way of good interaction to JVM GC and mem_notify shrinker?
>  Sould mem_notify shrinker kick JVM GC for shrink anonymous memory?
>
>
>
>
>  > My thought was to notify only when the threshold is crossed, i.e.,
>  > edge-triggered not level-triggered.
>
>  Hm, interesting..
>
>
>
>  > But I now think a threshold
>  > mechanism may be too complicated, and artificially putting pages in
>  > the cache works just as well.  As a proof-of-concept, I do this, and
>  > it works well, but is inefficient:
>  >
>  >    extern char _text;
>  >    for (int i = 0; i < bytes; i += 4096) {
>  >        *((volatile char *)&_text + i);
>  >    }
>
>  you intent populate to .text segment?
>  if so, you can mamp(MAP_POPULATE), IMHO.
>
>
>
>  > I think embedded Java is a perfect user of /dev/mem_notify :-) I was
>  > happy to see your patches and the criteria you used for notification.
>  > But I'm having a problem in practice :-(
>
>  Yeah, absolutely.
>  I'll try to set up JVM to my test environment tomorrow.
>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2008-05-06  5:22 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-09 15:19 [PATCH 0/8][for -mm] mem_notify v6 KOSAKI Motohiro
2008-02-09 15:19 ` KOSAKI Motohiro
2008-02-09 15:19 ` KOSAKI Motohiro
2008-02-09 16:02 ` Jon Masters
2008-02-09 16:02   ` Jon Masters
2008-02-09 16:33   ` KOSAKI Motohiro
2008-02-09 16:33     ` KOSAKI Motohiro
2008-02-09 16:43     ` Rik van Riel
2008-02-09 16:43       ` Rik van Riel
2008-02-09 16:49       ` KOSAKI Motohiro
2008-02-09 16:49         ` KOSAKI Motohiro
2008-02-11 15:36 ` [PATCH 0/8][for -mm] mem_notify v6, " Jonathan Corbet
2008-02-11 15:36   ` Jonathan Corbet
2008-02-11 15:46   ` KOSAKI Motohiro
2008-02-11 15:46     ` KOSAKI Motohiro
2008-02-17 14:49 ` Paul Jackson
2008-02-17 14:49   ` Paul Jackson
2008-02-19  7:36   ` KOSAKI Motohiro
2008-02-19  7:36     ` KOSAKI Motohiro
2008-02-19 15:00     ` Paul Jackson
2008-02-19 15:00       ` Paul Jackson
2008-02-19 19:02       ` Rik van Riel
2008-02-19 19:02         ` Rik van Riel
2008-02-19 20:18         ` Paul Jackson
2008-02-19 20:18           ` Paul Jackson
2008-02-19 20:43           ` Paul Jackson
2008-02-19 20:43             ` Paul Jackson
2008-02-19 22:28       ` Pavel Machek
2008-02-19 22:28         ` Pavel Machek
2008-02-20  1:54         ` Paul Jackson
2008-02-20  1:54           ` Paul Jackson
2008-02-20  2:07         ` Rik van Riel
2008-02-20  2:07           ` Rik van Riel
2008-02-20  2:48           ` KOSAKI Motohiro
2008-02-20  2:48             ` KOSAKI Motohiro
2008-02-20  4:57             ` Paul Jackson
2008-02-20  4:57               ` Paul Jackson
2008-02-20  5:21               ` KOSAKI Motohiro
2008-02-20  5:21                 ` KOSAKI Motohiro
2008-02-20  4:36           ` Paul Jackson
2008-02-20  4:36             ` Paul Jackson
2008-04-01 23:35 ` Tom May
2008-04-01 23:35   ` Tom May
2008-04-02  7:31   ` KOSAKI Motohiro
2008-04-02 17:45     ` Tom May
2008-04-15  0:16     ` Tom May
2008-04-15  0:16       ` Tom May
2008-04-16  2:30       ` KOSAKI Motohiro
2008-04-16  2:30         ` KOSAKI Motohiro
2008-04-17  9:30       ` KOSAKI Motohiro
2008-04-17  9:30         ` KOSAKI Motohiro
2008-04-17 19:23         ` Tom May
2008-04-17 19:23           ` Tom May
2008-04-18 10:07           ` KOSAKI Motohiro
2008-04-21 20:32             ` Tom May
2008-04-21 20:32               ` Tom May
2008-04-23  8:27           ` Daniel Spång
2008-04-23  8:27             ` Daniel Spång
2008-05-01  2:07             ` Tom May
2008-05-01  2:07               ` Tom May
2008-05-01 15:06               ` KOSAKI Motohiro
2008-05-01 15:06                 ` KOSAKI Motohiro
2008-05-02 22:21                 ` Tom May
2008-05-02 22:21                   ` Tom May
2008-05-03 12:26                   ` KOSAKI Motohiro
2008-05-03 12:26                     ` KOSAKI Motohiro
2008-05-06  5:22                     ` Tom May
2008-05-06  5:22                       ` Tom May

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.