All of lore.kernel.org
 help / color / mirror / Atom feed
* Regression from 2.6.36
@ 2011-03-15 13:25 azurIt
  2011-03-17  0:15 ` Greg KH
  0 siblings, 1 reply; 98+ messages in thread
From: azurIt @ 2011-03-15 13:25 UTC (permalink / raw)
  To: linux-kernel


Hi,

we are successfully running several very busy web servers on 2.6.32.* and
few days ago I decided to upgrade to 2.6.37 (mainly because of blkio cgroup).
I installed 2.6.37.2 on one of the servers and very strange things started to
happen with Apache web server.

We are using Apache with MPM-ITK ( http://mpm-itk.sesse.net/ ) so it is doing
lots of 'fork' and lots of 'setuid'. I have also noticed that problem is
happening only on very busy servers.

Everything is ok when Apache is started but as time is passing by, its 'root'
processes (Apache processes running under root) are consuming more and more CPU.
Finally, the whole server becames very unstable and Apache must be restarted.
This is repeating until the load on web sites is much lower (usually on 22:00).
Sometimes it takes 3 hours when restart is needed, sometimes only 1 hour (again,
depends on load on web sites). Here is the graph of CPU utilization showing the
problem (red color), Apache was REstarted at 8:11 and 9:35:
http://watchdog.sk/lkml/cpu-problem.png

Here is how it looks on htop:
http://watchdog.sk/lkml/htop.jpg

And finally here is how it looks with older kernels (yes, when i install older
kernel, problem is gone), notice also that I/O wait is much lower and nicer
(blue color):
http://watchdog.sk/lkml/cpu-ok.png

I was also strace-ing Apache processes which were doing problems, here it is:
http://watchdog.sk/lkml/strace.txt

I'm not 100% sure but I think that CPU was consumed on 'futex' lines.

I tried several kernel versions and find out that everything BEFORE 2.6.36 is
NOT affected and everything AFTER 2.6.36 (included) is affected.

Versions which I tried and were NOT affected by this problem:
2.6.32.*
2.6.35.11

Versions which I tried and were affected by this problem:
2.6.36
2.6.36.4
2.6.37.2
2.6.37.3
2.6.38-rc8 (final version was not released yet)

All tests were made on vanilla kernels on Debian Lenny with this config:
http://watchdog.sk/lkml/config

Do you need any other information from me ? I'm able to try other versions or
patches but, please, take into account that I have to do this on _production_
server (I failed to reproduce it in testing environment). Also, I'm able to try
only one kernel per day.

Thank you !

azurit

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-03-15 13:25 Regression from 2.6.36 azurIt
@ 2011-03-17  0:15 ` Greg KH
  2011-03-17  0:53   ` Dave Jones
  2011-04-07 10:01   ` azurIt
  0 siblings, 2 replies; 98+ messages in thread
From: Greg KH @ 2011-03-17  0:15 UTC (permalink / raw)
  To: azurIt; +Cc: linux-kernel

On Tue, Mar 15, 2011 at 02:25:27PM +0100, azurIt wrote:
> 
> Hi,
> 
> we are successfully running several very busy web servers on 2.6.32.* and
> few days ago I decided to upgrade to 2.6.37 (mainly because of blkio cgroup).
> I installed 2.6.37.2 on one of the servers and very strange things started to
> happen with Apache web server.
> 
> We are using Apache with MPM-ITK ( http://mpm-itk.sesse.net/ ) so it is doing
> lots of 'fork' and lots of 'setuid'. I have also noticed that problem is
> happening only on very busy servers.
> 
> Everything is ok when Apache is started but as time is passing by, its 'root'
> processes (Apache processes running under root) are consuming more and more CPU.
> Finally, the whole server becames very unstable and Apache must be restarted.
> This is repeating until the load on web sites is much lower (usually on 22:00).
> Sometimes it takes 3 hours when restart is needed, sometimes only 1 hour (again,
> depends on load on web sites). Here is the graph of CPU utilization showing the
> problem (red color), Apache was REstarted at 8:11 and 9:35:
> http://watchdog.sk/lkml/cpu-problem.png
> 
> Here is how it looks on htop:
> http://watchdog.sk/lkml/htop.jpg
> 
> And finally here is how it looks with older kernels (yes, when i install older
> kernel, problem is gone), notice also that I/O wait is much lower and nicer
> (blue color):
> http://watchdog.sk/lkml/cpu-ok.png
> 
> I was also strace-ing Apache processes which were doing problems, here it is:
> http://watchdog.sk/lkml/strace.txt
> 
> I'm not 100% sure but I think that CPU was consumed on 'futex' lines.
> 
> I tried several kernel versions and find out that everything BEFORE 2.6.36 is
> NOT affected and everything AFTER 2.6.36 (included) is affected.
> 
> Versions which I tried and were NOT affected by this problem:
> 2.6.32.*
> 2.6.35.11
> 
> Versions which I tried and were affected by this problem:
> 2.6.36
> 2.6.36.4
> 2.6.37.2
> 2.6.37.3
> 2.6.38-rc8 (final version was not released yet)
> 
> All tests were made on vanilla kernels on Debian Lenny with this config:
> http://watchdog.sk/lkml/config
> 
> Do you need any other information from me ? I'm able to try other versions or
> patches but, please, take into account that I have to do this on _production_
> server (I failed to reproduce it in testing environment). Also, I'm able to try
> only one kernel per day.

Ick, one kernel per day might make this a bit difficult, but if there
was any way you could use 'git bisect' to try to narrow this down to the
patch that caused this problem, it would be great.

You can mark 2.6.35 as working and 2.6.36 as bad and git will go from
there and try to offer you different chances to find the problem.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-03-17  0:15 ` Greg KH
@ 2011-03-17  0:53   ` Dave Jones
  2011-03-17 13:30     ` azurIt
  2011-04-07 10:01   ` azurIt
  1 sibling, 1 reply; 98+ messages in thread
From: Dave Jones @ 2011-03-17  0:53 UTC (permalink / raw)
  To: Greg KH; +Cc: azurIt, linux-kernel

On Wed, Mar 16, 2011 at 05:15:19PM -0700, Greg Kroah-Hartman wrote:

 > > Do you need any other information from me ? I'm able to try other versions or
 > > patches but, please, take into account that I have to do this on _production_
 > > server (I failed to reproduce it in testing environment). Also, I'm able to try
 > > only one kernel per day.
 > 
 > Ick, one kernel per day might make this a bit difficult, but if there
 > was any way you could use 'git bisect' to try to narrow this down to the
 > patch that caused this problem, it would be great.
 > 
 > You can mark 2.6.35 as working and 2.6.36 as bad and git will go from
 > there and try to offer you different chances to find the problem.

Comparing the output of a perf profile between the good/bad kernels might
narrow it down faster than a bisect if something obvious sticks out.

	Dave


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-03-17  0:53   ` Dave Jones
@ 2011-03-17 13:30     ` azurIt
  0 siblings, 0 replies; 98+ messages in thread
From: azurIt @ 2011-03-17 13:30 UTC (permalink / raw)
  To: Dave Jones, Greg KH; +Cc: linux-kernel


Bisecting: 5103 revisions left to test after this (roughly 12 steps)

If i'm right, it will takes 12 reboots. I'm really able to reboot only once per day and NOT during weekend so this will take 2,5 weeks.

What about that 'perf' tool ? Can anyone, please, tell me how exactly should i run it to gather usefull data ?

Thank you.

 
 ______________________________________________________________
 > Od: "Dave Jones" 
 > Komu: Greg KH 
 > Dátum: 17.03.2011 01:53
 > Predmet: Re: Regression from 2.6.36
 >
 > CC: linux-kernel@vger.kernel.org On Wed, Mar 16, 2011 at 05:15:19PM -0700, Greg Kroah-Hartman wrote: 
 
 > > Do you need any other information from me ? I'm able to try other versions or 
 > > patches but, please, take into account that I have to do this on _production_ 
 > > server (I failed to reproduce it in testing environment). Also, I'm able to try 
 > > only one kernel per day. 
 >  
 > Ick, one kernel per day might make this a bit difficult, but if there 
 > was any way you could use 'git bisect' to try to narrow this down to the 
 > patch that caused this problem, it would be great. 
 >  
 > You can mark 2.6.35 as working and 2.6.36 as bad and git will go from 
 > there and try to offer you different chances to find the problem. 
 
 Comparing the output of a perf profile between the good/bad kernels might 
 narrow it down faster than a bisect if something obvious sticks out. 
 
 Dave

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-03-17  0:15 ` Greg KH
  2011-03-17  0:53   ` Dave Jones
@ 2011-04-07 10:01   ` azurIt
  2011-04-07 10:19       ` Jiri Slaby
  1 sibling, 1 reply; 98+ messages in thread
From: azurIt @ 2011-04-07 10:01 UTC (permalink / raw)
  To: linux-kernel


I have finally completed bisection, here are the results:



a892e2d7dcdfa6c76e60c50a8c7385c65587a2a6 is first bad commit
commit a892e2d7dcdfa6c76e60c50a8c7385c65587a2a6
Author: Changli Gao <xiaosuo@gmail.com>
Date:   Tue Aug 10 18:01:35 2010 -0700

    vfs: use kmalloc() to allocate fdmem if possible
   
    Use kmalloc() to allocate fdmem if possible.
   
    vmalloc() is used as a fallback solution for fdmem allocation.  A new
    helper function __free_fdtable() is introduced to reduce the lines of
    code.
   
    A potential bug, vfree() a memory allocated by kmalloc(), is fixed.
   
    [akpm@linux-foundation.org: use __GFP_NOWARN, uninline alloc_fdmem() and free_fdmem()]
    Signed-off-by: Changli Gao <xiaosuo@gmail.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Jiri Slaby <jslaby@suse.cz>
    Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
    Cc: Alexey Dobriyan <adobriyan@gmail.com>
    Cc: Ingo Molnar <mingo@elte.hu>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Avi Kivity <avi@redhat.com>
    Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

:040000 040000 a7b3997bc754f573b4a309cda1a0774ea95c235e 4241a4f2115c60e5c1dc1879c85c9911fa077807 M      fs





 
 ______________________________________________________________
 > Od: "Greg KH" <greg@kroah.com>
 > Komu: azurIt <azurit@pobox.sk>
 > Dátum: 17.03.2011 01:15
 > Predmet: Re: Regression from 2.6.36
 >
 > CC: linux-kernel@vger.kernel.org On Tue, Mar 15, 2011 at 02:25:27PM +0100, azurIt wrote: 
 >  
 > Hi, 
 >  
 > we are successfully running several very busy web servers on 2.6.32.* and 
 > few days ago I decided to upgrade to 2.6.37 (mainly because of blkio cgroup). 
 > I installed 2.6.37.2 on one of the servers and very strange things started to 
 > happen with Apache web server. 
 >  
 > We are using Apache with MPM-ITK ( http://mpm-itk.sesse.net/ ) so it is doing 
 > lots of 'fork' and lots of 'setuid'. I have also noticed that problem is 
 > happening only on very busy servers. 
 >  
 > Everything is ok when Apache is started but as time is passing by, its 'root' 
 > processes (Apache processes running under root) are consuming more and more CPU. 
 > Finally, the whole server becames very unstable and Apache must be restarted. 
 > This is repeating until the load on web sites is much lower (usually on 22:00). 
 > Sometimes it takes 3 hours when restart is needed, sometimes only 1 hour (again, 
 > depends on load on web sites). Here is the graph of CPU utilization showing the 
 > problem (red color), Apache was REstarted at 8:11 and 9:35: 
 > http://watchdog.sk/lkml/cpu-problem.png 
 >  
 > Here is how it looks on htop: 
 > http://watchdog.sk/lkml/htop.jpg 
 >  
 > And finally here is how it looks with older kernels (yes, when i install older 
 > kernel, problem is gone), notice also that I/O wait is much lower and nicer 
 > (blue color): 
 > http://watchdog.sk/lkml/cpu-ok.png 
 >  
 > I was also strace-ing Apache processes which were doing problems, here it is: 
 > http://watchdog.sk/lkml/strace.txt 
 >  
 > I'm not 100% sure but I think that CPU was consumed on 'futex' lines. 
 >  
 > I tried several kernel versions and find out that everything BEFORE 2.6.36 is 
 > NOT affected and everything AFTER 2.6.36 (included) is affected. 
 >  
 > Versions which I tried and were NOT affected by this problem: 
 > 2.6.32.* 
 > 2.6.35.11 
 >  
 > Versions which I tried and were affected by this problem: 
 > 2.6.36 
 > 2.6.36.4 
 > 2.6.37.2 
 > 2.6.37.3 
 > 2.6.38-rc8 (final version was not released yet) 
 >  
 > All tests were made on vanilla kernels on Debian Lenny with this config: 
 > http://watchdog.sk/lkml/config 
 >  
 > Do you need any other information from me ? I'm able to try other versions or 
 > patches but, please, take into account that I have to do this on _production_ 
 > server (I failed to reproduce it in testing environment). Also, I'm able to try 
 > only one kernel per day. 
 
 Ick, one kernel per day might make this a bit difficult, but if there 
 was any way you could use 'git bisect' to try to narrow this down to the 
 patch that caused this problem, it would be great. 
 
 You can mark 2.6.35 as working and 2.6.36 as bad and git will go from 
 there and try to offer you different chances to find the problem. 
 
 thanks, 
 
 greg k-h

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-07 10:01   ` azurIt
  2011-04-07 10:19       ` Jiri Slaby
@ 2011-04-07 10:19       ` Jiri Slaby
  0 siblings, 0 replies; 98+ messages in thread
From: Jiri Slaby @ 2011-04-07 10:19 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, Changli Gao, Andrew Morton, linux-mm, Eric Dumazet,
	linux-fsdevel, Jiri Slaby

Cced few people.

Also the series which introduced this were discussed at:
http://lkml.org/lkml/2010/5/3/53

On 04/07/2011 12:01 PM, azurIt wrote:
> 
> I have finally completed bisection, here are the results:
> 
> 
> 
> a892e2d7dcdfa6c76e60c50a8c7385c65587a2a6 is first bad commit
> commit a892e2d7dcdfa6c76e60c50a8c7385c65587a2a6
> Author: Changli Gao <xiaosuo@gmail.com>
> Date:   Tue Aug 10 18:01:35 2010 -0700
> 
>     vfs: use kmalloc() to allocate fdmem if possible
>    
>     Use kmalloc() to allocate fdmem if possible.
>    
>     vmalloc() is used as a fallback solution for fdmem allocation.  A new
>     helper function __free_fdtable() is introduced to reduce the lines of
>     code.
>    
>     A potential bug, vfree() a memory allocated by kmalloc(), is fixed.
>    
>     [akpm@linux-foundation.org: use __GFP_NOWARN, uninline alloc_fdmem() and free_fdmem()]
>     Signed-off-by: Changli Gao <xiaosuo@gmail.com>
>     Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>     Cc: Jiri Slaby <jslaby@suse.cz>
>     Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
>     Cc: Alexey Dobriyan <adobriyan@gmail.com>
>     Cc: Ingo Molnar <mingo@elte.hu>
>     Cc: Peter Zijlstra <peterz@infradead.org>
>     Cc: Avi Kivity <avi@redhat.com>
>     Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> 
> :040000 040000 a7b3997bc754f573b4a309cda1a0774ea95c235e 4241a4f2115c60e5c1dc1879c85c9911fa077807 M      fs
> 
> 
> 
> 
> 
>  
>  ______________________________________________________________
>  > Od: "Greg KH" <greg@kroah.com>
>  > Komu: azurIt <azurit@pobox.sk>
>  > Dátum: 17.03.2011 01:15
>  > Predmet: Re: Regression from 2.6.36
>  >
>  > CC: linux-kernel@vger.kernel.org On Tue, Mar 15, 2011 at 02:25:27PM +0100, azurIt wrote: 
>  >  
>  > Hi, 
>  >  
>  > we are successfully running several very busy web servers on 2.6.32.* and 
>  > few days ago I decided to upgrade to 2.6.37 (mainly because of blkio cgroup). 
>  > I installed 2.6.37.2 on one of the servers and very strange things started to 
>  > happen with Apache web server. 
>  >  
>  > We are using Apache with MPM-ITK ( http://mpm-itk.sesse.net/ ) so it is doing 
>  > lots of 'fork' and lots of 'setuid'. I have also noticed that problem is 
>  > happening only on very busy servers. 
>  >  
>  > Everything is ok when Apache is started but as time is passing by, its 'root' 
>  > processes (Apache processes running under root) are consuming more and more CPU. 
>  > Finally, the whole server becames very unstable and Apache must be restarted. 
>  > This is repeating until the load on web sites is much lower (usually on 22:00). 
>  > Sometimes it takes 3 hours when restart is needed, sometimes only 1 hour (again, 
>  > depends on load on web sites). Here is the graph of CPU utilization showing the 
>  > problem (red color), Apache was REstarted at 8:11 and 9:35: 
>  > http://watchdog.sk/lkml/cpu-problem.png 
>  >  
>  > Here is how it looks on htop: 
>  > http://watchdog.sk/lkml/htop.jpg 
>  >  
>  > And finally here is how it looks with older kernels (yes, when i install older 
>  > kernel, problem is gone), notice also that I/O wait is much lower and nicer 
>  > (blue color): 
>  > http://watchdog.sk/lkml/cpu-ok.png 
>  >  
>  > I was also strace-ing Apache processes which were doing problems, here it is: 
>  > http://watchdog.sk/lkml/strace.txt 
>  >  
>  > I'm not 100% sure but I think that CPU was consumed on 'futex' lines. 
>  >  
>  > I tried several kernel versions and find out that everything BEFORE 2.6.36 is 
>  > NOT affected and everything AFTER 2.6.36 (included) is affected. 
>  >  
>  > Versions which I tried and were NOT affected by this problem: 
>  > 2.6.32.* 
>  > 2.6.35.11 
>  >  
>  > Versions which I tried and were affected by this problem: 
>  > 2.6.36 
>  > 2.6.36.4 
>  > 2.6.37.2 
>  > 2.6.37.3 
>  > 2.6.38-rc8 (final version was not released yet) 
>  >  
>  > All tests were made on vanilla kernels on Debian Lenny with this config: 
>  > http://watchdog.sk/lkml/config 
>  >  
>  > Do you need any other information from me ? I'm able to try other versions or 
>  > patches but, please, take into account that I have to do this on _production_ 
>  > server (I failed to reproduce it in testing environment). Also, I'm able to try 
>  > only one kernel per day. 
>  
>  Ick, one kernel per day might make this a bit difficult, but if there 
>  was any way you could use 'git bisect' to try to narrow this down to the 
>  patch that caused this problem, it would be great. 
>  
>  You can mark 2.6.35 as working and 2.6.36 as bad and git will go from 
>  there and try to offer you different chances to find the problem. 
>  
>  thanks, 
>  
>  greg k-h

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-07 10:19       ` Jiri Slaby
  0 siblings, 0 replies; 98+ messages in thread
From: Jiri Slaby @ 2011-04-07 10:19 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, Changli Gao, Andrew Morton, linux-mm, Eric Dumazet,
	linux-fsdevel, Jiri Slaby

Cced few people.

Also the series which introduced this were discussed at:
http://lkml.org/lkml/2010/5/3/53

On 04/07/2011 12:01 PM, azurIt wrote:
> 
> I have finally completed bisection, here are the results:
> 
> 
> 
> a892e2d7dcdfa6c76e60c50a8c7385c65587a2a6 is first bad commit
> commit a892e2d7dcdfa6c76e60c50a8c7385c65587a2a6
> Author: Changli Gao <xiaosuo@gmail.com>
> Date:   Tue Aug 10 18:01:35 2010 -0700
> 
>     vfs: use kmalloc() to allocate fdmem if possible
>    
>     Use kmalloc() to allocate fdmem if possible.
>    
>     vmalloc() is used as a fallback solution for fdmem allocation.  A new
>     helper function __free_fdtable() is introduced to reduce the lines of
>     code.
>    
>     A potential bug, vfree() a memory allocated by kmalloc(), is fixed.
>    
>     [akpm@linux-foundation.org: use __GFP_NOWARN, uninline alloc_fdmem() and free_fdmem()]
>     Signed-off-by: Changli Gao <xiaosuo@gmail.com>
>     Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>     Cc: Jiri Slaby <jslaby@suse.cz>
>     Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
>     Cc: Alexey Dobriyan <adobriyan@gmail.com>
>     Cc: Ingo Molnar <mingo@elte.hu>
>     Cc: Peter Zijlstra <peterz@infradead.org>
>     Cc: Avi Kivity <avi@redhat.com>
>     Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> 
> :040000 040000 a7b3997bc754f573b4a309cda1a0774ea95c235e 4241a4f2115c60e5c1dc1879c85c9911fa077807 M      fs
> 
> 
> 
> 
> 
>  
>  ______________________________________________________________
>  > Od: "Greg KH" <greg@kroah.com>
>  > Komu: azurIt <azurit@pobox.sk>
>  > Dátum: 17.03.2011 01:15
>  > Predmet: Re: Regression from 2.6.36
>  >
>  > CC: linux-kernel@vger.kernel.org On Tue, Mar 15, 2011 at 02:25:27PM +0100, azurIt wrote: 
>  >  
>  > Hi, 
>  >  
>  > we are successfully running several very busy web servers on 2.6.32.* and 
>  > few days ago I decided to upgrade to 2.6.37 (mainly because of blkio cgroup). 
>  > I installed 2.6.37.2 on one of the servers and very strange things started to 
>  > happen with Apache web server. 
>  >  
>  > We are using Apache with MPM-ITK ( http://mpm-itk.sesse.net/ ) so it is doing 
>  > lots of 'fork' and lots of 'setuid'. I have also noticed that problem is 
>  > happening only on very busy servers. 
>  >  
>  > Everything is ok when Apache is started but as time is passing by, its 'root' 
>  > processes (Apache processes running under root) are consuming more and more CPU. 
>  > Finally, the whole server becames very unstable and Apache must be restarted. 
>  > This is repeating until the load on web sites is much lower (usually on 22:00). 
>  > Sometimes it takes 3 hours when restart is needed, sometimes only 1 hour (again, 
>  > depends on load on web sites). Here is the graph of CPU utilization showing the 
>  > problem (red color), Apache was REstarted at 8:11 and 9:35: 
>  > http://watchdog.sk/lkml/cpu-problem.png 
>  >  
>  > Here is how it looks on htop: 
>  > http://watchdog.sk/lkml/htop.jpg 
>  >  
>  > And finally here is how it looks with older kernels (yes, when i install older 
>  > kernel, problem is gone), notice also that I/O wait is much lower and nicer 
>  > (blue color): 
>  > http://watchdog.sk/lkml/cpu-ok.png 
>  >  
>  > I was also strace-ing Apache processes which were doing problems, here it is: 
>  > http://watchdog.sk/lkml/strace.txt 
>  >  
>  > I'm not 100% sure but I think that CPU was consumed on 'futex' lines. 
>  >  
>  > I tried several kernel versions and find out that everything BEFORE 2.6.36 is 
>  > NOT affected and everything AFTER 2.6.36 (included) is affected. 
>  >  
>  > Versions which I tried and were NOT affected by this problem: 
>  > 2.6.32.* 
>  > 2.6.35.11 
>  >  
>  > Versions which I tried and were affected by this problem: 
>  > 2.6.36 
>  > 2.6.36.4 
>  > 2.6.37.2 
>  > 2.6.37.3 
>  > 2.6.38-rc8 (final version was not released yet) 
>  >  
>  > All tests were made on vanilla kernels on Debian Lenny with this config: 
>  > http://watchdog.sk/lkml/config 
>  >  
>  > Do you need any other information from me ? I'm able to try other versions or 
>  > patches but, please, take into account that I have to do this on _production_ 
>  > server (I failed to reproduce it in testing environment). Also, I'm able to try 
>  > only one kernel per day. 
>  
>  Ick, one kernel per day might make this a bit difficult, but if there 
>  was any way you could use 'git bisect' to try to narrow this down to the 
>  patch that caused this problem, it would be great. 
>  
>  You can mark 2.6.35 as working and 2.6.36 as bad and git will go from 
>  there and try to offer you different chances to find the problem. 
>  
>  thanks, 
>  
>  greg k-h

thanks,
-- 
js
suse labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-07 10:19       ` Jiri Slaby
  0 siblings, 0 replies; 98+ messages in thread
From: Jiri Slaby @ 2011-04-07 10:19 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, Changli Gao, Andrew Morton, linux-mm, Eric Dumazet,
	linux-fsdevel, Jiri Slaby

Cced few people.

Also the series which introduced this were discussed at:
http://lkml.org/lkml/2010/5/3/53

On 04/07/2011 12:01 PM, azurIt wrote:
> 
> I have finally completed bisection, here are the results:
> 
> 
> 
> a892e2d7dcdfa6c76e60c50a8c7385c65587a2a6 is first bad commit
> commit a892e2d7dcdfa6c76e60c50a8c7385c65587a2a6
> Author: Changli Gao <xiaosuo@gmail.com>
> Date:   Tue Aug 10 18:01:35 2010 -0700
> 
>     vfs: use kmalloc() to allocate fdmem if possible
>    
>     Use kmalloc() to allocate fdmem if possible.
>    
>     vmalloc() is used as a fallback solution for fdmem allocation.  A new
>     helper function __free_fdtable() is introduced to reduce the lines of
>     code.
>    
>     A potential bug, vfree() a memory allocated by kmalloc(), is fixed.
>    
>     [akpm@linux-foundation.org: use __GFP_NOWARN, uninline alloc_fdmem() and free_fdmem()]
>     Signed-off-by: Changli Gao <xiaosuo@gmail.com>
>     Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>     Cc: Jiri Slaby <jslaby@suse.cz>
>     Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
>     Cc: Alexey Dobriyan <adobriyan@gmail.com>
>     Cc: Ingo Molnar <mingo@elte.hu>
>     Cc: Peter Zijlstra <peterz@infradead.org>
>     Cc: Avi Kivity <avi@redhat.com>
>     Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> 
> :040000 040000 a7b3997bc754f573b4a309cda1a0774ea95c235e 4241a4f2115c60e5c1dc1879c85c9911fa077807 M      fs
> 
> 
> 
> 
> 
>  
>  ______________________________________________________________
>  > Od: "Greg KH" <greg@kroah.com>
>  > Komu: azurIt <azurit@pobox.sk>
>  > DA!tum: 17.03.2011 01:15
>  > Predmet: Re: Regression from 2.6.36
>  >
>  > CC: linux-kernel@vger.kernel.org On Tue, Mar 15, 2011 at 02:25:27PM +0100, azurIt wrote: 
>  >  
>  > Hi, 
>  >  
>  > we are successfully running several very busy web servers on 2.6.32.* and 
>  > few days ago I decided to upgrade to 2.6.37 (mainly because of blkio cgroup). 
>  > I installed 2.6.37.2 on one of the servers and very strange things started to 
>  > happen with Apache web server. 
>  >  
>  > We are using Apache with MPM-ITK ( http://mpm-itk.sesse.net/ ) so it is doing 
>  > lots of 'fork' and lots of 'setuid'. I have also noticed that problem is 
>  > happening only on very busy servers. 
>  >  
>  > Everything is ok when Apache is started but as time is passing by, its 'root' 
>  > processes (Apache processes running under root) are consuming more and more CPU. 
>  > Finally, the whole server becames very unstable and Apache must be restarted. 
>  > This is repeating until the load on web sites is much lower (usually on 22:00). 
>  > Sometimes it takes 3 hours when restart is needed, sometimes only 1 hour (again, 
>  > depends on load on web sites). Here is the graph of CPU utilization showing the 
>  > problem (red color), Apache was REstarted at 8:11 and 9:35: 
>  > http://watchdog.sk/lkml/cpu-problem.png 
>  >  
>  > Here is how it looks on htop: 
>  > http://watchdog.sk/lkml/htop.jpg 
>  >  
>  > And finally here is how it looks with older kernels (yes, when i install older 
>  > kernel, problem is gone), notice also that I/O wait is much lower and nicer 
>  > (blue color): 
>  > http://watchdog.sk/lkml/cpu-ok.png 
>  >  
>  > I was also strace-ing Apache processes which were doing problems, here it is: 
>  > http://watchdog.sk/lkml/strace.txt 
>  >  
>  > I'm not 100% sure but I think that CPU was consumed on 'futex' lines. 
>  >  
>  > I tried several kernel versions and find out that everything BEFORE 2.6.36 is 
>  > NOT affected and everything AFTER 2.6.36 (included) is affected. 
>  >  
>  > Versions which I tried and were NOT affected by this problem: 
>  > 2.6.32.* 
>  > 2.6.35.11 
>  >  
>  > Versions which I tried and were affected by this problem: 
>  > 2.6.36 
>  > 2.6.36.4 
>  > 2.6.37.2 
>  > 2.6.37.3 
>  > 2.6.38-rc8 (final version was not released yet) 
>  >  
>  > All tests were made on vanilla kernels on Debian Lenny with this config: 
>  > http://watchdog.sk/lkml/config 
>  >  
>  > Do you need any other information from me ? I'm able to try other versions or 
>  > patches but, please, take into account that I have to do this on _production_ 
>  > server (I failed to reproduce it in testing environment). Also, I'm able to try 
>  > only one kernel per day. 
>  
>  Ick, one kernel per day might make this a bit difficult, but if there 
>  was any way you could use 'git bisect' to try to narrow this down to the 
>  patch that caused this problem, it would be great. 
>  
>  You can mark 2.6.35 as working and 2.6.36 as bad and git will go from 
>  there and try to offer you different chances to find the problem. 
>  
>  thanks, 
>  
>  greg k-h

thanks,
-- 
js
suse labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-07 10:19       ` Jiri Slaby
@ 2011-04-07 11:21         ` Américo Wang
  -1 siblings, 0 replies; 98+ messages in thread
From: Américo Wang @ 2011-04-07 11:21 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: azurIt, linux-kernel, Changli Gao, Andrew Morton, linux-mm,
	Eric Dumazet, linux-fsdevel, Jiri Slaby

On Thu, Apr 7, 2011 at 6:19 PM, Jiri Slaby <jslaby@suse.cz> wrote:
> Cced few people.
>
> Also the series which introduced this were discussed at:
> http://lkml.org/lkml/2010/5/3/53
>

I guess this is due to that lots of fdt are allocated by kmalloc(),
not vmalloc(), and we kfree() them in rcu callback.

How about deferring all of the removal to workqueue? This may
hurt performance I think.

Anyway, like the patch below... makes sense?

Not-yet-signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>

---
diff --git a/fs/file.c b/fs/file.c
index 0be3447..34dc355 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -96,20 +96,14 @@ void free_fdtable_rcu(struct rcu_head *rcu)
                                container_of(fdt, struct files_struct, fdtab));
                return;
        }
-       if (!is_vmalloc_addr(fdt->fd) && !is_vmalloc_addr(fdt->open_fds)) {
-               kfree(fdt->fd);
-               kfree(fdt->open_fds);
-               kfree(fdt);
-       } else {
-               fddef = &get_cpu_var(fdtable_defer_list);
-               spin_lock(&fddef->lock);
-               fdt->next = fddef->next;
-               fddef->next = fdt;
-               /* vmallocs are handled from the workqueue context */
-               schedule_work(&fddef->wq);
-               spin_unlock(&fddef->lock);
-               put_cpu_var(fdtable_defer_list);
-       }
+
+       fddef = &get_cpu_var(fdtable_defer_list);
+       spin_lock(&fddef->lock);
+       fdt->next = fddef->next;
+       fddef->next = fdt;
+       schedule_work(&fddef->wq);
+       spin_unlock(&fddef->lock);
+       put_cpu_var(fdtable_defer_list);
 }

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-07 11:21         ` Américo Wang
  0 siblings, 0 replies; 98+ messages in thread
From: Américo Wang @ 2011-04-07 11:21 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: azurIt, linux-kernel, Changli Gao, Andrew Morton, linux-mm,
	Eric Dumazet, linux-fsdevel, Jiri Slaby

On Thu, Apr 7, 2011 at 6:19 PM, Jiri Slaby <jslaby@suse.cz> wrote:
> Cced few people.
>
> Also the series which introduced this were discussed at:
> http://lkml.org/lkml/2010/5/3/53
>

I guess this is due to that lots of fdt are allocated by kmalloc(),
not vmalloc(), and we kfree() them in rcu callback.

How about deferring all of the removal to workqueue? This may
hurt performance I think.

Anyway, like the patch below... makes sense?

Not-yet-signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>

---
diff --git a/fs/file.c b/fs/file.c
index 0be3447..34dc355 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -96,20 +96,14 @@ void free_fdtable_rcu(struct rcu_head *rcu)
                                container_of(fdt, struct files_struct, fdtab));
                return;
        }
-       if (!is_vmalloc_addr(fdt->fd) && !is_vmalloc_addr(fdt->open_fds)) {
-               kfree(fdt->fd);
-               kfree(fdt->open_fds);
-               kfree(fdt);
-       } else {
-               fddef = &get_cpu_var(fdtable_defer_list);
-               spin_lock(&fddef->lock);
-               fdt->next = fddef->next;
-               fddef->next = fdt;
-               /* vmallocs are handled from the workqueue context */
-               schedule_work(&fddef->wq);
-               spin_unlock(&fddef->lock);
-               put_cpu_var(fdtable_defer_list);
-       }
+
+       fddef = &get_cpu_var(fdtable_defer_list);
+       spin_lock(&fddef->lock);
+       fdt->next = fddef->next;
+       fddef->next = fdt;
+       schedule_work(&fddef->wq);
+       spin_unlock(&fddef->lock);
+       put_cpu_var(fdtable_defer_list);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-07 11:21         ` Américo Wang
  (?)
@ 2011-04-07 11:57           ` Eric Dumazet
  -1 siblings, 0 replies; 98+ messages in thread
From: Eric Dumazet @ 2011-04-07 11:57 UTC (permalink / raw)
  To: Américo Wang
  Cc: Jiri Slaby, azurIt, linux-kernel, Changli Gao, Andrew Morton,
	linux-mm, linux-fsdevel, Jiri Slaby

Le jeudi 07 avril 2011 à 19:21 +0800, Américo Wang a écrit :
> On Thu, Apr 7, 2011 at 6:19 PM, Jiri Slaby <jslaby@suse.cz> wrote:
> > Cced few people.
> >
> > Also the series which introduced this were discussed at:
> > http://lkml.org/lkml/2010/5/3/53


> >
> 
> I guess this is due to that lots of fdt are allocated by kmalloc(),
> not vmalloc(), and we kfree() them in rcu callback.
> 
> How about deferring all of the removal to workqueue? This may
> hurt performance I think.
> 
> Anyway, like the patch below... makes sense?
> 
> Not-yet-signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
> 
> ---
> diff --git a/fs/file.c b/fs/file.c
> index 0be3447..34dc355 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -96,20 +96,14 @@ void free_fdtable_rcu(struct rcu_head *rcu)
>                                 container_of(fdt, struct files_struct, fdtab));
>                 return;
>         }
> -       if (!is_vmalloc_addr(fdt->fd) && !is_vmalloc_addr(fdt->open_fds)) {
> -               kfree(fdt->fd);
> -               kfree(fdt->open_fds);
> -               kfree(fdt);
> -       } else {
> -               fddef = &get_cpu_var(fdtable_defer_list);
> -               spin_lock(&fddef->lock);
> -               fdt->next = fddef->next;
> -               fddef->next = fdt;
> -               /* vmallocs are handled from the workqueue context */
> -               schedule_work(&fddef->wq);
> -               spin_unlock(&fddef->lock);
> -               put_cpu_var(fdtable_defer_list);
> -       }
> +
> +       fddef = &get_cpu_var(fdtable_defer_list);
> +       spin_lock(&fddef->lock);
> +       fdt->next = fddef->next;
> +       fddef->next = fdt;
> +       schedule_work(&fddef->wq);
> +       spin_unlock(&fddef->lock);
> +       put_cpu_var(fdtable_defer_list);
>  }


Nope, this makes no sense at all.

Its probably the other way. We want to free those blocks ASAP

A fix would be to make alloc_fdmem() use vmalloc() if size is more than
4 pages, or whatever limit is reached.

We had a similar memory problem in fib_trie in the past  : We force a
synchronize_rcu() every XXX Mbytes allocated to make sure we dont have
too much ram waiting to be freed in rcu queues.








^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-07 11:57           ` Eric Dumazet
  0 siblings, 0 replies; 98+ messages in thread
From: Eric Dumazet @ 2011-04-07 11:57 UTC (permalink / raw)
  To: Américo Wang
  Cc: Jiri Slaby, azurIt, linux-kernel, Changli Gao, Andrew Morton,
	linux-mm, linux-fsdevel, Jiri Slaby

Le jeudi 07 avril 2011 à 19:21 +0800, Américo Wang a écrit :
> On Thu, Apr 7, 2011 at 6:19 PM, Jiri Slaby <jslaby@suse.cz> wrote:
> > Cced few people.
> >
> > Also the series which introduced this were discussed at:
> > http://lkml.org/lkml/2010/5/3/53


> >
> 
> I guess this is due to that lots of fdt are allocated by kmalloc(),
> not vmalloc(), and we kfree() them in rcu callback.
> 
> How about deferring all of the removal to workqueue? This may
> hurt performance I think.
> 
> Anyway, like the patch below... makes sense?
> 
> Not-yet-signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
> 
> ---
> diff --git a/fs/file.c b/fs/file.c
> index 0be3447..34dc355 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -96,20 +96,14 @@ void free_fdtable_rcu(struct rcu_head *rcu)
>                                 container_of(fdt, struct files_struct, fdtab));
>                 return;
>         }
> -       if (!is_vmalloc_addr(fdt->fd) && !is_vmalloc_addr(fdt->open_fds)) {
> -               kfree(fdt->fd);
> -               kfree(fdt->open_fds);
> -               kfree(fdt);
> -       } else {
> -               fddef = &get_cpu_var(fdtable_defer_list);
> -               spin_lock(&fddef->lock);
> -               fdt->next = fddef->next;
> -               fddef->next = fdt;
> -               /* vmallocs are handled from the workqueue context */
> -               schedule_work(&fddef->wq);
> -               spin_unlock(&fddef->lock);
> -               put_cpu_var(fdtable_defer_list);
> -       }
> +
> +       fddef = &get_cpu_var(fdtable_defer_list);
> +       spin_lock(&fddef->lock);
> +       fdt->next = fddef->next;
> +       fddef->next = fdt;
> +       schedule_work(&fddef->wq);
> +       spin_unlock(&fddef->lock);
> +       put_cpu_var(fdtable_defer_list);
>  }


Nope, this makes no sense at all.

Its probably the other way. We want to free those blocks ASAP

A fix would be to make alloc_fdmem() use vmalloc() if size is more than
4 pages, or whatever limit is reached.

We had a similar memory problem in fib_trie in the past  : We force a
synchronize_rcu() every XXX Mbytes allocated to make sure we dont have
too much ram waiting to be freed in rcu queues.







--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-07 11:57           ` Eric Dumazet
  0 siblings, 0 replies; 98+ messages in thread
From: Eric Dumazet @ 2011-04-07 11:57 UTC (permalink / raw)
  To: Américo Wang
  Cc: Jiri Slaby, azurIt, linux-kernel, Changli Gao, Andrew Morton,
	linux-mm, linux-fsdevel, Jiri Slaby

Le jeudi 07 avril 2011 A  19:21 +0800, AmA(C)rico Wang a A(C)crit :
> On Thu, Apr 7, 2011 at 6:19 PM, Jiri Slaby <jslaby@suse.cz> wrote:
> > Cced few people.
> >
> > Also the series which introduced this were discussed at:
> > http://lkml.org/lkml/2010/5/3/53


> >
> 
> I guess this is due to that lots of fdt are allocated by kmalloc(),
> not vmalloc(), and we kfree() them in rcu callback.
> 
> How about deferring all of the removal to workqueue? This may
> hurt performance I think.
> 
> Anyway, like the patch below... makes sense?
> 
> Not-yet-signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
> 
> ---
> diff --git a/fs/file.c b/fs/file.c
> index 0be3447..34dc355 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -96,20 +96,14 @@ void free_fdtable_rcu(struct rcu_head *rcu)
>                                 container_of(fdt, struct files_struct, fdtab));
>                 return;
>         }
> -       if (!is_vmalloc_addr(fdt->fd) && !is_vmalloc_addr(fdt->open_fds)) {
> -               kfree(fdt->fd);
> -               kfree(fdt->open_fds);
> -               kfree(fdt);
> -       } else {
> -               fddef = &get_cpu_var(fdtable_defer_list);
> -               spin_lock(&fddef->lock);
> -               fdt->next = fddef->next;
> -               fddef->next = fdt;
> -               /* vmallocs are handled from the workqueue context */
> -               schedule_work(&fddef->wq);
> -               spin_unlock(&fddef->lock);
> -               put_cpu_var(fdtable_defer_list);
> -       }
> +
> +       fddef = &get_cpu_var(fdtable_defer_list);
> +       spin_lock(&fddef->lock);
> +       fdt->next = fddef->next;
> +       fddef->next = fdt;
> +       schedule_work(&fddef->wq);
> +       spin_unlock(&fddef->lock);
> +       put_cpu_var(fdtable_defer_list);
>  }


Nope, this makes no sense at all.

Its probably the other way. We want to free those blocks ASAP

A fix would be to make alloc_fdmem() use vmalloc() if size is more than
4 pages, or whatever limit is reached.

We had a similar memory problem in fib_trie in the past  : We force a
synchronize_rcu() every XXX Mbytes allocated to make sure we dont have
too much ram waiting to be freed in rcu queues.







--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-07 11:57           ` Eric Dumazet
  (?)
@ 2011-04-07 12:13             ` Eric Dumazet
  -1 siblings, 0 replies; 98+ messages in thread
From: Eric Dumazet @ 2011-04-07 12:13 UTC (permalink / raw)
  To: Américo Wang
  Cc: Jiri Slaby, azurIt, linux-kernel, Changli Gao, Andrew Morton,
	linux-mm, linux-fsdevel, Jiri Slaby

Le jeudi 07 avril 2011 à 13:57 +0200, Eric Dumazet a écrit :

> We had a similar memory problem in fib_trie in the past  : We force a
> synchronize_rcu() every XXX Mbytes allocated to make sure we dont have
> too much ram waiting to be freed in rcu queues.

This was done in commit c3059477fce2d956
(ipv4: Use synchronize_rcu() during trie_rebalance())

It was possible in fib_trie because we hold RTNL lock, so managing
a counter was free.

In fs case, we might use a percpu_counter if we really want to limit the
amount of space.

Now, I am not even sure we should care that much and could just forget
about this high order pages use.


diff --git a/fs/file.c b/fs/file.c
index 0be3447..7ba26fe 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -41,12 +41,6 @@ static DEFINE_PER_CPU(struct fdtable_defer,
fdtable_defer_list);
 
 static inline void *alloc_fdmem(unsigned int size)
 {
-	void *data;
-
-	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
-	if (data != NULL)
-	return data;
-
 	return vmalloc(size);
 }



^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-07 12:13             ` Eric Dumazet
  0 siblings, 0 replies; 98+ messages in thread
From: Eric Dumazet @ 2011-04-07 12:13 UTC (permalink / raw)
  To: Américo Wang
  Cc: Jiri Slaby, azurIt, linux-kernel, Changli Gao, Andrew Morton,
	linux-mm, linux-fsdevel, Jiri Slaby

Le jeudi 07 avril 2011 à 13:57 +0200, Eric Dumazet a écrit :

> We had a similar memory problem in fib_trie in the past  : We force a
> synchronize_rcu() every XXX Mbytes allocated to make sure we dont have
> too much ram waiting to be freed in rcu queues.

This was done in commit c3059477fce2d956
(ipv4: Use synchronize_rcu() during trie_rebalance())

It was possible in fib_trie because we hold RTNL lock, so managing
a counter was free.

In fs case, we might use a percpu_counter if we really want to limit the
amount of space.

Now, I am not even sure we should care that much and could just forget
about this high order pages use.


diff --git a/fs/file.c b/fs/file.c
index 0be3447..7ba26fe 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -41,12 +41,6 @@ static DEFINE_PER_CPU(struct fdtable_defer,
fdtable_defer_list);
 
 static inline void *alloc_fdmem(unsigned int size)
 {
-	void *data;
-
-	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
-	if (data != NULL)
-	return data;
-
 	return vmalloc(size);
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-07 12:13             ` Eric Dumazet
  0 siblings, 0 replies; 98+ messages in thread
From: Eric Dumazet @ 2011-04-07 12:13 UTC (permalink / raw)
  To: Américo Wang
  Cc: Jiri Slaby, azurIt, linux-kernel, Changli Gao, Andrew Morton,
	linux-mm, linux-fsdevel, Jiri Slaby

Le jeudi 07 avril 2011 A  13:57 +0200, Eric Dumazet a A(C)crit :

> We had a similar memory problem in fib_trie in the past  : We force a
> synchronize_rcu() every XXX Mbytes allocated to make sure we dont have
> too much ram waiting to be freed in rcu queues.

This was done in commit c3059477fce2d956
(ipv4: Use synchronize_rcu() during trie_rebalance())

It was possible in fib_trie because we hold RTNL lock, so managing
a counter was free.

In fs case, we might use a percpu_counter if we really want to limit the
amount of space.

Now, I am not even sure we should care that much and could just forget
about this high order pages use.


diff --git a/fs/file.c b/fs/file.c
index 0be3447..7ba26fe 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -41,12 +41,6 @@ static DEFINE_PER_CPU(struct fdtable_defer,
fdtable_defer_list);
 
 static inline void *alloc_fdmem(unsigned int size)
 {
-	void *data;
-
-	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
-	if (data != NULL)
-	return data;
-
 	return vmalloc(size);
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-07 12:13             ` Eric Dumazet
  (?)
  (?)
@ 2011-04-07 15:27             ` Changli Gao
  2011-04-07 15:36                 ` Eric Dumazet
  2011-04-08 12:25                 ` azurIt
  -1 siblings, 2 replies; 98+ messages in thread
From: Changli Gao @ 2011-04-07 15:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Américo Wang, Jiri Slaby, azurIt, linux-kernel,
	Andrew Morton, linux-mm, linux-fsdevel, Jiri Slaby

[-- Attachment #1: Type: text/plain, Size: 1352 bytes --]

On Thu, Apr 7, 2011 at 8:13 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le jeudi 07 avril 2011 à 13:57 +0200, Eric Dumazet a écrit :
>
>> We had a similar memory problem in fib_trie in the past  : We force a
>> synchronize_rcu() every XXX Mbytes allocated to make sure we dont have
>> too much ram waiting to be freed in rcu queues.

I don't think there is too much memory allocated by vmalloc to free.
My patch should reduce the size of the memory allocated by vmalloc().
I think the real problem is kfree always returns the memory, whose
size is aligned to 2^n pages, and more memory are used than before.

>
> This was done in commit c3059477fce2d956
> (ipv4: Use synchronize_rcu() during trie_rebalance())
>
> It was possible in fib_trie because we hold RTNL lock, so managing
> a counter was free.
>
> In fs case, we might use a percpu_counter if we really want to limit the
> amount of space.
>
> Now, I am not even sure we should care that much and could just forget
> about this high order pages use.

In normal cases, only a few fds are used, the ftable isn't larger than
one page, so we should use kmalloc to reduce the memory cost. Maybe we
should set a upper limit for kmalloc() here. One page?

azurlt, would you please test the patch attached? Thanks.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

[-- Attachment #2: x.diff --]
[-- Type: application/octet-stream, Size: 419 bytes --]

diff --git a/fs/file.c b/fs/file.c
index 0be3447..966bf0c 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -43,9 +43,11 @@ static inline void *alloc_fdmem(unsigned int size)
 {
 	void *data;
 
-	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
-	if (data != NULL)
-		return data;
+	if (size <= PAGE_SIZE) {
+		data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
+		if (data != NULL)
+			return data;
+	}
 
 	return vmalloc(size);
 }

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-07 15:27             ` Changli Gao
  2011-04-07 15:36                 ` Eric Dumazet
@ 2011-04-07 15:36                 ` Eric Dumazet
  1 sibling, 0 replies; 98+ messages in thread
From: Eric Dumazet @ 2011-04-07 15:36 UTC (permalink / raw)
  To: Changli Gao
  Cc: Américo Wang, Jiri Slaby, azurIt, linux-kernel,
	Andrew Morton, linux-mm, linux-fsdevel, Jiri Slaby

Le jeudi 07 avril 2011 à 23:27 +0800, Changli Gao a écrit :

> azurlt, would you please test the patch attached? Thanks.
> 

Yes of course, I meant to reverse the patch

(use kmalloc() under PAGE_SIZE, vmalloc() for 'big' allocs)


Dont fallback to vmalloc if kmalloc() fails.


if (size <= PAGE_SIZE)
	return kmalloc(size, GFP_KERNEL);
else
	return vmalloc(size);




^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-07 15:36                 ` Eric Dumazet
  0 siblings, 0 replies; 98+ messages in thread
From: Eric Dumazet @ 2011-04-07 15:36 UTC (permalink / raw)
  To: Changli Gao
  Cc: Américo Wang, Jiri Slaby, azurIt, linux-kernel,
	Andrew Morton, linux-mm, linux-fsdevel, Jiri Slaby

Le jeudi 07 avril 2011 à 23:27 +0800, Changli Gao a écrit :

> azurlt, would you please test the patch attached? Thanks.
> 

Yes of course, I meant to reverse the patch

(use kmalloc() under PAGE_SIZE, vmalloc() for 'big' allocs)


Dont fallback to vmalloc if kmalloc() fails.


if (size <= PAGE_SIZE)
	return kmalloc(size, GFP_KERNEL);
else
	return vmalloc(size);



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-07 15:36                 ` Eric Dumazet
  0 siblings, 0 replies; 98+ messages in thread
From: Eric Dumazet @ 2011-04-07 15:36 UTC (permalink / raw)
  To: Changli Gao
  Cc: Américo Wang, Jiri Slaby, azurIt, linux-kernel,
	Andrew Morton, linux-mm, linux-fsdevel, Jiri Slaby

Le jeudi 07 avril 2011 A  23:27 +0800, Changli Gao a A(C)crit :

> azurlt, would you please test the patch attached? Thanks.
> 

Yes of course, I meant to reverse the patch

(use kmalloc() under PAGE_SIZE, vmalloc() for 'big' allocs)


Dont fallback to vmalloc if kmalloc() fails.


if (size <= PAGE_SIZE)
	return kmalloc(size, GFP_KERNEL);
else
	return vmalloc(size);



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-07 15:27             ` Changli Gao
  2011-04-07 15:36                 ` Eric Dumazet
@ 2011-04-08 12:25                 ` azurIt
  1 sibling, 0 replies; 98+ messages in thread
From: azurIt @ 2011-04-08 12:25 UTC (permalink / raw)
  To: Changli Gao, Eric Dumazet
  Cc: Américo Wang, Jiri Slaby, linux-kernel, Andrew Morton,
	linux-mm, linux-fsdevel, Jiri Slaby


>azurlt, would you please test the patch attached? Thanks.

This patch fixed the problem, i used 2.6.36.4 for testing. Do you need from me to test also other kernel versions or patches ?

Thank you very much!


______________________________________________________________
> Od: "Changli Gao" <xiaosuo@gmail.com>
> Komu: Eric Dumazet <eric.dumazet@gmail.com>
> Dátum: 07.04.2011 17:27
> Predmet: Re: Regression from 2.6.36
>
> CC: "Américo Wang" <xiyou.wangcong@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, linux-kernel@vger.kernel.org, "Andrew Morton" <akpm@linux-foundation.org>, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>
>On Thu, Apr 7, 2011 at 8:13 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Le jeudi 07 avril 2011 à 13:57 +0200, Eric Dumazet a écrit :
>>
>>> We had a similar memory problem in fib_trie in the past  : We force a
>>> synchronize_rcu() every XXX Mbytes allocated to make sure we dont have
>>> too much ram waiting to be freed in rcu queues.
>
>I don't think there is too much memory allocated by vmalloc to free.
>My patch should reduce the size of the memory allocated by vmalloc().
>I think the real problem is kfree always returns the memory, whose
>size is aligned to 2^n pages, and more memory are used than before.
>
>>
>> This was done in commit c3059477fce2d956
>> (ipv4: Use synchronize_rcu() during trie_rebalance())
>>
>> It was possible in fib_trie because we hold RTNL lock, so managing
>> a counter was free.
>>
>> In fs case, we might use a percpu_counter if we really want to limit the
>> amount of space.
>>
>> Now, I am not even sure we should care that much and could just forget
>> about this high order pages use.
>
>In normal cases, only a few fds are used, the ftable isn't larger than
>one page, so we should use kmalloc to reduce the memory cost. Maybe we
>should set a upper limit for kmalloc() here. One page?
>
>
>-- 
>Regards,
>Changli Gao(xiaosuo@gmail.com)
>
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-08 12:25                 ` azurIt
  0 siblings, 0 replies; 98+ messages in thread
From: azurIt @ 2011-04-08 12:25 UTC (permalink / raw)
  To: Changli Gao, Eric Dumazet
  Cc: Américo Wang, Jiri Slaby, linux-kernel, Andrew Morton,
	linux-mm, linux-fsdevel, Jiri Slaby


>azurlt, would you please test the patch attached? Thanks.

This patch fixed the problem, i used 2.6.36.4 for testing. Do you need from me to test also other kernel versions or patches ?

Thank you very much!


______________________________________________________________
> Od: "Changli Gao" <xiaosuo@gmail.com>
> Komu: Eric Dumazet <eric.dumazet@gmail.com>
> Dátum: 07.04.2011 17:27
> Predmet: Re: Regression from 2.6.36
>
> CC: "Américo Wang" <xiyou.wangcong@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, linux-kernel@vger.kernel.org, "Andrew Morton" <akpm@linux-foundation.org>, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>
>On Thu, Apr 7, 2011 at 8:13 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Le jeudi 07 avril 2011 à 13:57 +0200, Eric Dumazet a écrit :
>>
>>> We had a similar memory problem in fib_trie in the past  : We force a
>>> synchronize_rcu() every XXX Mbytes allocated to make sure we dont have
>>> too much ram waiting to be freed in rcu queues.
>
>I don't think there is too much memory allocated by vmalloc to free.
>My patch should reduce the size of the memory allocated by vmalloc().
>I think the real problem is kfree always returns the memory, whose
>size is aligned to 2^n pages, and more memory are used than before.
>
>>
>> This was done in commit c3059477fce2d956
>> (ipv4: Use synchronize_rcu() during trie_rebalance())
>>
>> It was possible in fib_trie because we hold RTNL lock, so managing
>> a counter was free.
>>
>> In fs case, we might use a percpu_counter if we really want to limit the
>> amount of space.
>>
>> Now, I am not even sure we should care that much and could just forget
>> about this high order pages use.
>
>In normal cases, only a few fds are used, the ftable isn't larger than
>one page, so we should use kmalloc to reduce the memory cost. Maybe we
>should set a upper limit for kmalloc() here. One page?
>
>
>-- 
>Regards,
>Changli Gao(xiaosuo@gmail.com)
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-08 12:25                 ` azurIt
  0 siblings, 0 replies; 98+ messages in thread
From: azurIt @ 2011-04-08 12:25 UTC (permalink / raw)
  To: Changli Gao, Eric Dumazet
  Cc: Américo Wang, Jiri Slaby, linux-kernel, Andrew Morton,
	linux-mm, linux-fsdevel, Jiri Slaby


>azurlt, would you please test the patch attached? Thanks.

This patch fixed the problem, i used 2.6.36.4 for testing. Do you need from me to test also other kernel versions or patches ?

Thank you very much!


______________________________________________________________
> Od: "Changli Gao" <xiaosuo@gmail.com>
> Komu: Eric Dumazet <eric.dumazet@gmail.com>
> DA!tum: 07.04.2011 17:27
> Predmet: Re: Regression from 2.6.36
>
> CC: "AmA(C)rico Wang" <xiyou.wangcong@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, linux-kernel@vger.kernel.org, "Andrew Morton" <akpm@linux-foundation.org>, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>
>On Thu, Apr 7, 2011 at 8:13 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Le jeudi 07 avril 2011 A  13:57 +0200, Eric Dumazet a A(C)crit :
>>
>>> We had a similar memory problem in fib_trie in the past A : We force a
>>> synchronize_rcu() every XXX Mbytes allocated to make sure we dont have
>>> too much ram waiting to be freed in rcu queues.
>
>I don't think there is too much memory allocated by vmalloc to free.
>My patch should reduce the size of the memory allocated by vmalloc().
>I think the real problem is kfree always returns the memory, whose
>size is aligned to 2^n pages, and more memory are used than before.
>
>>
>> This was done in commit c3059477fce2d956
>> (ipv4: Use synchronize_rcu() during trie_rebalance())
>>
>> It was possible in fib_trie because we hold RTNL lock, so managing
>> a counter was free.
>>
>> In fs case, we might use a percpu_counter if we really want to limit the
>> amount of space.
>>
>> Now, I am not even sure we should care that much and could just forget
>> about this high order pages use.
>
>In normal cases, only a few fds are used, the ftable isn't larger than
>one page, so we should use kmalloc to reduce the memory cost. Maybe we
>should set a upper limit for kmalloc() here. One page?
>
>
>-- 
>Regards,
>Changli Gao(xiaosuo@gmail.com)
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-07 15:36                 ` Eric Dumazet
@ 2011-04-12 22:49                   ` Andrew Morton
  -1 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2011-04-12 22:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, Américo Wang, Jiri Slaby, azurIt, linux-kernel,
	linux-mm, linux-fsdevel, Jiri Slaby

On Thu, 07 Apr 2011 17:36:26 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le jeudi 07 avril 2011 __ 23:27 +0800, Changli Gao a __crit :
> 
> > azurlt, would you please test the patch attached? Thanks.
> > 
> 
> Yes of course, I meant to reverse the patch
> 
> (use kmalloc() under PAGE_SIZE, vmalloc() for 'big' allocs)
> 
> 
> Dont fallback to vmalloc if kmalloc() fails.
> 
> 
> if (size <= PAGE_SIZE)
> 	return kmalloc(size, GFP_KERNEL);
> else
> 	return vmalloc(size);
> 

It's somewhat unclear (to me) what caused this regression.

Is it because the kernel is now doing large kmalloc()s for the fdtable,
and this makes the page allocator go nuts trying to satisfy high-order
page allocation requests?

Is it because the kernel now will usually free the fdtable
synchronously within the rcu callback, rather than deferring this to a
workqueue?

The latter seems unlikely, so I'm thinking this was a case of
high-order-allocations-considered-harmful?


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-12 22:49                   ` Andrew Morton
  0 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2011-04-12 22:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, Américo Wang, Jiri Slaby, azurIt, linux-kernel,
	linux-mm, linux-fsdevel, Jiri Slaby

On Thu, 07 Apr 2011 17:36:26 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le jeudi 07 avril 2011 __ 23:27 +0800, Changli Gao a __crit :
> 
> > azurlt, would you please test the patch attached? Thanks.
> > 
> 
> Yes of course, I meant to reverse the patch
> 
> (use kmalloc() under PAGE_SIZE, vmalloc() for 'big' allocs)
> 
> 
> Dont fallback to vmalloc if kmalloc() fails.
> 
> 
> if (size <= PAGE_SIZE)
> 	return kmalloc(size, GFP_KERNEL);
> else
> 	return vmalloc(size);
> 

It's somewhat unclear (to me) what caused this regression.

Is it because the kernel is now doing large kmalloc()s for the fdtable,
and this makes the page allocator go nuts trying to satisfy high-order
page allocation requests?

Is it because the kernel now will usually free the fdtable
synchronously within the rcu callback, rather than deferring this to a
workqueue?

The latter seems unlikely, so I'm thinking this was a case of
high-order-allocations-considered-harmful?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-12 22:49                   ` Andrew Morton
@ 2011-04-13  1:23                     ` Changli Gao
  -1 siblings, 0 replies; 98+ messages in thread
From: Changli Gao @ 2011-04-13  1:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric Dumazet, Américo Wang, Jiri Slaby, azurIt,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby

On Wed, Apr 13, 2011 at 6:49 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> It's somewhat unclear (to me) what caused this regression.
>
> Is it because the kernel is now doing large kmalloc()s for the fdtable,
> and this makes the page allocator go nuts trying to satisfy high-order
> page allocation requests?
>
> Is it because the kernel now will usually free the fdtable
> synchronously within the rcu callback, rather than deferring this to a
> workqueue?
>
> The latter seems unlikely, so I'm thinking this was a case of
> high-order-allocations-considered-harmful?
>

Maybe, but I am not sure. Maybe my patch causes too many inner
fragments. For example, when asking for 5 pages, get 8 pages, and 3
pages are wasted, then memory thrash happens finally.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-13  1:23                     ` Changli Gao
  0 siblings, 0 replies; 98+ messages in thread
From: Changli Gao @ 2011-04-13  1:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric Dumazet, Américo Wang, Jiri Slaby, azurIt,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby

On Wed, Apr 13, 2011 at 6:49 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> It's somewhat unclear (to me) what caused this regression.
>
> Is it because the kernel is now doing large kmalloc()s for the fdtable,
> and this makes the page allocator go nuts trying to satisfy high-order
> page allocation requests?
>
> Is it because the kernel now will usually free the fdtable
> synchronously within the rcu callback, rather than deferring this to a
> workqueue?
>
> The latter seems unlikely, so I'm thinking this was a case of
> high-order-allocations-considered-harmful?
>

Maybe, but I am not sure. Maybe my patch causes too many inner
fragments. For example, when asking for 5 pages, get 8 pages, and 3
pages are wasted, then memory thrash happens finally.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-13  1:23                     ` Changli Gao
@ 2011-04-13  1:31                       ` Andrew Morton
  -1 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2011-04-13  1:31 UTC (permalink / raw)
  To: Changli Gao
  Cc: Eric Dumazet, Américo Wang, Jiri Slaby, azurIt,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby

On Wed, 13 Apr 2011 09:23:11 +0800 Changli Gao <xiaosuo@gmail.com> wrote:

> On Wed, Apr 13, 2011 at 6:49 AM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
> >
> > It's somewhat unclear (to me) what caused this regression.
> >
> > Is it because the kernel is now doing large kmalloc()s for the fdtable,
> > and this makes the page allocator go nuts trying to satisfy high-order
> > page allocation requests?
> >
> > Is it because the kernel now will usually free the fdtable
> > synchronously within the rcu callback, rather than deferring this to a
> > workqueue?
> >
> > The latter seems unlikely, so I'm thinking this was a case of
> > high-order-allocations-considered-harmful?
> >
> 
> Maybe, but I am not sure. Maybe my patch causes too many inner
> fragments. For example, when asking for 5 pages, get 8 pages, and 3
> pages are wasted, then memory thrash happens finally.

That theory sounds less likely, but could be tested by using
alloc_pages_exact().


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-13  1:31                       ` Andrew Morton
  0 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2011-04-13  1:31 UTC (permalink / raw)
  To: Changli Gao
  Cc: Eric Dumazet, Américo Wang, Jiri Slaby, azurIt,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby

On Wed, 13 Apr 2011 09:23:11 +0800 Changli Gao <xiaosuo@gmail.com> wrote:

> On Wed, Apr 13, 2011 at 6:49 AM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
> >
> > It's somewhat unclear (to me) what caused this regression.
> >
> > Is it because the kernel is now doing large kmalloc()s for the fdtable,
> > and this makes the page allocator go nuts trying to satisfy high-order
> > page allocation requests?
> >
> > Is it because the kernel now will usually free the fdtable
> > synchronously within the rcu callback, rather than deferring this to a
> > workqueue?
> >
> > The latter seems unlikely, so I'm thinking this was a case of
> > high-order-allocations-considered-harmful?
> >
> 
> Maybe, but I am not sure. Maybe my patch causes too many inner
> fragments. For example, when asking for 5 pages, get 8 pages, and 3
> pages are wasted, then memory thrash happens finally.

That theory sounds less likely, but could be tested by using
alloc_pages_exact().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-13  1:31                       ` Andrew Morton
  (?)
@ 2011-04-13  2:37                         ` Eric Dumazet
  -1 siblings, 0 replies; 98+ messages in thread
From: Eric Dumazet @ 2011-04-13  2:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Changli Gao, Américo Wang, Jiri Slaby, azurIt, linux-kernel,
	linux-mm, linux-fsdevel, Jiri Slaby

Le mardi 12 avril 2011 à 18:31 -0700, Andrew Morton a écrit :
> On Wed, 13 Apr 2011 09:23:11 +0800 Changli Gao <xiaosuo@gmail.com> wrote:
> 
> > On Wed, Apr 13, 2011 at 6:49 AM, Andrew Morton
> > <akpm@linux-foundation.org> wrote:
> > >
> > > It's somewhat unclear (to me) what caused this regression.
> > >
> > > Is it because the kernel is now doing large kmalloc()s for the fdtable,
> > > and this makes the page allocator go nuts trying to satisfy high-order
> > > page allocation requests?
> > >
> > > Is it because the kernel now will usually free the fdtable
> > > synchronously within the rcu callback, rather than deferring this to a
> > > workqueue?
> > >
> > > The latter seems unlikely, so I'm thinking this was a case of
> > > high-order-allocations-considered-harmful?
> > >
> > 
> > Maybe, but I am not sure. Maybe my patch causes too many inner
> > fragments. For example, when asking for 5 pages, get 8 pages, and 3
> > pages are wasted, then memory thrash happens finally.
> 
> That theory sounds less likely, but could be tested by using
> alloc_pages_exact().
> 

Very unlikely, since fdtable sizes are powers of two, unless you hit
sysctl_nr_open and it was changed (default value being 2^20)





^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-13  2:37                         ` Eric Dumazet
  0 siblings, 0 replies; 98+ messages in thread
From: Eric Dumazet @ 2011-04-13  2:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Changli Gao, Américo Wang, Jiri Slaby, azurIt, linux-kernel,
	linux-mm, linux-fsdevel, Jiri Slaby

Le mardi 12 avril 2011 à 18:31 -0700, Andrew Morton a écrit :
> On Wed, 13 Apr 2011 09:23:11 +0800 Changli Gao <xiaosuo@gmail.com> wrote:
> 
> > On Wed, Apr 13, 2011 at 6:49 AM, Andrew Morton
> > <akpm@linux-foundation.org> wrote:
> > >
> > > It's somewhat unclear (to me) what caused this regression.
> > >
> > > Is it because the kernel is now doing large kmalloc()s for the fdtable,
> > > and this makes the page allocator go nuts trying to satisfy high-order
> > > page allocation requests?
> > >
> > > Is it because the kernel now will usually free the fdtable
> > > synchronously within the rcu callback, rather than deferring this to a
> > > workqueue?
> > >
> > > The latter seems unlikely, so I'm thinking this was a case of
> > > high-order-allocations-considered-harmful?
> > >
> > 
> > Maybe, but I am not sure. Maybe my patch causes too many inner
> > fragments. For example, when asking for 5 pages, get 8 pages, and 3
> > pages are wasted, then memory thrash happens finally.
> 
> That theory sounds less likely, but could be tested by using
> alloc_pages_exact().
> 

Very unlikely, since fdtable sizes are powers of two, unless you hit
sysctl_nr_open and it was changed (default value being 2^20)




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-13  2:37                         ` Eric Dumazet
  0 siblings, 0 replies; 98+ messages in thread
From: Eric Dumazet @ 2011-04-13  2:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Changli Gao, Américo Wang, Jiri Slaby, azurIt, linux-kernel,
	linux-mm, linux-fsdevel, Jiri Slaby

Le mardi 12 avril 2011 A  18:31 -0700, Andrew Morton a A(C)crit :
> On Wed, 13 Apr 2011 09:23:11 +0800 Changli Gao <xiaosuo@gmail.com> wrote:
> 
> > On Wed, Apr 13, 2011 at 6:49 AM, Andrew Morton
> > <akpm@linux-foundation.org> wrote:
> > >
> > > It's somewhat unclear (to me) what caused this regression.
> > >
> > > Is it because the kernel is now doing large kmalloc()s for the fdtable,
> > > and this makes the page allocator go nuts trying to satisfy high-order
> > > page allocation requests?
> > >
> > > Is it because the kernel now will usually free the fdtable
> > > synchronously within the rcu callback, rather than deferring this to a
> > > workqueue?
> > >
> > > The latter seems unlikely, so I'm thinking this was a case of
> > > high-order-allocations-considered-harmful?
> > >
> > 
> > Maybe, but I am not sure. Maybe my patch causes too many inner
> > fragments. For example, when asking for 5 pages, get 8 pages, and 3
> > pages are wasted, then memory thrash happens finally.
> 
> That theory sounds less likely, but could be tested by using
> alloc_pages_exact().
> 

Very unlikely, since fdtable sizes are powers of two, unless you hit
sysctl_nr_open and it was changed (default value being 2^20)




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Regarding memory fragmentation using malloc....
  2011-04-13  2:37                         ` Eric Dumazet
@ 2011-04-13  6:54                           ` Pintu Agarwal
  -1 siblings, 0 replies; 98+ messages in thread
From: Pintu Agarwal @ 2011-04-13  6:54 UTC (permalink / raw)
  To: Andrew Morton, Eric Dumazet
  Cc: Changli Gao, Américo Wang, Jiri Slaby, azurIt, linux-kernel,
	linux-mm, linux-fsdevel, Jiri Slaby

Dear All,

I am trying to understand how memory fragmentation occurs in linux using many malloc calls.
I am trying to reproduce the page fragmentation problem in linux 2.6.29.x on a linux mobile(without Swap) using a small malloc(in loop) test program of BLOCK_SIZE (64*(4*K)).
And then monitoring the page changes in /proc/buddyinfo after each operation.
>From the output I can see that the page values under buddyinfo keeps changing. But I am not able to relate these changes with my malloc BLOCK_SIZE.
I mean with my BLOCK_SIZE of (2^6 x 4K ==> 2^6 PAGES) the 2^6 th block under /proc/buddyinfo should change. But this is not the actual behaviour.
Whatever is the blocksize, the buddyinfo changes only for 2^0 or 2^1 or 2^2 or 2^3.

I am trying to measure the level of fragmentation after each page allocation.
Can somebody explain me in detail, how actually /proc/buddyinfo changes after each allocation and deallocation.


Thanks,
Pintu



      

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Regarding memory fragmentation using malloc....
@ 2011-04-13  6:54                           ` Pintu Agarwal
  0 siblings, 0 replies; 98+ messages in thread
From: Pintu Agarwal @ 2011-04-13  6:54 UTC (permalink / raw)
  To: Andrew Morton, Eric Dumazet
  Cc: Changli Gao, Américo Wang, Jiri Slaby, azurIt, linux-kernel,
	linux-mm, linux-fsdevel, Jiri Slaby

Dear All,

I am trying to understand how memory fragmentation occurs in linux using many malloc calls.
I am trying to reproduce the page fragmentation problem in linux 2.6.29.x on a linux mobile(without Swap) using a small malloc(in loop) test program of BLOCK_SIZE (64*(4*K)).
And then monitoring the page changes in /proc/buddyinfo after each operation.
>From the output I can see that the page values under buddyinfo keeps changing. But I am not able to relate these changes with my malloc BLOCK_SIZE.
I mean with my BLOCK_SIZE of (2^6 x 4K ==> 2^6 PAGES) the 2^6 th block under /proc/buddyinfo should change. But this is not the actual behaviour.
Whatever is the blocksize, the buddyinfo changes only for 2^0 or 2^1 or 2^2 or 2^3.

I am trying to measure the level of fragmentation after each page allocation.
Can somebody explain me in detail, how actually /proc/buddyinfo changes after each allocation and deallocation.


Thanks,
Pintu





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regarding memory fragmentation using malloc....
  2011-04-13  6:54                           ` Pintu Agarwal
@ 2011-04-13 11:44                             ` Américo Wang
  -1 siblings, 0 replies; 98+ messages in thread
From: Américo Wang @ 2011-04-13 11:44 UTC (permalink / raw)
  To: Pintu Agarwal
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Jiri Slaby, azurIt,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby

On Wed, Apr 13, 2011 at 2:54 PM, Pintu Agarwal <pintu_agarwal@yahoo.com> wrote:
> Dear All,
>
> I am trying to understand how memory fragmentation occurs in linux using many malloc calls.
> I am trying to reproduce the page fragmentation problem in linux 2.6.29.x on a linux mobile(without Swap) using a small malloc(in loop) test program of BLOCK_SIZE (64*(4*K)).
> And then monitoring the page changes in /proc/buddyinfo after each operation.
> From the output I can see that the page values under buddyinfo keeps changing. But I am not able to relate these changes with my malloc BLOCK_SIZE.
> I mean with my BLOCK_SIZE of (2^6 x 4K ==> 2^6 PAGES) the 2^6 th block under /proc/buddyinfo should change. But this is not the actual behaviour.
> Whatever is the blocksize, the buddyinfo changes only for 2^0 or 2^1 or 2^2 or 2^3.
>
> I am trying to measure the level of fragmentation after each page allocation.
> Can somebody explain me in detail, how actually /proc/buddyinfo changes after each allocation and deallocation.
>

What malloc() sees is virtual memory of the process, while buddyinfo
shows physical memory pages.

When you malloc() 64K memory, the kernel may not allocate a 64K
physical memory at one time
for you.

Thanks.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regarding memory fragmentation using malloc....
@ 2011-04-13 11:44                             ` Américo Wang
  0 siblings, 0 replies; 98+ messages in thread
From: Américo Wang @ 2011-04-13 11:44 UTC (permalink / raw)
  To: Pintu Agarwal
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Jiri Slaby, azurIt,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby

On Wed, Apr 13, 2011 at 2:54 PM, Pintu Agarwal <pintu_agarwal@yahoo.com> wrote:
> Dear All,
>
> I am trying to understand how memory fragmentation occurs in linux using many malloc calls.
> I am trying to reproduce the page fragmentation problem in linux 2.6.29.x on a linux mobile(without Swap) using a small malloc(in loop) test program of BLOCK_SIZE (64*(4*K)).
> And then monitoring the page changes in /proc/buddyinfo after each operation.
> From the output I can see that the page values under buddyinfo keeps changing. But I am not able to relate these changes with my malloc BLOCK_SIZE.
> I mean with my BLOCK_SIZE of (2^6 x 4K ==> 2^6 PAGES) the 2^6 th block under /proc/buddyinfo should change. But this is not the actual behaviour.
> Whatever is the blocksize, the buddyinfo changes only for 2^0 or 2^1 or 2^2 or 2^3.
>
> I am trying to measure the level of fragmentation after each page allocation.
> Can somebody explain me in detail, how actually /proc/buddyinfo changes after each allocation and deallocation.
>

What malloc() sees is virtual memory of the process, while buddyinfo
shows physical memory pages.

When you malloc() 64K memory, the kernel may not allocate a 64K
physical memory at one time
for you.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regarding memory fragmentation using malloc....
  2011-04-13 11:44                             ` Américo Wang
@ 2011-04-13 13:56                               ` Pintu Agarwal
  -1 siblings, 0 replies; 98+ messages in thread
From: Pintu Agarwal @ 2011-04-13 13:56 UTC (permalink / raw)
  To: Américo Wang
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Jiri Slaby, azurIt,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby

Hi,

My requirement is, I wanted to measure memory fragmentation level in linux kernel2.6.29 (ARM cortex A8 without swap).
How can I measure fragmentation level(percentage) from /proc/buddyinfo ?

Example : After each page allocation operation, I need to measure fragmentation level. If the level is above 80%, I will trigger a OOM or something to the user.
How can I reproduce this memory fragmentation scenario using a sample program?

Here is my sample program: (to check page allocation using malloc)
----------------------------------------------
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<errno.h>
#include<unistd.h>


#define PAGE_SIZE       (4*1024)

#define MEM_BLOCK       (64*PAGE_SIZE)


#define MAX_LIMIT       (16)

int main()
{
        char *ptr[MAX_LIMIT+1] = {NULL,};
        int i = 0;

        printf("Requesting <%d> blocks of memory of block size <%d>........\n",MAX_LIMIT,MEM_BLOCK);
        system("cat /proc/buddyinfo");
        system("cat /proc/zoneinfo | grep free_pages");
        printf("*****************************************\n\n\n");
        for(i=0; i<MAX_LIMIT; i++)
        {
                ptr[i] = (char *)malloc(sizeof(char)*MEM_BLOCK);
                if(ptr[i] == NULL)
                {
                        printf("ERROR : malloc failed(counter %d) <%s>\n",i,strerror(errno));
                        system("cat /proc/buddyinfo");
                        system("cat /proc/zoneinfo | grep free_pages");
                        printf("press any key to terminate......");
                        getchar();
                        exit(0);
                }
                memset(ptr[i],1,MEM_BLOCK);
                sleep(1);
                //system("cat /proc/buddyinfo");
                //system("cat /proc/zoneinfo | grep free_pages");
                //printf("-----------------------------------------\n");
        }

        sleep(1);
        system("cat /proc/buddyinfo");
        system("cat /proc/zoneinfo | grep free_pages");
        printf("-----------------------------------------\n");

        printf("press any key to end......");
        getchar();

        for(i=0; i<MAX_LIMIT; i++)
        {
                if(ptr[i] != NULL)
                {
                        free(ptr[i]);
                }
        }

        printf("DONE !!!\n");

        return 0;
}
EACH BLOCK SIZE = 64 Pages ==> (64 * 4 * 1024)
TOTAL BLOCKS = 16
----------------------------------------------
In my linux2.6.29 ARM machine, the initial /proc/buddyinfo shows the following:
Node 0, zone      DMA     17     22      1      1      0      1      1      0      0      0      0      0
Node 1, zone      DMA     15    320    423    225     97     26      1      0      0      0      0      0

After running my sample program (with 16 iterations) the buddyinfo output is as follows:
Requesting <16> blocks of memory of block size <262144>........
Node 0, zone      DMA     17     22      1      1      0      1      1      0      0      0      0      0
Node 1, zone      DMA     15    301    419    224     96     27      1      0      0      0      0      0
    nr_free_pages 169
    nr_free_pages 6545
*****************************************


Node 0, zone      DMA     17     22      1      1      0      1      1      0      0      0      0      0
Node 1, zone      DMA     18      2    305    226     96     27      1      0      0      0      0      0
    nr_free_pages 169
    nr_free_pages 5514
-----------------------------------------

The requested block size is 64 pages (2^6) for each block. 
But if we see the output after 16 iterations the buddyinfo allocates pages only from Node 1 , (2^0, 2^1, 2^2, 2^3).
But the actual allocation should happen from (2^6) block in buddyinfo.

Questions:
1) How to analyse buddyinfo based on each page block size?
2) How and in what scenario the buddyinfo changes?
3) Can we rely completely on buddyinfo information for measuring the level of fragmentation?

Can somebody through some more lights on this???

Thanks,
Pintu

--- On Wed, 4/13/11, Américo Wang <xiyou.wangcong@gmail.com> wrote:

> From: Américo Wang <xiyou.wangcong@gmail.com>
> Subject: Re: Regarding memory fragmentation using malloc....
> To: "Pintu Agarwal" <pintu_agarwal@yahoo.com>
> Cc: "Andrew Morton" <akpm@linux-foundation.org>, "Eric Dumazet" <eric.dumazet@gmail.com>, "Changli Gao" <xiaosuo@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, "azurIt" <azurit@pobox.sk>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>
> Date: Wednesday, April 13, 2011, 6:44 AM
> On Wed, Apr 13, 2011 at 2:54 PM,
> Pintu Agarwal <pintu_agarwal@yahoo.com>
> wrote:
> > Dear All,
> >
> > I am trying to understand how memory fragmentation
> occurs in linux using many malloc calls.
> > I am trying to reproduce the page fragmentation
> problem in linux 2.6.29.x on a linux mobile(without Swap)
> using a small malloc(in loop) test program of BLOCK_SIZE
> (64*(4*K)).
> > And then monitoring the page changes in
> /proc/buddyinfo after each operation.
> > From the output I can see that the page values under
> buddyinfo keeps changing. But I am not able to relate these
> changes with my malloc BLOCK_SIZE.
> > I mean with my BLOCK_SIZE of (2^6 x 4K ==> 2^6
> PAGES) the 2^6 th block under /proc/buddyinfo should change.
> But this is not the actual behaviour.
> > Whatever is the blocksize, the buddyinfo changes only
> for 2^0 or 2^1 or 2^2 or 2^3.
> >
> > I am trying to measure the level of fragmentation
> after each page allocation.
> > Can somebody explain me in detail, how actually
> /proc/buddyinfo changes after each allocation and
> deallocation.
> >
> 
> What malloc() sees is virtual memory of the process, while
> buddyinfo
> shows physical memory pages.
> 
> When you malloc() 64K memory, the kernel may not allocate a
> 64K
> physical memory at one time
> for you.
> 
> Thanks.
> 


      

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regarding memory fragmentation using malloc....
@ 2011-04-13 13:56                               ` Pintu Agarwal
  0 siblings, 0 replies; 98+ messages in thread
From: Pintu Agarwal @ 2011-04-13 13:56 UTC (permalink / raw)
  To: Américo Wang
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Jiri Slaby, azurIt,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby

Hi,

My requirement is, I wanted to measure memory fragmentation level in linux kernel2.6.29 (ARM cortex A8 without swap).
How can I measure fragmentation level(percentage) from /proc/buddyinfo ?

Example : After each page allocation operation, I need to measure fragmentation level. If the level is above 80%, I will trigger a OOM or something to the user.
How can I reproduce this memory fragmentation scenario using a sample program?

Here is my sample program: (to check page allocation using malloc)
----------------------------------------------
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<errno.h>
#include<unistd.h>


#define PAGE_SIZE       (4*1024)

#define MEM_BLOCK       (64*PAGE_SIZE)


#define MAX_LIMIT       (16)

int main()
{
        char *ptr[MAX_LIMIT+1] = {NULL,};
        int i = 0;

        printf("Requesting <%d> blocks of memory of block size <%d>........\n",MAX_LIMIT,MEM_BLOCK);
        system("cat /proc/buddyinfo");
        system("cat /proc/zoneinfo | grep free_pages");
        printf("*****************************************\n\n\n");
        for(i=0; i<MAX_LIMIT; i++)
        {
                ptr[i] = (char *)malloc(sizeof(char)*MEM_BLOCK);
                if(ptr[i] == NULL)
                {
                        printf("ERROR : malloc failed(counter %d) <%s>\n",i,strerror(errno));
                        system("cat /proc/buddyinfo");
                        system("cat /proc/zoneinfo | grep free_pages");
                        printf("press any key to terminate......");
                        getchar();
                        exit(0);
                }
                memset(ptr[i],1,MEM_BLOCK);
                sleep(1);
                //system("cat /proc/buddyinfo");
                //system("cat /proc/zoneinfo | grep free_pages");
                //printf("-----------------------------------------\n");
        }

        sleep(1);
        system("cat /proc/buddyinfo");
        system("cat /proc/zoneinfo | grep free_pages");
        printf("-----------------------------------------\n");

        printf("press any key to end......");
        getchar();

        for(i=0; i<MAX_LIMIT; i++)
        {
                if(ptr[i] != NULL)
                {
                        free(ptr[i]);
                }
        }

        printf("DONE !!!\n");

        return 0;
}
EACH BLOCK SIZE = 64 Pages ==> (64 * 4 * 1024)
TOTAL BLOCKS = 16
----------------------------------------------
In my linux2.6.29 ARM machine, the initial /proc/buddyinfo shows the following:
Node 0, zone      DMA     17     22      1      1      0      1      1      0      0      0      0      0
Node 1, zone      DMA     15    320    423    225     97     26      1      0      0      0      0      0

After running my sample program (with 16 iterations) the buddyinfo output is as follows:
Requesting <16> blocks of memory of block size <262144>........
Node 0, zone      DMA     17     22      1      1      0      1      1      0      0      0      0      0
Node 1, zone      DMA     15    301    419    224     96     27      1      0      0      0      0      0
    nr_free_pages 169
    nr_free_pages 6545
*****************************************


Node 0, zone      DMA     17     22      1      1      0      1      1      0      0      0      0      0
Node 1, zone      DMA     18      2    305    226     96     27      1      0      0      0      0      0
    nr_free_pages 169
    nr_free_pages 5514
-----------------------------------------

The requested block size is 64 pages (2^6) for each block. 
But if we see the output after 16 iterations the buddyinfo allocates pages only from Node 1 , (2^0, 2^1, 2^2, 2^3).
But the actual allocation should happen from (2^6) block in buddyinfo.

Questions:
1) How to analyse buddyinfo based on each page block size?
2) How and in what scenario the buddyinfo changes?
3) Can we rely completely on buddyinfo information for measuring the level of fragmentation?

Can somebody through some more lights on this???

Thanks,
Pintu

--- On Wed, 4/13/11, Américo Wang <xiyou.wangcong@gmail.com> wrote:

> From: Américo Wang <xiyou.wangcong@gmail.com>
> Subject: Re: Regarding memory fragmentation using malloc....
> To: "Pintu Agarwal" <pintu_agarwal@yahoo.com>
> Cc: "Andrew Morton" <akpm@linux-foundation.org>, "Eric Dumazet" <eric.dumazet@gmail.com>, "Changli Gao" <xiaosuo@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, "azurIt" <azurit@pobox.sk>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>
> Date: Wednesday, April 13, 2011, 6:44 AM
> On Wed, Apr 13, 2011 at 2:54 PM,
> Pintu Agarwal <pintu_agarwal@yahoo.com>
> wrote:
> > Dear All,
> >
> > I am trying to understand how memory fragmentation
> occurs in linux using many malloc calls.
> > I am trying to reproduce the page fragmentation
> problem in linux 2.6.29.x on a linux mobile(without Swap)
> using a small malloc(in loop) test program of BLOCK_SIZE
> (64*(4*K)).
> > And then monitoring the page changes in
> /proc/buddyinfo after each operation.
> > From the output I can see that the page values under
> buddyinfo keeps changing. But I am not able to relate these
> changes with my malloc BLOCK_SIZE.
> > I mean with my BLOCK_SIZE of (2^6 x 4K ==> 2^6
> PAGES) the 2^6 th block under /proc/buddyinfo should change.
> But this is not the actual behaviour.
> > Whatever is the blocksize, the buddyinfo changes only
> for 2^0 or 2^1 or 2^2 or 2^3.
> >
> > I am trying to measure the level of fragmentation
> after each page allocation.
> > Can somebody explain me in detail, how actually
> /proc/buddyinfo changes after each allocation and
> deallocation.
> >
> 
> What malloc() sees is virtual memory of the process, while
> buddyinfo
> shows physical memory pages.
> 
> When you malloc() 64K memory, the kernel may not allocate a
> 64K
> physical memory at one time
> for you.
> 
> Thanks.
> 




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regarding memory fragmentation using malloc....
  2011-04-13 13:56                               ` Pintu Agarwal
@ 2011-04-13 15:25                                 ` Michal Nazarewicz
  -1 siblings, 0 replies; 98+ messages in thread
From: Michal Nazarewicz @ 2011-04-13 15:25 UTC (permalink / raw)
  To: Américo Wang, Pintu Agarwal
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Jiri Slaby, azurIt,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby

On Wed, 13 Apr 2011 15:56:00 +0200, Pintu Agarwal  
<pintu_agarwal@yahoo.com> wrote:
> My requirement is, I wanted to measure memory fragmentation level in  
> linux kernel2.6.29 (ARM cortex A8 without swap).
> How can I measure fragmentation level(percentage) from /proc/buddyinfo ?

[...]

> In my linux2.6.29 ARM machine, the initial /proc/buddyinfo shows the  
> following:
> Node 0, zone      DMA     17     22      1      1      0      1       
> 1      0      0      0      0      0
> Node 1, zone      DMA     15    320    423    225     97     26       
> 1      0      0      0      0      0
>
> After running my sample program (with 16 iterations) the buddyinfo  
> output is as follows:
> Requesting <16> blocks of memory of block size <262144>........
> Node 0, zone      DMA     17     22      1      1      0      1       
> 1      0      0      0      0      0
> Node 1, zone      DMA     15    301    419    224     96     27       
> 1      0      0      0      0      0
>     nr_free_pages 169
>     nr_free_pages 6545
> *****************************************
>
>
> Node 0, zone      DMA     17     22      1      1      0      1       
> 1      0      0      0      0      0
> Node 1, zone      DMA     18      2    305    226     96     27       
> 1      0      0      0      0      0
>     nr_free_pages 169
>     nr_free_pages 5514
> -----------------------------------------
>
> The requested block size is 64 pages (2^6) for each block.
> But if we see the output after 16 iterations the buddyinfo allocates  
> pages only from Node 1 , (2^0, 2^1, 2^2, 2^3).
> But the actual allocation should happen from (2^6) block in buddyinfo.

No.  When you call malloc() only virtual address space is allocated.  The
actual allocation of physical space occurs when user space accesses the
memory (either reads or writes) and it happens page at a time.

As a matter of fact, if you have limited number of 0-order pages and
allocates in user space block of 64 pages later accessing the memory,
what really happens is that kernel allocates the 0-order pages and when
it runs out of those, splits a 1-order page into two 0-order pages and
takes one of those.

Because of MMU, fragmentation of physical memory is not an issue for
normal user space programs.

It becomes an issue once you deal with hardware that does not have MMU
nor support for scatter-getter DMA or with some big kernel structures.

/proc/buddyinfo tells you how many free pages of given order there are
in the system.  You may interpret it in such a way that the bigger number
of the low order pages the bigger fragmentation of physical memory.  If
there was no fragmentation (for some definition of the term) you'd get only
the highest order pages and at most one page for each lower order.

Again though, this fragmentation is not an issue for user space programs.

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michal "mina86" Nazarewicz    (o o)
ooo +-----<email/xmpp: mnazarewicz@google.com>-----ooO--(_)--Ooo--

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regarding memory fragmentation using malloc....
@ 2011-04-13 15:25                                 ` Michal Nazarewicz
  0 siblings, 0 replies; 98+ messages in thread
From: Michal Nazarewicz @ 2011-04-13 15:25 UTC (permalink / raw)
  To: Américo Wang, Pintu Agarwal
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Jiri Slaby, azurIt,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby

On Wed, 13 Apr 2011 15:56:00 +0200, Pintu Agarwal  
<pintu_agarwal@yahoo.com> wrote:
> My requirement is, I wanted to measure memory fragmentation level in  
> linux kernel2.6.29 (ARM cortex A8 without swap).
> How can I measure fragmentation level(percentage) from /proc/buddyinfo ?

[...]

> In my linux2.6.29 ARM machine, the initial /proc/buddyinfo shows the  
> following:
> Node 0, zone      DMA     17     22      1      1      0      1       
> 1      0      0      0      0      0
> Node 1, zone      DMA     15    320    423    225     97     26       
> 1      0      0      0      0      0
>
> After running my sample program (with 16 iterations) the buddyinfo  
> output is as follows:
> Requesting <16> blocks of memory of block size <262144>........
> Node 0, zone      DMA     17     22      1      1      0      1       
> 1      0      0      0      0      0
> Node 1, zone      DMA     15    301    419    224     96     27       
> 1      0      0      0      0      0
>     nr_free_pages 169
>     nr_free_pages 6545
> *****************************************
>
>
> Node 0, zone      DMA     17     22      1      1      0      1       
> 1      0      0      0      0      0
> Node 1, zone      DMA     18      2    305    226     96     27       
> 1      0      0      0      0      0
>     nr_free_pages 169
>     nr_free_pages 5514
> -----------------------------------------
>
> The requested block size is 64 pages (2^6) for each block.
> But if we see the output after 16 iterations the buddyinfo allocates  
> pages only from Node 1 , (2^0, 2^1, 2^2, 2^3).
> But the actual allocation should happen from (2^6) block in buddyinfo.

No.  When you call malloc() only virtual address space is allocated.  The
actual allocation of physical space occurs when user space accesses the
memory (either reads or writes) and it happens page at a time.

As a matter of fact, if you have limited number of 0-order pages and
allocates in user space block of 64 pages later accessing the memory,
what really happens is that kernel allocates the 0-order pages and when
it runs out of those, splits a 1-order page into two 0-order pages and
takes one of those.

Because of MMU, fragmentation of physical memory is not an issue for
normal user space programs.

It becomes an issue once you deal with hardware that does not have MMU
nor support for scatter-getter DMA or with some big kernel structures.

/proc/buddyinfo tells you how many free pages of given order there are
in the system.  You may interpret it in such a way that the bigger number
of the low order pages the bigger fragmentation of physical memory.  If
there was no fragmentation (for some definition of the term) you'd get only
the highest order pages and at most one page for each lower order.

Again though, this fragmentation is not an issue for user space programs.

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michal "mina86" Nazarewicz    (o o)
ooo +-----<email/xmpp: mnazarewicz@google.com>-----ooO--(_)--Ooo--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-13  2:37                         ` Eric Dumazet
@ 2011-04-13 21:16                           ` Andrew Morton
  -1 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2011-04-13 21:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, Américo Wang, Jiri Slaby, azurIt, linux-kernel,
	linux-mm, linux-fsdevel, Jiri Slaby, Mel Gorman

On Wed, 13 Apr 2011 04:37:36 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le mardi 12 avril 2011 __ 18:31 -0700, Andrew Morton a __crit :
> > On Wed, 13 Apr 2011 09:23:11 +0800 Changli Gao <xiaosuo@gmail.com> wrote:
> > 
> > > On Wed, Apr 13, 2011 at 6:49 AM, Andrew Morton
> > > <akpm@linux-foundation.org> wrote:
> > > >
> > > > It's somewhat unclear (to me) what caused this regression.
> > > >
> > > > Is it because the kernel is now doing large kmalloc()s for the fdtable,
> > > > and this makes the page allocator go nuts trying to satisfy high-order
> > > > page allocation requests?
> > > >
> > > > Is it because the kernel now will usually free the fdtable
> > > > synchronously within the rcu callback, rather than deferring this to a
> > > > workqueue?
> > > >
> > > > The latter seems unlikely, so I'm thinking this was a case of
> > > > high-order-allocations-considered-harmful?
> > > >
> > > 
> > > Maybe, but I am not sure. Maybe my patch causes too many inner
> > > fragments. For example, when asking for 5 pages, get 8 pages, and 3
> > > pages are wasted, then memory thrash happens finally.
> > 
> > That theory sounds less likely, but could be tested by using
> > alloc_pages_exact().
> > 
> 
> Very unlikely, since fdtable sizes are powers of two, unless you hit
> sysctl_nr_open and it was changed (default value being 2^20)
> 

So am I correct in believing that this regression is due to the
high-order allocations putting excess stress onto page reclaim?

If so, then how large _are_ these allocations?  This perhaps can be
determined from /proc/slabinfo.  They must be pretty huge, because slub
likes to do excessively-large allocations and the system handles that
reasonably well.

I suppose that a suitable fix would be


From: Andrew Morton <akpm@linux-foundation.org>

Azurit reports large increases in system time after 2.6.36 when running
Apache.  It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc()
to allocate fdmem if possible").

That patch caused the vfs to use kmalloc() for very large allocations and
this is causing excessive work (and presumably excessive reclaim) within
the page allocator.

Fix it by falling back to vmalloc() earlier - when the allocation attempt
would have been considered "costly" by reclaim.

Reported-by: azurIt <azurit@pobox.sk>
Cc: Changli Gao <xiaosuo@gmail.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/file.c |   17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff -puN fs/file.c~a fs/file.c
--- a/fs/file.c~a
+++ a/fs/file.c
@@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
  */
 static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
 
-static inline void *alloc_fdmem(unsigned int size)
+static void *alloc_fdmem(unsigned int size)
 {
-	void *data;
-
-	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
-	if (data != NULL)
-		return data;
-
+	/*
+	 * Very large allocations can stress page reclaim, so fall back to
+	 * vmalloc() if the allocation size will be considered "large" by the VM.
+	 */
+	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
+		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
+		if (data != NULL)
+			return data;
+	}
 	return vmalloc(size);
 }
 
_


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-13 21:16                           ` Andrew Morton
  0 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2011-04-13 21:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, Américo Wang, Jiri Slaby, azurIt, linux-kernel,
	linux-mm, linux-fsdevel, Jiri Slaby, Mel Gorman

On Wed, 13 Apr 2011 04:37:36 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le mardi 12 avril 2011 __ 18:31 -0700, Andrew Morton a __crit :
> > On Wed, 13 Apr 2011 09:23:11 +0800 Changli Gao <xiaosuo@gmail.com> wrote:
> > 
> > > On Wed, Apr 13, 2011 at 6:49 AM, Andrew Morton
> > > <akpm@linux-foundation.org> wrote:
> > > >
> > > > It's somewhat unclear (to me) what caused this regression.
> > > >
> > > > Is it because the kernel is now doing large kmalloc()s for the fdtable,
> > > > and this makes the page allocator go nuts trying to satisfy high-order
> > > > page allocation requests?
> > > >
> > > > Is it because the kernel now will usually free the fdtable
> > > > synchronously within the rcu callback, rather than deferring this to a
> > > > workqueue?
> > > >
> > > > The latter seems unlikely, so I'm thinking this was a case of
> > > > high-order-allocations-considered-harmful?
> > > >
> > > 
> > > Maybe, but I am not sure. Maybe my patch causes too many inner
> > > fragments. For example, when asking for 5 pages, get 8 pages, and 3
> > > pages are wasted, then memory thrash happens finally.
> > 
> > That theory sounds less likely, but could be tested by using
> > alloc_pages_exact().
> > 
> 
> Very unlikely, since fdtable sizes are powers of two, unless you hit
> sysctl_nr_open and it was changed (default value being 2^20)
> 

So am I correct in believing that this regression is due to the
high-order allocations putting excess stress onto page reclaim?

If so, then how large _are_ these allocations?  This perhaps can be
determined from /proc/slabinfo.  They must be pretty huge, because slub
likes to do excessively-large allocations and the system handles that
reasonably well.

I suppose that a suitable fix would be


From: Andrew Morton <akpm@linux-foundation.org>

Azurit reports large increases in system time after 2.6.36 when running
Apache.  It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc()
to allocate fdmem if possible").

That patch caused the vfs to use kmalloc() for very large allocations and
this is causing excessive work (and presumably excessive reclaim) within
the page allocator.

Fix it by falling back to vmalloc() earlier - when the allocation attempt
would have been considered "costly" by reclaim.

Reported-by: azurIt <azurit@pobox.sk>
Cc: Changli Gao <xiaosuo@gmail.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/file.c |   17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff -puN fs/file.c~a fs/file.c
--- a/fs/file.c~a
+++ a/fs/file.c
@@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
  */
 static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
 
-static inline void *alloc_fdmem(unsigned int size)
+static void *alloc_fdmem(unsigned int size)
 {
-	void *data;
-
-	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
-	if (data != NULL)
-		return data;
-
+	/*
+	 * Very large allocations can stress page reclaim, so fall back to
+	 * vmalloc() if the allocation size will be considered "large" by the VM.
+	 */
+	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
+		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
+		if (data != NULL)
+			return data;
+	}
 	return vmalloc(size);
 }
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-13 21:16                           ` Andrew Morton
@ 2011-04-13 21:24                             ` Andrew Morton
  -1 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2011-04-13 21:24 UTC (permalink / raw)
  To: Eric Dumazet, Changli Gao, Américo Wang, Jiri Slaby, azurIt,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby, Mel Gorman

On Wed, 13 Apr 2011 14:16:00 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

>  fs/file.c |   17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)

bah, stupid compiler.


--- a/fs/file.c~vfs-avoid-large-kmallocs-for-the-fdtable
+++ a/fs/file.c
@@ -9,6 +9,7 @@
 #include <linux/module.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/mmzone.h>
 #include <linux/time.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
@@ -39,14 +40,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
  */
 static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
 
-static inline void *alloc_fdmem(unsigned int size)
+static void *alloc_fdmem(unsigned int size)
 {
-	void *data;
-
-	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
-	if (data != NULL)
-		return data;
-
+	/*
+	 * Very large allocations can stress page reclaim, so fall back to
+	 * vmalloc() if the allocation size will be considered "large" by the VM.
+	 */
+	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
+		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
+		if (data != NULL)
+			return data;
+	}
 	return vmalloc(size);
 }
 
_


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-13 21:16                           ` Andrew Morton
  (?)
@ 2011-04-13 21:24                           ` Andrew Morton
  -1 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2011-04-13 21:24 UTC (permalink / raw)
  To: Eric Dumazet, Changli Gao, Américo Wang, Jiri Slaby, azurIt

On Wed, 13 Apr 2011 14:16:00 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

>  fs/file.c |   17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)

bah, stupid compiler.


--- a/fs/file.c~vfs-avoid-large-kmallocs-for-the-fdtable
+++ a/fs/file.c
@@ -9,6 +9,7 @@
 #include <linux/module.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/mmzone.h>
 #include <linux/time.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
@@ -39,14 +40,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
  */
 static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
 
-static inline void *alloc_fdmem(unsigned int size)
+static void *alloc_fdmem(unsigned int size)
 {
-	void *data;
-
-	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
-	if (data != NULL)
-		return data;
-
+	/*
+	 * Very large allocations can stress page reclaim, so fall back to
+	 * vmalloc() if the allocation size will be considered "large" by the VM.
+	 */
+	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
+		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
+		if (data != NULL)
+			return data;
+	}
 	return vmalloc(size);
 }
 
_


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-13 21:24                             ` Andrew Morton
  0 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2011-04-13 21:24 UTC (permalink / raw)
  To: Eric Dumazet, Changli Gao, Américo Wang, Jiri Slaby, azurIt,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby, Mel Gorman

On Wed, 13 Apr 2011 14:16:00 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

>  fs/file.c |   17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)

bah, stupid compiler.


--- a/fs/file.c~vfs-avoid-large-kmallocs-for-the-fdtable
+++ a/fs/file.c
@@ -9,6 +9,7 @@
 #include <linux/module.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/mmzone.h>
 #include <linux/time.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
@@ -39,14 +40,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
  */
 static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
 
-static inline void *alloc_fdmem(unsigned int size)
+static void *alloc_fdmem(unsigned int size)
 {
-	void *data;
-
-	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
-	if (data != NULL)
-		return data;
-
+	/*
+	 * Very large allocations can stress page reclaim, so fall back to
+	 * vmalloc() if the allocation size will be considered "large" by the VM.
+	 */
+	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
+		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
+		if (data != NULL)
+			return data;
+	}
 	return vmalloc(size);
 }
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-13 21:16                           ` Andrew Morton
@ 2011-04-13 21:44                             ` David Rientjes
  -1 siblings, 0 replies; 98+ messages in thread
From: David Rientjes @ 2011-04-13 21:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric Dumazet, Changli Gao, Américo Wang, Jiri Slaby, azurIt,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby, Mel Gorman,
	Christoph Lameter

On Wed, 13 Apr 2011, Andrew Morton wrote:

> Azurit reports large increases in system time after 2.6.36 when running
> Apache.  It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc()
> to allocate fdmem if possible").
> 
> That patch caused the vfs to use kmalloc() for very large allocations and
> this is causing excessive work (and presumably excessive reclaim) within
> the page allocator.
> 
> Fix it by falling back to vmalloc() earlier - when the allocation attempt
> would have been considered "costly" by reclaim.
> 
> Reported-by: azurIt <azurit@pobox.sk>
> Cc: Changli Gao <xiaosuo@gmail.com>
> Cc: Americo Wang <xiyou.wangcong@gmail.com>
> Cc: Jiri Slaby <jslaby@suse.cz>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  fs/file.c |   17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff -puN fs/file.c~a fs/file.c
> --- a/fs/file.c~a
> +++ a/fs/file.c
> @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
>   */
>  static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
>  
> -static inline void *alloc_fdmem(unsigned int size)
> +static void *alloc_fdmem(unsigned int size)
>  {
> -	void *data;
> -
> -	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> -	if (data != NULL)
> -		return data;
> -
> +	/*
> +	 * Very large allocations can stress page reclaim, so fall back to
> +	 * vmalloc() if the allocation size will be considered "large" by the VM.
> +	 */
> +	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
> +		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> +		if (data != NULL)
> +			return data;
> +	}
>  	return vmalloc(size);
>  }
>  

It's a shame that we can't at least try kmalloc() with sufficiently large 
sizes by doing something like

	gfp_t flags = GFP_NOWAIT | __GFP_NOWARN;

	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
		flags |= GFP_KERNEL;
	data = kmalloc(size, flags);
	if (data)
		return data;
	return vmalloc(size);

which would at least attempt to use the slab allocator.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-13 21:44                             ` David Rientjes
  0 siblings, 0 replies; 98+ messages in thread
From: David Rientjes @ 2011-04-13 21:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric Dumazet, Changli Gao, Américo Wang, Jiri Slaby, azurIt,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby, Mel Gorman,
	Christoph Lameter

On Wed, 13 Apr 2011, Andrew Morton wrote:

> Azurit reports large increases in system time after 2.6.36 when running
> Apache.  It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc()
> to allocate fdmem if possible").
> 
> That patch caused the vfs to use kmalloc() for very large allocations and
> this is causing excessive work (and presumably excessive reclaim) within
> the page allocator.
> 
> Fix it by falling back to vmalloc() earlier - when the allocation attempt
> would have been considered "costly" by reclaim.
> 
> Reported-by: azurIt <azurit@pobox.sk>
> Cc: Changli Gao <xiaosuo@gmail.com>
> Cc: Americo Wang <xiyou.wangcong@gmail.com>
> Cc: Jiri Slaby <jslaby@suse.cz>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  fs/file.c |   17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff -puN fs/file.c~a fs/file.c
> --- a/fs/file.c~a
> +++ a/fs/file.c
> @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
>   */
>  static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
>  
> -static inline void *alloc_fdmem(unsigned int size)
> +static void *alloc_fdmem(unsigned int size)
>  {
> -	void *data;
> -
> -	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> -	if (data != NULL)
> -		return data;
> -
> +	/*
> +	 * Very large allocations can stress page reclaim, so fall back to
> +	 * vmalloc() if the allocation size will be considered "large" by the VM.
> +	 */
> +	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
> +		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> +		if (data != NULL)
> +			return data;
> +	}
>  	return vmalloc(size);
>  }
>  

It's a shame that we can't at least try kmalloc() with sufficiently large 
sizes by doing something like

	gfp_t flags = GFP_NOWAIT | __GFP_NOWARN;

	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
		flags |= GFP_KERNEL;
	data = kmalloc(size, flags);
	if (data)
		return data;
	return vmalloc(size);

which would at least attempt to use the slab allocator.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-13 21:44                             ` David Rientjes
@ 2011-04-13 21:54                               ` Andrew Morton
  -1 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2011-04-13 21:54 UTC (permalink / raw)
  To: David Rientjes
  Cc: Eric Dumazet, Changli Gao, Américo Wang, Jiri Slaby, azurIt,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby, Mel Gorman,
	Christoph Lameter

On Wed, 13 Apr 2011 14:44:16 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> > -static inline void *alloc_fdmem(unsigned int size)
> > +static void *alloc_fdmem(unsigned int size)
> >  {
> > -	void *data;
> > -
> > -	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> > -	if (data != NULL)
> > -		return data;
> > -
> > +	/*
> > +	 * Very large allocations can stress page reclaim, so fall back to
> > +	 * vmalloc() if the allocation size will be considered "large" by the VM.
> > +	 */
> > +	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
> > +		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> > +		if (data != NULL)
> > +			return data;
> > +	}
> >  	return vmalloc(size);
> >  }
> >  
> 
> It's a shame that we can't at least try kmalloc() with sufficiently large 
> sizes by doing something like
> 
> 	gfp_t flags = GFP_NOWAIT | __GFP_NOWARN;
> 
> 	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> 		flags |= GFP_KERNEL;
> 	data = kmalloc(size, flags);
> 	if (data)
> 		return data;
> 	return vmalloc(size);
> 
> which would at least attempt to use the slab allocator.

Maybe.  If the fdtable is that huge then the fork() is probably going
to be pretty slow anyway.  And the large allocation might cause
depletion of high-order free pages and might cause fragmentation of
even-higher-order pages by splitting them up. </handwaving>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-13 21:54                               ` Andrew Morton
  0 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2011-04-13 21:54 UTC (permalink / raw)
  To: David Rientjes
  Cc: Eric Dumazet, Changli Gao, Américo Wang, Jiri Slaby, azurIt,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby, Mel Gorman,
	Christoph Lameter

On Wed, 13 Apr 2011 14:44:16 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> > -static inline void *alloc_fdmem(unsigned int size)
> > +static void *alloc_fdmem(unsigned int size)
> >  {
> > -	void *data;
> > -
> > -	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> > -	if (data != NULL)
> > -		return data;
> > -
> > +	/*
> > +	 * Very large allocations can stress page reclaim, so fall back to
> > +	 * vmalloc() if the allocation size will be considered "large" by the VM.
> > +	 */
> > +	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
> > +		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> > +		if (data != NULL)
> > +			return data;
> > +	}
> >  	return vmalloc(size);
> >  }
> >  
> 
> It's a shame that we can't at least try kmalloc() with sufficiently large 
> sizes by doing something like
> 
> 	gfp_t flags = GFP_NOWAIT | __GFP_NOWARN;
> 
> 	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> 		flags |= GFP_KERNEL;
> 	data = kmalloc(size, flags);
> 	if (data)
> 		return data;
> 	return vmalloc(size);
> 
> which would at least attempt to use the slab allocator.

Maybe.  If the fdtable is that huge then the fork() is probably going
to be pretty slow anyway.  And the large allocation might cause
depletion of high-order free pages and might cause fragmentation of
even-higher-order pages by splitting them up. </handwaving>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-13 21:16                           ` Andrew Morton
  (?)
@ 2011-04-14  2:10                             ` Eric Dumazet
  -1 siblings, 0 replies; 98+ messages in thread
From: Eric Dumazet @ 2011-04-14  2:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Changli Gao, Américo Wang, Jiri Slaby, azurIt, linux-kernel,
	linux-mm, linux-fsdevel, Jiri Slaby, Mel Gorman

Le mercredi 13 avril 2011 à 14:16 -0700, Andrew Morton a écrit :

> So am I correct in believing that this regression is due to the
> high-order allocations putting excess stress onto page reclaim?
> 

I suppose so.

> If so, then how large _are_ these allocations?  This perhaps can be
> determined from /proc/slabinfo.  They must be pretty huge, because slub
> likes to do excessively-large allocations and the system handles that
> reasonably well.
> 
> I suppose that a suitable fix would be
> 
> 
> From: Andrew Morton <akpm@linux-foundation.org>
> 
> Azurit reports large increases in system time after 2.6.36 when running
> Apache.  It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc()
> to allocate fdmem if possible").
> 
> That patch caused the vfs to use kmalloc() for very large allocations and
> this is causing excessive work (and presumably excessive reclaim) within
> the page allocator.
> 
> Fix it by falling back to vmalloc() earlier - when the allocation attempt
> would have been considered "costly" by reclaim.
> 
> Reported-by: azurIt <azurit@pobox.sk>
> Cc: Changli Gao <xiaosuo@gmail.com>
> Cc: Americo Wang <xiyou.wangcong@gmail.com>
> Cc: Jiri Slaby <jslaby@suse.cz>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  fs/file.c |   17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff -puN fs/file.c~a fs/file.c
> --- a/fs/file.c~a
> +++ a/fs/file.c
> @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
>   */
>  static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
>  
> -static inline void *alloc_fdmem(unsigned int size)
> +static void *alloc_fdmem(unsigned int size)
>  {
> -	void *data;
> -
> -	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> -	if (data != NULL)
> -		return data;
> -
> +	/*
> +	 * Very large allocations can stress page reclaim, so fall back to
> +	 * vmalloc() if the allocation size will be considered "large" by the VM.
> +	 */
> +	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
> +		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> +		if (data != NULL)
> +			return data;
> +	}
>  	return vmalloc(size);
>  }
>  
> _
> 

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

#define PAGE_ALLOC_COSTLY_ORDER 3

On x86_64, this means we try kmalloc() up to 4096 files in fdtable.

Thanks !



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-14  2:10                             ` Eric Dumazet
  0 siblings, 0 replies; 98+ messages in thread
From: Eric Dumazet @ 2011-04-14  2:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Changli Gao, Américo Wang, Jiri Slaby, azurIt, linux-kernel,
	linux-mm, linux-fsdevel, Jiri Slaby, Mel Gorman

Le mercredi 13 avril 2011 à 14:16 -0700, Andrew Morton a écrit :

> So am I correct in believing that this regression is due to the
> high-order allocations putting excess stress onto page reclaim?
> 

I suppose so.

> If so, then how large _are_ these allocations?  This perhaps can be
> determined from /proc/slabinfo.  They must be pretty huge, because slub
> likes to do excessively-large allocations and the system handles that
> reasonably well.
> 
> I suppose that a suitable fix would be
> 
> 
> From: Andrew Morton <akpm@linux-foundation.org>
> 
> Azurit reports large increases in system time after 2.6.36 when running
> Apache.  It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc()
> to allocate fdmem if possible").
> 
> That patch caused the vfs to use kmalloc() for very large allocations and
> this is causing excessive work (and presumably excessive reclaim) within
> the page allocator.
> 
> Fix it by falling back to vmalloc() earlier - when the allocation attempt
> would have been considered "costly" by reclaim.
> 
> Reported-by: azurIt <azurit@pobox.sk>
> Cc: Changli Gao <xiaosuo@gmail.com>
> Cc: Americo Wang <xiyou.wangcong@gmail.com>
> Cc: Jiri Slaby <jslaby@suse.cz>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  fs/file.c |   17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff -puN fs/file.c~a fs/file.c
> --- a/fs/file.c~a
> +++ a/fs/file.c
> @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
>   */
>  static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
>  
> -static inline void *alloc_fdmem(unsigned int size)
> +static void *alloc_fdmem(unsigned int size)
>  {
> -	void *data;
> -
> -	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> -	if (data != NULL)
> -		return data;
> -
> +	/*
> +	 * Very large allocations can stress page reclaim, so fall back to
> +	 * vmalloc() if the allocation size will be considered "large" by the VM.
> +	 */
> +	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
> +		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> +		if (data != NULL)
> +			return data;
> +	}
>  	return vmalloc(size);
>  }
>  
> _
> 

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

#define PAGE_ALLOC_COSTLY_ORDER 3

On x86_64, this means we try kmalloc() up to 4096 files in fdtable.

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-14  2:10                             ` Eric Dumazet
  0 siblings, 0 replies; 98+ messages in thread
From: Eric Dumazet @ 2011-04-14  2:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Changli Gao, Américo Wang, Jiri Slaby, azurIt, linux-kernel,
	linux-mm, linux-fsdevel, Jiri Slaby, Mel Gorman

Le mercredi 13 avril 2011 A  14:16 -0700, Andrew Morton a A(C)crit :

> So am I correct in believing that this regression is due to the
> high-order allocations putting excess stress onto page reclaim?
> 

I suppose so.

> If so, then how large _are_ these allocations?  This perhaps can be
> determined from /proc/slabinfo.  They must be pretty huge, because slub
> likes to do excessively-large allocations and the system handles that
> reasonably well.
> 
> I suppose that a suitable fix would be
> 
> 
> From: Andrew Morton <akpm@linux-foundation.org>
> 
> Azurit reports large increases in system time after 2.6.36 when running
> Apache.  It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc()
> to allocate fdmem if possible").
> 
> That patch caused the vfs to use kmalloc() for very large allocations and
> this is causing excessive work (and presumably excessive reclaim) within
> the page allocator.
> 
> Fix it by falling back to vmalloc() earlier - when the allocation attempt
> would have been considered "costly" by reclaim.
> 
> Reported-by: azurIt <azurit@pobox.sk>
> Cc: Changli Gao <xiaosuo@gmail.com>
> Cc: Americo Wang <xiyou.wangcong@gmail.com>
> Cc: Jiri Slaby <jslaby@suse.cz>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  fs/file.c |   17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff -puN fs/file.c~a fs/file.c
> --- a/fs/file.c~a
> +++ a/fs/file.c
> @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
>   */
>  static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
>  
> -static inline void *alloc_fdmem(unsigned int size)
> +static void *alloc_fdmem(unsigned int size)
>  {
> -	void *data;
> -
> -	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> -	if (data != NULL)
> -		return data;
> -
> +	/*
> +	 * Very large allocations can stress page reclaim, so fall back to
> +	 * vmalloc() if the allocation size will be considered "large" by the VM.
> +	 */
> +	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
> +		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> +		if (data != NULL)
> +			return data;
> +	}
>  	return vmalloc(size);
>  }
>  
> _
> 

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

#define PAGE_ALLOC_COSTLY_ORDER 3

On x86_64, this means we try kmalloc() up to 4096 files in fdtable.

Thanks !


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-14  2:10                             ` Eric Dumazet
@ 2011-04-14  5:28                               ` Andrew Morton
  -1 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2011-04-14  5:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, Américo Wang, Jiri Slaby, azurIt, linux-kernel,
	linux-mm, linux-fsdevel, Jiri Slaby, Mel Gorman

On Thu, 14 Apr 2011 04:10:58 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > --- a/fs/file.c~a
> > +++ a/fs/file.c
> > @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
> >   */
> >  static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
> >  
> > -static inline void *alloc_fdmem(unsigned int size)
> > +static void *alloc_fdmem(unsigned int size)
> >  {
> > -	void *data;
> > -
> > -	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> > -	if (data != NULL)
> > -		return data;
> > -
> > +	/*
> > +	 * Very large allocations can stress page reclaim, so fall back to
> > +	 * vmalloc() if the allocation size will be considered "large" by the VM.
> > +	 */
> > +	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
> > +		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> > +		if (data != NULL)
> > +			return data;
> > +	}
> >  	return vmalloc(size);
> >  }
> >  
> > _
> > 
> 
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> #define PAGE_ALLOC_COSTLY_ORDER 3
> 
> On x86_64, this means we try kmalloc() up to 4096 files in fdtable.

Thanks.  I added the cc:stable to the changelog.

It'd be nice to get this tested if poss, to confrm that it actually
fixes things.

Also, Melpoke.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-14  5:28                               ` Andrew Morton
  0 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2011-04-14  5:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, Américo Wang, Jiri Slaby, azurIt, linux-kernel,
	linux-mm, linux-fsdevel, Jiri Slaby, Mel Gorman

On Thu, 14 Apr 2011 04:10:58 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > --- a/fs/file.c~a
> > +++ a/fs/file.c
> > @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
> >   */
> >  static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
> >  
> > -static inline void *alloc_fdmem(unsigned int size)
> > +static void *alloc_fdmem(unsigned int size)
> >  {
> > -	void *data;
> > -
> > -	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> > -	if (data != NULL)
> > -		return data;
> > -
> > +	/*
> > +	 * Very large allocations can stress page reclaim, so fall back to
> > +	 * vmalloc() if the allocation size will be considered "large" by the VM.
> > +	 */
> > +	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
> > +		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> > +		if (data != NULL)
> > +			return data;
> > +	}
> >  	return vmalloc(size);
> >  }
> >  
> > _
> > 
> 
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> #define PAGE_ALLOC_COSTLY_ORDER 3
> 
> On x86_64, this means we try kmalloc() up to 4096 files in fdtable.

Thanks.  I added the cc:stable to the changelog.

It'd be nice to get this tested if poss, to confrm that it actually
fixes things.

Also, Melpoke.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-14  5:28                               ` Andrew Morton
  (?)
@ 2011-04-14  6:31                                 ` Eric Dumazet
  -1 siblings, 0 replies; 98+ messages in thread
From: Eric Dumazet @ 2011-04-14  6:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Changli Gao, Américo Wang, Jiri Slaby, azurIt, linux-kernel,
	linux-mm, linux-fsdevel, Jiri Slaby, Mel Gorman

Le mercredi 13 avril 2011 à 22:28 -0700, Andrew Morton a écrit :
> On Thu, 14 Apr 2011 04:10:58 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > > --- a/fs/file.c~a
> > > +++ a/fs/file.c
> > > @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
> > >   */
> > >  static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
> > >  
> > > -static inline void *alloc_fdmem(unsigned int size)
> > > +static void *alloc_fdmem(unsigned int size)
> > >  {
> > > -	void *data;
> > > -
> > > -	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> > > -	if (data != NULL)
> > > -		return data;
> > > -
> > > +	/*
> > > +	 * Very large allocations can stress page reclaim, so fall back to
> > > +	 * vmalloc() if the allocation size will be considered "large" by the VM.
> > > +	 */
> > > +	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
> > > +		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> > > +		if (data != NULL)
> > > +			return data;
> > > +	}
> > >  	return vmalloc(size);
> > >  }
> > >  
> > > _
> > > 
> > 
> > Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
> > 
> > #define PAGE_ALLOC_COSTLY_ORDER 3
> > 
> > On x86_64, this means we try kmalloc() up to 4096 files in fdtable.
> 
> Thanks.  I added the cc:stable to the changelog.
> 
> It'd be nice to get this tested if poss, to confrm that it actually
> fixes things.
> 
> Also, Melpoke.

Azurit, could you check how many fds are opened by your apache servers ?
(must be related to number of virtual hosts / acces_log / error_log
files)

Pick one pid from ps list
ps aux | grep apache

ls /proc/{pid_of_one_apache}/fd | wc -l

or

lsof -p { pid_of_one_apache} | tail -n 2
apache2 8501 httpadm   13w   REG     104,7  2350407   3866638 /data/logs/httpd/rewrites.log
apache2 8501 httpadm   14r  0000      0,10        0 263148343 eventpoll

Here it's "14"

Thanks



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-14  6:31                                 ` Eric Dumazet
  0 siblings, 0 replies; 98+ messages in thread
From: Eric Dumazet @ 2011-04-14  6:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Changli Gao, Américo Wang, Jiri Slaby, azurIt, linux-kernel,
	linux-mm, linux-fsdevel, Jiri Slaby, Mel Gorman

Le mercredi 13 avril 2011 à 22:28 -0700, Andrew Morton a écrit :
> On Thu, 14 Apr 2011 04:10:58 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > > --- a/fs/file.c~a
> > > +++ a/fs/file.c
> > > @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
> > >   */
> > >  static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
> > >  
> > > -static inline void *alloc_fdmem(unsigned int size)
> > > +static void *alloc_fdmem(unsigned int size)
> > >  {
> > > -	void *data;
> > > -
> > > -	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> > > -	if (data != NULL)
> > > -		return data;
> > > -
> > > +	/*
> > > +	 * Very large allocations can stress page reclaim, so fall back to
> > > +	 * vmalloc() if the allocation size will be considered "large" by the VM.
> > > +	 */
> > > +	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
> > > +		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> > > +		if (data != NULL)
> > > +			return data;
> > > +	}
> > >  	return vmalloc(size);
> > >  }
> > >  
> > > _
> > > 
> > 
> > Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
> > 
> > #define PAGE_ALLOC_COSTLY_ORDER 3
> > 
> > On x86_64, this means we try kmalloc() up to 4096 files in fdtable.
> 
> Thanks.  I added the cc:stable to the changelog.
> 
> It'd be nice to get this tested if poss, to confrm that it actually
> fixes things.
> 
> Also, Melpoke.

Azurit, could you check how many fds are opened by your apache servers ?
(must be related to number of virtual hosts / acces_log / error_log
files)

Pick one pid from ps list
ps aux | grep apache

ls /proc/{pid_of_one_apache}/fd | wc -l

or

lsof -p { pid_of_one_apache} | tail -n 2
apache2 8501 httpadm   13w   REG     104,7  2350407   3866638 /data/logs/httpd/rewrites.log
apache2 8501 httpadm   14r  0000      0,10        0 263148343 eventpoll

Here it's "14"

Thanks


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-14  6:31                                 ` Eric Dumazet
  0 siblings, 0 replies; 98+ messages in thread
From: Eric Dumazet @ 2011-04-14  6:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Changli Gao, Américo Wang, Jiri Slaby, azurIt, linux-kernel,
	linux-mm, linux-fsdevel, Jiri Slaby, Mel Gorman

Le mercredi 13 avril 2011 A  22:28 -0700, Andrew Morton a A(C)crit :
> On Thu, 14 Apr 2011 04:10:58 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > > --- a/fs/file.c~a
> > > +++ a/fs/file.c
> > > @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
> > >   */
> > >  static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
> > >  
> > > -static inline void *alloc_fdmem(unsigned int size)
> > > +static void *alloc_fdmem(unsigned int size)
> > >  {
> > > -	void *data;
> > > -
> > > -	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> > > -	if (data != NULL)
> > > -		return data;
> > > -
> > > +	/*
> > > +	 * Very large allocations can stress page reclaim, so fall back to
> > > +	 * vmalloc() if the allocation size will be considered "large" by the VM.
> > > +	 */
> > > +	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
> > > +		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> > > +		if (data != NULL)
> > > +			return data;
> > > +	}
> > >  	return vmalloc(size);
> > >  }
> > >  
> > > _
> > > 
> > 
> > Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
> > 
> > #define PAGE_ALLOC_COSTLY_ORDER 3
> > 
> > On x86_64, this means we try kmalloc() up to 4096 files in fdtable.
> 
> Thanks.  I added the cc:stable to the changelog.
> 
> It'd be nice to get this tested if poss, to confrm that it actually
> fixes things.
> 
> Also, Melpoke.

Azurit, could you check how many fds are opened by your apache servers ?
(must be related to number of virtual hosts / acces_log / error_log
files)

Pick one pid from ps list
ps aux | grep apache

ls /proc/{pid_of_one_apache}/fd | wc -l

or

lsof -p { pid_of_one_apache} | tail -n 2
apache2 8501 httpadm   13w   REG     104,7  2350407   3866638 /data/logs/httpd/rewrites.log
apache2 8501 httpadm   14r  0000      0,10        0 263148343 eventpoll

Here it's "14"

Thanks


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regarding memory fragmentation using malloc....
  2011-04-13 15:25                                 ` Michal Nazarewicz
@ 2011-04-14  6:44                                   ` Pintu Agarwal
  -1 siblings, 0 replies; 98+ messages in thread
From: Pintu Agarwal @ 2011-04-14  6:44 UTC (permalink / raw)
  To: Américo Wang, Michal Nazarewicz
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Jiri Slaby, azurIt,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby

Thanks Mr. Michal for all your comments :)

As I can understand from your comments that, malloc from user space will not have much impact on memory fragmentation. 
Will the memory fragmentation be visible if I do kmalloc from the kernel module????

> No.  When you call malloc() only virtual address space
> is allocated.  The
> actual allocation of physical space occurs when user space
> accesses the
> memory (either reads or writes) and it happens page at a
> time.

Here, if I do memset then I am accessing the memory...right? That I am doing already in my sample program.

> what really happens is that kernel allocates the 0-order
> pages and when
> it runs out of those, splits a 1-order page into two
> 0-order pages and
> takes one of those.

Actually, if I understand buddy allocator, it allocates pages from top to bottom. That means it checks for the highest order block that can be allocated, if possible it allocates pages from that block otherwise split the next highest block into two.
What will happen if the next higher blocks are all empty???


Is the memory fragmentation is always a cause of the kernel space program and not user space at all??


Can you provide me with some references for migitating memory fragmentation in linux?



Thanks,
Pintu


--- On Wed, 4/13/11, Michal Nazarewicz <mina86@mina86.com> wrote:

> From: Michal Nazarewicz <mina86@mina86.com>
> Subject: Re: Regarding memory fragmentation using malloc....
> To: "Américo Wang" <xiyou.wangcong@gmail.com>, "Pintu Agarwal" <pintu_agarwal@yahoo.com>
> Cc: "Andrew Morton" <akpm@linux-foundation.org>, "Eric Dumazet" <eric.dumazet@gmail.com>, "Changli Gao" <xiaosuo@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, "azurIt" <azurit@pobox.sk>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>
> Date: Wednesday, April 13, 2011, 10:25 AM
> On Wed, 13 Apr 2011 15:56:00 +0200,
> Pintu Agarwal <pintu_agarwal@yahoo.com>
> wrote:
> > My requirement is, I wanted to measure memory
> fragmentation level in linux kernel2.6.29 (ARM cortex A8
> without swap).
> > How can I measure fragmentation level(percentage) from
> /proc/buddyinfo ?
> 
> [...]
> 
> > In my linux2.6.29 ARM machine, the initial
> /proc/buddyinfo shows the following:
> > Node 0, zone      DMA 
>    17     22   
>   1      1      0 
>     1      1     
> 0      0      0   
>   0      0
> > Node 1, zone      DMA 
>    15    320    423 
>   225     97 
>    26      1   
>   0      0      0 
>     0      0
> > 
> > After running my sample program (with 16 iterations)
> the buddyinfo output is as follows:
> > Requesting <16> blocks of memory of block size
> <262144>........
> > Node 0, zone      DMA 
>    17     22   
>   1      1      0 
>     1      1     
> 0      0      0   
>   0      0
> > Node 1, zone      DMA 
>    15    301    419 
>   224     96 
>    27      1   
>   0      0      0 
>     0      0
> >     nr_free_pages 169
> >     nr_free_pages 6545
> > *****************************************
> > 
> > 
> > Node 0, zone      DMA 
>    17     22   
>   1      1      0 
>     1      1     
> 0      0      0   
>   0      0
> > Node 1, zone      DMA 
>    18      2   
> 305    226     96 
>    27      1   
>   0      0      0 
>     0      0
> >     nr_free_pages 169
> >     nr_free_pages 5514
> > -----------------------------------------
> > 
> > The requested block size is 64 pages (2^6) for each
> block.
> > But if we see the output after 16 iterations the
> buddyinfo allocates pages only from Node 1 , (2^0, 2^1, 2^2,
> 2^3).
> > But the actual allocation should happen from (2^6)
> block in buddyinfo.
> 
> No.  When you call malloc() only virtual address space
> is allocated.  The
> actual allocation of physical space occurs when user space
> accesses the
> memory (either reads or writes) and it happens page at a
> time.
> 
> As a matter of fact, if you have limited number of 0-order
> pages and
> allocates in user space block of 64 pages later accessing
> the memory,
> what really happens is that kernel allocates the 0-order
> pages and when
> it runs out of those, splits a 1-order page into two
> 0-order pages and
> takes one of those.
> 
> Because of MMU, fragmentation of physical memory is not an
> issue for
> normal user space programs.
> 
> It becomes an issue once you deal with hardware that does
> not have MMU
> nor support for scatter-getter DMA or with some big kernel
> structures.
> 
> /proc/buddyinfo tells you how many free pages of given
> order there are
> in the system.  You may interpret it in such a way
> that the bigger number
> of the low order pages the bigger fragmentation of physical
> memory.  If
> there was no fragmentation (for some definition of the
> term) you'd get only
> the highest order pages and at most one page for each lower
> order.
> 
> Again though, this fragmentation is not an issue for user
> space programs.
> 
> --Best regards,           
>                
>              _ 
>    _
> .o. | Liege of Serenely Enlightened Majesty of   
>   o' \,=./ `o
> ..o | Computer Science,  Michal "mina86"
> Nazarewicz    (o o)
> ooo +-----<email/xmpp: mnazarewicz@google.com>-----ooO--(_)--Ooo--
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regarding memory fragmentation using malloc....
@ 2011-04-14  6:44                                   ` Pintu Agarwal
  0 siblings, 0 replies; 98+ messages in thread
From: Pintu Agarwal @ 2011-04-14  6:44 UTC (permalink / raw)
  To: Américo Wang, Michal Nazarewicz
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Jiri Slaby, azurIt,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby

Thanks Mr. Michal for all your comments :)

As I can understand from your comments that, malloc from user space will not have much impact on memory fragmentation. 
Will the memory fragmentation be visible if I do kmalloc from the kernel module????

> No.  When you call malloc() only virtual address space
> is allocated.  The
> actual allocation of physical space occurs when user space
> accesses the
> memory (either reads or writes) and it happens page at a
> time.

Here, if I do memset then I am accessing the memory...right? That I am doing already in my sample program.

> what really happens is that kernel allocates the 0-order
> pages and when
> it runs out of those, splits a 1-order page into two
> 0-order pages and
> takes one of those.

Actually, if I understand buddy allocator, it allocates pages from top to bottom. That means it checks for the highest order block that can be allocated, if possible it allocates pages from that block otherwise split the next highest block into two.
What will happen if the next higher blocks are all empty???


Is the memory fragmentation is always a cause of the kernel space program and not user space at all??


Can you provide me with some references for migitating memory fragmentation in linux?



Thanks,
Pintu


--- On Wed, 4/13/11, Michal Nazarewicz <mina86@mina86.com> wrote:

> From: Michal Nazarewicz <mina86@mina86.com>
> Subject: Re: Regarding memory fragmentation using malloc....
> To: "Américo Wang" <xiyou.wangcong@gmail.com>, "Pintu Agarwal" <pintu_agarwal@yahoo.com>
> Cc: "Andrew Morton" <akpm@linux-foundation.org>, "Eric Dumazet" <eric.dumazet@gmail.com>, "Changli Gao" <xiaosuo@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, "azurIt" <azurit@pobox.sk>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>
> Date: Wednesday, April 13, 2011, 10:25 AM
> On Wed, 13 Apr 2011 15:56:00 +0200,
> Pintu Agarwal <pintu_agarwal@yahoo.com>
> wrote:
> > My requirement is, I wanted to measure memory
> fragmentation level in linux kernel2.6.29 (ARM cortex A8
> without swap).
> > How can I measure fragmentation level(percentage) from
> /proc/buddyinfo ?
> 
> [...]
> 
> > In my linux2.6.29 ARM machine, the initial
> /proc/buddyinfo shows the following:
> > Node 0, zone      DMA 
>    17     22   
>   1      1      0 
>     1      1     
> 0      0      0   
>   0      0
> > Node 1, zone      DMA 
>    15    320    423 
>   225     97 
>    26      1   
>   0      0      0 
>     0      0
> > 
> > After running my sample program (with 16 iterations)
> the buddyinfo output is as follows:
> > Requesting <16> blocks of memory of block size
> <262144>........
> > Node 0, zone      DMA 
>    17     22   
>   1      1      0 
>     1      1     
> 0      0      0   
>   0      0
> > Node 1, zone      DMA 
>    15    301    419 
>   224     96 
>    27      1   
>   0      0      0 
>     0      0
> >     nr_free_pages 169
> >     nr_free_pages 6545
> > *****************************************
> > 
> > 
> > Node 0, zone      DMA 
>    17     22   
>   1      1      0 
>     1      1     
> 0      0      0   
>   0      0
> > Node 1, zone      DMA 
>    18      2   
> 305    226     96 
>    27      1   
>   0      0      0 
>     0      0
> >     nr_free_pages 169
> >     nr_free_pages 5514
> > -----------------------------------------
> > 
> > The requested block size is 64 pages (2^6) for each
> block.
> > But if we see the output after 16 iterations the
> buddyinfo allocates pages only from Node 1 , (2^0, 2^1, 2^2,
> 2^3).
> > But the actual allocation should happen from (2^6)
> block in buddyinfo.
> 
> No.  When you call malloc() only virtual address space
> is allocated.  The
> actual allocation of physical space occurs when user space
> accesses the
> memory (either reads or writes) and it happens page at a
> time.
> 
> As a matter of fact, if you have limited number of 0-order
> pages and
> allocates in user space block of 64 pages later accessing
> the memory,
> what really happens is that kernel allocates the 0-order
> pages and when
> it runs out of those, splits a 1-order page into two
> 0-order pages and
> takes one of those.
> 
> Because of MMU, fragmentation of physical memory is not an
> issue for
> normal user space programs.
> 
> It becomes an issue once you deal with hardware that does
> not have MMU
> nor support for scatter-getter DMA or with some big kernel
> structures.
> 
> /proc/buddyinfo tells you how many free pages of given
> order there are
> in the system.  You may interpret it in such a way
> that the bigger number
> of the low order pages the bigger fragmentation of physical
> memory.  If
> there was no fragmentation (for some definition of the
> term) you'd get only
> the highest order pages and at most one page for each lower
> order.
> 
> Again though, this fragmentation is not an issue for user
> space programs.
> 
> --Best regards,           
>                
>              _ 
>    _
> .o. | Liege of Serenely Enlightened Majesty of   
>   o' \,=./ `o
> ..o | Computer Science,  Michal "mina86"
> Nazarewicz    (o o)
> ooo +-----<email/xmpp: mnazarewicz@google.com>-----ooO--(_)--Ooo--
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-14  6:31                                 ` Eric Dumazet
@ 2011-04-14  9:08                                   ` azurIt
  -1 siblings, 0 replies; 98+ messages in thread
From: azurIt @ 2011-04-14  9:08 UTC (permalink / raw)
  To: Eric Dumazet, Andrew Morton
  Cc: Changli Gao, Américo Wang, Jiri Slaby, linux-kernel,
	linux-mm, linux-fsdevel, Jiri Slaby, Mel Gorman


Here it is:


# ls /proc/31416/fd | wc -l
5926


azur


______________________________________________________________
> Od: "Eric Dumazet" <eric.dumazet@gmail.com>
> Komu: Andrew Morton <akpm@linux-foundation.org>
> Dátum: 14.04.2011 08:32
> Predmet: Re: Regression from 2.6.36
>
> CC: "Changli Gao" <xiaosuo@gmail.com>, "Américo Wang" <xiyou.wangcong@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>, "Mel Gorman" <mel@csn.ul.ie>
>Le mercredi 13 avril 2011 à 22:28 -0700, Andrew Morton a écrit :
>> On Thu, 14 Apr 2011 04:10:58 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> 
>> > > --- a/fs/file.c~a
>> > > +++ a/fs/file.c
>> > > @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
>> > >   */
>> > >  static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
>> > >  
>> > > -static inline void *alloc_fdmem(unsigned int size)
>> > > +static void *alloc_fdmem(unsigned int size)
>> > >  {
>> > > -	void *data;
>> > > -
>> > > -	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>> > > -	if (data != NULL)
>> > > -		return data;
>> > > -
>> > > +	/*
>> > > +	 * Very large allocations can stress page reclaim, so fall back to
>> > > +	 * vmalloc() if the allocation size will be considered "large" by the VM.
>> > > +	 */
>> > > +	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
>> > > +		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>> > > +		if (data != NULL)
>> > > +			return data;
>> > > +	}
>> > >  	return vmalloc(size);
>> > >  }
>> > >  
>> > > _
>> > > 
>> > 
>> > Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
>> > 
>> > #define PAGE_ALLOC_COSTLY_ORDER 3
>> > 
>> > On x86_64, this means we try kmalloc() up to 4096 files in fdtable.
>> 
>> Thanks.  I added the cc:stable to the changelog.
>> 
>> It'd be nice to get this tested if poss, to confrm that it actually
>> fixes things.
>> 
>> Also, Melpoke.
>
>Azurit, could you check how many fds are opened by your apache servers ?
>(must be related to number of virtual hosts / acces_log / error_log
>files)
>
>Pick one pid from ps list
>ps aux | grep apache
>
>ls /proc/{pid_of_one_apache}/fd | wc -l
>
>or
>
>lsof -p { pid_of_one_apache} | tail -n 2
>apache2 8501 httpadm   13w   REG     104,7  2350407   3866638 /data/logs/httpd/rewrites.log
>apache2 8501 httpadm   14r  0000      0,10        0 263148343 eventpoll
>
>Here it's "14"
>
>Thanks
>
>
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-14  9:08                                   ` azurIt
  0 siblings, 0 replies; 98+ messages in thread
From: azurIt @ 2011-04-14  9:08 UTC (permalink / raw)
  To: Eric Dumazet, Andrew Morton
  Cc: Changli Gao, Américo Wang, Jiri Slaby, linux-kernel,
	linux-mm, linux-fsdevel, Jiri Slaby, Mel Gorman


Here it is:


# ls /proc/31416/fd | wc -l
5926


azur


______________________________________________________________
> Od: "Eric Dumazet" <eric.dumazet@gmail.com>
> Komu: Andrew Morton <akpm@linux-foundation.org>
> DA!tum: 14.04.2011 08:32
> Predmet: Re: Regression from 2.6.36
>
> CC: "Changli Gao" <xiaosuo@gmail.com>, "AmA(C)rico Wang" <xiyou.wangcong@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>, "Mel Gorman" <mel@csn.ul.ie>
>Le mercredi 13 avril 2011 A  22:28 -0700, Andrew Morton a A(C)crit :
>> On Thu, 14 Apr 2011 04:10:58 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> 
>> > > --- a/fs/file.c~a
>> > > +++ a/fs/file.c
>> > > @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
>> > >   */
>> > >  static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
>> > >  
>> > > -static inline void *alloc_fdmem(unsigned int size)
>> > > +static void *alloc_fdmem(unsigned int size)
>> > >  {
>> > > -	void *data;
>> > > -
>> > > -	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>> > > -	if (data != NULL)
>> > > -		return data;
>> > > -
>> > > +	/*
>> > > +	 * Very large allocations can stress page reclaim, so fall back to
>> > > +	 * vmalloc() if the allocation size will be considered "large" by the VM.
>> > > +	 */
>> > > +	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
>> > > +		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>> > > +		if (data != NULL)
>> > > +			return data;
>> > > +	}
>> > >  	return vmalloc(size);
>> > >  }
>> > >  
>> > > _
>> > > 
>> > 
>> > Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
>> > 
>> > #define PAGE_ALLOC_COSTLY_ORDER 3
>> > 
>> > On x86_64, this means we try kmalloc() up to 4096 files in fdtable.
>> 
>> Thanks.  I added the cc:stable to the changelog.
>> 
>> It'd be nice to get this tested if poss, to confrm that it actually
>> fixes things.
>> 
>> Also, Melpoke.
>
>Azurit, could you check how many fds are opened by your apache servers ?
>(must be related to number of virtual hosts / acces_log / error_log
>files)
>
>Pick one pid from ps list
>ps aux | grep apache
>
>ls /proc/{pid_of_one_apache}/fd | wc -l
>
>or
>
>lsof -p { pid_of_one_apache} | tail -n 2
>apache2 8501 httpadm   13w   REG     104,7  2350407   3866638 /data/logs/httpd/rewrites.log
>apache2 8501 httpadm   14r  0000      0,10        0 263148343 eventpoll
>
>Here it's "14"
>
>Thanks
>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-13 21:16                           ` Andrew Morton
                                             ` (4 preceding siblings ...)
  (?)
@ 2011-04-14 10:25                           ` Mel Gorman
  2011-04-15  9:59                               ` azurIt
  -1 siblings, 1 reply; 98+ messages in thread
From: Mel Gorman @ 2011-04-14 10:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric Dumazet, Changli Gao, Am?rico Wang, Jiri Slaby, azurIt,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby

[-- Attachment #1: Type: text/plain, Size: 4995 bytes --]

On Wed, Apr 13, 2011 at 02:16:00PM -0700, Andrew Morton wrote:
> On Wed, 13 Apr 2011 04:37:36 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > Le mardi 12 avril 2011 __ 18:31 -0700, Andrew Morton a __crit :
> > > On Wed, 13 Apr 2011 09:23:11 +0800 Changli Gao <xiaosuo@gmail.com> wrote:
> > > 
> > > > On Wed, Apr 13, 2011 at 6:49 AM, Andrew Morton
> > > > <akpm@linux-foundation.org> wrote:
> > > > >
> > > > > It's somewhat unclear (to me) what caused this regression.
> > > > >
> > > > > Is it because the kernel is now doing large kmalloc()s for the fdtable,
> > > > > and this makes the page allocator go nuts trying to satisfy high-order
> > > > > page allocation requests?
> > > > >
> > > > > Is it because the kernel now will usually free the fdtable
> > > > > synchronously within the rcu callback, rather than deferring this to a
> > > > > workqueue?
> > > > >
> > > > > The latter seems unlikely, so I'm thinking this was a case of
> > > > > high-order-allocations-considered-harmful?
> > > > >
> > > > 
> > > > Maybe, but I am not sure. Maybe my patch causes too many inner
> > > > fragments. For example, when asking for 5 pages, get 8 pages, and 3
> > > > pages are wasted, then memory thrash happens finally.
> > > 
> > > That theory sounds less likely, but could be tested by using
> > > alloc_pages_exact().
> > > 
> > 
> > Very unlikely, since fdtable sizes are powers of two, unless you hit
> > sysctl_nr_open and it was changed (default value being 2^20)
> > 
> 
> So am I correct in believing that this regression is due to the
> high-order allocations putting excess stress onto page reclaim?
> 

This is very plausible but it would be nice to get confirmation on
what the size of the fdtable was to be sure. If it's big enough for
high-order allocations and it's a fork-heavy workload with memory
mostly in use, the fork() latencies could be getting very high. In
addition, each fork is potentially kicking kswapd awake (to rebalance
the zone for higher orders). I do not see CONFIG_COMPACTION enabled
meaning that if I'm right in that kswapd is awake and fork() is
entering direct reclaim, then we are lumpy reclaiming as well which
can stall pretty severely.

> If so, then how large _are_ these allocations?  This perhaps can be
> determined from /proc/slabinfo.  They must be pretty huge, because slub
> likes to do excessively-large allocations and the system handles that
> reasonably well.
> 

I'd be interested in finding out the value of /proc/sys/fs/file-max and
the output of ulimit -n (max open files) for the main server is. This
should help us determine what the size of the fdtable is.

> I suppose that a suitable fix would be
> 
> 
> From: Andrew Morton <akpm@linux-foundation.org>
> 
> Azurit reports large increases in system time after 2.6.36 when running
> Apache.  It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc()
> to allocate fdmem if possible").
> 
> That patch caused the vfs to use kmalloc() for very large allocations and
> this is causing excessive work (and presumably excessive reclaim) within
> the page allocator.
> 
> Fix it by falling back to vmalloc() earlier - when the allocation attempt
> would have been considered "costly" by reclaim.
> 
> Reported-by: azurIt <azurit@pobox.sk>
> Cc: Changli Gao <xiaosuo@gmail.com>
> Cc: Americo Wang <xiyou.wangcong@gmail.com>
> Cc: Jiri Slaby <jslaby@suse.cz>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  fs/file.c |   17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff -puN fs/file.c~a fs/file.c
> --- a/fs/file.c~a
> +++ a/fs/file.c
> @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
>   */
>  static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
>  
> -static inline void *alloc_fdmem(unsigned int size)
> +static void *alloc_fdmem(unsigned int size)
>  {
> -	void *data;
> -
> -	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> -	if (data != NULL)
> -		return data;
> -
> +	/*
> +	 * Very large allocations can stress page reclaim, so fall back to
> +	 * vmalloc() if the allocation size will be considered "large" by the VM.
> +	 */
> +	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {

The reporter will need to retest this is really ok. The patch that was
reported to help avoided high-order allocations entirely. If fork-heavy
workloads are really entering direct reclaim and increasing fork latency
enough to ruin performance, then this patch will also suffer. How much
it helps depends on how big fdtable.

> +		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> +		if (data != NULL)
> +			return data;
> +	}
>  	return vmalloc(size);
>  }
>  

I'm attaching a primitive perl script that reports high-order allocation
latencies. I'd be interesting to see what the output of it looks like,
particularly when the server is in trouble if the bug reporter as the
time.

-- 
Mel Gorman
SUSE Labs

[-- Attachment #2: watch-highorder-latency.pl --]
[-- Type: application/x-perl, Size: 2069 bytes --]

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-14  9:08                                   ` azurIt
  (?)
@ 2011-04-14 10:27                                     ` Eric Dumazet
  -1 siblings, 0 replies; 98+ messages in thread
From: Eric Dumazet @ 2011-04-14 10:27 UTC (permalink / raw)
  To: azurIt
  Cc: Andrew Morton, Changli Gao, Américo Wang, Jiri Slaby,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby, Mel Gorman

Le jeudi 14 avril 2011 à 11:08 +0200, azurIt a écrit :
> Here it is:
> 
> 
> # ls /proc/31416/fd | wc -l
> 5926

Hmm, if its a 32bit kernel, I am afraid Andrew patch wont solve the
problem...

[On 32bit kernel, we still use kmalloc() up to 8192 files ]



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-14 10:27                                     ` Eric Dumazet
  0 siblings, 0 replies; 98+ messages in thread
From: Eric Dumazet @ 2011-04-14 10:27 UTC (permalink / raw)
  To: azurIt
  Cc: Andrew Morton, Changli Gao, Américo Wang, Jiri Slaby,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby, Mel Gorman

Le jeudi 14 avril 2011 à 11:08 +0200, azurIt a écrit :
> Here it is:
> 
> 
> # ls /proc/31416/fd | wc -l
> 5926

Hmm, if its a 32bit kernel, I am afraid Andrew patch wont solve the
problem...

[On 32bit kernel, we still use kmalloc() up to 8192 files ]


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-14 10:27                                     ` Eric Dumazet
  0 siblings, 0 replies; 98+ messages in thread
From: Eric Dumazet @ 2011-04-14 10:27 UTC (permalink / raw)
  To: azurIt
  Cc: Andrew Morton, Changli Gao, Américo Wang, Jiri Slaby,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby, Mel Gorman

Le jeudi 14 avril 2011 A  11:08 +0200, azurIt a A(C)crit :
> Here it is:
> 
> 
> # ls /proc/31416/fd | wc -l
> 5926

Hmm, if its a 32bit kernel, I am afraid Andrew patch wont solve the
problem...

[On 32bit kernel, we still use kmalloc() up to 8192 files ]


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-14 10:27                                     ` Eric Dumazet
  (?)
@ 2011-04-14 10:31                                       ` azurIt
  -1 siblings, 0 replies; 98+ messages in thread
From: azurIt @ 2011-04-14 10:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Changli Gao, Américo Wang, Jiri Slaby,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby, Mel Gorman


It's completely 64bit system.



______________________________________________________________
> Od: "Eric Dumazet" <eric.dumazet@gmail.com>
> Komu: azurIt <azurit@pobox.sk>
> Dátum: 14.04.2011 12:28
> Predmet: Re: Regression from 2.6.36
>
> CC: "Andrew Morton" <akpm@linux-foundation.org>, "Changli Gao" <xiaosuo@gmail.com>, "Américo Wang" <xiyou.wangcong@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>, "Mel Gorman" <mel@csn.ul.ie>
>Le jeudi 14 avril 2011 à 11:08 +0200, azurIt a écrit :
>> Here it is:
>> 
>> 
>> # ls /proc/31416/fd | wc -l
>> 5926
>
>Hmm, if its a 32bit kernel, I am afraid Andrew patch wont solve the
>problem...
>
>[On 32bit kernel, we still use kmalloc() up to 8192 files ]
>
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-14 10:31                                       ` azurIt
  0 siblings, 0 replies; 98+ messages in thread
From: azurIt @ 2011-04-14 10:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Changli Gao, Américo Wang, Jiri Slaby,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby, Mel Gorman


It's completely 64bit system.



______________________________________________________________
> Od: "Eric Dumazet" <eric.dumazet@gmail.com>
> Komu: azurIt <azurit@pobox.sk>
> Dátum: 14.04.2011 12:28
> Predmet: Re: Regression from 2.6.36
>
> CC: "Andrew Morton" <akpm@linux-foundation.org>, "Changli Gao" <xiaosuo@gmail.com>, "Américo Wang" <xiyou.wangcong@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>, "Mel Gorman" <mel@csn.ul.ie>
>Le jeudi 14 avril 2011 à 11:08 +0200, azurIt a écrit :
>> Here it is:
>> 
>> 
>> # ls /proc/31416/fd | wc -l
>> 5926
>
>Hmm, if its a 32bit kernel, I am afraid Andrew patch wont solve the
>problem...
>
>[On 32bit kernel, we still use kmalloc() up to 8192 files ]
>
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-14 10:31                                       ` azurIt
  0 siblings, 0 replies; 98+ messages in thread
From: azurIt @ 2011-04-14 10:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Changli Gao, Américo Wang, Jiri Slaby,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby, Mel Gorman


It's completely 64bit system.



______________________________________________________________
> Od: "Eric Dumazet" <eric.dumazet@gmail.com>
> Komu: azurIt <azurit@pobox.sk>
> DA!tum: 14.04.2011 12:28
> Predmet: Re: Regression from 2.6.36
>
> CC: "Andrew Morton" <akpm@linux-foundation.org>, "Changli Gao" <xiaosuo@gmail.com>, "AmA(C)rico Wang" <xiyou.wangcong@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>, "Mel Gorman" <mel@csn.ul.ie>
>Le jeudi 14 avril 2011 A  11:08 +0200, azurIt a A(C)crit :
>> Here it is:
>> 
>> 
>> # ls /proc/31416/fd | wc -l
>> 5926
>
>Hmm, if its a 32bit kernel, I am afraid Andrew patch wont solve the
>problem...
>
>[On 32bit kernel, we still use kmalloc() up to 8192 files ]
>
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regarding memory fragmentation using malloc....
  2011-04-14  6:44                                   ` Pintu Agarwal
@ 2011-04-14 10:47                                     ` Michal Nazarewicz
  -1 siblings, 0 replies; 98+ messages in thread
From: Michal Nazarewicz @ 2011-04-14 10:47 UTC (permalink / raw)
  To: Américo Wang, Pintu Agarwal
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Jiri Slaby, azurIt,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby

On Thu, 14 Apr 2011 08:44:50 +0200, Pintu Agarwal  
<pintu_agarwal@yahoo.com> wrote:
> As I can understand from your comments that, malloc from user space will  
> not have much impact on memory fragmentation.

It has an impact, just like any kind of allocation, it just don't care  
about
fragmentation of physical memory.  You can have only 0-order pages and
successfully allocate megabytes of memory with malloc().

> Will the memory fragmentation be visible if I do kmalloc from
> the kernel module????

It will be more visible in the sense that if you allocate 8 KiB, kernel  
will
have to find 8 KiB contiguous physical memory (ie. 1-order page).

>> No.  When you call malloc() only virtual address space is allocated.
>> The actual allocation of physical space occurs when user space accesses
>> the memory (either reads or writes) and it happens page at a time.
>
> Here, if I do memset then I am accessing the memory...right? That I am  
> doing already in my sample program.

Yes.  But note that even though it's a single memset() call, you are
accessing page at a time and kernel is allocating page at a time.

On some architectures (not ARM) you could access two pages with a single
instructions but I think that would result in two page faults anyway.  I
might be wrong though, the details are not important though.

>> what really happens is that kernel allocates the 0-order
>> pages and when
>> it runs out of those, splits a 1-order page into two
>> 0-order pages and
>> takes one of those.
>
> Actually, if I understand buddy allocator, it allocates pages from top  
> to bottom.

No.  If you want to allocate a single 0-order page, buddy looks for a
a free 0-order page.  If one is not found, it will look for 1-order page
and split it.  This goes up till buddy reaches (MAX_ORDER-1)-page.

> Is the memory fragmentation is always a cause of the kernel space  
> program and not user space at all?

Well, no.  If you allocate memory in user space, kernel will have to
allocate physical memory and *every* allocation may contribute to
fragmentation.  The point is, that all allocations from user-space are
single-page allocations even if you malloc() MiBs of memory.

> Can you provide me with some references for migitating memory  
> fragmentation in linux?

I'm not sure what you mean by that.

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michal "mina86" Nazarewicz    (o o)
ooo +-----<email/xmpp: mnazarewicz@google.com>-----ooO--(_)--Ooo--

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regarding memory fragmentation using malloc....
@ 2011-04-14 10:47                                     ` Michal Nazarewicz
  0 siblings, 0 replies; 98+ messages in thread
From: Michal Nazarewicz @ 2011-04-14 10:47 UTC (permalink / raw)
  To: Américo Wang, Pintu Agarwal
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Jiri Slaby, azurIt,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby

On Thu, 14 Apr 2011 08:44:50 +0200, Pintu Agarwal  
<pintu_agarwal@yahoo.com> wrote:
> As I can understand from your comments that, malloc from user space will  
> not have much impact on memory fragmentation.

It has an impact, just like any kind of allocation, it just don't care  
about
fragmentation of physical memory.  You can have only 0-order pages and
successfully allocate megabytes of memory with malloc().

> Will the memory fragmentation be visible if I do kmalloc from
> the kernel module????

It will be more visible in the sense that if you allocate 8 KiB, kernel  
will
have to find 8 KiB contiguous physical memory (ie. 1-order page).

>> No.  When you call malloc() only virtual address space is allocated.
>> The actual allocation of physical space occurs when user space accesses
>> the memory (either reads or writes) and it happens page at a time.
>
> Here, if I do memset then I am accessing the memory...right? That I am  
> doing already in my sample program.

Yes.  But note that even though it's a single memset() call, you are
accessing page at a time and kernel is allocating page at a time.

On some architectures (not ARM) you could access two pages with a single
instructions but I think that would result in two page faults anyway.  I
might be wrong though, the details are not important though.

>> what really happens is that kernel allocates the 0-order
>> pages and when
>> it runs out of those, splits a 1-order page into two
>> 0-order pages and
>> takes one of those.
>
> Actually, if I understand buddy allocator, it allocates pages from top  
> to bottom.

No.  If you want to allocate a single 0-order page, buddy looks for a
a free 0-order page.  If one is not found, it will look for 1-order page
and split it.  This goes up till buddy reaches (MAX_ORDER-1)-page.

> Is the memory fragmentation is always a cause of the kernel space  
> program and not user space at all?

Well, no.  If you allocate memory in user space, kernel will have to
allocate physical memory and *every* allocation may contribute to
fragmentation.  The point is, that all allocations from user-space are
single-page allocations even if you malloc() MiBs of memory.

> Can you provide me with some references for migitating memory  
> fragmentation in linux?

I'm not sure what you mean by that.

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michal "mina86" Nazarewicz    (o o)
ooo +-----<email/xmpp: mnazarewicz@google.com>-----ooO--(_)--Ooo--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regarding memory fragmentation using malloc....
  2011-04-14 10:47                                     ` Michal Nazarewicz
@ 2011-04-14 12:24                                       ` Pintu Agarwal
  -1 siblings, 0 replies; 98+ messages in thread
From: Pintu Agarwal @ 2011-04-14 12:24 UTC (permalink / raw)
  To: Américo Wang, Michal Nazarewicz
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Jiri Slaby, azurIt,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby

Hello Mr. Michal,

Thanks for your comments.
Sorry. There was a small typo in my last sentence (mitigating not *migitating* memory fragmentation)
That means how can I measure the memory fragmentation either from user space or from kernel space.
Is there a way to measure the amount of memory fragmentation in linux?
Can you provide me some references for that?


Thanks,
Pintu



--- On Thu, 4/14/11, Michal Nazarewicz <mina86@mina86.com> wrote:

> From: Michal Nazarewicz <mina86@mina86.com>
> Subject: Re: Regarding memory fragmentation using malloc....
> To: "Américo Wang" <xiyou.wangcong@gmail.com>, "Pintu Agarwal" <pintu_agarwal@yahoo.com>
> Cc: "Andrew Morton" <akpm@linux-foundation.org>, "Eric Dumazet" <eric.dumazet@gmail.com>, "Changli Gao" <xiaosuo@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, "azurIt" <azurit@pobox.sk>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>
> Date: Thursday, April 14, 2011, 5:47 AM
> On Thu, 14 Apr 2011 08:44:50 +0200,
> Pintu Agarwal <pintu_agarwal@yahoo.com>
> wrote:
> > As I can understand from your comments that, malloc
> from user space will not have much impact on memory
> fragmentation.
> 
> It has an impact, just like any kind of allocation, it just
> don't care about
> fragmentation of physical memory.  You can have only
> 0-order pages and
> successfully allocate megabytes of memory with malloc().
> 
> > Will the memory fragmentation be visible if I do
> kmalloc from
> > the kernel module????
> 
> It will be more visible in the sense that if you allocate 8
> KiB, kernel will
> have to find 8 KiB contiguous physical memory (ie. 1-order
> page).
> 
> >> No.  When you call malloc() only virtual
> address space is allocated.
> >> The actual allocation of physical space occurs
> when user space accesses
> >> the memory (either reads or writes) and it happens
> page at a time.
> > 
> > Here, if I do memset then I am accessing the
> memory...right? That I am doing already in my sample
> program.
> 
> Yes.  But note that even though it's a single memset()
> call, you are
> accessing page at a time and kernel is allocating page at a
> time.
> 
> On some architectures (not ARM) you could access two pages
> with a single
> instructions but I think that would result in two page
> faults anyway.  I
> might be wrong though, the details are not important
> though.
> 
> >> what really happens is that kernel allocates the
> 0-order
> >> pages and when
> >> it runs out of those, splits a 1-order page into
> two
> >> 0-order pages and
> >> takes one of those.
> > 
> > Actually, if I understand buddy allocator, it
> allocates pages from top to bottom.
> 
> No.  If you want to allocate a single 0-order page,
> buddy looks for a
> a free 0-order page.  If one is not found, it will
> look for 1-order page
> and split it.  This goes up till buddy reaches
> (MAX_ORDER-1)-page.
> 
> > Is the memory fragmentation is always a cause of the
> kernel space program and not user space at all?
> 
> Well, no.  If you allocate memory in user space,
> kernel will have to
> allocate physical memory and *every* allocation may
> contribute to
> fragmentation.  The point is, that all allocations
> from user-space are
> single-page allocations even if you malloc() MiBs of
> memory.
> 
> > Can you provide me with some references for migitating
> memory fragmentation in linux?
> 
> I'm not sure what you mean by that.
> 
> --Best regards,           
>                
>              _ 
>    _
> .o. | Liege of Serenely Enlightened Majesty of   
>   o' \,=./ `o
> ..o | Computer Science,  Michal "mina86"
> Nazarewicz    (o o)
> ooo +-----<email/xmpp: mnazarewicz@google.com>-----ooO--(_)--Ooo--
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm'
> in
> the body to majordomo@kvack.org. 
> For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org">
> email@kvack.org
> </a>
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regarding memory fragmentation using malloc....
@ 2011-04-14 12:24                                       ` Pintu Agarwal
  0 siblings, 0 replies; 98+ messages in thread
From: Pintu Agarwal @ 2011-04-14 12:24 UTC (permalink / raw)
  To: Américo Wang, Michal Nazarewicz
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Jiri Slaby, azurIt,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby

Hello Mr. Michal,

Thanks for your comments.
Sorry. There was a small typo in my last sentence (mitigating not *migitating* memory fragmentation)
That means how can I measure the memory fragmentation either from user space or from kernel space.
Is there a way to measure the amount of memory fragmentation in linux?
Can you provide me some references for that?


Thanks,
Pintu



--- On Thu, 4/14/11, Michal Nazarewicz <mina86@mina86.com> wrote:

> From: Michal Nazarewicz <mina86@mina86.com>
> Subject: Re: Regarding memory fragmentation using malloc....
> To: "Américo Wang" <xiyou.wangcong@gmail.com>, "Pintu Agarwal" <pintu_agarwal@yahoo.com>
> Cc: "Andrew Morton" <akpm@linux-foundation.org>, "Eric Dumazet" <eric.dumazet@gmail.com>, "Changli Gao" <xiaosuo@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, "azurIt" <azurit@pobox.sk>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>
> Date: Thursday, April 14, 2011, 5:47 AM
> On Thu, 14 Apr 2011 08:44:50 +0200,
> Pintu Agarwal <pintu_agarwal@yahoo.com>
> wrote:
> > As I can understand from your comments that, malloc
> from user space will not have much impact on memory
> fragmentation.
> 
> It has an impact, just like any kind of allocation, it just
> don't care about
> fragmentation of physical memory.  You can have only
> 0-order pages and
> successfully allocate megabytes of memory with malloc().
> 
> > Will the memory fragmentation be visible if I do
> kmalloc from
> > the kernel module????
> 
> It will be more visible in the sense that if you allocate 8
> KiB, kernel will
> have to find 8 KiB contiguous physical memory (ie. 1-order
> page).
> 
> >> No.  When you call malloc() only virtual
> address space is allocated.
> >> The actual allocation of physical space occurs
> when user space accesses
> >> the memory (either reads or writes) and it happens
> page at a time.
> > 
> > Here, if I do memset then I am accessing the
> memory...right? That I am doing already in my sample
> program.
> 
> Yes.  But note that even though it's a single memset()
> call, you are
> accessing page at a time and kernel is allocating page at a
> time.
> 
> On some architectures (not ARM) you could access two pages
> with a single
> instructions but I think that would result in two page
> faults anyway.  I
> might be wrong though, the details are not important
> though.
> 
> >> what really happens is that kernel allocates the
> 0-order
> >> pages and when
> >> it runs out of those, splits a 1-order page into
> two
> >> 0-order pages and
> >> takes one of those.
> > 
> > Actually, if I understand buddy allocator, it
> allocates pages from top to bottom.
> 
> No.  If you want to allocate a single 0-order page,
> buddy looks for a
> a free 0-order page.  If one is not found, it will
> look for 1-order page
> and split it.  This goes up till buddy reaches
> (MAX_ORDER-1)-page.
> 
> > Is the memory fragmentation is always a cause of the
> kernel space program and not user space at all?
> 
> Well, no.  If you allocate memory in user space,
> kernel will have to
> allocate physical memory and *every* allocation may
> contribute to
> fragmentation.  The point is, that all allocations
> from user-space are
> single-page allocations even if you malloc() MiBs of
> memory.
> 
> > Can you provide me with some references for migitating
> memory fragmentation in linux?
> 
> I'm not sure what you mean by that.
> 
> --Best regards,           
>                
>              _ 
>    _
> .o. | Liege of Serenely Enlightened Majesty of   
>   o' \,=./ `o
> ..o | Computer Science,  Michal "mina86"
> Nazarewicz    (o o)
> ooo +-----<email/xmpp: mnazarewicz@google.com>-----ooO--(_)--Ooo--
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm'
> in
> the body to majordomo@kvack.org. 
> For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org">
> email@kvack.org
> </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regarding memory fragmentation using malloc....
  2011-04-14 12:24                                       ` Pintu Agarwal
@ 2011-04-14 12:31                                         ` Michal Nazarewicz
  -1 siblings, 0 replies; 98+ messages in thread
From: Michal Nazarewicz @ 2011-04-14 12:31 UTC (permalink / raw)
  To: Américo Wang, Pintu Agarwal
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Jiri Slaby, azurIt,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby

On Thu, 14 Apr 2011 14:24:56 +0200, Pintu Agarwal  
<pintu_agarwal@yahoo.com> wrote:

> Sorry. There was a small typo in my last sentence (mitigating not  
> *migitating* memory fragmentation)
> That means how can I measure the memory fragmentation either from user  
> space or from kernel space.
> Is there a way to measure the amount of memory fragmentation in linux?

I'm still not entirely sure what you need.  You may try to measure
fragmentation by the number of low order pages -- the more low order
pages compared to high order pages the bigger the fragmentation.

As of how to mitigate...  There's memory compaction.  There's some
optimisations in buddy system.  I'm probably not the best person to
ask anyway.

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michal "mina86" Nazarewicz    (o o)
ooo +-----<email/xmpp: mnazarewicz@google.com>-----ooO--(_)--Ooo--

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regarding memory fragmentation using malloc....
@ 2011-04-14 12:31                                         ` Michal Nazarewicz
  0 siblings, 0 replies; 98+ messages in thread
From: Michal Nazarewicz @ 2011-04-14 12:31 UTC (permalink / raw)
  To: Américo Wang, Pintu Agarwal
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Jiri Slaby, azurIt,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby

On Thu, 14 Apr 2011 14:24:56 +0200, Pintu Agarwal  
<pintu_agarwal@yahoo.com> wrote:

> Sorry. There was a small typo in my last sentence (mitigating not  
> *migitating* memory fragmentation)
> That means how can I measure the memory fragmentation either from user  
> space or from kernel space.
> Is there a way to measure the amount of memory fragmentation in linux?

I'm still not entirely sure what you need.  You may try to measure
fragmentation by the number of low order pages -- the more low order
pages compared to high order pages the bigger the fragmentation.

As of how to mitigate...  There's memory compaction.  There's some
optimisations in buddy system.  I'm probably not the best person to
ask anyway.

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michal "mina86" Nazarewicz    (o o)
ooo +-----<email/xmpp: mnazarewicz@google.com>-----ooO--(_)--Ooo--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-14 10:25                           ` Mel Gorman
  2011-04-15  9:59                               ` azurIt
@ 2011-04-15  9:59                               ` azurIt
  0 siblings, 0 replies; 98+ messages in thread
From: azurIt @ 2011-04-15  9:59 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Eric Dumazet, Changli Gao, Am?rico Wang, Jiri Slaby,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby


Also this new patch is working fine and fixing the problem.

Mel, I cannot run your script:
# perl watch-highorder-latency.pl
Failed to open /sys/kernel/debug/tracing/set_ftrace_filter for writing at watch-highorder-latency.pl line 17.

# ls -ld /sys/kernel/debug/
ls: cannot access /sys/kernel/debug/: No such file or directory


azur

______________________________________________________________
> Od: "Mel Gorman" <mel@csn.ul.ie>
> Komu: Andrew Morton <akpm@linux-foundation.org>
> Dátum: 14.04.2011 12:25
> Predmet: Re: Regression from 2.6.36
>
> CC: "Eric Dumazet" <eric.dumazet@gmail.com>, "Changli Gao" <xiaosuo@gmail.com>, "Am?rico Wang" <xiyou.wangcong@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>
>On Wed, Apr 13, 2011 at 02:16:00PM -0700, Andrew Morton wrote:
>> On Wed, 13 Apr 2011 04:37:36 +0200
>> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> 
>> > Le mardi 12 avril 2011 __ 18:31 -0700, Andrew Morton a __crit :
>> > > On Wed, 13 Apr 2011 09:23:11 +0800 Changli Gao <xiaosuo@gmail.com> wrote:
>> > > 
>> > > > On Wed, Apr 13, 2011 at 6:49 AM, Andrew Morton
>> > > > <akpm@linux-foundation.org> wrote:
>> > > > >
>> > > > > It's somewhat unclear (to me) what caused this regression.
>> > > > >
>> > > > > Is it because the kernel is now doing large kmalloc()s for the fdtable,
>> > > > > and this makes the page allocator go nuts trying to satisfy high-order
>> > > > > page allocation requests?
>> > > > >
>> > > > > Is it because the kernel now will usually free the fdtable
>> > > > > synchronously within the rcu callback, rather than deferring this to a
>> > > > > workqueue?
>> > > > >
>> > > > > The latter seems unlikely, so I'm thinking this was a case of
>> > > > > high-order-allocations-considered-harmful?
>> > > > >
>> > > > 
>> > > > Maybe, but I am not sure. Maybe my patch causes too many inner
>> > > > fragments. For example, when asking for 5 pages, get 8 pages, and 3
>> > > > pages are wasted, then memory thrash happens finally.
>> > > 
>> > > That theory sounds less likely, but could be tested by using
>> > > alloc_pages_exact().
>> > > 
>> > 
>> > Very unlikely, since fdtable sizes are powers of two, unless you hit
>> > sysctl_nr_open and it was changed (default value being 2^20)
>> > 
>> 
>> So am I correct in believing that this regression is due to the
>> high-order allocations putting excess stress onto page reclaim?
>> 
>
>This is very plausible but it would be nice to get confirmation on
>what the size of the fdtable was to be sure. If it's big enough for
>high-order allocations and it's a fork-heavy workload with memory
>mostly in use, the fork() latencies could be getting very high. In
>addition, each fork is potentially kicking kswapd awake (to rebalance
>the zone for higher orders). I do not see CONFIG_COMPACTION enabled
>meaning that if I'm right in that kswapd is awake and fork() is
>entering direct reclaim, then we are lumpy reclaiming as well which
>can stall pretty severely.
>
>> If so, then how large _are_ these allocations?  This perhaps can be
>> determined from /proc/slabinfo.  They must be pretty huge, because slub
>> likes to do excessively-large allocations and the system handles that
>> reasonably well.
>> 
>
>I'd be interested in finding out the value of /proc/sys/fs/file-max and
>the output of ulimit -n (max open files) for the main server is. This
>should help us determine what the size of the fdtable is.
>
>> I suppose that a suitable fix would be
>> 
>> 
>> From: Andrew Morton <akpm@linux-foundation.org>
>> 
>> Azurit reports large increases in system time after 2.6.36 when running
>> Apache.  It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc()
>> to allocate fdmem if possible").
>> 
>> That patch caused the vfs to use kmalloc() for very large allocations and
>> this is causing excessive work (and presumably excessive reclaim) within
>> the page allocator.
>> 
>> Fix it by falling back to vmalloc() earlier - when the allocation attempt
>> would have been considered "costly" by reclaim.
>> 
>> Reported-by: azurIt <azurit@pobox.sk>
>> Cc: Changli Gao <xiaosuo@gmail.com>
>> Cc: Americo Wang <xiyou.wangcong@gmail.com>
>> Cc: Jiri Slaby <jslaby@suse.cz>
>> Cc: Eric Dumazet <eric.dumazet@gmail.com>
>> Cc: Mel Gorman <mel@csn.ul.ie>
>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>> ---
>> 
>>  fs/file.c |   17 ++++++++++-------
>>  1 file changed, 10 insertions(+), 7 deletions(-)
>> 
>> diff -puN fs/file.c~a fs/file.c
>> --- a/fs/file.c~a
>> +++ a/fs/file.c
>> @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
>>   */
>>  static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
>>  
>> -static inline void *alloc_fdmem(unsigned int size)
>> +static void *alloc_fdmem(unsigned int size)
>>  {
>> -	void *data;
>> -
>> -	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>> -	if (data != NULL)
>> -		return data;
>> -
>> +	/*
>> +	 * Very large allocations can stress page reclaim, so fall back to
>> +	 * vmalloc() if the allocation size will be considered "large" by the VM.
>> +	 */
>> +	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
>
>The reporter will need to retest this is really ok. The patch that was
>reported to help avoided high-order allocations entirely. If fork-heavy
>workloads are really entering direct reclaim and increasing fork latency
>enough to ruin performance, then this patch will also suffer. How much
>it helps depends on how big fdtable.
>
>> +		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>> +		if (data != NULL)
>> +			return data;
>> +	}
>>  	return vmalloc(size);
>>  }
>>  
>
>I'm attaching a primitive perl script that reports high-order allocation
>latencies. I'd be interesting to see what the output of it looks like,
>particularly when the server is in trouble if the bug reporter as the
>time.
>
>-- 
>Mel Gorman
>SUSE Labs
>
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-15  9:59                               ` azurIt
  0 siblings, 0 replies; 98+ messages in thread
From: azurIt @ 2011-04-15  9:59 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Eric Dumazet, Changli Gao, Am?rico Wang, Jiri Slaby,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby


Also this new patch is working fine and fixing the problem.

Mel, I cannot run your script:
# perl watch-highorder-latency.pl
Failed to open /sys/kernel/debug/tracing/set_ftrace_filter for writing at watch-highorder-latency.pl line 17.

# ls -ld /sys/kernel/debug/
ls: cannot access /sys/kernel/debug/: No such file or directory


azur

______________________________________________________________
> Od: "Mel Gorman" <mel@csn.ul.ie>
> Komu: Andrew Morton <akpm@linux-foundation.org>
> Dátum: 14.04.2011 12:25
> Predmet: Re: Regression from 2.6.36
>
> CC: "Eric Dumazet" <eric.dumazet@gmail.com>, "Changli Gao" <xiaosuo@gmail.com>, "Am?rico Wang" <xiyou.wangcong@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>
>On Wed, Apr 13, 2011 at 02:16:00PM -0700, Andrew Morton wrote:
>> On Wed, 13 Apr 2011 04:37:36 +0200
>> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> 
>> > Le mardi 12 avril 2011 __ 18:31 -0700, Andrew Morton a __crit :
>> > > On Wed, 13 Apr 2011 09:23:11 +0800 Changli Gao <xiaosuo@gmail.com> wrote:
>> > > 
>> > > > On Wed, Apr 13, 2011 at 6:49 AM, Andrew Morton
>> > > > <akpm@linux-foundation.org> wrote:
>> > > > >
>> > > > > It's somewhat unclear (to me) what caused this regression.
>> > > > >
>> > > > > Is it because the kernel is now doing large kmalloc()s for the fdtable,
>> > > > > and this makes the page allocator go nuts trying to satisfy high-order
>> > > > > page allocation requests?
>> > > > >
>> > > > > Is it because the kernel now will usually free the fdtable
>> > > > > synchronously within the rcu callback, rather than deferring this to a
>> > > > > workqueue?
>> > > > >
>> > > > > The latter seems unlikely, so I'm thinking this was a case of
>> > > > > high-order-allocations-considered-harmful?
>> > > > >
>> > > > 
>> > > > Maybe, but I am not sure. Maybe my patch causes too many inner
>> > > > fragments. For example, when asking for 5 pages, get 8 pages, and 3
>> > > > pages are wasted, then memory thrash happens finally.
>> > > 
>> > > That theory sounds less likely, but could be tested by using
>> > > alloc_pages_exact().
>> > > 
>> > 
>> > Very unlikely, since fdtable sizes are powers of two, unless you hit
>> > sysctl_nr_open and it was changed (default value being 2^20)
>> > 
>> 
>> So am I correct in believing that this regression is due to the
>> high-order allocations putting excess stress onto page reclaim?
>> 
>
>This is very plausible but it would be nice to get confirmation on
>what the size of the fdtable was to be sure. If it's big enough for
>high-order allocations and it's a fork-heavy workload with memory
>mostly in use, the fork() latencies could be getting very high. In
>addition, each fork is potentially kicking kswapd awake (to rebalance
>the zone for higher orders). I do not see CONFIG_COMPACTION enabled
>meaning that if I'm right in that kswapd is awake and fork() is
>entering direct reclaim, then we are lumpy reclaiming as well which
>can stall pretty severely.
>
>> If so, then how large _are_ these allocations?  This perhaps can be
>> determined from /proc/slabinfo.  They must be pretty huge, because slub
>> likes to do excessively-large allocations and the system handles that
>> reasonably well.
>> 
>
>I'd be interested in finding out the value of /proc/sys/fs/file-max and
>the output of ulimit -n (max open files) for the main server is. This
>should help us determine what the size of the fdtable is.
>
>> I suppose that a suitable fix would be
>> 
>> 
>> From: Andrew Morton <akpm@linux-foundation.org>
>> 
>> Azurit reports large increases in system time after 2.6.36 when running
>> Apache.  It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc()
>> to allocate fdmem if possible").
>> 
>> That patch caused the vfs to use kmalloc() for very large allocations and
>> this is causing excessive work (and presumably excessive reclaim) within
>> the page allocator.
>> 
>> Fix it by falling back to vmalloc() earlier - when the allocation attempt
>> would have been considered "costly" by reclaim.
>> 
>> Reported-by: azurIt <azurit@pobox.sk>
>> Cc: Changli Gao <xiaosuo@gmail.com>
>> Cc: Americo Wang <xiyou.wangcong@gmail.com>
>> Cc: Jiri Slaby <jslaby@suse.cz>
>> Cc: Eric Dumazet <eric.dumazet@gmail.com>
>> Cc: Mel Gorman <mel@csn.ul.ie>
>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>> ---
>> 
>>  fs/file.c |   17 ++++++++++-------
>>  1 file changed, 10 insertions(+), 7 deletions(-)
>> 
>> diff -puN fs/file.c~a fs/file.c
>> --- a/fs/file.c~a
>> +++ a/fs/file.c
>> @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
>>   */
>>  static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
>>  
>> -static inline void *alloc_fdmem(unsigned int size)
>> +static void *alloc_fdmem(unsigned int size)
>>  {
>> -	void *data;
>> -
>> -	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>> -	if (data != NULL)
>> -		return data;
>> -
>> +	/*
>> +	 * Very large allocations can stress page reclaim, so fall back to
>> +	 * vmalloc() if the allocation size will be considered "large" by the VM.
>> +	 */
>> +	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
>
>The reporter will need to retest this is really ok. The patch that was
>reported to help avoided high-order allocations entirely. If fork-heavy
>workloads are really entering direct reclaim and increasing fork latency
>enough to ruin performance, then this patch will also suffer. How much
>it helps depends on how big fdtable.
>
>> +		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>> +		if (data != NULL)
>> +			return data;
>> +	}
>>  	return vmalloc(size);
>>  }
>>  
>
>I'm attaching a primitive perl script that reports high-order allocation
>latencies. I'd be interesting to see what the output of it looks like,
>particularly when the server is in trouble if the bug reporter as the
>time.
>
>-- 
>Mel Gorman
>SUSE Labs
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-15  9:59                               ` azurIt
  0 siblings, 0 replies; 98+ messages in thread
From: azurIt @ 2011-04-15  9:59 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Eric Dumazet, Changli Gao, Am?rico Wang, Jiri Slaby,
	linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby


Also this new patch is working fine and fixing the problem.

Mel, I cannot run your script:
# perl watch-highorder-latency.pl
Failed to open /sys/kernel/debug/tracing/set_ftrace_filter for writing at watch-highorder-latency.pl line 17.

# ls -ld /sys/kernel/debug/
ls: cannot access /sys/kernel/debug/: No such file or directory


azur

______________________________________________________________
> Od: "Mel Gorman" <mel@csn.ul.ie>
> Komu: Andrew Morton <akpm@linux-foundation.org>
> DA!tum: 14.04.2011 12:25
> Predmet: Re: Regression from 2.6.36
>
> CC: "Eric Dumazet" <eric.dumazet@gmail.com>, "Changli Gao" <xiaosuo@gmail.com>, "Am?rico Wang" <xiyou.wangcong@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>
>On Wed, Apr 13, 2011 at 02:16:00PM -0700, Andrew Morton wrote:
>> On Wed, 13 Apr 2011 04:37:36 +0200
>> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> 
>> > Le mardi 12 avril 2011 __ 18:31 -0700, Andrew Morton a __crit :
>> > > On Wed, 13 Apr 2011 09:23:11 +0800 Changli Gao <xiaosuo@gmail.com> wrote:
>> > > 
>> > > > On Wed, Apr 13, 2011 at 6:49 AM, Andrew Morton
>> > > > <akpm@linux-foundation.org> wrote:
>> > > > >
>> > > > > It's somewhat unclear (to me) what caused this regression.
>> > > > >
>> > > > > Is it because the kernel is now doing large kmalloc()s for the fdtable,
>> > > > > and this makes the page allocator go nuts trying to satisfy high-order
>> > > > > page allocation requests?
>> > > > >
>> > > > > Is it because the kernel now will usually free the fdtable
>> > > > > synchronously within the rcu callback, rather than deferring this to a
>> > > > > workqueue?
>> > > > >
>> > > > > The latter seems unlikely, so I'm thinking this was a case of
>> > > > > high-order-allocations-considered-harmful?
>> > > > >
>> > > > 
>> > > > Maybe, but I am not sure. Maybe my patch causes too many inner
>> > > > fragments. For example, when asking for 5 pages, get 8 pages, and 3
>> > > > pages are wasted, then memory thrash happens finally.
>> > > 
>> > > That theory sounds less likely, but could be tested by using
>> > > alloc_pages_exact().
>> > > 
>> > 
>> > Very unlikely, since fdtable sizes are powers of two, unless you hit
>> > sysctl_nr_open and it was changed (default value being 2^20)
>> > 
>> 
>> So am I correct in believing that this regression is due to the
>> high-order allocations putting excess stress onto page reclaim?
>> 
>
>This is very plausible but it would be nice to get confirmation on
>what the size of the fdtable was to be sure. If it's big enough for
>high-order allocations and it's a fork-heavy workload with memory
>mostly in use, the fork() latencies could be getting very high. In
>addition, each fork is potentially kicking kswapd awake (to rebalance
>the zone for higher orders). I do not see CONFIG_COMPACTION enabled
>meaning that if I'm right in that kswapd is awake and fork() is
>entering direct reclaim, then we are lumpy reclaiming as well which
>can stall pretty severely.
>
>> If so, then how large _are_ these allocations?  This perhaps can be
>> determined from /proc/slabinfo.  They must be pretty huge, because slub
>> likes to do excessively-large allocations and the system handles that
>> reasonably well.
>> 
>
>I'd be interested in finding out the value of /proc/sys/fs/file-max and
>the output of ulimit -n (max open files) for the main server is. This
>should help us determine what the size of the fdtable is.
>
>> I suppose that a suitable fix would be
>> 
>> 
>> From: Andrew Morton <akpm@linux-foundation.org>
>> 
>> Azurit reports large increases in system time after 2.6.36 when running
>> Apache.  It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc()
>> to allocate fdmem if possible").
>> 
>> That patch caused the vfs to use kmalloc() for very large allocations and
>> this is causing excessive work (and presumably excessive reclaim) within
>> the page allocator.
>> 
>> Fix it by falling back to vmalloc() earlier - when the allocation attempt
>> would have been considered "costly" by reclaim.
>> 
>> Reported-by: azurIt <azurit@pobox.sk>
>> Cc: Changli Gao <xiaosuo@gmail.com>
>> Cc: Americo Wang <xiyou.wangcong@gmail.com>
>> Cc: Jiri Slaby <jslaby@suse.cz>
>> Cc: Eric Dumazet <eric.dumazet@gmail.com>
>> Cc: Mel Gorman <mel@csn.ul.ie>
>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>> ---
>> 
>>  fs/file.c |   17 ++++++++++-------
>>  1 file changed, 10 insertions(+), 7 deletions(-)
>> 
>> diff -puN fs/file.c~a fs/file.c
>> --- a/fs/file.c~a
>> +++ a/fs/file.c
>> @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
>>   */
>>  static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
>>  
>> -static inline void *alloc_fdmem(unsigned int size)
>> +static void *alloc_fdmem(unsigned int size)
>>  {
>> -	void *data;
>> -
>> -	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>> -	if (data != NULL)
>> -		return data;
>> -
>> +	/*
>> +	 * Very large allocations can stress page reclaim, so fall back to
>> +	 * vmalloc() if the allocation size will be considered "large" by the VM.
>> +	 */
>> +	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
>
>The reporter will need to retest this is really ok. The patch that was
>reported to help avoided high-order allocations entirely. If fork-heavy
>workloads are really entering direct reclaim and increasing fork latency
>enough to ruin performance, then this patch will also suffer. How much
>it helps depends on how big fdtable.
>
>> +		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>> +		if (data != NULL)
>> +			return data;
>> +	}
>>  	return vmalloc(size);
>>  }
>>  
>
>I'm attaching a primitive perl script that reports high-order allocation
>latencies. I'd be interesting to see what the output of it looks like,
>particularly when the server is in trouble if the bug reporter as the
>time.
>
>-- 
>Mel Gorman
>SUSE Labs
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-15  9:59                               ` azurIt
@ 2011-04-15 10:47                                 ` Mel Gorman
  -1 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2011-04-15 10:47 UTC (permalink / raw)
  To: azurIt
  Cc: Mel Gorman, Andrew Morton, Eric Dumazet, Changli Gao,
	Am?rico Wang, Jiri Slaby, linux-kernel, linux-mm, linux-fsdevel,
	Jiri Slaby

On Fri, Apr 15, 2011 at 11:59:03AM +0200, azurIt wrote:
> 
> Also this new patch is working fine and fixing the problem.
> 
> Mel, I cannot run your script:
> # perl watch-highorder-latency.pl
> Failed to open /sys/kernel/debug/tracing/set_ftrace_filter for writing at watch-highorder-latency.pl line 17.
> 
> # ls -ld /sys/kernel/debug/
> ls: cannot access /sys/kernel/debug/: No such file or directory
> 

mount -t debugfs none /sys/kernel/debug

If it still doesn't work, sysfs or the necessary FTRACE options are
not enabled on your .config. I'll give you a list if that is the case.

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-15 10:47                                 ` Mel Gorman
  0 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2011-04-15 10:47 UTC (permalink / raw)
  To: azurIt
  Cc: Mel Gorman, Andrew Morton, Eric Dumazet, Changli Gao,
	Am?rico Wang, Jiri Slaby, linux-kernel, linux-mm, linux-fsdevel,
	Jiri Slaby

On Fri, Apr 15, 2011 at 11:59:03AM +0200, azurIt wrote:
> 
> Also this new patch is working fine and fixing the problem.
> 
> Mel, I cannot run your script:
> # perl watch-highorder-latency.pl
> Failed to open /sys/kernel/debug/tracing/set_ftrace_filter for writing at watch-highorder-latency.pl line 17.
> 
> # ls -ld /sys/kernel/debug/
> ls: cannot access /sys/kernel/debug/: No such file or directory
> 

mount -t debugfs none /sys/kernel/debug

If it still doesn't work, sysfs or the necessary FTRACE options are
not enabled on your .config. I'll give you a list if that is the case.

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-15 10:47                                 ` Mel Gorman
  (?)
@ 2011-04-15 10:56                                   ` azurIt
  -1 siblings, 0 replies; 98+ messages in thread
From: azurIt @ 2011-04-15 10:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Mel Gorman, Andrew Morton, Eric Dumazet, Changli Gao,
	Am?rico Wang, Jiri Slaby, linux-kernel, linux-mm, linux-fsdevel,
	Jiri Slaby


# mount -t debugfs none /sys/kernel/debug
mount: mount point /sys/kernel/debug does not exist

# mkdir /sys/kernel/debug
mkdir: cannot create directory `/sys/kernel/debug': No such file or directory


config file used for testing is here:
http://watchdog.sk/lkml/config


azur


______________________________________________________________
> Od: "Mel Gorman" <mgorman@suse.de>
> Komu: azurIt <azurit@pobox.sk>
> Dátum: 15.04.2011 12:47
> Predmet: Re: Regression from 2.6.36
>
> CC: "Mel Gorman" <mel@csn.ul.ie>, "Andrew Morton" <akpm@linux-foundation.org>, "Eric Dumazet" <eric.dumazet@gmail.com>, "Changli Gao" <xiaosuo@gmail.com>, "Am?rico Wang" <xiyou.wangcong@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>
>On Fri, Apr 15, 2011 at 11:59:03AM +0200, azurIt wrote:
>> 
>> Also this new patch is working fine and fixing the problem.
>> 
>> Mel, I cannot run your script:
>> # perl watch-highorder-latency.pl
>> Failed to open /sys/kernel/debug/tracing/set_ftrace_filter for writing at watch-highorder-latency.pl line 17.
>> 
>> # ls -ld /sys/kernel/debug/
>> ls: cannot access /sys/kernel/debug/: No such file or directory
>> 
>
>mount -t debugfs none /sys/kernel/debug
>
>If it still doesn't work, sysfs or the necessary FTRACE options are
>not enabled on your .config. I'll give you a list if that is the case.
>
>Thanks.
>
>-- 
>Mel Gorman
>SUSE Labs
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-15 10:56                                   ` azurIt
  0 siblings, 0 replies; 98+ messages in thread
From: azurIt @ 2011-04-15 10:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Mel Gorman, Andrew Morton, Eric Dumazet, Changli Gao,
	Am?rico Wang, Jiri Slaby, linux-kernel, linux-mm, linux-fsdevel,
	Jiri Slaby


# mount -t debugfs none /sys/kernel/debug
mount: mount point /sys/kernel/debug does not exist

# mkdir /sys/kernel/debug
mkdir: cannot create directory `/sys/kernel/debug': No such file or directory


config file used for testing is here:
http://watchdog.sk/lkml/config


azur


______________________________________________________________
> Od: "Mel Gorman" <mgorman@suse.de>
> Komu: azurIt <azurit@pobox.sk>
> Dátum: 15.04.2011 12:47
> Predmet: Re: Regression from 2.6.36
>
> CC: "Mel Gorman" <mel@csn.ul.ie>, "Andrew Morton" <akpm@linux-foundation.org>, "Eric Dumazet" <eric.dumazet@gmail.com>, "Changli Gao" <xiaosuo@gmail.com>, "Am?rico Wang" <xiyou.wangcong@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>
>On Fri, Apr 15, 2011 at 11:59:03AM +0200, azurIt wrote:
>> 
>> Also this new patch is working fine and fixing the problem.
>> 
>> Mel, I cannot run your script:
>> # perl watch-highorder-latency.pl
>> Failed to open /sys/kernel/debug/tracing/set_ftrace_filter for writing at watch-highorder-latency.pl line 17.
>> 
>> # ls -ld /sys/kernel/debug/
>> ls: cannot access /sys/kernel/debug/: No such file or directory
>> 
>
>mount -t debugfs none /sys/kernel/debug
>
>If it still doesn't work, sysfs or the necessary FTRACE options are
>not enabled on your .config. I'll give you a list if that is the case.
>
>Thanks.
>
>-- 
>Mel Gorman
>SUSE Labs
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-15 10:56                                   ` azurIt
  0 siblings, 0 replies; 98+ messages in thread
From: azurIt @ 2011-04-15 10:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Mel Gorman, Andrew Morton, Eric Dumazet, Changli Gao,
	Am?rico Wang, Jiri Slaby, linux-kernel, linux-mm, linux-fsdevel,
	Jiri Slaby


# mount -t debugfs none /sys/kernel/debug
mount: mount point /sys/kernel/debug does not exist

# mkdir /sys/kernel/debug
mkdir: cannot create directory `/sys/kernel/debug': No such file or directory


config file used for testing is here:
http://watchdog.sk/lkml/config


azur


______________________________________________________________
> Od: "Mel Gorman" <mgorman@suse.de>
> Komu: azurIt <azurit@pobox.sk>
> DA!tum: 15.04.2011 12:47
> Predmet: Re: Regression from 2.6.36
>
> CC: "Mel Gorman" <mel@csn.ul.ie>, "Andrew Morton" <akpm@linux-foundation.org>, "Eric Dumazet" <eric.dumazet@gmail.com>, "Changli Gao" <xiaosuo@gmail.com>, "Am?rico Wang" <xiyou.wangcong@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>
>On Fri, Apr 15, 2011 at 11:59:03AM +0200, azurIt wrote:
>> 
>> Also this new patch is working fine and fixing the problem.
>> 
>> Mel, I cannot run your script:
>> # perl watch-highorder-latency.pl
>> Failed to open /sys/kernel/debug/tracing/set_ftrace_filter for writing at watch-highorder-latency.pl line 17.
>> 
>> # ls -ld /sys/kernel/debug/
>> ls: cannot access /sys/kernel/debug/: No such file or directory
>> 
>
>mount -t debugfs none /sys/kernel/debug
>
>If it still doesn't work, sysfs or the necessary FTRACE options are
>not enabled on your .config. I'll give you a list if that is the case.
>
>Thanks.
>
>-- 
>Mel Gorman
>SUSE Labs
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-15 10:56                                   ` azurIt
@ 2011-04-15 11:17                                     ` Mel Gorman
  -1 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2011-04-15 11:17 UTC (permalink / raw)
  To: azurIt
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Am?rico Wang,
	Jiri Slaby, linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby

On Fri, 2011-04-15 at 12:56 +0200, azurIt wrote:
> # mount -t debugfs none /sys/kernel/debug
> mount: mount point /sys/kernel/debug does not exist
> 
> # mkdir /sys/kernel/debug
> mkdir: cannot create directory `/sys/kernel/debug': No such file or directory
> 

Mount sysfs first

mount -t sysfs none /sys

> 
> config file used for testing is here:
> http://watchdog.sk/lkml/config
> 

Try setting the following

CONFIG_TRACEPOINTS=y
CONFIG_STACKTRACE=y
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_FTRACE_NMI_ENTER=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_GENERIC_TRACER=y
CONFIG_FTRACE=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_IRQSOFF_TRACER=y
CONFIG_SCHED_TRACER=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_STACK_TRACER=y
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FTRACE_MCOUNT_RECORD=y
CONFIG_FTRACE_SELFTEST=y
CONFIG_FTRACE_STARTUP_TEST=y
CONFIG_MMIOTRACE=y
CONFIG_HAVE_MMIOTRACE_SUPPORT=y

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-15 11:17                                     ` Mel Gorman
  0 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2011-04-15 11:17 UTC (permalink / raw)
  To: azurIt
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Am?rico Wang,
	Jiri Slaby, linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby

On Fri, 2011-04-15 at 12:56 +0200, azurIt wrote:
> # mount -t debugfs none /sys/kernel/debug
> mount: mount point /sys/kernel/debug does not exist
> 
> # mkdir /sys/kernel/debug
> mkdir: cannot create directory `/sys/kernel/debug': No such file or directory
> 

Mount sysfs first

mount -t sysfs none /sys

> 
> config file used for testing is here:
> http://watchdog.sk/lkml/config
> 

Try setting the following

CONFIG_TRACEPOINTS=y
CONFIG_STACKTRACE=y
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_FTRACE_NMI_ENTER=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_GENERIC_TRACER=y
CONFIG_FTRACE=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_IRQSOFF_TRACER=y
CONFIG_SCHED_TRACER=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_STACK_TRACER=y
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FTRACE_MCOUNT_RECORD=y
CONFIG_FTRACE_SELFTEST=y
CONFIG_FTRACE_STARTUP_TEST=y
CONFIG_MMIOTRACE=y
CONFIG_HAVE_MMIOTRACE_SUPPORT=y

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-15 11:17                                     ` Mel Gorman
  (?)
@ 2011-04-15 11:36                                       ` azurIt
  -1 siblings, 0 replies; 98+ messages in thread
From: azurIt @ 2011-04-15 11:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Am?rico Wang,
	Jiri Slaby, linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby


sysfs was already mounted:

# mount
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)


I have enabled all of the options you suggested and also CONFIG_DEBUG_FS ;) I will boot new kernel this night. Hope it won't degraded performance much..


______________________________________________________________
> Od: "Mel Gorman" <mgorman@suse.de>
> Komu: azurIt <azurit@pobox.sk>
> Dátum: 15.04.2011 13:17
> Predmet: Re: Regression from 2.6.36
>
> CC: "Andrew Morton" <akpm@linux-foundation.org>, "Eric Dumazet" <eric.dumazet@gmail.com>, "Changli Gao" <xiaosuo@gmail.com>, "Am?rico Wang" <xiyou.wangcong@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>
>On Fri, 2011-04-15 at 12:56 +0200, azurIt wrote:
>> # mount -t debugfs none /sys/kernel/debug
>> mount: mount point /sys/kernel/debug does not exist
>> 
>> # mkdir /sys/kernel/debug
>> mkdir: cannot create directory `/sys/kernel/debug': No such file or directory
>> 
>
>Mount sysfs first
>
>mount -t sysfs none /sys
>
>> 
>> config file used for testing is here:
>> http://watchdog.sk/lkml/config
>> 
>
>Try setting the following
>
>CONFIG_TRACEPOINTS=y
>CONFIG_STACKTRACE=y
>CONFIG_USER_STACKTRACE_SUPPORT=y
>CONFIG_NOP_TRACER=y
>CONFIG_TRACER_MAX_TRACE=y
>CONFIG_FTRACE_NMI_ENTER=y
>CONFIG_CONTEXT_SWITCH_TRACER=y
>CONFIG_GENERIC_TRACER=y
>CONFIG_FTRACE=y
>CONFIG_FUNCTION_TRACER=y
>CONFIG_FUNCTION_GRAPH_TRACER=y
>CONFIG_IRQSOFF_TRACER=y
>CONFIG_SCHED_TRACER=y
>CONFIG_FTRACE_SYSCALLS=y
>CONFIG_STACK_TRACER=y
>CONFIG_BLK_DEV_IO_TRACE=y
>CONFIG_DYNAMIC_FTRACE=y
>CONFIG_FTRACE_MCOUNT_RECORD=y
>CONFIG_FTRACE_SELFTEST=y
>CONFIG_FTRACE_STARTUP_TEST=y
>CONFIG_MMIOTRACE=y
>CONFIG_HAVE_MMIOTRACE_SUPPORT=y
>
>-- 
>Mel Gorman
>SUSE Labs
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-15 11:36                                       ` azurIt
  0 siblings, 0 replies; 98+ messages in thread
From: azurIt @ 2011-04-15 11:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Am?rico Wang,
	Jiri Slaby, linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby


sysfs was already mounted:

# mount
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)


I have enabled all of the options you suggested and also CONFIG_DEBUG_FS ;) I will boot new kernel this night. Hope it won't degraded performance much..


______________________________________________________________
> Od: "Mel Gorman" <mgorman@suse.de>
> Komu: azurIt <azurit@pobox.sk>
> Dátum: 15.04.2011 13:17
> Predmet: Re: Regression from 2.6.36
>
> CC: "Andrew Morton" <akpm@linux-foundation.org>, "Eric Dumazet" <eric.dumazet@gmail.com>, "Changli Gao" <xiaosuo@gmail.com>, "Am?rico Wang" <xiyou.wangcong@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>
>On Fri, 2011-04-15 at 12:56 +0200, azurIt wrote:
>> # mount -t debugfs none /sys/kernel/debug
>> mount: mount point /sys/kernel/debug does not exist
>> 
>> # mkdir /sys/kernel/debug
>> mkdir: cannot create directory `/sys/kernel/debug': No such file or directory
>> 
>
>Mount sysfs first
>
>mount -t sysfs none /sys
>
>> 
>> config file used for testing is here:
>> http://watchdog.sk/lkml/config
>> 
>
>Try setting the following
>
>CONFIG_TRACEPOINTS=y
>CONFIG_STACKTRACE=y
>CONFIG_USER_STACKTRACE_SUPPORT=y
>CONFIG_NOP_TRACER=y
>CONFIG_TRACER_MAX_TRACE=y
>CONFIG_FTRACE_NMI_ENTER=y
>CONFIG_CONTEXT_SWITCH_TRACER=y
>CONFIG_GENERIC_TRACER=y
>CONFIG_FTRACE=y
>CONFIG_FUNCTION_TRACER=y
>CONFIG_FUNCTION_GRAPH_TRACER=y
>CONFIG_IRQSOFF_TRACER=y
>CONFIG_SCHED_TRACER=y
>CONFIG_FTRACE_SYSCALLS=y
>CONFIG_STACK_TRACER=y
>CONFIG_BLK_DEV_IO_TRACE=y
>CONFIG_DYNAMIC_FTRACE=y
>CONFIG_FTRACE_MCOUNT_RECORD=y
>CONFIG_FTRACE_SELFTEST=y
>CONFIG_FTRACE_STARTUP_TEST=y
>CONFIG_MMIOTRACE=y
>CONFIG_HAVE_MMIOTRACE_SUPPORT=y
>
>-- 
>Mel Gorman
>SUSE Labs
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-15 11:36                                       ` azurIt
  0 siblings, 0 replies; 98+ messages in thread
From: azurIt @ 2011-04-15 11:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Am?rico Wang,
	Jiri Slaby, linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby


sysfs was already mounted:

# mount
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)


I have enabled all of the options you suggested and also CONFIG_DEBUG_FS ;) I will boot new kernel this night. Hope it won't degraded performance much..


______________________________________________________________
> Od: "Mel Gorman" <mgorman@suse.de>
> Komu: azurIt <azurit@pobox.sk>
> DA!tum: 15.04.2011 13:17
> Predmet: Re: Regression from 2.6.36
>
> CC: "Andrew Morton" <akpm@linux-foundation.org>, "Eric Dumazet" <eric.dumazet@gmail.com>, "Changli Gao" <xiaosuo@gmail.com>, "Am?rico Wang" <xiyou.wangcong@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>
>On Fri, 2011-04-15 at 12:56 +0200, azurIt wrote:
>> # mount -t debugfs none /sys/kernel/debug
>> mount: mount point /sys/kernel/debug does not exist
>> 
>> # mkdir /sys/kernel/debug
>> mkdir: cannot create directory `/sys/kernel/debug': No such file or directory
>> 
>
>Mount sysfs first
>
>mount -t sysfs none /sys
>
>> 
>> config file used for testing is here:
>> http://watchdog.sk/lkml/config
>> 
>
>Try setting the following
>
>CONFIG_TRACEPOINTS=y
>CONFIG_STACKTRACE=y
>CONFIG_USER_STACKTRACE_SUPPORT=y
>CONFIG_NOP_TRACER=y
>CONFIG_TRACER_MAX_TRACE=y
>CONFIG_FTRACE_NMI_ENTER=y
>CONFIG_CONTEXT_SWITCH_TRACER=y
>CONFIG_GENERIC_TRACER=y
>CONFIG_FTRACE=y
>CONFIG_FUNCTION_TRACER=y
>CONFIG_FUNCTION_GRAPH_TRACER=y
>CONFIG_IRQSOFF_TRACER=y
>CONFIG_SCHED_TRACER=y
>CONFIG_FTRACE_SYSCALLS=y
>CONFIG_STACK_TRACER=y
>CONFIG_BLK_DEV_IO_TRACE=y
>CONFIG_DYNAMIC_FTRACE=y
>CONFIG_FTRACE_MCOUNT_RECORD=y
>CONFIG_FTRACE_SELFTEST=y
>CONFIG_FTRACE_STARTUP_TEST=y
>CONFIG_MMIOTRACE=y
>CONFIG_HAVE_MMIOTRACE_SUPPORT=y
>
>-- 
>Mel Gorman
>SUSE Labs
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-15 11:36                                       ` azurIt
@ 2011-04-15 13:01                                         ` Mel Gorman
  -1 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2011-04-15 13:01 UTC (permalink / raw)
  To: azurIt
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Am?rico Wang,
	Jiri Slaby, linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby

On Fri, 2011-04-15 at 13:36 +0200, azurIt wrote:
> sysfs was already mounted:
> 
> # mount
> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
> 
> 
> I have enabled all of the options you suggested and also CONFIG_DEBUG_FS ;) I will boot new kernel this night. Hope it won't degraded performance much..
> 

It's only for curiousity's sake. As you report the patch fixes the
problem, it matches the theory that it's allocator latency. The script
would confirm it for sure, but it's not a high priority.

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-15 13:01                                         ` Mel Gorman
  0 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2011-04-15 13:01 UTC (permalink / raw)
  To: azurIt
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Am?rico Wang,
	Jiri Slaby, linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby

On Fri, 2011-04-15 at 13:36 +0200, azurIt wrote:
> sysfs was already mounted:
> 
> # mount
> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
> 
> 
> I have enabled all of the options you suggested and also CONFIG_DEBUG_FS ;) I will boot new kernel this night. Hope it won't degraded performance much..
> 

It's only for curiousity's sake. As you report the patch fixes the
problem, it matches the theory that it's allocator latency. The script
would confirm it for sure, but it's not a high priority.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-15 13:01                                         ` Mel Gorman
  (?)
@ 2011-04-15 13:21                                           ` azurIt
  -1 siblings, 0 replies; 98+ messages in thread
From: azurIt @ 2011-04-15 13:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Am?rico Wang,
	Jiri Slaby, linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby


So it's really not necessary ? It would be better for us if you can go without it cos it means to run buggy kernel for one more day.

Which kernel versions will include this fix ?

Thank you very much!

azur



______________________________________________________________
> Od: "Mel Gorman" <mgorman@suse.de>
> Komu: azurIt <azurit@pobox.sk>
> Dátum: 15.04.2011 15:01
> Predmet: Re: Regression from 2.6.36
>
> CC: "Andrew Morton" <akpm@linux-foundation.org>, "Eric Dumazet" <eric.dumazet@gmail.com>, "Changli Gao" <xiaosuo@gmail.com>, "Am?rico Wang" <xiyou.wangcong@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>
>On Fri, 2011-04-15 at 13:36 +0200, azurIt wrote:
>> sysfs was already mounted:
>> 
>> # mount
>> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
>> 
>> 
>> I have enabled all of the options you suggested and also CONFIG_DEBUG_FS ;) I will boot new kernel this night. Hope it won't degraded performance much..
>> 
>
>It's only for curiousity's sake. As you report the patch fixes the
>problem, it matches the theory that it's allocator latency. The script
>would confirm it for sure, but it's not a high priority.
>
>-- 
>Mel Gorman
>SUSE Labs
>
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-15 13:21                                           ` azurIt
  0 siblings, 0 replies; 98+ messages in thread
From: azurIt @ 2011-04-15 13:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Am?rico Wang,
	Jiri Slaby, linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby


So it's really not necessary ? It would be better for us if you can go without it cos it means to run buggy kernel for one more day.

Which kernel versions will include this fix ?

Thank you very much!

azur



______________________________________________________________
> Od: "Mel Gorman" <mgorman@suse.de>
> Komu: azurIt <azurit@pobox.sk>
> Dátum: 15.04.2011 15:01
> Predmet: Re: Regression from 2.6.36
>
> CC: "Andrew Morton" <akpm@linux-foundation.org>, "Eric Dumazet" <eric.dumazet@gmail.com>, "Changli Gao" <xiaosuo@gmail.com>, "Am?rico Wang" <xiyou.wangcong@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>
>On Fri, 2011-04-15 at 13:36 +0200, azurIt wrote:
>> sysfs was already mounted:
>> 
>> # mount
>> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
>> 
>> 
>> I have enabled all of the options you suggested and also CONFIG_DEBUG_FS ;) I will boot new kernel this night. Hope it won't degraded performance much..
>> 
>
>It's only for curiousity's sake. As you report the patch fixes the
>problem, it matches the theory that it's allocator latency. The script
>would confirm it for sure, but it's not a high priority.
>
>-- 
>Mel Gorman
>SUSE Labs
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-15 13:21                                           ` azurIt
  0 siblings, 0 replies; 98+ messages in thread
From: azurIt @ 2011-04-15 13:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Am?rico Wang,
	Jiri Slaby, linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby


So it's really not necessary ? It would be better for us if you can go without it cos it means to run buggy kernel for one more day.

Which kernel versions will include this fix ?

Thank you very much!

azur



______________________________________________________________
> Od: "Mel Gorman" <mgorman@suse.de>
> Komu: azurIt <azurit@pobox.sk>
> DA!tum: 15.04.2011 15:01
> Predmet: Re: Regression from 2.6.36
>
> CC: "Andrew Morton" <akpm@linux-foundation.org>, "Eric Dumazet" <eric.dumazet@gmail.com>, "Changli Gao" <xiaosuo@gmail.com>, "Am?rico Wang" <xiyou.wangcong@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>
>On Fri, 2011-04-15 at 13:36 +0200, azurIt wrote:
>> sysfs was already mounted:
>> 
>> # mount
>> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
>> 
>> 
>> I have enabled all of the options you suggested and also CONFIG_DEBUG_FS ;) I will boot new kernel this night. Hope it won't degraded performance much..
>> 
>
>It's only for curiousity's sake. As you report the patch fixes the
>problem, it matches the theory that it's allocator latency. The script
>would confirm it for sure, but it's not a high priority.
>
>-- 
>Mel Gorman
>SUSE Labs
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-15 13:21                                           ` azurIt
@ 2011-04-15 14:15                                             ` Mel Gorman
  -1 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2011-04-15 14:15 UTC (permalink / raw)
  To: azurIt
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Am?rico Wang,
	Jiri Slaby, linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby

On Fri, 2011-04-15 at 15:21 +0200, azurIt wrote:
> So it's really not necessary ? It would be better for us if you can go without it cos it means to run buggy kernel for one more day.
> 

I can live without it.

> Which kernel versions will include this fix ?
> 

As it's a performance fix, I would guess 2.6.39 only. I don't think
-stable pick up performance fixes but I could be wrong.

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-15 14:15                                             ` Mel Gorman
  0 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2011-04-15 14:15 UTC (permalink / raw)
  To: azurIt
  Cc: Andrew Morton, Eric Dumazet, Changli Gao, Am?rico Wang,
	Jiri Slaby, linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby

On Fri, 2011-04-15 at 15:21 +0200, azurIt wrote:
> So it's really not necessary ? It would be better for us if you can go without it cos it means to run buggy kernel for one more day.
> 

I can live without it.

> Which kernel versions will include this fix ?
> 

As it's a performance fix, I would guess 2.6.39 only. I don't think
-stable pick up performance fixes but I could be wrong.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-13 21:24                             ` Andrew Morton
@ 2011-04-19 19:29                               ` azurIt
  -1 siblings, 0 replies; 98+ messages in thread
From: azurIt @ 2011-04-19 19:29 UTC (permalink / raw)
  To: Andrew Morton, Eric Dumazet, Changli Gao, Américo Wang,
	Jiri Slaby, linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby,
	Mel Gorman


Andrew,

which kernel versions will include this patch ? Thank you.

azur



______________________________________________________________
> Od: "Andrew Morton" <akpm@linux-foundation.org>
> Komu: Eric Dumazet <eric.dumazet@gmail.com>,Changli Gao <xiaosuo@gmail.com>,Américo Wang <xiyou.wangcong@gmail.com>,Jiri Slaby <jslaby@suse.cz>, azurIt <azurit@pobox.sk>,linux-kernel@vger.kernel.org, linux-mm@kvack.org,linux-fsdevel@vger.kernel.org, Jiri Slaby <jirislaby@gmail.com>,Mel Gorman <mel@csn.ul.ie>
> Dátum: 13.04.2011 23:26
> Predmet: Re: Regression from 2.6.36
>
>On Wed, 13 Apr 2011 14:16:00 -0700
>Andrew Morton <akpm@linux-foundation.org> wrote:
>
>>  fs/file.c |   17 ++++++++++-------
>>  1 file changed, 10 insertions(+), 7 deletions(-)
>
>bah, stupid compiler.
>
>
>--- a/fs/file.c~vfs-avoid-large-kmallocs-for-the-fdtable
>+++ a/fs/file.c
>@@ -9,6 +9,7 @@
> #include <linux/module.h>
> #include <linux/fs.h>
> #include <linux/mm.h>
>+#include <linux/mmzone.h>
> #include <linux/time.h>
> #include <linux/sched.h>
> #include <linux/slab.h>
>@@ -39,14 +40,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
>  */
> static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
> 
>-static inline void *alloc_fdmem(unsigned int size)
>+static void *alloc_fdmem(unsigned int size)
> {
>-	void *data;
>-
>-	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>-	if (data != NULL)
>-		return data;
>-
>+	/*
>+	 * Very large allocations can stress page reclaim, so fall back to
>+	 * vmalloc() if the allocation size will be considered "large" by the VM.
>+	 */
>+	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
>+		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>+		if (data != NULL)
>+			return data;
>+	}
> 	return vmalloc(size);
> }
> 
>_
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-19 19:29                               ` azurIt
  0 siblings, 0 replies; 98+ messages in thread
From: azurIt @ 2011-04-19 19:29 UTC (permalink / raw)
  To: Andrew Morton, Eric Dumazet, Changli Gao, Américo Wang,
	Jiri Slaby, linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby,
	Mel Gorman


Andrew,

which kernel versions will include this patch ? Thank you.

azur



______________________________________________________________
> Od: "Andrew Morton" <akpm@linux-foundation.org>
> Komu: Eric Dumazet <eric.dumazet@gmail.com>,Changli Gao <xiaosuo@gmail.com>,AmA(C)rico Wang <xiyou.wangcong@gmail.com>,Jiri Slaby <jslaby@suse.cz>, azurIt <azurit@pobox.sk>,linux-kernel@vger.kernel.org, linux-mm@kvack.org,linux-fsdevel@vger.kernel.org, Jiri Slaby <jirislaby@gmail.com>,Mel Gorman <mel@csn.ul.ie>
> DA!tum: 13.04.2011 23:26
> Predmet: Re: Regression from 2.6.36
>
>On Wed, 13 Apr 2011 14:16:00 -0700
>Andrew Morton <akpm@linux-foundation.org> wrote:
>
>>  fs/file.c |   17 ++++++++++-------
>>  1 file changed, 10 insertions(+), 7 deletions(-)
>
>bah, stupid compiler.
>
>
>--- a/fs/file.c~vfs-avoid-large-kmallocs-for-the-fdtable
>+++ a/fs/file.c
>@@ -9,6 +9,7 @@
> #include <linux/module.h>
> #include <linux/fs.h>
> #include <linux/mm.h>
>+#include <linux/mmzone.h>
> #include <linux/time.h>
> #include <linux/sched.h>
> #include <linux/slab.h>
>@@ -39,14 +40,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
>  */
> static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
> 
>-static inline void *alloc_fdmem(unsigned int size)
>+static void *alloc_fdmem(unsigned int size)
> {
>-	void *data;
>-
>-	data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>-	if (data != NULL)
>-		return data;
>-
>+	/*
>+	 * Very large allocations can stress page reclaim, so fall back to
>+	 * vmalloc() if the allocation size will be considered "large" by the VM.
>+	 */
>+	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
>+		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>+		if (data != NULL)
>+			return data;
>+	}
> 	return vmalloc(size);
> }
> 
>_
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
  2011-04-19 19:29                               ` azurIt
@ 2011-04-19 19:55                                 ` Andrew Morton
  -1 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2011-04-19 19:55 UTC (permalink / raw)
  To: azurIt
  Cc: Eric Dumazet, Changli Gao,  Américo Wang ,
	Jiri Slaby, linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby,
	Mel Gorman

On Tue, 19 Apr 2011 21:29:20 +0200
"azurIt" <azurit@pobox.sk> wrote:

> which kernel versions will include this patch ? Thank you.

Probably 2.6.39. If so, some later 2.6.38.x too.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Regression from 2.6.36
@ 2011-04-19 19:55                                 ` Andrew Morton
  0 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2011-04-19 19:55 UTC (permalink / raw)
  To: azurIt
  Cc: Eric Dumazet, Changli Gao,  Américo Wang ,
	Jiri Slaby, linux-kernel, linux-mm, linux-fsdevel, Jiri Slaby,
	Mel Gorman

On Tue, 19 Apr 2011 21:29:20 +0200
"azurIt" <azurit@pobox.sk> wrote:

> which kernel versions will include this patch ? Thank you.

Probably 2.6.39. If so, some later 2.6.38.x too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

end of thread, other threads:[~2011-04-19 19:57 UTC | newest]

Thread overview: 98+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-03-15 13:25 Regression from 2.6.36 azurIt
2011-03-17  0:15 ` Greg KH
2011-03-17  0:53   ` Dave Jones
2011-03-17 13:30     ` azurIt
2011-04-07 10:01   ` azurIt
2011-04-07 10:19     ` Jiri Slaby
2011-04-07 10:19       ` Jiri Slaby
2011-04-07 10:19       ` Jiri Slaby
2011-04-07 11:21       ` Américo Wang
2011-04-07 11:21         ` Américo Wang
2011-04-07 11:57         ` Eric Dumazet
2011-04-07 11:57           ` Eric Dumazet
2011-04-07 11:57           ` Eric Dumazet
2011-04-07 12:13           ` Eric Dumazet
2011-04-07 12:13             ` Eric Dumazet
2011-04-07 12:13             ` Eric Dumazet
2011-04-07 15:27             ` Changli Gao
2011-04-07 15:36               ` Eric Dumazet
2011-04-07 15:36                 ` Eric Dumazet
2011-04-07 15:36                 ` Eric Dumazet
2011-04-12 22:49                 ` Andrew Morton
2011-04-12 22:49                   ` Andrew Morton
2011-04-13  1:23                   ` Changli Gao
2011-04-13  1:23                     ` Changli Gao
2011-04-13  1:31                     ` Andrew Morton
2011-04-13  1:31                       ` Andrew Morton
2011-04-13  2:37                       ` Eric Dumazet
2011-04-13  2:37                         ` Eric Dumazet
2011-04-13  2:37                         ` Eric Dumazet
2011-04-13  6:54                         ` Regarding memory fragmentation using malloc Pintu Agarwal
2011-04-13  6:54                           ` Pintu Agarwal
2011-04-13 11:44                           ` Américo Wang
2011-04-13 11:44                             ` Américo Wang
2011-04-13 13:56                             ` Pintu Agarwal
2011-04-13 13:56                               ` Pintu Agarwal
2011-04-13 15:25                               ` Michal Nazarewicz
2011-04-13 15:25                                 ` Michal Nazarewicz
2011-04-14  6:44                                 ` Pintu Agarwal
2011-04-14  6:44                                   ` Pintu Agarwal
2011-04-14 10:47                                   ` Michal Nazarewicz
2011-04-14 10:47                                     ` Michal Nazarewicz
2011-04-14 12:24                                     ` Pintu Agarwal
2011-04-14 12:24                                       ` Pintu Agarwal
2011-04-14 12:31                                       ` Michal Nazarewicz
2011-04-14 12:31                                         ` Michal Nazarewicz
2011-04-13 21:16                         ` Regression from 2.6.36 Andrew Morton
2011-04-13 21:16                           ` Andrew Morton
2011-04-13 21:24                           ` Andrew Morton
2011-04-13 21:24                           ` Andrew Morton
2011-04-13 21:24                             ` Andrew Morton
2011-04-19 19:29                             ` azurIt
2011-04-19 19:29                               ` azurIt
2011-04-19 19:55                               ` Andrew Morton
2011-04-19 19:55                                 ` Andrew Morton
2011-04-13 21:44                           ` David Rientjes
2011-04-13 21:44                             ` David Rientjes
2011-04-13 21:54                             ` Andrew Morton
2011-04-13 21:54                               ` Andrew Morton
2011-04-14  2:10                           ` Eric Dumazet
2011-04-14  2:10                             ` Eric Dumazet
2011-04-14  2:10                             ` Eric Dumazet
2011-04-14  5:28                             ` Andrew Morton
2011-04-14  5:28                               ` Andrew Morton
2011-04-14  6:31                               ` Eric Dumazet
2011-04-14  6:31                                 ` Eric Dumazet
2011-04-14  6:31                                 ` Eric Dumazet
2011-04-14  9:08                                 ` azurIt
2011-04-14  9:08                                   ` azurIt
2011-04-14 10:27                                   ` Eric Dumazet
2011-04-14 10:27                                     ` Eric Dumazet
2011-04-14 10:27                                     ` Eric Dumazet
2011-04-14 10:31                                     ` azurIt
2011-04-14 10:31                                       ` azurIt
2011-04-14 10:31                                       ` azurIt
2011-04-14 10:25                           ` Mel Gorman
2011-04-15  9:59                             ` azurIt
2011-04-15  9:59                               ` azurIt
2011-04-15  9:59                               ` azurIt
2011-04-15 10:47                               ` Mel Gorman
2011-04-15 10:47                                 ` Mel Gorman
2011-04-15 10:56                                 ` azurIt
2011-04-15 10:56                                   ` azurIt
2011-04-15 10:56                                   ` azurIt
2011-04-15 11:17                                   ` Mel Gorman
2011-04-15 11:17                                     ` Mel Gorman
2011-04-15 11:36                                     ` azurIt
2011-04-15 11:36                                       ` azurIt
2011-04-15 11:36                                       ` azurIt
2011-04-15 13:01                                       ` Mel Gorman
2011-04-15 13:01                                         ` Mel Gorman
2011-04-15 13:21                                         ` azurIt
2011-04-15 13:21                                           ` azurIt
2011-04-15 13:21                                           ` azurIt
2011-04-15 14:15                                           ` Mel Gorman
2011-04-15 14:15                                             ` Mel Gorman
2011-04-08 12:25               ` azurIt
2011-04-08 12:25                 ` azurIt
2011-04-08 12:25                 ` azurIt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.