linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* The performance and behaviour of the anti-fragmentation related patches
@ 2007-03-01 10:12 Mel Gorman
  2007-03-02  0:09 ` Andrew Morton
  2007-03-02  1:52 ` Bill Irwin
  0 siblings, 2 replies; 99+ messages in thread
From: Mel Gorman @ 2007-03-01 10:12 UTC (permalink / raw)
  To: akpm, npiggin, clameter, mingo, jschopp, arjan, torvalds, mbligh
  Cc: linux-mm, linux-kernel

Hi all,

I've posted up patches that implement the two generally accepted approaches
for reducing external fragmentation in the buddy allocator. The first
(list-based) works by grouping pages of related mobility together, across
all existing zones.  The second (zone-based) creates a zone where only pages
that can be migrated or reclaimed are allocated from.

List-based requires no configuration other than setting min_free_kbytes to
16384, but workloads might exist that break it down, such that different
page types become badly mixed. It was suggested that zones should instead
be used to partition memory between the types to avoid this breakdown. This
works well and it's behaviour is predictable, but it requires configuration at
boot time and that is a lot less flexible than the list-based approach. Both
approaches had their proponents and detractors.

Hence, the two patchsets posted are no longer mutually exclusive and will work
together when they are both applied.  This means that without configuration,
external fragmentation will be reduced as much as possible. However,
if the system administrator has a workload which requires higher levels
of availability or is using varying numbers of huge pages between jobs,
ZONE_MOVABLE can be configured to be the maximum number of huge pages
required by any job.

This mail is intended to describe more about how the patches actually work
and provide some performance figures to the people who have made serious
comments about the patches in the past, mainly at VM Summit. I hope it will
help solidify discussions on these patch sets and ultimately lead to a decision
on which method or methods are worthy of merging to -mm for wider exposure.

In the past, it has been pointed out that the code is complicated and it is not
particularly clear what the end effect of the patches is or why they work. To
give people a better understanding of what the patches are actually doing,
some tools were put together by myself and Andy that can graph how pages of
different mobility types are distributed in memory. An example image from
an x86_64 machine is

http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg

Each pixel represents a page of memory, each block represents
MAX_ORDER_NR_PAGES number of pages. The color of a pixel indicates the
mobility type of the page.  In most cases, one box is one huge page. On
x86_64 and some x86, half a box will be a huge page. The legend for colors
is at the top but for further clarification;

ZoneBoundary 	- A black outline to the left and above a box implies a zone
		  starts there
Movable 	- __GFP_MOVABLE pages
Reclaimable 	- __GFP_RECLAIMABLE pages
Pinned		- Bootmem allocated pages and unmovable pages are these
per-cpu		- These are allocated pages with no mappings or count. They are usually
		  per-cpu pages but a high-order allocation can appear like this too.
Movable-under-reclaim - These are __GFP_MOVABLE pages but IO is being performed

During tests, the data required to generate this image is collected every
2 seconds. At the end of the test, all the samples are gathered together
and a video is created. The video shows over time an approximate view of
the memory fragmentation, this very clearly shows trends in locations of
allocations and areas under reclaim.

This is a video[*] of the vanilla kernel running a series of benchmarks on a
ppc64 machine.

video:  http://www.skynet.ie/~mel/anti-frag/2007-02-28/examplevideo-ppc64-2.6.20-mm2-vanilla.avi
frames: http://www.skynet.ie/~mel/anti-frag/2007-02-28/exampleframes-ppc64-2.6.20-mm2-vanilla.tar.gz


Notice that pages of different mobility types get scattered all over the
physical address space on the vanilla kernel because there is no effort made
to place the pages.  2% of memory was kept free with min_free_kbytes and this
results in the free block of pages towards the start of memory. Notice
also that the huge page allocations always come from here as well. *This* is
why setting min_free_kbytes to a high value allows higher-order allocations
work for a period of time! As the buddy allocator favours small blocks for
allocation, the pages kept free were contiguous to begin with. It works
until that block gets split due to excessive memory pressure and after that,
high-order allocations start failing again. If min_free_kbytes was set to
a higher value once the system had been running for some time, high-order
allocations would continue to fail. This is why I've asserted before that
setting min_free_kbytes is not a fix for external fragmentation problems.

Next is a video of a kernel patched with both list-based and zone-based
patches applied.

video:  http://www.skynet.ie/~mel/anti-frag/2007-02-28/examplevideo-ppc64-2.6.20-mm2-antifrag.avi
frames: http://www.skynet.ie/~mel/anti-frag/2007-02-28/exampleframes-ppc64-2.6.20-mm2-antifrag.tar.gz

kernelcore has been set to 60% of memory so you'll see where the black
line indicating the zone is starting. Note how the higher zone is always
green (indicating it is being used for movable pages) - this is how
zone-based works.  In the remaining portions of memory, you'll see how
the boxes (i.e. MAX_ORDER areas) remain as solid colors the majority of
the time. this is the effect of list-based as it groups pages together of
similar mobility. Note how when allocating huge pages under load as well,
it fails to use all of ZONE_MOVABLE. This is a problem with reclaim which
patches from Andy Whitcroft aim to fix up. Two sets of figures are posted
below. The first set is just anti-frag related and the second test includes
patches from Andy.

It should be clear from the videos how and why anti-frag is successful at
what it does. To get higher success rates, defragmentation is needed to move
movable pages from sparsely populated hugepages to densely populated ones.
It should also be clear that slab reclaim needs to be a bit smarter because
you'll see in the videos the "blue" pages that are very sparsely populated
but not being reclaimed.

However, as anti-frag currently stands, it's very effective. Improvements
are logical progressions instead of problems with the fundamental idea. For
example, on gekko-lp4, 1% of memory can be allocated as a huge page which
represents min_free_kbytes. With both patches applied, 51% of memory can be
allocated as huge pages.

The following are performance figures based on a number of tests with
different machines

Kernbench Total CPU Time
                          Vanilla Kernel   List-base Kernel  Zone-base Kernel  Combined Kernel
Machine       Arch           Seconds           Seconds           Seconds          Seconds
-------     ---------     --------------   ----------------  ----------------  ---------------
bl6-13      x86_64              121.34          119.00            119.60           ---    
elm3b14     x86-numaq          1527.57         1530.80           1529.26          1530.64 
elm3b245    x86_64              346.95          346.48            347.18           346.67  
gekko-lp1   ppc64               323.66          323.80            323.67           323.58  
gekko-lp4   ppc64               319.61          320.25            319.49           319.58  


Kernbench Total Elapsed Time
                          Vanilla Kernel   List-base Kernel  Zone-base Kernel  Combined Kernel
Machine       Arch           Seconds           Seconds           Seconds          Seconds
-------     ---------     --------------   ----------------  ----------------  ---------------
bl6-13      x86_64              36.32           37.78             35.14            ---    
elm3b14     x86-numaq          426.08          427.34            426.76           427.73  
elm3b245    x86_64              96.50           96.03             96.34            96.11   
gekko-lp1   ppc64              172.17          171.74            172.06           171.73  
gekko-lp4   ppc64              325.38          326.26            324.90           324.83  


Percentage of memory allocated as huge pages under load
                          Vanilla Kernel   List-base Kernel  Zone-base Kernel  Combined Kernel
Machine       Arch          Percentage       Percentage         Percentage       Percentage
-------     ---------     --------------   ----------------  ----------------  ---------------
bl6-13      x86_64              10              75                 17             0  
elm3b14     x86-numaq           21              27                 19             25 
elm3b245    x86_64              34              66                 27             62 
gekko-lp1   ppc64               2               14                 4              20 
gekko-lp4   ppc64               1               24                 3              17 


Percentage of memory allocated as huge pages at rest at end of test
                          Vanilla Kernel   List-base Kernel  Zone-base Kernel  Combined Kernel
Machine       Arch          Percentage       Percentage         Percentage       Percentage
-------     ---------     --------------   ----------------  ----------------  ---------------
bl6-13      x86_64              17              76                 22             0  
elm3b14     x86-numaq           69              84                 55             82 
elm3b245    x86_64              41              82                 44             82 
gekko-lp1   ppc64               3               61                 9              69 
gekko-lp4   ppc64               1               32                 4              51 

These are figures based on kernels patches with Andy Whitcrofts reclaim
patches. You will see that the zone-based kernel is getting success rates
closer to 40% as one would expect although there is still something amiss.

Kernbench Total CPU Time        
                          Vanilla Kernel   List-base Kernel  Zone-base Kernel  Combined Kernel
Machine       Arch           Seconds           Seconds           Seconds          Seconds
-------     ---------     --------------   ----------------  ----------------  ---------------
elm3b14     x86-numaq           1528.42         1531.25           1528.48          1531.04 
elm3b245    x86_64              347.48          346.09            346.67           346.04  
gekko-lp1   ppc64               323.74          323.79            323.45           323.77  
gekko-lp4   ppc64               319.65          319.72            319.74           319.70  
                                
                                
Kernbench Total Elapsed Time    
                          Vanilla Kernel   List-base Kernel  Zone-base Kernel  Combined Kernel
Machine       Arch           Seconds           Seconds           Seconds          Seconds
-------     ---------     --------------   ----------------  ----------------  ---------------
elm3b14     x86-numaq           427.00          427.85            426.18           427.42  
elm3b245    x86_64              96.72           96.03             96.58            96.27   
gekko-lp1   ppc64               172.07          172.07            171.96           172.72  
gekko-lp4   ppc64               325.41          324.97            325.71           324.94  
                                
                                
Percentage of memory allocated as huge pages under load
                          Vanilla Kernel   List-base Kernel  Zone-base Kernel  Combined Kernel
Machine       Arch          Percentage       Percentage         Percentage       Percentage
-------     ---------     --------------   ----------------  ----------------  ---------------
elm3b14     x86-numaq           24              29                 23             26 
elm3b245    x86_64              33              76                 42             75 
gekko-lp1   ppc64               2               23                 9              29 
gekko-lp4   ppc64               1               24                 24             40 
                                
                                
Percentage of memory allocated as huge pages at rest at end of test
                          Vanilla Kernel   List-base Kernel  Zone-base Kernel  Combined Kernel
Machine       Arch          Percentage       Percentage         Percentage       Percentage
-------     ---------     --------------   ----------------  ----------------  ---------------
elm3b14     x86-numaq           52              84                 64             82 
elm3b245    x86_64              51              87                 44             85 
gekko-lp1   ppc64               7               69                 25             67 
gekko-lp4   ppc64               3               43                 29             53 

The patches go a long way to making sure that high-order allocations work
and particularly that the hugepage pool can be resized once the system has
been running. With the clustering of high-order atomic allocations, I have
some confidence that allocating contiguous jumbo frames will work even with
loads performing lots of IO. I think the videos show how the patches actually
work in the clearest possible manner.

I am of the opinion that both approaches have their advantages and
disadvantages. Given a choice between the two, I prefer list-based
because of it's flexibility and it should also help high-order kernel
allocations. However, by applying both, the disadvantages of list-based are
covered and there still appears to be no performance loss as a result. Hence,
I'd like to see both merged.  Any opinion on merging these patches into -mm
for wider testing?




Here is a list of videos showing different patched kernels on each machine
for the curious. Be warned that they are all pretty large which means the
guys hosting the machine are going to love me.

elm3b14-vanilla       http://www.skynet.ie/~mel/anti-frag/2007-02-28/elm3b14-vanilla.avi
elm3b14-list-based    http://www.skynet.ie/~mel/anti-frag/2007-02-28/elm3b14-listbased.avi
elm3b14-zone-based    http://www.skynet.ie/~mel/anti-frag/2007-02-28/elm3b14-zonebased.avi
elm3b14-combined      http://www.skynet.ie/~mel/anti-frag/2007-02-28/elm3b14-combined.avi

elm3b245-vanilla      http://www.skynet.ie/~mel/anti-frag/2007-02-28/elm3b245-vanilla.avi
elm3b245-list-based   http://www.skynet.ie/~mel/anti-frag/2007-02-28/elm3b245-listbased.avi
elm3b245-zone-based   http://www.skynet.ie/~mel/anti-frag/2007-02-28/elm3b245-zonebased.avi
elm3b245-combined     http://www.skynet.ie/~mel/anti-frag/2007-02-28/elm3b245-combined.avi

gekko-lp1-vanilla     http://www.skynet.ie/~mel/anti-frag/2007-02-28/gekkolp1-vanilla.avi
gekko-lp1-list-based  http://www.skynet.ie/~mel/anti-frag/2007-02-28/gekkolp1-listbased.avi
gekko-lp1-zone-based  http://www.skynet.ie/~mel/anti-frag/2007-02-28/gekkolp1-zonebased.avi
gekko-lp1-combined    http://www.skynet.ie/~mel/anti-frag/2007-02-28/gekkolp1-combined.avi

gekko-lp4-vanilla     http://www.skynet.ie/~mel/anti-frag/2007-02-28/gekkolp4-vanilla.avi
gekko-lp4-list-based  http://www.skynet.ie/~mel/anti-frag/2007-02-28/gekkolp4-listbased.avi
gekko-lp4-zone-based  http://www.skynet.ie/~mel/anti-frag/2007-02-28/gekkolp4-zonebased.avi
gekko-lp4-combined    http://www.skynet.ie/~mel/anti-frag/2007-02-28/gekkolp4-combined.avi

Notes;
1. The performance figures show small variances, both performance gains and
   regressions. The biggest gains tend to be on x86_64.
2. The x86 figures are based on a numaq which is an ancient machine. I didn't
   have a more modern machine available for running these tests on.
3. The Vanilla kernel is an unpatched 2.6.20-mm2 kernel
4. List-base represents the "list-based" patches desribed above which groups
   pages by mobility type.
5. Zone-base represents the "zone-based" patches which groups movable pages
   together in one zone as described.
6. Combined is with both sets of patches applied
7. The kernbench figures are based on an average of 3 iterations. The figures
   always show that the vanilla and patched kernels have similar performance.
   The anti-frag kernels are usually faster on x86_64.
8. The success rates for the allocation of hugepages should always be at least
   40%. Anything lower implies that reclaim is not reclaiming pages that it
   could. I've included figures below based on kernels patches with additional
   fixes to reclaim from Andy.
9. The bl6-13 figures are incomplete because the machine was deleted from
   the test grid and never came back. They're left in because it was a machine
   that showed reliable performance improvements from the patches
10. The videos are a bit blurry due to quality. High-res images can be
   generated

[*] On my Debian Etch system, xine-ui works for playing videos. On other
	systems, I found ffplay from the ffmpeg package worked. If neither
	of these work for you, the tar.gz contains the JPG files making up
	the frames and you can view them with any image viewer.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-01 10:12 The performance and behaviour of the anti-fragmentation related patches Mel Gorman
@ 2007-03-02  0:09 ` Andrew Morton
  2007-03-02  0:44   ` Linus Torvalds
                     ` (5 more replies)
  2007-03-02  1:52 ` Bill Irwin
  1 sibling, 6 replies; 99+ messages in thread
From: Andrew Morton @ 2007-03-02  0:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: npiggin, clameter, mingo, jschopp, arjan, torvalds, mbligh,
	linux-mm, linux-kernel

On Thu, 1 Mar 2007 10:12:50 +0000
mel@skynet.ie (Mel Gorman) wrote:

> Any opinion on merging these patches into -mm
> for wider testing?

I'm a little reluctant to make changes to -mm's core mm unless those
changes are reasonably certain to be on track for mainline, so let's talk
about that.

What worries me is memory hot-unplug and per-container RSS limits.  We
don't know how we're going to do either of these yet, and it could well be
that the anti-frag work significantly complexicates whatever we end up
doing there.

For prioritisation purposes I'd judge that memory hot-unplug is of similar
value to the antifrag work (because memory hot-unplug permits DIMM
poweroff).

And I'd judge that per-container RSS limits are of considerably more value
than antifrag (in fact per-container RSS might be a superset of antifrag,
in the sense that per-container RSS and containers could be abused to fix
the i-cant-get-any-hugepages problem, dunno).



So some urgent questions are: how are we going to do mem hotunplug and
per-container RSS?



Our basic unit of memory management is the zone.  Right now, a zone maps
onto some hardware-imposed thing.  But the zone-based MM works *well*.  I
suspect that a good way to solve both per-container RSS and mem hotunplug
is to split the zone concept away from its hardware limitations: create a
"software zone" and a "hardware zone".  All the existing page allocator and
reclaim code remains basically unchanged, and it operates on "software
zones".  Each software zones always lies within a single hardware zone. 
The software zones are resizeable.  For per-container RSS we give each
container one (or perhaps multiple) resizeable software zones.

For memory hotunplug, some of the hardware zone's software zones are marked
reclaimable and some are not; DIMMs which are wholly within reclaimable
zones can be depopulated and powered off or removed.

NUMA and cpusets screwed up: they've gone and used nodes as their basic
unit of memory management whereas they should have used zones.  This will
need to be untangled.


Anyway, that's just a shot in the dark.  Could be that we implement unplug
and RSS control by totally different means.  But I do wish that we'd sort
out what those means will be before we potentially complicate the story a
lot by adding antifragmentation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  0:09 ` Andrew Morton
@ 2007-03-02  0:44   ` Linus Torvalds
  2007-03-02  1:52     ` Balbir Singh
                       ` (2 more replies)
  2007-03-02  1:39   ` Balbir Singh
                     ` (4 subsequent siblings)
  5 siblings, 3 replies; 99+ messages in thread
From: Linus Torvalds @ 2007-03-02  0:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, npiggin, clameter, mingo, jschopp, arjan, mbligh,
	linux-mm, linux-kernel


On Thu, 1 Mar 2007, Andrew Morton wrote:
> 
> So some urgent questions are: how are we going to do mem hotunplug and
> per-container RSS?

Also: how are we going to do this in virtualized environments? Usually the 
people who care abotu memory hotunplug are exactly the same people who 
also care (or claim to care, or _will_ care) about virtualization.

My personal opinion is that while I'm not a huge fan of virtualization, 
these kinds of things really _can_ be handled more cleanly at that layer, 
and not in the kernel at all. Afaik, it's what IBM already does, and has 
been doing for a while. There's no shame in looking at what already works, 
especially if it's simpler.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  0:09 ` Andrew Morton
  2007-03-02  0:44   ` Linus Torvalds
@ 2007-03-02  1:39   ` Balbir Singh
  2007-03-02  2:34   ` KAMEZAWA Hiroyuki
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 99+ messages in thread
From: Balbir Singh @ 2007-03-02  1:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, npiggin, clameter, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

Andrew Morton wrote:
> So some urgent questions are: how are we going to do mem hotunplug and
> per-container RSS?
> 
> 
> 
> Our basic unit of memory management is the zone.  Right now, a zone maps
> onto some hardware-imposed thing.  But the zone-based MM works *well*.  I
> suspect that a good way to solve both per-container RSS and mem hotunplug
> is to split the zone concept away from its hardware limitations: create a
> "software zone" and a "hardware zone".  All the existing page allocator and
> reclaim code remains basically unchanged, and it operates on "software
> zones".  Each software zones always lies within a single hardware zone. 
> The software zones are resizeable.  For per-container RSS we give each
> container one (or perhaps multiple) resizeable software zones.
> 
> For memory hotunplug, some of the hardware zone's software zones are marked
> reclaimable and some are not; DIMMs which are wholly within reclaimable
> zones can be depopulated and powered off or removed.
> 
> NUMA and cpusets screwed up: they've gone and used nodes as their basic
> unit of memory management whereas they should have used zones.  This will
> need to be untangled.
> 
> 
> Anyway, that's just a shot in the dark.  Could be that we implement unplug
> and RSS control by totally different means.  But I do wish that we'd sort
> out what those means will be before we potentially complicate the story a
> lot by adding antifragmentation.
> 

Paul Menage had suggested something very similar in response to the RFC
for memory controllers I sent out and it was suggested that we create
small zones (roughly 64 MB) to avoid the issue of a zone/node not being
a shareable across containers. Even with a small size, there are some 
issues. The following thread has the details discussed.

	http://lkml.org/lkml/2006/10/30/120

RSS accounting is very easy (with minimal changes to the core mm),
supplemented with an efficient per-container reclaimer, it should be
easy to implement a  good per-container RSS controller.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  0:44   ` Linus Torvalds
@ 2007-03-02  1:52     ` Balbir Singh
  2007-03-02  3:44       ` Linus Torvalds
  2007-03-02 16:58     ` Mel Gorman
  2007-03-02 17:05     ` Joel Schopp
  2 siblings, 1 reply; 99+ messages in thread
From: Balbir Singh @ 2007-03-02  1:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Mel Gorman, npiggin, clameter, mingo, jschopp,
	arjan, mbligh, linux-mm, linux-kernel

Linus Torvalds wrote:
> On Thu, 1 Mar 2007, Andrew Morton wrote:
>> So some urgent questions are: how are we going to do mem hotunplug and
>> per-container RSS?
> 
> Also: how are we going to do this in virtualized environments? Usually the 
> people who care abotu memory hotunplug are exactly the same people who 
> also care (or claim to care, or _will_ care) about virtualization.
> 
> My personal opinion is that while I'm not a huge fan of virtualization, 
> these kinds of things really _can_ be handled more cleanly at that layer, 
> and not in the kernel at all. Afaik, it's what IBM already does, and has 
> been doing for a while. There's no shame in looking at what already works, 
> especially if it's simpler.

Could you please clarify as to what "that layer" means - is it the
firmware/hardware for virtualization? or does it refer to user space?
With virtualization the linux kernel would end up acting as a hypervisor
and resource management support like per-container RSS support needs to
be built into the kernel.

It would also be useful to have a resource controller like per-container
RSS control (container refers to a task grouping) within the kernel or
non-virtualized environments as well.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-01 10:12 The performance and behaviour of the anti-fragmentation related patches Mel Gorman
  2007-03-02  0:09 ` Andrew Morton
@ 2007-03-02  1:52 ` Bill Irwin
  2007-03-02 10:38   ` Mel Gorman
  1 sibling, 1 reply; 99+ messages in thread
From: Bill Irwin @ 2007-03-02  1:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: akpm, npiggin, clameter, mingo, jschopp, arjan, torvalds, mbligh,
	linux-mm, linux-kernel

On Thu, Mar 01, 2007 at 10:12:50AM +0000, Mel Gorman wrote:
> These are figures based on kernels patches with Andy Whitcrofts reclaim
> patches. You will see that the zone-based kernel is getting success rates
> closer to 40% as one would expect although there is still something amiss.

Yes, combining the two should do at least as well as either in
isolation. Are there videos of each of the two in isolation? Maybe that
would give someone insight into what's happening.


On Thu, Mar 01, 2007 at 10:12:50AM +0000, Mel Gorman wrote:
> Kernbench Total CPU Time        

Oh dear. How do the other benchmarks look?


On Thu, Mar 01, 2007 at 10:12:50AM +0000, Mel Gorman wrote:
> The patches go a long way to making sure that high-order allocations work
> and particularly that the hugepage pool can be resized once the system has
> been running. With the clustering of high-order atomic allocations, I have
> some confidence that allocating contiguous jumbo frames will work even with
> loads performing lots of IO. I think the videos show how the patches actually
> work in the clearest possible manner.
> I am of the opinion that both approaches have their advantages and
> disadvantages. Given a choice between the two, I prefer list-based
> because of it's flexibility and it should also help high-order kernel
> allocations. However, by applying both, the disadvantages of list-based are
> covered and there still appears to be no performance loss as a result. Hence,
> I'd like to see both merged.  Any opinion on merging these patches into -mm
> for wider testing?

Exhibiting a workload where the list patch breaks down and the zone
patch rescues it might help if it's felt that the combination isn't as
good as lists in isolation. I'm sure one can be dredged up somewhere.
Either that or someone will eventually spot why the combination doesn't
get as many available maximally contiguous regions as the list patch.
By and large I'm happy to see anything go in that inches hugetlbfs
closer to a backward compatibility wrapper over ramfs.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  0:09 ` Andrew Morton
  2007-03-02  0:44   ` Linus Torvalds
  2007-03-02  1:39   ` Balbir Singh
@ 2007-03-02  2:34   ` KAMEZAWA Hiroyuki
  2007-03-02  3:05   ` Christoph Lameter
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 99+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-03-02  2:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mel, npiggin, clameter, mingo, jschopp, arjan, torvalds, mbligh,
	linux-mm, linux-kernel

On Thu, 1 Mar 2007 16:09:15 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 1 Mar 2007 10:12:50 +0000
> mel@skynet.ie (Mel Gorman) wrote:
> 
> > Any opinion on merging these patches into -mm
> > for wider testing?
> 
> I'm a little reluctant to make changes to -mm's core mm unless those
> changes are reasonably certain to be on track for mainline, so let's talk
> about that.
> 
> What worries me is memory hot-unplug and per-container RSS limits.  We
> don't know how we're going to do either of these yet, and it could well be
> that the anti-frag work significantly complexicates whatever we end up
> doing there.
> 
> For prioritisation purposes I'd judge that memory hot-unplug is of similar
> value to the antifrag work (because memory hot-unplug permits DIMM
> poweroff).

About memory-hot-unplug, I'm now writing a new patch-set for memory-unplug for
showing my overview and roadmap. I'm now debugging it. I think I will be able to
post them as RFC in a week.

At least, ZONE_MOVABLE(or something partitioning memory) is necessary for
memory-hot-unplug like DIMM-poweroff. (I'm now using my own ZONE_MOVABLE patch, but
It is O.K. to migrate to Mel's one if it's ready to be merged.)


> Our basic unit of memory management is the zone.  Right now, a zone maps
> onto some hardware-imposed thing.  But the zone-based MM works *well*.  I
> suspect that a good way to solve both per-container RSS and mem hotunplug
> is to split the zone concept away from its hardware limitations: create a
> "software zone" and a "hardware zone".  All the existing page allocator and
> reclaim code remains basically unchanged, and it operates on "software
> zones".  Each software zones always lies within a single hardware zone. 
> The software zones are resizeable.  For per-container RSS we give each
> container one (or perhaps multiple) resizeable software zones.
> 
> For memory hotunplug, some of the hardware zone's software zones are marked
> reclaimable and some are not; DIMMs which are wholly within reclaimable
> zones can be depopulated and powered off or removed.
> 
Hmm...software-zone seems attractive.
I remember someone posted pesuedo-zone(pzone) patch in past.

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  0:09 ` Andrew Morton
                     ` (2 preceding siblings ...)
  2007-03-02  2:34   ` KAMEZAWA Hiroyuki
@ 2007-03-02  3:05   ` Christoph Lameter
  2007-03-02  3:57     ` Nick Piggin
  2007-03-02 13:50   ` Arjan van de Ven
  2007-03-02 15:29   ` Rik van Riel
  5 siblings, 1 reply; 99+ messages in thread
From: Christoph Lameter @ 2007-03-02  3:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, npiggin, mingo, jschopp, arjan, torvalds, mbligh,
	linux-mm, linux-kernel

On Thu, 1 Mar 2007, Andrew Morton wrote:

> What worries me is memory hot-unplug and per-container RSS limits.  We
> don't know how we're going to do either of these yet, and it could well be
> that the anti-frag work significantly complexicates whatever we end up
> doing there.

Right now it seems that the per container RSS limits differ from the 
statistics calculated per zone. There would be a conceptual overlap but 
the containers are optional and track numbers differently. There is no RSS 
counter in a zone f.e.

memory hot-unplug would directly tap into the anti-frag work. Essentially 
only the zone with movable pages would be unpluggable without additional 
measures. Making slab items and other allocations that is fixed movable 
requires work anyways. A new zone concept will not help.

> For prioritisation purposes I'd judge that memory hot-unplug is of similar
> value to the antifrag work (because memory hot-unplug permits DIMM
> poweroff).

I would say that anti-frag / defrag enables memory unplug.

> And I'd judge that per-container RSS limits are of considerably more value
> than antifrag (in fact per-container RSS might be a superset of antifrag,
> in the sense that per-container RSS and containers could be abused to fix
> the i-cant-get-any-hugepages problem, dunno).

They relate? How can a container perform antifrag? Meaning a container 
reserves a portion of a hardware zone and becomes a software zone.

> So some urgent questions are: how are we going to do mem hotunplug and
> per-container RSS?

Separately. There is no need to mingle these two together.

> Our basic unit of memory management is the zone.  Right now, a zone maps
> onto some hardware-imposed thing.  But the zone-based MM works *well*.  I

Thats a value judgement that I doubt. Zone based balancing is bad and has 
been repeatedly patched up so that it works with the usual loads.

> suspect that a good way to solve both per-container RSS and mem hotunplug
> is to split the zone concept away from its hardware limitations: create a
> "software zone" and a "hardware zone".  All the existing page allocator and
> reclaim code remains basically unchanged, and it operates on "software
> zones".  Each software zones always lies within a single hardware zone. 
> The software zones are resizeable.  For per-container RSS we give each
> container one (or perhaps multiple) resizeable software zones.

Resizable software zones? Are they contiguous or not? If not then we
add another layer to the defrag problem.

> For memory hotunplug, some of the hardware zone's software zones are marked
> reclaimable and some are not; DIMMs which are wholly within reclaimable
> zones can be depopulated and powered off or removed.

So subzones indeed. How about calling the MAX_ORDER entities that Mel's 
patches create "software zones"?

> NUMA and cpusets screwed up: they've gone and used nodes as their basic
> unit of memory management whereas they should have used zones.  This will
> need to be untangled.

zones have hardware characteristics at its core. In a NUMA setting zones 
determine the performance of loads from those areas. I would like to have
zones and nodes merged. Maybe extend node numbers into the negative area
-1 = DMA -2 DMA32 etc? All systems then manage the "nones" (node / zones 
meerged). One could create additional "virtual" nones after the real nones 
that have hardware characteristics behind them. The virtual nones would be 
something like the software zones? Contain MAX_ORDER portions of hardware 
nones?

> Anyway, that's just a shot in the dark.  Could be that we implement unplug
> and RSS control by totally different means.  But I do wish that we'd sort
> out what those means will be before we potentially complicate the story a
> lot by adding antifragmentation.

Hmmm.... My shot:

1. Merge zones/nodes

2. Create new virtual zones/nodes that are subsets of MAX_order blocks of 
the real zones/nodes. These may then have additional characteristics such
as 

A. moveable/unmovable
B. DMA restrictions
C. container assignment.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  1:52     ` Balbir Singh
@ 2007-03-02  3:44       ` Linus Torvalds
  2007-03-02  3:59         ` Andrew Morton
                           ` (3 more replies)
  0 siblings, 4 replies; 99+ messages in thread
From: Linus Torvalds @ 2007-03-02  3:44 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, Mel Gorman, npiggin, clameter, mingo, jschopp,
	arjan, mbligh, linux-mm, linux-kernel


On Fri, 2 Mar 2007, Balbir Singh wrote:
>
> > My personal opinion is that while I'm not a huge fan of virtualization,
> > these kinds of things really _can_ be handled more cleanly at that layer,
> > and not in the kernel at all. Afaik, it's what IBM already does, and has
> > been doing for a while. There's no shame in looking at what already works,
> > especially if it's simpler.
> 
> Could you please clarify as to what "that layer" means - is it the
> firmware/hardware for virtualization? or does it refer to user space?

Virtualization in general. We don't know what it is - in IBM machines it's 
a hypervisor. With Xen and VMware, it's usually a hypervisor too. With 
KVM, it's obviously a host Linux kernel/user-process combination.

The point being that in the guests, hotunplug is almost useless (for 
bigger ranges), and we're much better off just telling the virtualization 
hosts on a per-page level whether we care about a page or not, than to 
worry about fragmentation.

And in hosts, we usually don't care EITHER, since it's usually done in a 
hypervisor.

> It would also be useful to have a resource controller like per-container
> RSS control (container refers to a task grouping) within the kernel or
> non-virtualized environments as well.

.. but this has again no impact on anti-fragmentation.

In other words, I really don't see a huge upside. I see *lots* of 
downsides, but upsides? Not so much. Almost everybody who wants unplug 
wants virtualization, and right now none of the "big virtualization" 
people would want to have kernel-level anti-fragmentation anyway sicne 
they'd need to do it on their own.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  3:05   ` Christoph Lameter
@ 2007-03-02  3:57     ` Nick Piggin
  2007-03-02  4:06       ` Christoph Lameter
  2007-03-02  4:20       ` Paul Mundt
  0 siblings, 2 replies; 99+ messages in thread
From: Nick Piggin @ 2007-03-02  3:57 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Thu, Mar 01, 2007 at 07:05:48PM -0800, Christoph Lameter wrote:
> On Thu, 1 Mar 2007, Andrew Morton wrote:
> > For prioritisation purposes I'd judge that memory hot-unplug is of similar
> > value to the antifrag work (because memory hot-unplug permits DIMM
> > poweroff).
> 
> I would say that anti-frag / defrag enables memory unplug.

Well that really depends. If you want to have any sort of guaranteed
amount of unplugging or shrinking (or hugepage allocating), then antifrag
doesn't work because it is a heuristic.

One thing that worries me about anti-fragmentation is that people might
actually start _using_ higher order pages in the kernel. Then fragmentation
comes back, and it's worse because now it is not just the fringe hugepage or
unplug users (who can anyway work around the fragmentation by allocating
from reserve zones).

> > Our basic unit of memory management is the zone.  Right now, a zone maps
> > onto some hardware-imposed thing.  But the zone-based MM works *well*.  I
> 
> Thats a value judgement that I doubt. Zone based balancing is bad and has 
> been repeatedly patched up so that it works with the usual loads.

Shouldn't we fix it instead of deciding it is broken and add another layer
on top that supposedly does better balancing?

> > suspect that a good way to solve both per-container RSS and mem hotunplug
> > is to split the zone concept away from its hardware limitations: create a
> > "software zone" and a "hardware zone".  All the existing page allocator and
> > reclaim code remains basically unchanged, and it operates on "software
> > zones".  Each software zones always lies within a single hardware zone. 
> > The software zones are resizeable.  For per-container RSS we give each
> > container one (or perhaps multiple) resizeable software zones.
> 
> Resizable software zones? Are they contiguous or not? If not then we
> add another layer to the defrag problem.

I think Andrew is proposing that we work out what the problem is first.
I don't know what the defrag problem is, but I know that fragmentation
is unavoidable unless you have fixed size areas for each different size
of unreclaimable allocation.

> > NUMA and cpusets screwed up: they've gone and used nodes as their basic
> > unit of memory management whereas they should have used zones.  This will
> > need to be untangled.
> 
> zones have hardware characteristics at its core. In a NUMA setting zones 
> determine the performance of loads from those areas. I would like to have
> zones and nodes merged. Maybe extend node numbers into the negative area
> -1 = DMA -2 DMA32 etc? All systems then manage the "nones" (node / zones 
> meerged). One could create additional "virtual" nones after the real nones 
> that have hardware characteristics behind them. The virtual nones would be 
> something like the software zones? Contain MAX_ORDER portions of hardware 
> nones?

But just because zones are hardware _now_ doesn't mean they have to stay
that way. The upshot is that a lot of work for zones is already there.

> > Anyway, that's just a shot in the dark.  Could be that we implement unplug
> > and RSS control by totally different means.  But I do wish that we'd sort
> > out what those means will be before we potentially complicate the story a
> > lot by adding antifragmentation.
> 
> Hmmm.... My shot:
> 
> 1. Merge zones/nodes
> 
> 2. Create new virtual zones/nodes that are subsets of MAX_order blocks of 
> the real zones/nodes. These may then have additional characteristics such
> as 
> 
> A. moveable/unmovable
> B. DMA restrictions
> C. container assignment.

There are alternatives to adding a new layer of virtual zones. We could try
using zones, enven.

zones aren't perfect right now, but they are quite similar to what you
want (ie. blocks of memory). I think we should first try to generalise what
we have rather than adding another layer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  3:44       ` Linus Torvalds
@ 2007-03-02  3:59         ` Andrew Morton
  2007-03-02  5:11           ` Linus Torvalds
  2007-03-02  4:18         ` Balbir Singh
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 99+ messages in thread
From: Andrew Morton @ 2007-03-02  3:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Balbir Singh, Mel Gorman, npiggin, clameter, mingo, jschopp,
	arjan, mbligh, linux-mm, linux-kernel

On Thu, 1 Mar 2007 19:44:27 -0800 (PST) Linus Torvalds <torvalds@linux-foundation.org> wrote:

> In other words, I really don't see a huge upside. I see *lots* of 
> downsides, but upsides? Not so much. Almost everybody who wants unplug 
> wants virtualization, and right now none of the "big virtualization" 
> people would want to have kernel-level anti-fragmentation anyway sicne 
> they'd need to do it on their own.

Agree with all that, but you're missing the other application: power
saving.  FBDIMMs take eight watts a pop.  If we can turn them off when the
system is unloaded we save either four or all eight watts (assuming we can
get Intel to part with the information which is needed to do this.  I fear
an ACPI method will ensue).

There's a whole lot of complexity and work in all of this, but 24*8 watts
is a lot of watts, and it's worth striving for.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  3:57     ` Nick Piggin
@ 2007-03-02  4:06       ` Christoph Lameter
  2007-03-02  4:21         ` Nick Piggin
  2007-03-02  4:29         ` Andrew Morton
  2007-03-02  4:20       ` Paul Mundt
  1 sibling, 2 replies; 99+ messages in thread
From: Christoph Lameter @ 2007-03-02  4:06 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Nick Piggin wrote:

> > I would say that anti-frag / defrag enables memory unplug.
> 
> Well that really depends. If you want to have any sort of guaranteed
> amount of unplugging or shrinking (or hugepage allocating), then antifrag
> doesn't work because it is a heuristic.

We would need additional measures such as real defrag and make more 
structure movable.

> One thing that worries me about anti-fragmentation is that people might
> actually start _using_ higher order pages in the kernel. Then fragmentation
> comes back, and it's worse because now it is not just the fringe hugepage or
> unplug users (who can anyway work around the fragmentation by allocating
> from reserve zones).

Yes, we (SGI) need exactly that: Use of higher order pages in the kernel 
in order to reduce overhead of managing page structs for large I/O and 
large memory applications. We need appropriate measures to deal with the 
fragmentation problem.

> > Thats a value judgement that I doubt. Zone based balancing is bad and has 
> > been repeatedly patched up so that it works with the usual loads.
> 
> Shouldn't we fix it instead of deciding it is broken and add another layer
> on top that supposedly does better balancing?

We need to reduce the real hardware zones as much as possible. Most high 
performance architectures have no need for additional DMA zones f.e. and
do not have to deal with the complexities that arise there.

> But just because zones are hardware _now_ doesn't mean they have to stay
> that way. The upshot is that a lot of work for zones is already there.

Well you cannot get there without the nodes. The control of memory 
allocations with user space support etc only comes with the nodes.

> > A. moveable/unmovable
> > B. DMA restrictions
> > C. container assignment.
> 
> There are alternatives to adding a new layer of virtual zones. We could try
> using zones, enven.

No merge them to one thing and handle them as one. No difference between 
zones and nodes anymore.
 
> zones aren't perfect right now, but they are quite similar to what you
> want (ie. blocks of memory). I think we should first try to generalise what
> we have rather than adding another layer.

Yes that would mean merging nodes and zones. So "nones".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  3:44       ` Linus Torvalds
  2007-03-02  3:59         ` Andrew Morton
@ 2007-03-02  4:18         ` Balbir Singh
  2007-03-02  5:13         ` Jeremy Fitzhardinge
  2007-03-06  4:16         ` Paul Mackerras
  3 siblings, 0 replies; 99+ messages in thread
From: Balbir Singh @ 2007-03-02  4:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Mel Gorman, npiggin, clameter, mingo, jschopp,
	arjan, mbligh, linux-mm, linux-kernel

Linus Torvalds wrote:
> 
> On Fri, 2 Mar 2007, Balbir Singh wrote:
>>> My personal opinion is that while I'm not a huge fan of virtualization,
>>> these kinds of things really _can_ be handled more cleanly at that layer,
>>> and not in the kernel at all. Afaik, it's what IBM already does, and has
>>> been doing for a while. There's no shame in looking at what already works,
>>> especially if it's simpler.
>> Could you please clarify as to what "that layer" means - is it the
>> firmware/hardware for virtualization? or does it refer to user space?
> 
> Virtualization in general. We don't know what it is - in IBM machines it's 
> a hypervisor. With Xen and VMware, it's usually a hypervisor too. With 
> KVM, it's obviously a host Linux kernel/user-process combination.
> 

Thanks for clarifying.

> The point being that in the guests, hotunplug is almost useless (for 
> bigger ranges), and we're much better off just telling the virtualization 
> hosts on a per-page level whether we care about a page or not, than to 
> worry about fragmentation.
> 
> And in hosts, we usually don't care EITHER, since it's usually done in a 
> hypervisor.
> 
>> It would also be useful to have a resource controller like per-container
>> RSS control (container refers to a task grouping) within the kernel or
>> non-virtualized environments as well.
> 
> .. but this has again no impact on anti-fragmentation.
> 

Yes, I agree that anti-fragmentation and resource management are independent
of each other. I must admit to being a bit selfish here, in that my main
interest is in resource management and we would love to see a well
written  and easy to understand resource management infrastructure and 
controllers to control CPU and memory usage. Since the issue of
per-container RSS control came up, I wanted to ensure that we do not mix
up resource control and anti-fragmentation.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  3:57     ` Nick Piggin
  2007-03-02  4:06       ` Christoph Lameter
@ 2007-03-02  4:20       ` Paul Mundt
  1 sibling, 0 replies; 99+ messages in thread
From: Paul Mundt @ 2007-03-02  4:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Andrew Morton, Mel Gorman, mingo, jschopp,
	arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, Mar 02, 2007 at 04:57:51AM +0100, Nick Piggin wrote:
> On Thu, Mar 01, 2007 at 07:05:48PM -0800, Christoph Lameter wrote:
> > On Thu, 1 Mar 2007, Andrew Morton wrote:
> > > For prioritisation purposes I'd judge that memory hot-unplug is of similar
> > > value to the antifrag work (because memory hot-unplug permits DIMM
> > > poweroff).
> > 
> > I would say that anti-frag / defrag enables memory unplug.
> 
> Well that really depends. If you want to have any sort of guaranteed
> amount of unplugging or shrinking (or hugepage allocating), then antifrag
> doesn't work because it is a heuristic.
> 
> One thing that worries me about anti-fragmentation is that people might
> actually start _using_ higher order pages in the kernel. Then fragmentation
> comes back, and it's worse because now it is not just the fringe hugepage or
> unplug users (who can anyway work around the fragmentation by allocating
> from reserve zones).
> 
There's two sides to that, the ability to use higher order pages in the
kernel also means that it's possible to use larger TLB entries while
keeping the base page size small, too. There are already many places in
the kernel that attempt to use the largest possible size when setting up
the entries, and this is something that those of us with tiny
software-managed TLBs are a huge fan of -- some platforms have even opted
to do perverse things such as scanning for contiguous PTEs and bumping to
the next order automatically at set_pte() time.

Unplug is also interesting from a power management point of view.
Powering off is still more attractive than self-refresh, for example, but
could also be used at run-time depending on the workload.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  4:06       ` Christoph Lameter
@ 2007-03-02  4:21         ` Nick Piggin
  2007-03-02  4:31           ` Christoph Lameter
  2007-03-02  4:29         ` Andrew Morton
  1 sibling, 1 reply; 99+ messages in thread
From: Nick Piggin @ 2007-03-02  4:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Thu, Mar 01, 2007 at 08:06:25PM -0800, Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Nick Piggin wrote:
> 
> > > I would say that anti-frag / defrag enables memory unplug.
> > 
> > Well that really depends. If you want to have any sort of guaranteed
> > amount of unplugging or shrinking (or hugepage allocating), then antifrag
> > doesn't work because it is a heuristic.
> 
> We would need additional measures such as real defrag and make more 
> structure movable.
> 
> > One thing that worries me about anti-fragmentation is that people might
> > actually start _using_ higher order pages in the kernel. Then fragmentation
> > comes back, and it's worse because now it is not just the fringe hugepage or
> > unplug users (who can anyway work around the fragmentation by allocating
> > from reserve zones).
> 
> Yes, we (SGI) need exactly that: Use of higher order pages in the kernel 
> in order to reduce overhead of managing page structs for large I/O and 
> large memory applications. We need appropriate measures to deal with the 
> fragmentation problem.

I don't understand why, out of any architecture, ia64 would have to hack
around this in software :(

> > > Thats a value judgement that I doubt. Zone based balancing is bad and has 
> > > been repeatedly patched up so that it works with the usual loads.
> > 
> > Shouldn't we fix it instead of deciding it is broken and add another layer
> > on top that supposedly does better balancing?
> 
> We need to reduce the real hardware zones as much as possible. Most high 
> performance architectures have no need for additional DMA zones f.e. and
> do not have to deal with the complexities that arise there.

And then you want to add something else on top of them?

> > But just because zones are hardware _now_ doesn't mean they have to stay
> > that way. The upshot is that a lot of work for zones is already there.
> 
> Well you cannot get there without the nodes. The control of memory 
> allocations with user space support etc only comes with the nodes.
> 
> > > A. moveable/unmovable
> > > B. DMA restrictions
> > > C. container assignment.
> > 
> > There are alternatives to adding a new layer of virtual zones. We could try
> > using zones, enven.
> 
> No merge them to one thing and handle them as one. No difference between 
> zones and nodes anymore.
>  
> > zones aren't perfect right now, but they are quite similar to what you
> > want (ie. blocks of memory). I think we should first try to generalise what
> > we have rather than adding another layer.
> 
> Yes that would mean merging nodes and zones. So "nones".

Yes, this is what Andrew just said. But you then wanted to add virtual zones
or something on top. I just don't understand why. You agree that merging
nodes and zones is a good idea. Did I miss the important post where some
bright person discovered why merging zones and "virtual zones" is a bad
idea?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  4:06       ` Christoph Lameter
  2007-03-02  4:21         ` Nick Piggin
@ 2007-03-02  4:29         ` Andrew Morton
  2007-03-02  4:33           ` Christoph Lameter
  1 sibling, 1 reply; 99+ messages in thread
From: Andrew Morton @ 2007-03-02  4:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Mel Gorman, mingo, jschopp, arjan, torvalds, mbligh,
	linux-mm, linux-kernel

On Thu, 1 Mar 2007 20:06:25 -0800 (PST) Christoph Lameter <clameter@engr.sgi.com> wrote:

> No merge them to one thing and handle them as one. No difference between 
> zones and nodes anymore.

Sorry, but this is crap.  zones and nodes are distinct, physical concepts
and you're kidding yourself if you think you can somehow fudge things to make
one of them just go away.

Think: ZONE_DMA32 on an Opteron machine.  I don't think there is a sane way
in which we can fudge away the distinction between
bus-addresses-which-have-the-32-upper-bits-zero and
memory-which-is-local-to-each-socket.

No matter how hard those hands are waving.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  4:21         ` Nick Piggin
@ 2007-03-02  4:31           ` Christoph Lameter
  2007-03-02  5:06             ` Nick Piggin
  0 siblings, 1 reply; 99+ messages in thread
From: Christoph Lameter @ 2007-03-02  4:31 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Nick Piggin wrote:

> > Yes, we (SGI) need exactly that: Use of higher order pages in the kernel 
> > in order to reduce overhead of managing page structs for large I/O and 
> > large memory applications. We need appropriate measures to deal with the 
> > fragmentation problem.
> 
> I don't understand why, out of any architecture, ia64 would have to hack
> around this in software :(

Ummm... We have x86_64 platforms with the 4k page problem. 4k pages are 
very useful for the large number of small files that are around. But for 
the large streams of data you would want other methods of handling these.

If I want to write 1 terabyte (2^50) to disk then the I/O subsystem has 
to handle 2^(50-12) = 2^38 = 256 million page structs! This limits I/O 
bandwiths and leads to huge scatter gather lists (and we are limited in 
terms of the numbe of items on those lists in many drivers). Our future 
platforms have up to serveral petabytes of memory. There needs to be some 
way to handle these capacities in an efficient way. We cannot wait 
an hour for the terabyte to reach the disk.
 
> > We need to reduce the real hardware zones as much as possible. Most high 
> > performance architectures have no need for additional DMA zones f.e. and
> > do not have to deal with the complexities that arise there.
> 
> And then you want to add something else on top of them?

zones are basically managing a number of MAX_ORDER chunks. The adding of 
something here is dealing with the categorization of these MAX_ORDER 
chunks in order to insure movability and thus defragmentability of
most of them. Or the upper layer may limit the number of those chunks
assigned to a certain container.

> > Yes that would mean merging nodes and zones. So "nones".
> 
> Yes, this is what Andrew just said. But you then wanted to add virtual zones
> or something on top. I just don't understand why. You agree that merging
> nodes and zones is a good idea. Did I miss the important post where some
> bright person discovered why merging zones and "virtual zones" is a bad
> idea?

Hmmm.. I usually talk about the "virtual zones" as virtual nodes. But we 
are basically at the same point there. Node level controls and APIs exist and 
can even be used from user space. A container could just be a special node 
and then the allocations to this container could be controlled via the 
existing APIs.

A virtual zone/node would be assigned a number of MAX_ORDER blocks from 
real zones/nodes. Then it may hopefully be managed like a real node. In 
the original zone/node these MAX_ORDER blocks would show up as 
unavailable. The "upper" layer therefore is the existing node/zone layer. 
The virtual zones/nodes just steal memory from the real ones.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  4:29         ` Andrew Morton
@ 2007-03-02  4:33           ` Christoph Lameter
  2007-03-02  4:58             ` Andrew Morton
  0 siblings, 1 reply; 99+ messages in thread
From: Christoph Lameter @ 2007-03-02  4:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Mel Gorman, mingo, jschopp, arjan, torvalds, mbligh,
	linux-mm, linux-kernel

On Thu, 1 Mar 2007, Andrew Morton wrote:

> Sorry, but this is crap.  zones and nodes are distinct, physical concepts
> and you're kidding yourself if you think you can somehow fudge things to make
> one of them just go away.
> 
> Think: ZONE_DMA32 on an Opteron machine.  I don't think there is a sane way
> in which we can fudge away the distinction between
> bus-addresses-which-have-the-32-upper-bits-zero and
> memory-which-is-local-to-each-socket.

Of course you can. Add a virtual DMA and DMA32 zone/node and extract the 
relevant memory from the base zone/node.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  4:33           ` Christoph Lameter
@ 2007-03-02  4:58             ` Andrew Morton
  0 siblings, 0 replies; 99+ messages in thread
From: Andrew Morton @ 2007-03-02  4:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Mel Gorman, mingo, jschopp, arjan, torvalds, mbligh,
	linux-mm, linux-kernel

On Thu, 1 Mar 2007 20:33:04 -0800 (PST) Christoph Lameter <clameter@engr.sgi.com> wrote:

> On Thu, 1 Mar 2007, Andrew Morton wrote:
> 
> > Sorry, but this is crap.  zones and nodes are distinct, physical concepts
> > and you're kidding yourself if you think you can somehow fudge things to make
> > one of them just go away.
> > 
> > Think: ZONE_DMA32 on an Opteron machine.  I don't think there is a sane way
> > in which we can fudge away the distinction between
> > bus-addresses-which-have-the-32-upper-bits-zero and
> > memory-which-is-local-to-each-socket.
> 
> Of course you can. Add a virtual DMA and DMA32 zone/node and extract the 
> relevant memory from the base zone/node.

You're using terms which I've never seen described anywhere.

Please, just stop here.  Give us a complete design proposal which we can
understand and review.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  4:31           ` Christoph Lameter
@ 2007-03-02  5:06             ` Nick Piggin
  2007-03-02  5:40               ` Christoph Lameter
  2007-03-02  5:50               ` Christoph Lameter
  0 siblings, 2 replies; 99+ messages in thread
From: Nick Piggin @ 2007-03-02  5:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Thu, Mar 01, 2007 at 08:31:24PM -0800, Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Nick Piggin wrote:
> 
> > > Yes, we (SGI) need exactly that: Use of higher order pages in the kernel 
> > > in order to reduce overhead of managing page structs for large I/O and 
> > > large memory applications. We need appropriate measures to deal with the 
> > > fragmentation problem.
> > 
> > I don't understand why, out of any architecture, ia64 would have to hack
> > around this in software :(
> 
> Ummm... We have x86_64 platforms with the 4k page problem. 4k pages are 
> very useful for the large number of small files that are around. But for 
> the large streams of data you would want other methods of handling these.
> 
> If I want to write 1 terabyte (2^50) to disk then the I/O subsystem has 
> to handle 2^(50-12) = 2^38 = 256 million page structs! This limits I/O 
> bandwiths and leads to huge scatter gather lists (and we are limited in 
> terms of the numbe of items on those lists in many drivers). Our future 
> platforms have up to serveral petabytes of memory. There needs to be some 
> way to handle these capacities in an efficient way. We cannot wait 
> an hour for the terabyte to reach the disk.

I guess you mean 256 billion page structs.

So what do you mean by efficient? I guess you aren't talking about CPU
efficiency, because even if you make the IO subsystem submit larger
physical IOs, you still have to deal with 256 billion TLB entries, the
pagecache has to deal with 256 billion struct pages, so does the
filesystem code to build the bios.

So you are having problems with your IO controller's handling of sg
lists?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  3:59         ` Andrew Morton
@ 2007-03-02  5:11           ` Linus Torvalds
  2007-03-02  5:50             ` KAMEZAWA Hiroyuki
  2007-03-02 16:20             ` Mark Gross
  0 siblings, 2 replies; 99+ messages in thread
From: Linus Torvalds @ 2007-03-02  5:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, Mel Gorman, npiggin, clameter, mingo, jschopp,
	arjan, mbligh, linux-mm, linux-kernel


On Thu, 1 Mar 2007, Andrew Morton wrote:
>
> On Thu, 1 Mar 2007 19:44:27 -0800 (PST) Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > In other words, I really don't see a huge upside. I see *lots* of 
> > downsides, but upsides? Not so much. Almost everybody who wants unplug 
> > wants virtualization, and right now none of the "big virtualization" 
> > people would want to have kernel-level anti-fragmentation anyway sicne 
> > they'd need to do it on their own.
> 
> Agree with all that, but you're missing the other application: power
> saving.  FBDIMMs take eight watts a pop.

This is a hardware problem. Let's see how long it takes for Intel to 
realize that FBDIMM's were a hugely bad idea from a power perspective.

Yes, the same issues exist for other DRAM forms too, but to a *much* 
smaller degree.

Also, IN PRACTICE you're never ever going to see this anyway. Almost 
everybody wants bank interleaving, because it's a huge performance win on 
many loads. That, in turn, means that your memory will be spread out over 
multiple DIMM's even for a single page, much less any bigger area.

In other words - forget about DRAM power savings. It's not realistic. And 
if you want low-power, don't use FBDIMM's. It really *is* that simple.

(And yes, maybe FBDIMM controllers in a few years won't use 8 W per 
buffer. I kind of doubt that, since FBDIMM fairly fundamentally is highish 
voltage swings at high frequencies.)

Also, on a *truly* idle system, we'll see the power savings whatever we 
do, because the working set will fit in D$, and to get those DRAM power 
savings in reality you need to have the DRAM controller shut down on its 
own anyway (ie sw would only help a bit).

The whole DRAM power story is a bedtime story for gullible children. Don't 
fall for it. It's not realistic. The hardware support for it DOES NOT 
EXIST today, and probably won't for several years. And the real fix is 
elsewhere anyway (ie people will have to do a FBDIMM-2 interface, which 
is against the whole point of FBDIMM in the first place, but that's what 
you get when you ignore power in the first version!).

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  3:44       ` Linus Torvalds
  2007-03-02  3:59         ` Andrew Morton
  2007-03-02  4:18         ` Balbir Singh
@ 2007-03-02  5:13         ` Jeremy Fitzhardinge
  2007-03-06  4:16         ` Paul Mackerras
  3 siblings, 0 replies; 99+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-02  5:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Balbir Singh, Andrew Morton, Mel Gorman, npiggin, clameter,
	mingo, jschopp, arjan, mbligh, linux-mm, linux-kernel

Linus Torvalds wrote:
> Virtualization in general. We don't know what it is - in IBM machines it's 
> a hypervisor. With Xen and VMware, it's usually a hypervisor too. With 
> KVM, it's obviously a host Linux kernel/user-process combination.
>
> The point being that in the guests, hotunplug is almost useless (for 
> bigger ranges), and we're much better off just telling the virtualization 
> hosts on a per-page level whether we care about a page or not, than to 
> worry about fragmentation.
>
> And in hosts, we usually don't care EITHER, since it's usually done in a 
> hypervisor.
>   

The paravirt_ops patches I just posted implement all the machinery
required to create a pseudo-physical to machine address mapping under
the kernel.  This is used under Xen because it directly exposes the
pagetables to its guests, but there's no reason why you couldn't use
this layer to implement the same mapping without an underlying
hypervisor.  This allows the kernel to see a normal linear "physical"
address space which is in fact its mapped over a discontigious set of
machine ("real physical") pages.

Andrew and I discussed using it for a kdump kernel, so that you could
load it into a random bunch of pages, and set things up so that it sees
itself as being contiguous.

The mapping is pretty simple.  It intercepts __pte (__pmd, etc) to map
the "physical" page to the real machine page, and pte_val does the
reverse mapping.

You could implement this today as a farily simple, thin paravirt_ops
backend.  The main tricky part is making sure all the device drivers are
correct in using bus addresses (which are mapped to real machine
addresses), and that they don't assume that adjacent kernel virtual
pages are physically adjacent.

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  5:06             ` Nick Piggin
@ 2007-03-02  5:40               ` Christoph Lameter
  2007-03-02  5:49                 ` Nick Piggin
  2007-03-02  5:50               ` Christoph Lameter
  1 sibling, 1 reply; 99+ messages in thread
From: Christoph Lameter @ 2007-03-02  5:40 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Nick Piggin wrote:

> So what do you mean by efficient? I guess you aren't talking about CPU
> efficiency, because even if you make the IO subsystem submit larger
> physical IOs, you still have to deal with 256 billion TLB entries, the
> pagecache has to deal with 256 billion struct pages, so does the
> filesystem code to build the bios.

You do not have to deal with TLB entries if you do buffered I/O.

For mmapped I/O you would want to transparently use 2M TLBs if the 
page size is large.

> So you are having problems with your IO controller's handling of sg
> lists?

We currently have problems with the kernel limits of 128 SG 
entries but the fundamental issue is that we can only do 2 Meg of I/O in 
one go given the default limits of the block layer. Typically the number 
of hardware SG entrie is also limited. We never will be able to put a 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  5:40               ` Christoph Lameter
@ 2007-03-02  5:49                 ` Nick Piggin
  2007-03-02  5:53                   ` Christoph Lameter
  0 siblings, 1 reply; 99+ messages in thread
From: Nick Piggin @ 2007-03-02  5:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Thu, Mar 01, 2007 at 09:40:45PM -0800, Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Nick Piggin wrote:
> 
> > So what do you mean by efficient? I guess you aren't talking about CPU
> > efficiency, because even if you make the IO subsystem submit larger
> > physical IOs, you still have to deal with 256 billion TLB entries, the
> > pagecache has to deal with 256 billion struct pages, so does the
> > filesystem code to build the bios.
> 
> You do not have to deal with TLB entries if you do buffered I/O.

Where does the data come from?

> For mmapped I/O you would want to transparently use 2M TLBs if the 
> page size is large.
> 
> > So you are having problems with your IO controller's handling of sg
> > lists?
> 
> We currently have problems with the kernel limits of 128 SG 
> entries but the fundamental issue is that we can only do 2 Meg of I/O in 
> one go given the default limits of the block layer. Typically the number 
> of hardware SG entrie is also limited. We never will be able to put a 

Seems like changing the default limits would be the easiest way to
fix it then?

As far as hardware limits go, I don't think you need to scale that
number linearly with the amount of memory you have, or even with the
IO throughput. You should reach a point where your command overhead
is amortised sufficiently, and the controller will be pipelining the
commands.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  5:06             ` Nick Piggin
  2007-03-02  5:40               ` Christoph Lameter
@ 2007-03-02  5:50               ` Christoph Lameter
  1 sibling, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2007-03-02  5:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Nick Piggin wrote:

> So what do you mean by efficient? I guess you aren't talking about CPU
> efficiency, because even if you make the IO subsystem submit larger
> physical IOs, you still have to deal with 256 billion TLB entries, the
> pagecache has to deal with 256 billion struct pages, so does the
> filesystem code to build the bios.

Re the page cache: It needs also to be able to handle large page sizes of 
course. Scanning gazillions of page structs in vmscan.c will make the 
system slow as a dog. The number of page structs needs to be drastically 
reduced for large I/O. I think this can be done with allowing compound 
pages to be handled throughout the VM. The defrag issues then becomes very 
pressing indeed.

We have discussed the idea of going to kernel with 2M base page size on 
x86_64 but that step is a bit drastic and the overhead for small files 
would be tremendous.

Support for compound pages already exists in the page allocator and the 
slab allocator. Maybe we could extend that support to the I/O subsystem? 
We would also then have more contiguous writes which will further speed up 
I/O efficiency.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  5:11           ` Linus Torvalds
@ 2007-03-02  5:50             ` KAMEZAWA Hiroyuki
  2007-03-02  6:15               ` Paul Mundt
  2007-03-02 16:20             ` Mark Gross
  1 sibling, 1 reply; 99+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-03-02  5:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: akpm, balbir, mel, npiggin, clameter, mingo, jschopp, arjan,
	mbligh, linux-mm, linux-kernel

On Thu, 1 Mar 2007 21:11:58 -0800 (PST)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> The whole DRAM power story is a bedtime story for gullible children. Don't 
> fall for it. It's not realistic. The hardware support for it DOES NOT 
> EXIST today, and probably won't for several years. And the real fix is 
> elsewhere anyway (ie people will have to do a FBDIMM-2 interface, which 
> is against the whole point of FBDIMM in the first place, but that's what 
> you get when you ignore power in the first version!).
> 

At first, we have memory hot-add now. So I want to implement hot-removing 
hot-added memory, at least. (in this case, we don't have to write invasive
patches to memory-init-core.)

Our(Fujtisu's) product, ia64-NUMA server, has a feature to offline memory.
It supports dynamic reconfigraion of nodes, node-hoplug.

But there is no *shipped* firmware for hotplug yet. RHEL4 couldn't boot on
such hotplug-supported-firmware...so firmware-team were not in hurry.
It will be shipped after RHEL5 comes.
IMHO, a firmware which supports memory-hot-add are ready to support memory-hot-remove
if OS can handle it.

Note:
I heard embeded people often designs their own memory-power-off control on
embeded Linux. (but it never seems to be posted to the list.) But I don't know
they are interested in generic memory hotremove or not.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  5:49                 ` Nick Piggin
@ 2007-03-02  5:53                   ` Christoph Lameter
  2007-03-02  6:08                     ` Nick Piggin
  0 siblings, 1 reply; 99+ messages in thread
From: Christoph Lameter @ 2007-03-02  5:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Nick Piggin wrote:

> > You do not have to deal with TLB entries if you do buffered I/O.
> 
> Where does the data come from?

>From the I/O controller and from the application. 

> > We currently have problems with the kernel limits of 128 SG 
> > entries but the fundamental issue is that we can only do 2 Meg of I/O in 
> > one go given the default limits of the block layer. Typically the number 
> > of hardware SG entrie is also limited. We never will be able to put a 
> 
> Seems like changing the default limits would be the easiest way to
> fix it then?

This would only be a temporary fix pushing the limits to the double or so?
 
> As far as hardware limits go, I don't think you need to scale that
> number linearly with the amount of memory you have, or even with the
> IO throughput. You should reach a point where your command overhead
> is amortised sufficiently, and the controller will be pipelining the
> commands.

Amortized? The controller still would have to hunt down the 4kb page 
pieces that we have to feed him right now. Result: Huge scatter gather 
lists that may themselves create issues with higher page order.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  5:53                   ` Christoph Lameter
@ 2007-03-02  6:08                     ` Nick Piggin
  2007-03-02  6:19                       ` Christoph Lameter
  0 siblings, 1 reply; 99+ messages in thread
From: Nick Piggin @ 2007-03-02  6:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Thu, Mar 01, 2007 at 09:53:42PM -0800, Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Nick Piggin wrote:
> 
> > > You do not have to deal with TLB entries if you do buffered I/O.
> > 
> > Where does the data come from?
> 
> >From the I/O controller and from the application. 

Why doesn't the application need to deal with TLB entries?


> > > We currently have problems with the kernel limits of 128 SG 
> > > entries but the fundamental issue is that we can only do 2 Meg of I/O in 
> > > one go given the default limits of the block layer. Typically the number 
> > > of hardware SG entrie is also limited. We never will be able to put a 
> > 
> > Seems like changing the default limits would be the easiest way to
> > fix it then?
> 
> This would only be a temporary fix pushing the limits to the double or so?

And using slightly larger page sizes isn't?

> > As far as hardware limits go, I don't think you need to scale that
> > number linearly with the amount of memory you have, or even with the
> > IO throughput. You should reach a point where your command overhead
> > is amortised sufficiently, and the controller will be pipelining the
> > commands.
> 
> Amortized? The controller still would have to hunt down the 4kb page 
> pieces that we have to feed him right now. Result: Huge scatter gather 
> lists that may themselves create issues with higher page order.

What sort of numbers do you have for these controllers that aren't
very good at doing sg?

Isn't the issue was something like your IO controllers have only a
limited number of sg entries, which is fine with 16K pages, but with
4K pages that doesn't give enough data to cover your RAID stripe?

We're never going to do a variable sized pagecache just because of that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  5:50             ` KAMEZAWA Hiroyuki
@ 2007-03-02  6:15               ` Paul Mundt
  2007-03-02 17:01                 ` Mel Gorman
  0 siblings, 1 reply; 99+ messages in thread
From: Paul Mundt @ 2007-03-02  6:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Linus Torvalds, akpm, balbir, mel, npiggin, clameter, mingo,
	jschopp, arjan, mbligh, linux-mm, linux-kernel

On Fri, Mar 02, 2007 at 02:50:29PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 1 Mar 2007 21:11:58 -0800 (PST)
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > The whole DRAM power story is a bedtime story for gullible children. Don't 
> > fall for it. It's not realistic. The hardware support for it DOES NOT 
> > EXIST today, and probably won't for several years. And the real fix is 
> > elsewhere anyway (ie people will have to do a FBDIMM-2 interface, which 
> > is against the whole point of FBDIMM in the first place, but that's what 
> > you get when you ignore power in the first version!).
> > 
> 
> Note:
> I heard embeded people often designs their own memory-power-off control on
> embeded Linux. (but it never seems to be posted to the list.) But I don't know
> they are interested in generic memory hotremove or not.
> 
Yes, this is not that uncommon of a thing. People tend to do this in a
couple of different ways, in some cases the system is too loaded to ever
make doing such a thing at run-time worthwhile, and in those cases these
sorts of things tend to be munged in with the suspend code. Unfortunately
it tends to be quite difficult in practice to keep pages in one place,
so people rely on lame chip-select hacks and limiting the amount of
memory that the kernel treats as RAM instead so it never ends up being an
issue. Having some sort of a balance would certainly be nice, though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  6:08                     ` Nick Piggin
@ 2007-03-02  6:19                       ` Christoph Lameter
  2007-03-02  6:29                         ` Nick Piggin
  0 siblings, 1 reply; 99+ messages in thread
From: Christoph Lameter @ 2007-03-02  6:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Nick Piggin wrote:

> > >From the I/O controller and from the application. 
> 
> Why doesn't the application need to deal with TLB entries?

Because it may only operate on a small section of the file and hopefully 
splice the rest through? But yes support for mmapped I/O would be 
necessary.

> > This would only be a temporary fix pushing the limits to the double or so?
> 
> And using slightly larger page sizes isn't?

There was no talk about slightly. 1G page size would actually be quite 
convenient for some applications.

> > Amortized? The controller still would have to hunt down the 4kb page 
> > pieces that we have to feed him right now. Result: Huge scatter gather 
> > lists that may themselves create issues with higher page order.
> 
> What sort of numbers do you have for these controllers that aren't
> very good at doing sg?

Writing a terabyte of memory to disk with handling 256 billion page 
structs? In case of a system with 1 petabyte of memory this may be rather 
typical and necessary for the application to be able to save its state
on disk.

> Isn't the issue was something like your IO controllers have only a
> limited number of sg entries, which is fine with 16K pages, but with
> 4K pages that doesn't give enough data to cover your RAID stripe?
> 
> We're never going to do a variable sized pagecache just because of that.

No, we need support for larger page sizes than 16k. 16k has not been fine 
for a couple of years. We only agreed to 16k because that was the common 
consensus. Best performance was always at 64k 4 years ago (but then we 
have no numbers for higher page sizes yet). Now we would prefer much 
larger sizes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  6:19                       ` Christoph Lameter
@ 2007-03-02  6:29                         ` Nick Piggin
  2007-03-02  6:51                           ` Christoph Lameter
  0 siblings, 1 reply; 99+ messages in thread
From: Nick Piggin @ 2007-03-02  6:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Thu, Mar 01, 2007 at 10:19:48PM -0800, Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Nick Piggin wrote:
> 
> > > >From the I/O controller and from the application. 
> > 
> > Why doesn't the application need to deal with TLB entries?
> 
> Because it may only operate on a small section of the file and hopefully 
> splice the rest through? But yes support for mmapped I/O would be 
> necessary.

So you're talking about copying a file from one location to another?


> > > This would only be a temporary fix pushing the limits to the double or so?
> > 
> > And using slightly larger page sizes isn't?
> 
> There was no talk about slightly. 1G page size would actually be quite 
> convenient for some applications.

But it is far from convenient for the kernel. So we have hugepages, so
we can stay out of the hair of those applications and they can stay out
of hours.

> > > Amortized? The controller still would have to hunt down the 4kb page 
> > > pieces that we have to feed him right now. Result: Huge scatter gather 
> > > lists that may themselves create issues with higher page order.
> > 
> > What sort of numbers do you have for these controllers that aren't
> > very good at doing sg?
> 
> Writing a terabyte of memory to disk with handling 256 billion page 
> structs? In case of a system with 1 petabyte of memory this may be rather 
> typical and necessary for the application to be able to save its state
> on disk.

But you will have newer IO controllers, faster CPUs...

Is it a problem or isn't it? Waving around the 256 billion number isn't
impressive because it doesn't really say anything.

> > Isn't the issue was something like your IO controllers have only a
> > limited number of sg entries, which is fine with 16K pages, but with
> > 4K pages that doesn't give enough data to cover your RAID stripe?
> > 
> > We're never going to do a variable sized pagecache just because of that.
> 
> No, we need support for larger page sizes than 16k. 16k has not been fine 
> for a couple of years. We only agreed to 16k because that was the common 
> consensus. Best performance was always at 64k 4 years ago (but then we 
> have no numbers for higher page sizes yet). Now we would prefer much 
> larger sizes.

But you are in a tiny minority, so it is not so much a question of what
you prefer, but what you can make do with without being too intrusive.

I understand you have controllers (or maybe it is a block layer limit)
that doesn't work well with 4K pages, but works OK with 16K pages.
This is not something that we would introduce variable sized pagecache
for, surely.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  6:29                         ` Nick Piggin
@ 2007-03-02  6:51                           ` Christoph Lameter
  2007-03-02  7:03                             ` Andrew Morton
  2007-03-02  7:19                             ` Nick Piggin
  0 siblings, 2 replies; 99+ messages in thread
From: Christoph Lameter @ 2007-03-02  6:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Nick Piggin wrote:

> > There was no talk about slightly. 1G page size would actually be quite 
> > convenient for some applications.
> 
> But it is far from convenient for the kernel. So we have hugepages, so
> we can stay out of the hair of those applications and they can stay out
> of hours.

Huge pages cannot do I/O so we would get back to the gazillions of pages 
to be handled for I/O. I'd love to have I/O support for huge pages. This 
would address some of the issues.

> > Writing a terabyte of memory to disk with handling 256 billion page 
> > structs? In case of a system with 1 petabyte of memory this may be rather 
> > typical and necessary for the application to be able to save its state
> > on disk.
> 
> But you will have newer IO controllers, faster CPUs...

Sure we will. And you believe that the the newer controllers will be able 
to magically shrink the the SG lists somehow? We will offload the 
coalescing of the page structs into bios in hardware or some such thing? 
And the vmscans etc too?

> Is it a problem or isn't it? Waving around the 256 billion number isn't
> impressive because it doesn't really say anything.

It is the number of items that needs to be handled by the I/O layer and 
likely by the SG engine.
 
> I understand you have controllers (or maybe it is a block layer limit)
> that doesn't work well with 4K pages, but works OK with 16K pages.

Really? This is the first that I have heard about it.

> This is not something that we would introduce variable sized pagecache
> for, surely.

I am not sure where you get the idea that this is the sole reason why we 
need to be able to handle larger contiguous chunks of memory.

How about coming up with a response to the issue at hand? How do I write 
back 1 Terabyte effectively? Ok this may be an exotic configuration today 
but in one year this may be much more common. Memory sizes keep on 
increasing and so is the number of page structs to be handled for I/O. At 
some point we need a solution here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  6:51                           ` Christoph Lameter
@ 2007-03-02  7:03                             ` Andrew Morton
  2007-03-02  7:19                             ` Nick Piggin
  1 sibling, 0 replies; 99+ messages in thread
From: Andrew Morton @ 2007-03-02  7:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Mel Gorman, mingo, jschopp, arjan, torvalds, mbligh,
	linux-mm, linux-kernel

On Thu, 1 Mar 2007 22:51:00 -0800 (PST) Christoph Lameter <clameter@engr.sgi.com> wrote:

> I'd love to have I/O support for huge pages.

direct-IO works.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  6:51                           ` Christoph Lameter
  2007-03-02  7:03                             ` Andrew Morton
@ 2007-03-02  7:19                             ` Nick Piggin
  2007-03-02  7:44                               ` Christoph Lameter
  1 sibling, 1 reply; 99+ messages in thread
From: Nick Piggin @ 2007-03-02  7:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Thu, Mar 01, 2007 at 10:51:00PM -0800, Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Nick Piggin wrote:
> 
> > > There was no talk about slightly. 1G page size would actually be quite 
> > > convenient for some applications.
> > 
> > But it is far from convenient for the kernel. So we have hugepages, so
> > we can stay out of the hair of those applications and they can stay out
> > of hours.
> 
> Huge pages cannot do I/O so we would get back to the gazillions of pages 
> to be handled for I/O. I'd love to have I/O support for huge pages. This 
> would address some of the issues.

Can't direct IO from a hugepage?

> > > Writing a terabyte of memory to disk with handling 256 billion page 
> > > structs? In case of a system with 1 petabyte of memory this may be rather 
> > > typical and necessary for the application to be able to save its state
> > > on disk.
> > 
> > But you will have newer IO controllers, faster CPUs...
> 
> Sure we will. And you believe that the the newer controllers will be able 
> to magically shrink the the SG lists somehow? We will offload the 
> coalescing of the page structs into bios in hardware or some such thing? 
> And the vmscans etc too?

As far as pagecache page management goes, is that an issue for you?
I don't want to know about how many billions of pages for some operation,
just some profiles.

> > Is it a problem or isn't it? Waving around the 256 billion number isn't
> > impressive because it doesn't really say anything.
> 
> It is the number of items that needs to be handled by the I/O layer and 
> likely by the SG engine.

The number is irrelevant, it is the rate that is important.

> > I understand you have controllers (or maybe it is a block layer limit)
> > that doesn't work well with 4K pages, but works OK with 16K pages.
> 
> Really? This is the first that I have heard about it.
>

Maybe that's the issue you're running into.

> > This is not something that we would introduce variable sized pagecache
> > for, surely.
> 
> I am not sure where you get the idea that this is the sole reason why we 
> need to be able to handle larger contiguous chunks of memory.

I'm not saying that. You brought up this subject of variable sized pagecache.

> How about coming up with a response to the issue at hand? How do I write 
> back 1 Terabyte effectively? Ok this may be an exotic configuration today 
> but in one year this may be much more common. Memory sizes keep on 
> increasing and so is the number of page structs to be handled for I/O. At 
> some point we need a solution here.

Considering you're just handwaving about the actual problems, I
don't know. I assume you're sitting in front of some workload that has
gone wrong, so can't you elaborate?

Eventually, increasing x86 page size a bit might be an idea. We could even
do it in software if CPU manufacturers don't for us.

That doesn't buy us a great deal if you think there is this huge looming
problem with struct page management though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  7:19                             ` Nick Piggin
@ 2007-03-02  7:44                               ` Christoph Lameter
  2007-03-02  8:12                                 ` Nick Piggin
  0 siblings, 1 reply; 99+ messages in thread
From: Christoph Lameter @ 2007-03-02  7:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Nick Piggin wrote:

> > Sure we will. And you believe that the the newer controllers will be able 
> > to magically shrink the the SG lists somehow? We will offload the 
> > coalescing of the page structs into bios in hardware or some such thing? 
> > And the vmscans etc too?
> 
> As far as pagecache page management goes, is that an issue for you?
> I don't want to know about how many billions of pages for some operation,
> just some profiles.

If there are billions of pages in the system and we are allocating and 
deallocating then pages need to be aged. If there are just few pages 
freeable then we run into issues.

> > > I understand you have controllers (or maybe it is a block layer limit)
> > > that doesn't work well with 4K pages, but works OK with 16K pages.
> > Really? This is the first that I have heard about it.
> Maybe that's the issue you're running into.

Oh, I am running into an issue on a system that does not yet exist? I am 
extrapolating from the problems that we commonly see now. Those will get 
worse the more memory increases.

> > > This is not something that we would introduce variable sized pagecache
> > > for, surely.
> > I am not sure where you get the idea that this is the sole reason why we 
> > need to be able to handle larger contiguous chunks of memory.
> I'm not saying that. You brought up this subject of variable sized pagecache.

You keep bringing up the 4k/16k issue into this for some reason. I want 
just the ability to handle large amounts of memory. Larger page sizes are 
a way to accomplish that.

> Eventually, increasing x86 page size a bit might be an idea. We could even
> do it in software if CPU manufacturers don't for us.

A bit? Are we back to the 4k/16k issue? We need to reach 2M at mininum. 
Some way to handle continuous memory segments of 1GB and larger 
effectively would be great.
  
> That doesn't buy us a great deal if you think there is this huge looming
> problem with struct page management though.

I am not the first one.... See Rik's posts regarding the reasons for his 
new page replacement algorithms.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  7:44                               ` Christoph Lameter
@ 2007-03-02  8:12                                 ` Nick Piggin
  2007-03-02  8:21                                   ` Christoph Lameter
  2007-03-04  1:26                                   ` Rik van Riel
  0 siblings, 2 replies; 99+ messages in thread
From: Nick Piggin @ 2007-03-02  8:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Thu, Mar 01, 2007 at 11:44:05PM -0800, Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Nick Piggin wrote:
> 
> > > Sure we will. And you believe that the the newer controllers will be able 
> > > to magically shrink the the SG lists somehow? We will offload the 
> > > coalescing of the page structs into bios in hardware or some such thing? 
> > > And the vmscans etc too?
> > 
> > As far as pagecache page management goes, is that an issue for you?
> > I don't want to know about how many billions of pages for some operation,
> > just some profiles.
> 
> If there are billions of pages in the system and we are allocating and 
> deallocating then pages need to be aged. If there are just few pages 
> freeable then we run into issues.

page writeout and vmscan don't work too badly. What are the issues?

> > > > I understand you have controllers (or maybe it is a block layer limit)
> > > > that doesn't work well with 4K pages, but works OK with 16K pages.
> > > Really? This is the first that I have heard about it.
> > Maybe that's the issue you're running into.
> 
> Oh, I am running into an issue on a system that does not yet exist? I am 
> extrapolating from the problems that we commonly see now. Those will get 
> worse the more memory increases.

So what problems that you commonly see now? Some of us here don't
have 4TB of memory, so you actually have to tell us ;)

> > > > This is not something that we would introduce variable sized pagecache
> > > > for, surely.
> > > I am not sure where you get the idea that this is the sole reason why we 
> > > need to be able to handle larger contiguous chunks of memory.
> > I'm not saying that. You brought up this subject of variable sized pagecache.
> 
> You keep bringing up the 4k/16k issue into this for some reason. I want 
> just the ability to handle large amounts of memory. Larger page sizes are 
> a way to accomplish that.

As I said in my other mail to you, Linux runs on systems with 6 orders
of magnitude more struct pages than when it was first created. What's
the problem?

> > Eventually, increasing x86 page size a bit might be an idea. We could even
> > do it in software if CPU manufacturers don't for us.
> 
> A bit? Are we back to the 4k/16k issue? We need to reach 2M at mininum. 
> Some way to handle continuous memory segments of 1GB and larger 
> effectively would be great.

How did you come up with that 2MB number?

Anyway, we have hugetlbfs for things like that.

> > That doesn't buy us a great deal if you think there is this huge looming
> > problem with struct page management though.
> 
> I am not the first one.... See Rik's posts regarding the reasons for his 
> new page replacement algorithms.

Different issue, isn't it? Rik wants to be smarter in figuring out which
pages to throw away. More work per page == worse for you.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  8:12                                 ` Nick Piggin
@ 2007-03-02  8:21                                   ` Christoph Lameter
  2007-03-02  8:38                                     ` Nick Piggin
  2007-03-04  1:26                                   ` Rik van Riel
  1 sibling, 1 reply; 99+ messages in thread
From: Christoph Lameter @ 2007-03-02  8:21 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Nick Piggin wrote:

> > If there are billions of pages in the system and we are allocating and 
> > deallocating then pages need to be aged. If there are just few pages 
> > freeable then we run into issues.
> 
> page writeout and vmscan don't work too badly. What are the issues?

Slow downs up to livelocks with large memory configurations.

> So what problems that you commonly see now? Some of us here don't
> have 4TB of memory, so you actually have to tell us ;)

Oh just run a 32GB SMP system with sparsely freeable pages and lots of 
allocs and frees and you will see it too. F.e try Linus tree and mlock 
a large portion of the memory and then see the fun starting. See also 
Rik's list of pathological cases on this.

> How did you come up with that 2MB number?

Huge page size. The only basic choice on x86_64

> Anyway, we have hugetlbfs for things like that.

Good to know that direct io works.

> > I am not the first one.... See Rik's posts regarding the reasons for his 
> > new page replacement algorithms.
> 
> Different issue, isn't it? Rik wants to be smarter in figuring out which
> pages to throw away. More work per page == worse for you.

Rik is trying to solve the same issue in a different way. He is trying to 
manage gazillion entries better instead of reducing the entries to be 
managed. That can only work in a limited way. Drastic reductions in the 
entries to be manages have good effects in multiple ways. Reduce 
management overhead, improve I/O throughput, etc etc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  8:21                                   ` Christoph Lameter
@ 2007-03-02  8:38                                     ` Nick Piggin
  2007-03-02 17:09                                       ` Christoph Lameter
  0 siblings, 1 reply; 99+ messages in thread
From: Nick Piggin @ 2007-03-02  8:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Fri, Mar 02, 2007 at 12:21:49AM -0800, Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Nick Piggin wrote:
> 
> > > If there are billions of pages in the system and we are allocating and 
> > > deallocating then pages need to be aged. If there are just few pages 
> > > freeable then we run into issues.
> > 
> > page writeout and vmscan don't work too badly. What are the issues?
> 
> Slow downs up to livelocks with large memory configurations.
> 
> > So what problems that you commonly see now? Some of us here don't
> > have 4TB of memory, so you actually have to tell us ;)
> 
> Oh just run a 32GB SMP system with sparsely freeable pages and lots of 
> allocs and frees and you will see it too. F.e try Linus tree and mlock 
> a large portion of the memory and then see the fun starting. See also 
> Rik's list of pathological cases on this.

Ah, so your problem is lots of unreclaimable pages. There are heaps
of things we can try to reduce the rate at which we scan those.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  1:52 ` Bill Irwin
@ 2007-03-02 10:38   ` Mel Gorman
  2007-03-02 16:31     ` Joel Schopp
  0 siblings, 1 reply; 99+ messages in thread
From: Mel Gorman @ 2007-03-02 10:38 UTC (permalink / raw)
  To: Bill Irwin
  Cc: akpm, npiggin, clameter, mingo, Joel Schopp, arjan, torvalds,
	mbligh, Linux Memory Management List, Linux Kernel Mailing List

On Thu, 1 Mar 2007, Bill Irwin wrote:

> On Thu, Mar 01, 2007 at 10:12:50AM +0000, Mel Gorman wrote:
>> These are figures based on kernels patches with Andy Whitcrofts reclaim
>> patches. You will see that the zone-based kernel is getting success rates
>> closer to 40% as one would expect although there is still something amiss.
>
> Yes, combining the two should do at least as well as either in
> isolation. Are there videos of each of the two in isolation?

Yes. Towards the end of the mail, I give links to all of the images like 
this for example;

elm3b14-vanilla       http://www.skynet.ie/~mel/anti-frag/2007-02-28/elm3b14-vanilla.avi
elm3b14-list-based    http://www.skynet.ie/~mel/anti-frag/2007-02-28/elm3b14-listbased.avi
elm3b14-zone-based    http://www.skynet.ie/~mel/anti-frag/2007-02-28/elm3b14-zonebased.avi
elm3b14-combined      http://www.skynet.ie/~mel/anti-frag/2007-02-28/elm3b14-combined.avi

In the zone-based figures, there are pages there that could be reclaimed, 
but are ignored by page reclaim because watermarks are satisified.

> Maybe that
> would give someone insight into what's happening.
>
>
> On Thu, Mar 01, 2007 at 10:12:50AM +0000, Mel Gorman wrote:
>> Kernbench Total CPU Time
>
> Oh dear. How do the other benchmarks look?
>

What other figures would you like to see and I'll generate them. Often 
kernbench is all people look for for this type of thing.

"Oh dear" implies you think the figures are bad. But on ppc64 and x86_64 
at least, the total CPU times are slightly lower with both 
anti-fragmentation patches - that's not bad. On NUMA-Q (which no one uses 
any more or is even sold), it's very marginally slower.

These are the AIM9 figures I have

AIM9 Results
                                       Vanilla Kernel   List-base Kernel  Zone-base Kernel  Combined Kernel
Machine       Arch      Test              Seconds           Seconds           Seconds          Seconds
-------     ---------  ------         --------------   ----------------  ----------------  ---------------
elm3b14     x86-numaq   page_test      115108.30           112955.68             109773.37            108073.65 
elm3b14     x86-numaq   brk_test       520593.14           494251.92             496801.07            488141.24 
elm3b14     x86-numaq   fork_test      2007.99             2005.66               2011.00              1986.35 
elm3b14     x86-numaq   exec_test      57.11               57.15                 57.27                57.01 
elm3b245    x86_64      page_test      220490.00           218166.67             224371.67            224164.31 
elm3b245    x86_64      brk_test       2178186.97          2337110.48            3025495.75           2445733.33 
elm3b245    x86_64      fork_test      4854.19             4957.51               4900.03              5001.67 
elm3b245    x86_64      exec_test      194.55              196.30                195.55               195.90 
gekko-lp1   ppc64       page_test      300368.27           310651.56             300673.33            308720.00 
gekko-lp1   ppc64       brk_test       1328895.18          1403448.85            1431489.50           1408263.91 
gekko-lp1   ppc64       fork_test      3374.42             3395.00               3367.77              3396.64 
gekko-lp1   ppc64       exec_test      152.87              153.12                151.92               153.39 
gekko-lp4   ppc64       page_test      291643.06           306906.67             294872.52            303796.03 
gekko-lp4   ppc64       brk_test       1322946.18          1366572.24            1378470.25           1403116.15 
gekko-lp4   ppc64       fork_test      3326.11             3335.00               3315.56              3333.33 
gekko-lp4   ppc64       exec_test      149.01              149.90                149.48               149.87

Many of these are showing performance improvements as well, not 
regressions.

>
> On Thu, Mar 01, 2007 at 10:12:50AM +0000, Mel Gorman wrote:
>> The patches go a long way to making sure that high-order allocations work
>> and particularly that the hugepage pool can be resized once the system has
>> been running. With the clustering of high-order atomic allocations, I have
>> some confidence that allocating contiguous jumbo frames will work even with
>> loads performing lots of IO. I think the videos show how the patches actually
>> work in the clearest possible manner.
>> I am of the opinion that both approaches have their advantages and
>> disadvantages. Given a choice between the two, I prefer list-based
>> because of it's flexibility and it should also help high-order kernel
>> allocations. However, by applying both, the disadvantages of list-based are
>> covered and there still appears to be no performance loss as a result. Hence,
>> I'd like to see both merged.  Any opinion on merging these patches into -mm
>> for wider testing?
>
> Exhibiting a workload where the list patch breaks down and the zone
> patch rescues it might help if it's felt that the combination isn't as
> good as lists in isolation. I'm sure one can be dredged up somewhere.

I can't think of a workload that totally makes a mess out of list-based. 
However, list-based makes no guarantees on availability. If a system 
administrator knows they need between 10,000 and 100,000 huge pages and 
doesn't want to waste memory pinning too many huge pages at boot-time, the 
zone-based mechanism would be what he wanted.

> Either that or someone will eventually spot why the combination doesn't
> get as many available maximally contiguous regions as the list patch.
> By and large I'm happy to see anything go in that inches hugetlbfs
> closer to a backward compatibility wrapper over ramfs.
>

Good to hear

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  0:09 ` Andrew Morton
                     ` (3 preceding siblings ...)
  2007-03-02  3:05   ` Christoph Lameter
@ 2007-03-02 13:50   ` Arjan van de Ven
  2007-03-02 15:29   ` Rik van Riel
  5 siblings, 0 replies; 99+ messages in thread
From: Arjan van de Ven @ 2007-03-02 13:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, npiggin, clameter, mingo, jschopp, torvalds, mbligh,
	linux-mm, linux-kernel

On Thu, 2007-03-01 at 16:09 -0800, Andrew Morton wrote:
> And I'd judge that per-container RSS limits are of considerably more value
> than antifrag (in fact per-container RSS might be a superset of antifrag,
> in the sense that per-container RSS and containers could be abused to fix
> the i-cant-get-any-hugepages problem, dunno).


Hi,

the RSS thing is.. .funky.
I'm saying that because we have not been able to define what RSS means,
so before we expand how RSS is used that needs solving first.

This is relevant for the pagetable sharing patches: if RSS can exclude
shared, they're relatively easy. If RSS has to include shared always, we
have currently a problem because hugepages aren't part of RSS right now.

I would really really really like to see this unclarity sorted out on
the concept level before going through massive changes in the code based
on something so fundamentally unclear.

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  0:09 ` Andrew Morton
                     ` (4 preceding siblings ...)
  2007-03-02 13:50   ` Arjan van de Ven
@ 2007-03-02 15:29   ` Rik van Riel
  2007-03-02 16:58     ` Andrew Morton
  5 siblings, 1 reply; 99+ messages in thread
From: Rik van Riel @ 2007-03-02 15:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, npiggin, clameter, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

Andrew Morton wrote:

> And I'd judge that per-container RSS limits are of considerably more value
> than antifrag (in fact per-container RSS might be a superset of antifrag,
> in the sense that per-container RSS and containers could be abused to fix
> the i-cant-get-any-hugepages problem, dunno).

The RSS bits really worry me, since it looks like they could
exacerbate the scalability problems that we are already running
into on very large memory systems.

Linux is *not* happy on 256GB systems.  Even on some 32GB systems
the swappiness setting *needs* to be tweaked before Linux will even
run in a reasonable way.

Pageout scanning needs to be more efficient, not less.  The RSS
bits are worrysome...

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  5:11           ` Linus Torvalds
  2007-03-02  5:50             ` KAMEZAWA Hiroyuki
@ 2007-03-02 16:20             ` Mark Gross
  2007-03-02 17:07               ` Andrew Morton
  2007-03-02 17:16               ` Linus Torvalds
  1 sibling, 2 replies; 99+ messages in thread
From: Mark Gross @ 2007-03-02 16:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Balbir Singh, Mel Gorman, npiggin, clameter,
	mingo, jschopp, arjan, mbligh, linux-mm, linux-kernel

On Thu, Mar 01, 2007 at 09:11:58PM -0800, Linus Torvalds wrote:
> 
> On Thu, 1 Mar 2007, Andrew Morton wrote:
> >
> > On Thu, 1 Mar 2007 19:44:27 -0800 (PST) Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > 
> > > In other words, I really don't see a huge upside. I see *lots* of 
> > > downsides, but upsides? Not so much. Almost everybody who wants unplug 
> > > wants virtualization, and right now none of the "big virtualization" 
> > > people would want to have kernel-level anti-fragmentation anyway sicne 
> > > they'd need to do it on their own.
> > 
> > Agree with all that, but you're missing the other application: power
> > saving.  FBDIMMs take eight watts a pop.
> 
> This is a hardware problem. Let's see how long it takes for Intel to 
> realize that FBDIMM's were a hugely bad idea from a power perspective.
> 
> Yes, the same issues exist for other DRAM forms too, but to a *much* 
> smaller degree.

DDR3-1333 may be better than FBDIMM's but don't count on it being much
better.

> 
> Also, IN PRACTICE you're never ever going to see this anyway. Almost 
> everybody wants bank interleaving, because it's a huge performance win on 
> many loads. That, in turn, means that your memory will be spread out over 
> multiple DIMM's even for a single page, much less any bigger area.

4-way interleave across banks on systems may not be as common as you may
think for future chip sets.  2-way interleave across DIMMs within a bank
will stay.

Also the performance gains between 2 and 4 way interleave have been
shown to be hard to measure.  It may be counter intuitive but its not
the huge performance win you may expect.  At least in some of the test
cases I've seen reported showed it to be under the noise floor of the
lmbench test cases.  


> 
> In other words - forget about DRAM power savings. It's not realistic. And 
> if you want low-power, don't use FBDIMM's. It really *is* that simple.
>

DDR3-1333 won't be much better.  

> (And yes, maybe FBDIMM controllers in a few years won't use 8 W per 
> buffer. I kind of doubt that, since FBDIMM fairly fundamentally is highish 
> voltage swings at high frequencies.)
> 
> Also, on a *truly* idle system, we'll see the power savings whatever we 
> do, because the working set will fit in D$, and to get those DRAM power 
> savings in reality you need to have the DRAM controller shut down on its 
> own anyway (ie sw would only help a bit).
> 
> The whole DRAM power story is a bedtime story for gullible children. Don't 
> fall for it. It's not realistic. The hardware support for it DOES NOT 
> EXIST today, and probably won't for several years. And the real fix is 
> elsewhere anyway (ie people will have to do a FBDIMM-2 interface, which 
> is against the whole point of FBDIMM in the first place, but that's what 
> you get when you ignore power in the first version!).
>

Hardware support for some of this is coming this year in the ATCA space
on the MPCBL0050.  The feature is a bit experimental, and
power/performance benefits will be workload and configuration
dependent.  Its not a bed time story.

--mgross

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 10:38   ` Mel Gorman
@ 2007-03-02 16:31     ` Joel Schopp
  2007-03-02 21:37       ` Bill Irwin
  0 siblings, 1 reply; 99+ messages in thread
From: Joel Schopp @ 2007-03-02 16:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Bill Irwin, akpm, npiggin, clameter, mingo, arjan, torvalds,
	mbligh, Linux Memory Management List, Linux Kernel Mailing List

>> Exhibiting a workload where the list patch breaks down and the zone
>> patch rescues it might help if it's felt that the combination isn't as
>> good as lists in isolation. I'm sure one can be dredged up somewhere.
> 
> I can't think of a workload that totally makes a mess out of list-based. 
> However, list-based makes no guarantees on availability. If a system 
> administrator knows they need between 10,000 and 100,000 huge pages and 
> doesn't want to waste memory pinning too many huge pages at boot-time, 
> the zone-based mechanism would be what he wanted.

 From our testing with earlier versions of list based for memory hot-unplug on 
pSeries machines we were able to hot-unplug huge amounts of memory after running the 
nastiest workloads we could find for over a week.  Without the patches we were unable 
to hot-unplug anything within minutes of running the same workloads.

If something works for 99.999% of people (list based) and there is an easy way to 
configure it for the other 0.001% of the people ("zone" based) I call that a great 
solution.  I really don't understand what the resistance is to these patches.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 15:29   ` Rik van Riel
@ 2007-03-02 16:58     ` Andrew Morton
  2007-03-02 17:09       ` Mel Gorman
  2007-03-02 17:23       ` Christoph Lameter
  0 siblings, 2 replies; 99+ messages in thread
From: Andrew Morton @ 2007-03-02 16:58 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mel Gorman, npiggin, clameter, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Fri, 02 Mar 2007 10:29:58 -0500 Rik van Riel <riel@redhat.com> wrote:

> Andrew Morton wrote:
> 
> > And I'd judge that per-container RSS limits are of considerably more value
> > than antifrag (in fact per-container RSS might be a superset of antifrag,
> > in the sense that per-container RSS and containers could be abused to fix
> > the i-cant-get-any-hugepages problem, dunno).
> 
> The RSS bits really worry me, since it looks like they could
> exacerbate the scalability problems that we are already running
> into on very large memory systems.

Using a zone-per-container or N-64MB-zones-per-container should actually
move us in the direction of *fixing* any such problems.  Because, to a
first-order, the scanning of such a zone has the same behaviour as a 64MB
machine.

(We'd run into a few other problems, some related to the globalness of the
dirty-memory management, but that's fixable).

> Linux is *not* happy on 256GB systems.  Even on some 32GB systems
> the swappiness setting *needs* to be tweaked before Linux will even
> run in a reasonable way.

Please send testcases.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  0:44   ` Linus Torvalds
  2007-03-02  1:52     ` Balbir Singh
@ 2007-03-02 16:58     ` Mel Gorman
  2007-03-02 17:05     ` Joel Schopp
  2 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2007-03-02 16:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, npiggin, clameter, mingo, jschopp, arjan, mbligh,
	linux-mm, linux-kernel

On (01/03/07 16:44), Linus Torvalds didst pronounce:
> 
> 
> On Thu, 1 Mar 2007, Andrew Morton wrote:
> > 
> > So some urgent questions are: how are we going to do mem hotunplug and
> > per-container RSS?
> 
> Also: how are we going to do this in virtualized environments? Usually the 
> people who care abotu memory hotunplug are exactly the same people who 
> also care (or claim to care, or _will_ care) about virtualization.
> 

I sent a mail out with a fairly detailed treatment of how RSS could be done.
Essentially, I feel that containers should simply limit the number of
pages used by the container, and not try and do anything magic with a
poorly defined concept like RSS. It would do this by creating a
"software zone" and taking pages from a "hardware zone" at creation
time. It has a similar affect to RSS limits except it's better defined.

In that setup, a virtualized environment would create it's own software
zone. It would hand that over to the guest OS and the guest OS could do
whatever it liked. It would be responsible for it's own reclaim and so on
and not have to worry about other containers (or virtualized environments
for that matter) or kswapd interfering with it.

> My personal opinion is that while I'm not a huge fan of virtualization, 
> these kinds of things really _can_ be handled more cleanly at that layer, 
> and not in the kernel at all. Afaik, it's what IBM already does, and has 
> been doing for a while. There's no shame in looking at what already works, 
> especially if it's simpler.
> 
> 		Linus

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  6:15               ` Paul Mundt
@ 2007-03-02 17:01                 ` Mel Gorman
  0 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2007-03-02 17:01 UTC (permalink / raw)
  To: Paul Mundt, KAMEZAWA Hiroyuki, Linus Torvalds, akpm, balbir,
	npiggin, clameter, mingo, jschopp, arjan, mbligh, linux-mm,
	linux-kernel

On (02/03/07 15:15), Paul Mundt didst pronounce:
> On Fri, Mar 02, 2007 at 02:50:29PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu, 1 Mar 2007 21:11:58 -0800 (PST)
> > Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > 
> > > The whole DRAM power story is a bedtime story for gullible children. Don't 
> > > fall for it. It's not realistic. The hardware support for it DOES NOT 
> > > EXIST today, and probably won't for several years. And the real fix is 
> > > elsewhere anyway (ie people will have to do a FBDIMM-2 interface, which 
> > > is against the whole point of FBDIMM in the first place, but that's what 
> > > you get when you ignore power in the first version!).
> > > 
> > 
> > Note:
> > I heard embeded people often designs their own memory-power-off control on
> > embeded Linux. (but it never seems to be posted to the list.) But I don't know
> > they are interested in generic memory hotremove or not.
> > 
> Yes, this is not that uncommon of a thing. People tend to do this in a
> couple of different ways, in some cases the system is too loaded to ever
> make doing such a thing at run-time worthwhile, and in those cases these
> sorts of things tend to be munged in with the suspend code. Unfortunately
> it tends to be quite difficult in practice to keep pages in one place,
> so people rely on lame chip-select hacks and limiting the amount of
> memory that the kernel treats as RAM instead so it never ends up being an
> issue. Having some sort of a balance would certainly be nice, though.

If the range of memory you want to offline is MAX_ORDER_NR_PAGES,
anti-fragmentation should group pages you can reclaim into those size of
chunks. It might simplify the number of hacks you have to perform to
limit where the kernel uses memory.

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  0:44   ` Linus Torvalds
  2007-03-02  1:52     ` Balbir Singh
  2007-03-02 16:58     ` Mel Gorman
@ 2007-03-02 17:05     ` Joel Schopp
  2007-03-05  3:21       ` Nick Piggin
  2 siblings, 1 reply; 99+ messages in thread
From: Joel Schopp @ 2007-03-02 17:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Mel Gorman, npiggin, clameter, mingo, arjan,
	mbligh, linux-mm, linux-kernel

Linus Torvalds wrote:
> 
> On Thu, 1 Mar 2007, Andrew Morton wrote:
>> So some urgent questions are: how are we going to do mem hotunplug and
>> per-container RSS?

The people who were trying to do memory hot-unplug basically all stopped waiting for 
these patches, or something similar, to solve the fragmentation problem.  Our last 
working set of patches built on top of an earlier version of Mel's list based solution.

> 
> Also: how are we going to do this in virtualized environments? Usually the 
> people who care abotu memory hotunplug are exactly the same people who 
> also care (or claim to care, or _will_ care) about virtualization.

Yes, we are.  And we are very much in favor of these patches.  At last year's OLS 
developers from IBM, HP, Xen coauthored a paper titled "Resizing Memory with Balloons 
and Hotplug".  http://www.linuxsymposium.org/2006/linuxsymposium_procv2.pdf  Our 
conclusion was that ballooning is simply not good enough and we need memory 
hot-unplug.  Here is a quote from the article I find relevant to today's discussion:

"Memory Hotplug remove is not in mainline.
Patches exist, released under the GPL, but are
only occasionally rebased. To be worthwhile
the existing patches would need either a remappable
kernel, which remains highly doubtful, or
a fragmentation avoidance strategy to keep migrateable
and non-migrateable pages clumped
together nicely."

At IBM all of our Power4, Power5, and future hardware supports a lot of 
virtualization features.  This hardware took "Best Virtualization Solution" at 
LinuxWorld Expo, so we aren't talking research projects here. 
http://www-03.ibm.com/press/us/en/pressrelease/20138.wss

> My personal opinion is that while I'm not a huge fan of virtualization, 
> these kinds of things really _can_ be handled more cleanly at that layer, 
> and not in the kernel at all. Afaik, it's what IBM already does, and has 
> been doing for a while. There's no shame in looking at what already works, 
> especially if it's simpler.

I believe you are talking about the zSeries (aka mainframe) because the rest of IBM 
needs these patches.  zSeries built their whole processor instruction set, memory 
model, etc around their form of virtualization, and I doubt the rest of us are going 
to change our processor instruction set that drastically.  I've had a lot of talks 
with Martin Schwidefsky (the maintainer of Linux on zSeries) about how we could do 
more of what they do and the basic answer is we can't because what they do is so 
fundamentally incompatible.

While I appreciate that we should all dump our current hardware and buy mainframes it 
seems to me that an easier solution is to take a few patches from Mel and work with 
the hardware we already have.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 16:20             ` Mark Gross
@ 2007-03-02 17:07               ` Andrew Morton
  2007-03-02 17:35                 ` Mark Gross
  2007-03-02 17:16               ` Linus Torvalds
  1 sibling, 1 reply; 99+ messages in thread
From: Andrew Morton @ 2007-03-02 17:07 UTC (permalink / raw)
  To: mgross
  Cc: Linus Torvalds, Balbir Singh, Mel Gorman, npiggin, clameter,
	mingo, jschopp, arjan, mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007 08:20:23 -0800 Mark Gross <mgross@linux.intel.com> wrote:

> > The whole DRAM power story is a bedtime story for gullible children. Don't 
> > fall for it. It's not realistic. The hardware support for it DOES NOT 
> > EXIST today, and probably won't for several years. And the real fix is 
> > elsewhere anyway (ie people will have to do a FBDIMM-2 interface, which 
> > is against the whole point of FBDIMM in the first place, but that's what 
> > you get when you ignore power in the first version!).
> >
> 
> Hardware support for some of this is coming this year in the ATCA space
> on the MPCBL0050.  The feature is a bit experimental, and
> power/performance benefits will be workload and configuration
> dependent.  Its not a bed time story.

What is the plan for software support?

Will it be possible to just power the DIMMs off?  I don't see much point in
some half-power non-destructive mode.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  8:38                                     ` Nick Piggin
@ 2007-03-02 17:09                                       ` Christoph Lameter
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2007-03-02 17:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Nick Piggin wrote:

> > Oh just run a 32GB SMP system with sparsely freeable pages and lots of 
> > allocs and frees and you will see it too. F.e try Linus tree and mlock 
> > a large portion of the memory and then see the fun starting. See also 
> > Rik's list of pathological cases on this.
> 
> Ah, so your problem is lots of unreclaimable pages. There are heaps
> of things we can try to reduce the rate at which we scan those.

Well this is one possible sympton of the basic issue of having too many 
page structs. I wonder how long we can patch things up.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 16:58     ` Andrew Morton
@ 2007-03-02 17:09       ` Mel Gorman
  2007-03-02 17:23       ` Christoph Lameter
  1 sibling, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2007-03-02 17:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, npiggin, clameter, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On (02/03/07 08:58), Andrew Morton didst pronounce:
> On Fri, 02 Mar 2007 10:29:58 -0500 Rik van Riel <riel@redhat.com> wrote:
> 
> > Andrew Morton wrote:
> > 
> > > And I'd judge that per-container RSS limits are of considerably more value
> > > than antifrag (in fact per-container RSS might be a superset of antifrag,
> > > in the sense that per-container RSS and containers could be abused to fix
> > > the i-cant-get-any-hugepages problem, dunno).
> > 
> > The RSS bits really worry me, since it looks like they could
> > exacerbate the scalability problems that we are already running
> > into on very large memory systems.
> 
> Using a zone-per-container or N-64MB-zones-per-container should actually
> move us in the direction of *fixing* any such problems.  Because, to a
> first-order, the scanning of such a zone has the same behaviour as a 64MB
> machine.
> 

Quite possibly. Taking software zones from the other large mail I sent,
one could get the 64MB effect by increasing MAX_ORDER_NR_PAGES to be 64MB
in pages. To avoid external fragmentation issues, I'd prefer of course
if these container zones consisted of mainly contiguous memory but with
anti-fragmentation, that would be possible.

> (We'd run into a few other problems, some related to the globalness of the
> dirty-memory management, but that's fixable).
> 

It would be fixable, especially if containers do their own reclaim on their
container zones and not kswapd. Writing dirty data back periodically would
still need to be global in nature but that's no different to today.

> > Linux is *not* happy on 256GB systems.  Even on some 32GB systems
> > the swappiness setting *needs* to be tweaked before Linux will even
> > run in a reasonable way.
> 
> Please send testcases.

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 16:20             ` Mark Gross
  2007-03-02 17:07               ` Andrew Morton
@ 2007-03-02 17:16               ` Linus Torvalds
  2007-03-02 18:45                 ` Mark Gross
  2007-03-02 23:58                 ` Martin J. Bligh
  1 sibling, 2 replies; 99+ messages in thread
From: Linus Torvalds @ 2007-03-02 17:16 UTC (permalink / raw)
  To: Mark Gross
  Cc: Andrew Morton, Balbir Singh, Mel Gorman, npiggin, clameter,
	mingo, jschopp, arjan, mbligh, linux-mm, linux-kernel


On Fri, 2 Mar 2007, Mark Gross wrote:
> > 
> > Yes, the same issues exist for other DRAM forms too, but to a *much* 
> > smaller degree.
> 
> DDR3-1333 may be better than FBDIMM's but don't count on it being much
> better.

Hey, fair enough. But it's not a problem (and it doesn't have a solution) 
today. I'm not sure it's going to have a solution tomorrow either.

> > Also, IN PRACTICE you're never ever going to see this anyway. Almost 
> > everybody wants bank interleaving, because it's a huge performance win on 
> > many loads. That, in turn, means that your memory will be spread out over 
> > multiple DIMM's even for a single page, much less any bigger area.
> 
> 4-way interleave across banks on systems may not be as common as you may
> think for future chip sets.  2-way interleave across DIMMs within a bank
> will stay.

.. and think about a realistic future.

EVERYBODY will do on-die memory controllers. Yes, Intel doesn't do it 
today, but in the one- to two-year timeframe even Intel will.

What does that mean? It means that in bigger systems, you will no longer 
even *have* 8 or 16 banks where turning off a few banks makes sense. 
You'll quite often have just a few DIMM's per die, because that's what you 
want for latency. Then you'll have CSI or HT or another interconnect.

And with a few DIMM's per die, you're back where even just 2-way 
interleaving basically means that in order to turn off your DIMM, you 
probably need to remove HALF the memory for that CPU.

In other words: TURNING OFF DIMM's IS A BEDTIME STORY FOR DIMWITTED 
CHILDREN.

There are maybe a couple machines IN EXISTENCE TODAY that can do it. But 
nobody actually does it in practice, and nobody even knows if it's going 
to be viable (yes, DRAM takes energy, but trying to keep memory free will 
likely waste power *too*, and I doubt anybody has any real idea of how 
much any of this would actually help in practice).

And I don't think that will change. See above. The future is *not* moving 
towards more and more DIMMS. Quite the reverse. On workstations, we are 
currently in the "one or two DIMM's per die". Do you really think that 
will change? Hell no. And in big servers, pretty much everybody agrees 
that we will move towards that, rather than away from it.

So:
 - forget about turning DIMM's off. There is *no* actual data supporting 
   the notion that it's a good idea today, and I seriously doubt you can 
   really argue that it will be a good idea in five or ten years. It's a 
   hardware hack for a hardware problem, and the problems are way too 
   complex for us to solve in time for the solution to be relevant.

 - aim for NUMA memory allocation and turning off whole *nodes*. That's 
   much more likely to be productive in the longer timeframe. And yes, we 
   may well want to do memory compaction for that too, but I suspect that 
   the issues are going to be different (ie the way to do it is to simply 
   prefer certain nodes for certain allocations, and then try to keep the 
   jobs that you know can be idle on other nodes)

Do you actually have real data supporting the notion that turning DIMM's 
off will be reasonable and worthwhile? 

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 16:58     ` Andrew Morton
  2007-03-02 17:09       ` Mel Gorman
@ 2007-03-02 17:23       ` Christoph Lameter
  2007-03-02 17:35         ` Andrew Morton
  1 sibling, 1 reply; 99+ messages in thread
From: Christoph Lameter @ 2007-03-02 17:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, npiggin, mingo, jschopp, arjan,
	torvalds, mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Andrew Morton wrote:

> > Linux is *not* happy on 256GB systems.  Even on some 32GB systems
> > the swappiness setting *needs* to be tweaked before Linux will even
> > run in a reasonable way.
> 
> Please send testcases.

It is not happy if you put 256GB into one zone. We are fine with 1k nodes 
with 8GB each and a 16k page size (which reduces the number of 
page_structs to manage by a fourth). So the total memory is 8TB which is 
significantly larger than 256GB.

If we do this node/zone merging and reassign MAX_ORDER blocks to virtual 
node/zones for containers (with their own LRU etc) then this would also 
reduce the number of page_structs on the list and may make things a bit 
easier.

We would then produce the same effect as the partitioning via NUMA nodes 
on our 8TB boxes. However, then you still have a bandwidth issue since 
your 256 likely only has a single bus and all memory traffic for the 
node/zones has to go through this single bottleneck. That bottleneck does 
not exist on NUMA machines.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 17:23       ` Christoph Lameter
@ 2007-03-02 17:35         ` Andrew Morton
  2007-03-02 17:43           ` Rik van Riel
  0 siblings, 1 reply; 99+ messages in thread
From: Andrew Morton @ 2007-03-02 17:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Rik van Riel, Mel Gorman, npiggin, mingo, jschopp, arjan,
	torvalds, mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007 09:23:49 -0800 (PST) Christoph Lameter <clameter@engr.sgi.com> wrote:

> On Fri, 2 Mar 2007, Andrew Morton wrote:
> 
> > > Linux is *not* happy on 256GB systems.  Even on some 32GB systems
> > > the swappiness setting *needs* to be tweaked before Linux will even
> > > run in a reasonable way.
> > 
> > Please send testcases.
> 
> It is not happy if you put 256GB into one zone.

Oh come on.  What's the workload?  What happens?  system time?  user time?
kernel profiles?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 17:07               ` Andrew Morton
@ 2007-03-02 17:35                 ` Mark Gross
  2007-03-02 18:02                   ` Andrew Morton
  0 siblings, 1 reply; 99+ messages in thread
From: Mark Gross @ 2007-03-02 17:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Balbir Singh, Mel Gorman, npiggin, clameter,
	mingo, jschopp, arjan, mbligh, linux-mm, linux-kernel

On Fri, Mar 02, 2007 at 09:07:53AM -0800, Andrew Morton wrote:
> On Fri, 2 Mar 2007 08:20:23 -0800 Mark Gross <mgross@linux.intel.com> wrote:
> 
> > > The whole DRAM power story is a bedtime story for gullible children. Don't 
> > > fall for it. It's not realistic. The hardware support for it DOES NOT 
> > > EXIST today, and probably won't for several years. And the real fix is 
> > > elsewhere anyway (ie people will have to do a FBDIMM-2 interface, which 
> > > is against the whole point of FBDIMM in the first place, but that's what 
> > > you get when you ignore power in the first version!).
> > >
> > 
> > Hardware support for some of this is coming this year in the ATCA space
> > on the MPCBL0050.  The feature is a bit experimental, and
> > power/performance benefits will be workload and configuration
> > dependent.  Its not a bed time story.
> 
> What is the plan for software support?

The plan is the typical layered approach to enabling.  Post the basic
enabling patch, followed by a patch or software to actually exercise the
feature.

The code to exercise the feature is complicated by the fact that the
memory will need re-training as it comes out of low power state.  The
code doing this is still a bit confidential.

I have the base enabling patch ready for RFC review.
I'm working on the RFC now.

> 
> Will it be possible to just power the DIMMs off?  I don't see much point in
> some half-power non-destructive mode.

I think so, but need to double check with the HW folks.

Technically, the dims could be powered off, and put into 2 different low
power non-destructive states.  (standby and suspend), but putting them
in a low power non-destructive mode has much less latency and provides
good bang for the buck or LOC change needed to make work.

Which lower power mode an application chooses will depend on latency
tolerances of the app.  For the POC activities we are looking at we are
targeting the lower latency option, but that doesn't lock out folks from
trying to do something with the other options.

--mgross

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 17:35         ` Andrew Morton
@ 2007-03-02 17:43           ` Rik van Riel
  2007-03-02 18:06             ` Andrew Morton
  0 siblings, 1 reply; 99+ messages in thread
From: Rik van Riel @ 2007-03-02 17:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Mel Gorman, npiggin, mingo, jschopp, arjan,
	torvalds, mbligh, linux-mm, linux-kernel

Andrew Morton wrote:
> On Fri, 2 Mar 2007 09:23:49 -0800 (PST) Christoph Lameter <clameter@engr.sgi.com> wrote:
> 
>> On Fri, 2 Mar 2007, Andrew Morton wrote:
>>
>>>> Linux is *not* happy on 256GB systems.  Even on some 32GB systems
>>>> the swappiness setting *needs* to be tweaked before Linux will even
>>>> run in a reasonable way.
>>> Please send testcases.
>> It is not happy if you put 256GB into one zone.
> 
> Oh come on.  What's the workload?  What happens?  system time?  user time?
> kernel profiles?

I can't share all the details, since a lot of the problems are customer
workloads.

One particular case is a 32GB system with a database that takes most
of memory.  The amount of actually freeable page cache memory is in
the hundreds of MB.   With swappiness at the default level of 60, kswapd
ends up eating most of a CPU, and other tasks also dive into the pageout
code.  Even with swappiness as high as 98, that system still has
problems with the CPU use in the pageout code!

Another typical problem is that people want to back up their database
servers.  During the backup, parts of the working set get evicted from
the VM and performance is horrible.

A third scenario is where a system has way more RAM than swap, and not
a whole lot of freeable page cache.  In this case, the VM ends up
spending WAY too much CPU time scanning and shuffling around essentially
unswappable anonymous memory and tmpfs files.

I have briefly characterized some of these working sets on:

http://linux-mm.org/ProblemWorkloads

One thing I do not yet have are easily runnable test cases.  I know
the problems that happen because customers run into them, but it is
not as easy to reproduce on test systems...

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 17:35                 ` Mark Gross
@ 2007-03-02 18:02                   ` Andrew Morton
  2007-03-02 19:02                     ` Mark Gross
  0 siblings, 1 reply; 99+ messages in thread
From: Andrew Morton @ 2007-03-02 18:02 UTC (permalink / raw)
  To: mgross
  Cc: Linus Torvalds, Balbir Singh, Mel Gorman, npiggin, clameter,
	mingo, jschopp, arjan, mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007 09:35:27 -0800
Mark Gross <mgross@linux.intel.com> wrote:

> > 
> > Will it be possible to just power the DIMMs off?  I don't see much point in
> > some half-power non-destructive mode.
> 
> I think so, but need to double check with the HW folks.
> 
> Technically, the dims could be powered off, and put into 2 different low
> power non-destructive states.  (standby and suspend), but putting them
> in a low power non-destructive mode has much less latency and provides
> good bang for the buck or LOC change needed to make work.
> 
> Which lower power mode an application chooses will depend on latency
> tolerances of the app.  For the POC activities we are looking at we are
> targeting the lower latency option, but that doesn't lock out folks from
> trying to do something with the other options.
> 

If we don't evacuate all live data from all of the DIMM, we'll never be
able to power the thing down in many situations.

Given that we _have_ emptied the DIMM, we can just turn it off.  And
refilling it will be slow - often just disk speed.

So I don't see a useful use-case for non-destructive states.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 17:43           ` Rik van Riel
@ 2007-03-02 18:06             ` Andrew Morton
  2007-03-02 18:15               ` Christoph Lameter
  2007-03-02 20:59               ` Bill Irwin
  0 siblings, 2 replies; 99+ messages in thread
From: Andrew Morton @ 2007-03-02 18:06 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Christoph Lameter, Mel Gorman, npiggin, mingo, jschopp, arjan,
	torvalds, mbligh, linux-mm, linux-kernel

On Fri, 02 Mar 2007 12:43:42 -0500
Rik van Riel <riel@redhat.com> wrote:

> Andrew Morton wrote:
> > On Fri, 2 Mar 2007 09:23:49 -0800 (PST) Christoph Lameter <clameter@engr.sgi.com> wrote:
> > 
> >> On Fri, 2 Mar 2007, Andrew Morton wrote:
> >>
> >>>> Linux is *not* happy on 256GB systems.  Even on some 32GB systems
> >>>> the swappiness setting *needs* to be tweaked before Linux will even
> >>>> run in a reasonable way.
> >>> Please send testcases.
> >> It is not happy if you put 256GB into one zone.
> > 
> > Oh come on.  What's the workload?  What happens?  system time?  user time?
> > kernel profiles?
> 
> I can't share all the details, since a lot of the problems are customer
> workloads.
> 
> One particular case is a 32GB system with a database that takes most
> of memory.  The amount of actually freeable page cache memory is in
> the hundreds of MB.

Where's the rest of the memory? tmpfs?  mlocked?  hugetlb?

>   With swappiness at the default level of 60, kswapd
> ends up eating most of a CPU, and other tasks also dive into the pageout
> code.  Even with swappiness as high as 98, that system still has
> problems with the CPU use in the pageout code!
> 
> Another typical problem is that people want to back up their database
> servers.  During the backup, parts of the working set get evicted from
> the VM and performance is horrible.

userspace fixes for this are far, far better than any magic goo the kernel
can implement.  We really need to get off our butts and start educating
people.

> A third scenario is where a system has way more RAM than swap, and not
> a whole lot of freeable page cache.  In this case, the VM ends up
> spending WAY too much CPU time scanning and shuffling around essentially
> unswappable anonymous memory and tmpfs files.

Well we've allegedly fixed that, but it isn't going anywhere without
testing.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 18:06             ` Andrew Morton
@ 2007-03-02 18:15               ` Christoph Lameter
  2007-03-02 18:23                 ` Andrew Morton
  2007-03-02 18:23                 ` Rik van Riel
  2007-03-02 20:59               ` Bill Irwin
  1 sibling, 2 replies; 99+ messages in thread
From: Christoph Lameter @ 2007-03-02 18:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, npiggin, mingo, jschopp, arjan,
	torvalds, mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Andrew Morton wrote:

> > One particular case is a 32GB system with a database that takes most
> > of memory.  The amount of actually freeable page cache memory is in
> > the hundreds of MB.
> 
> Where's the rest of the memory? tmpfs?  mlocked?  hugetlb?

The memory is likely in use but there is enough memory free in unmapped 
clean pagecache pages so that we occasionally are able to free pages. Then 
the app is reading more from disk replenishing that ...
Thus we are forever cycling through the LRU lists moving pages between 
the lists aging etc etc. Can lead to a livelock.

> > A third scenario is where a system has way more RAM than swap, and not
> > a whole lot of freeable page cache.  In this case, the VM ends up
> > spending WAY too much CPU time scanning and shuffling around essentially
> > unswappable anonymous memory and tmpfs files.
> 
> Well we've allegedly fixed that, but it isn't going anywhere without
> testing.

We have fixed the case in which we compile the kernel without swap. Then 
anonymous pages behave like mlocked pages. Did we do more than that?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 18:15               ` Christoph Lameter
@ 2007-03-02 18:23                 ` Andrew Morton
  2007-03-02 18:23                 ` Rik van Riel
  1 sibling, 0 replies; 99+ messages in thread
From: Andrew Morton @ 2007-03-02 18:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Rik van Riel, Mel Gorman, npiggin, mingo, jschopp, arjan,
	torvalds, mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007 10:15:36 -0800 (PST)
Christoph Lameter <clameter@engr.sgi.com> wrote:

> On Fri, 2 Mar 2007, Andrew Morton wrote:
> 
> > > One particular case is a 32GB system with a database that takes most
> > > of memory.  The amount of actually freeable page cache memory is in
> > > the hundreds of MB.
> > 
> > Where's the rest of the memory? tmpfs?  mlocked?  hugetlb?
> 
> The memory is likely in use but there is enough memory free in unmapped 
> clean pagecache pages so that we occasionally are able to free pages. Then 
> the app is reading more from disk replenishing that ...
> Thus we are forever cycling through the LRU lists moving pages between 
> the lists aging etc etc. Can lead to a livelock.

Guys, with this level of detail thses problems will never be fixed.

> > > A third scenario is where a system has way more RAM than swap, and not
> > > a whole lot of freeable page cache.  In this case, the VM ends up
> > > spending WAY too much CPU time scanning and shuffling around essentially
> > > unswappable anonymous memory and tmpfs files.
> > 
> > Well we've allegedly fixed that, but it isn't going anywhere without
> > testing.
> 
> We have fixed the case in which we compile the kernel without swap. Then 
> anonymous pages behave like mlocked pages. Did we do more than that?

oh yeah, we took the ran-out-of-swapcache code out.  But if we're going to
do this thing, we should find some way to bring it back.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 18:15               ` Christoph Lameter
  2007-03-02 18:23                 ` Andrew Morton
@ 2007-03-02 18:23                 ` Rik van Riel
  2007-03-02 19:31                   ` Christoph Lameter
  2007-03-02 21:12                   ` Bill Irwin
  1 sibling, 2 replies; 99+ messages in thread
From: Rik van Riel @ 2007-03-02 18:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Mel Gorman, npiggin, mingo, jschopp, arjan,
	torvalds, mbligh, linux-mm, linux-kernel

Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Andrew Morton wrote:
> 
>>> One particular case is a 32GB system with a database that takes most
>>> of memory.  The amount of actually freeable page cache memory is in
>>> the hundreds of MB.
>> Where's the rest of the memory? tmpfs?  mlocked?  hugetlb?
> 
> The memory is likely in use but there is enough memory free in unmapped 
> clean pagecache pages so that we occasionally are able to free pages. Then 
> the app is reading more from disk replenishing that ...
> Thus we are forever cycling through the LRU lists moving pages between 
> the lists aging etc etc. Can lead to a livelock.

In this particular case, the system even has swap free.

The kernel just chooses not to use it until it has scanned
some memory, due to the way the swappiness algorithm works.

With 32 CPUs diving into the page reclaim simultaneously,
each trying to scan a fraction of memory, this is disastrous
for performance.  A 256GB system should be even worse.

>>> A third scenario is where a system has way more RAM than swap, and not
>>> a whole lot of freeable page cache.  In this case, the VM ends up
>>> spending WAY too much CPU time scanning and shuffling around essentially
>>> unswappable anonymous memory and tmpfs files.
>> Well we've allegedly fixed that, but it isn't going anywhere without
>> testing.
> 
> We have fixed the case in which we compile the kernel without swap. Then 
> anonymous pages behave like mlocked pages. Did we do more than that?

Not AFAIK.

I would like to see separate pageout selection queues
for anonymous/tmpfs and page cache backed pages.  That
way we can simply scan only that what we want to scan.

There are several ways available to balance pressure
between both sets of lists.

Splitting them out will also make it possible to do
proper use-once replacement for the page cache pages.
Ie. leaving the really active page cache pages on the
page cache active list, instead of deactivating them
because they're lower priority than anonymous pages.

That way we can do a backup without losting the page
cache working set.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 17:16               ` Linus Torvalds
@ 2007-03-02 18:45                 ` Mark Gross
  2007-03-02 19:03                   ` Linus Torvalds
  2007-03-02 23:58                 ` Martin J. Bligh
  1 sibling, 1 reply; 99+ messages in thread
From: Mark Gross @ 2007-03-02 18:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Balbir Singh, Mel Gorman, npiggin, clameter,
	mingo, jschopp, arjan, mbligh, linux-mm, linux-kernel

On Fri, Mar 02, 2007 at 09:16:17AM -0800, Linus Torvalds wrote:
> 
> 
> On Fri, 2 Mar 2007, Mark Gross wrote:
> > > 
> > > Yes, the same issues exist for other DRAM forms too, but to a *much* 
> > > smaller degree.
> > 
> > DDR3-1333 may be better than FBDIMM's but don't count on it being much
> > better.
> 
> Hey, fair enough. But it's not a problem (and it doesn't have a solution) 
> today. I'm not sure it's going to have a solution tomorrow either.
> 
> > > Also, IN PRACTICE you're never ever going to see this anyway. Almost 
> > > everybody wants bank interleaving, because it's a huge performance win on 
> > > many loads. That, in turn, means that your memory will be spread out over 
> > > multiple DIMM's even for a single page, much less any bigger area.
> > 
> > 4-way interleave across banks on systems may not be as common as you may
> > think for future chip sets.  2-way interleave across DIMMs within a bank
> > will stay.
> 
> .. and think about a realistic future.
> 
> EVERYBODY will do on-die memory controllers. Yes, Intel doesn't do it 
> today, but in the one- to two-year timeframe even Intel will.

True.

> 
> What does that mean? It means that in bigger systems, you will no longer 
> even *have* 8 or 16 banks where turning off a few banks makes sense. 
> You'll quite often have just a few DIMM's per die, because that's what you 
> want for latency. Then you'll have CSI or HT or another interconnect.
> 
> And with a few DIMM's per die, you're back where even just 2-way 
> interleaving basically means that in order to turn off your DIMM, you 
> probably need to remove HALF the memory for that CPU.

I think there will be more than just 2 dims per cpu socket on systems
that care about this type of capability.

> 
> In other words: TURNING OFF DIMM's IS A BEDTIME STORY FOR DIMWITTED 
> CHILDREN.


Its very true that taking advantage of the first incarnations of this
type of thing will be limited to specific workloads you personally don't
care about, but its got applications and customers.

BTW I hope we aren't talking past each other, there are low power states
where the ram contents are persevered.

> 
> There are maybe a couple machines IN EXISTENCE TODAY that can do it. But 
> nobody actually does it in practice, and nobody even knows if it's going 
> to be viable (yes, DRAM takes energy, but trying to keep memory free will 
> likely waste power *too*, and I doubt anybody has any real idea of how 
> much any of this would actually help in practice).
> 
> And I don't think that will change. See above. The future is *not* moving 
> towards more and more DIMMS. Quite the reverse. On workstations, we are 
> currently in the "one or two DIMM's per die". Do you really think that 
> will change? Hell no. And in big servers, pretty much everybody agrees 
> that we will move towards that, rather than away from it.
> 
> So:
>  - forget about turning DIMM's off. There is *no* actual data supporting 
>    the notion that it's a good idea today, and I seriously doubt you can 
>    really argue that it will be a good idea in five or ten years. It's a 
>    hardware hack for a hardware problem, and the problems are way too 
>    complex for us to solve in time for the solution to be relevant.
> 
>  - aim for NUMA memory allocation and turning off whole *nodes*. That's 
>    much more likely to be productive in the longer timeframe. And yes, we 
>    may well want to do memory compaction for that too, but I suspect that 
>    the issues are going to be different (ie the way to do it is to simply 
>    prefer certain nodes for certain allocations, and then try to keep the 
>    jobs that you know can be idle on other nodes)

We doing the NUMA approach.  

> 
> Do you actually have real data supporting the notion that turning DIMM's 
> off will be reasonable and worthwhile? 
> 

Yes we have data from our internal and external customers showing that
this stuff is worthwhile for specific workload that some people care
about.  However; you need to understand that its by definition marketing data.

--mgross

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 18:02                   ` Andrew Morton
@ 2007-03-02 19:02                     ` Mark Gross
  0 siblings, 0 replies; 99+ messages in thread
From: Mark Gross @ 2007-03-02 19:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Balbir Singh, Mel Gorman, npiggin, clameter,
	mingo, jschopp, arjan, mbligh, linux-mm, linux-kernel

On Fri, Mar 02, 2007 at 10:02:57AM -0800, Andrew Morton wrote:
> On Fri, 2 Mar 2007 09:35:27 -0800
> Mark Gross <mgross@linux.intel.com> wrote:
> 
> > > 
> > > Will it be possible to just power the DIMMs off?  I don't see much point in
> > > some half-power non-destructive mode.
> > 
> > I think so, but need to double check with the HW folks.
> > 
> > Technically, the dims could be powered off, and put into 2 different low
> > power non-destructive states.  (standby and suspend), but putting them
> > in a low power non-destructive mode has much less latency and provides
> > good bang for the buck or LOC change needed to make work.
> > 
> > Which lower power mode an application chooses will depend on latency
> > tolerances of the app.  For the POC activities we are looking at we are
> > targeting the lower latency option, but that doesn't lock out folks from
> > trying to do something with the other options.
> > 
> 
> If we don't evacuate all live data from all of the DIMM, we'll never be
> able to power the thing down in many situations.
> 
> Given that we _have_ emptied the DIMM, we can just turn it off.  And
> refilling it will be slow - often just disk speed.
> 
> So I don't see a useful use-case for non-destructive states.

I'll post the RFC very soon to provide a better thread context for this
line of discussion, but to answer your question:

There are 2 power management policies we are looking at.  The first one
is allocation based PM, and the other is access base PM.  The access
based PM needs chip set support which is coming at a TBD date.

--mgross

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 18:45                 ` Mark Gross
@ 2007-03-02 19:03                   ` Linus Torvalds
  0 siblings, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2007-03-02 19:03 UTC (permalink / raw)
  To: Mark Gross
  Cc: Andrew Morton, Balbir Singh, Mel Gorman, npiggin, clameter,
	mingo, jschopp, arjan, mbligh, linux-mm, linux-kernel


On Fri, 2 Mar 2007, Mark Gross wrote:
> 
> I think there will be more than just 2 dims per cpu socket on systems
> that care about this type of capability.

I agree. I think you'll have a nice mix of 2 and 4, although not likely a 
lot more. You want to have independent channels, and then within a channel 
you want to have as close to point-to-point as possible. 

But the reason that I think you're better off looking at a "node level" is 
that 

 (a) describing the DIMM setup is a total disaster. The interleaving is 
     part of it, but even in the absense of interleaving, we have so far 
     seen that describing DIMM mapping simply isn't a realistic thing to 
     be widely deplyed, judging by the fact that we cannot even get a 
     first-order approximate mapping for the ECC error events.

     Going node-level means that we just piggy-back on the existing node 
     mapping, which is a lot more likely to actually be correct and 
     available (ie you may not know which bank is bank0 and how the 
     interleaving works, but you usually *do* know which bank is connected 
     to which CPU package)

     (Btw, I shouldn't have used the word "die", since it's really about 
     package - Intel obviously has a penchant for putting two dies per 
     package)

 (b) especially if you can actually shut down the memory, going node-wide 
     may mean that you can shut down the CPU's too (ie per-package sleep). 
     I bet the people who care enough to care about DIMM's would want to 
     have that *anyway*, so tying them together simplifies the problem.

> BTW I hope we aren't talking past each other, there are low power states
> where the ram contents are persevered.

Yes. They are almost as hard to handle, but the advantage is that if we 
get things wrong, it can still work most of the time (ie we don't have to 
migrate everything off, we just need to try to migrate the stuff that gets 
*used* off a DIMM, and hardware will hopefully end up quiescing the right 
memory controller channel totally automatically, without us having to know 
the exact mapping or even having to 100% always get it 100% right).

With FBDIMM in particular, I guess the biggest power cost isn't actually 
the DRAM content, but just the controllers.

Of course, I wonder how much actual point there is to FBDIMM's once you 
have on-die memory controllers and thus the reason for deep queueing is 
basically gone (since you'd spread out the memory rather than having it 
behind a few central controllers).

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 18:23                 ` Rik van Riel
@ 2007-03-02 19:31                   ` Christoph Lameter
  2007-03-02 19:40                     ` Rik van Riel
  2007-03-02 21:12                   ` Bill Irwin
  1 sibling, 1 reply; 99+ messages in thread
From: Christoph Lameter @ 2007-03-02 19:31 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Mel Gorman, npiggin, mingo, jschopp, arjan,
	torvalds, mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Rik van Riel wrote:

> I would like to see separate pageout selection queues
> for anonymous/tmpfs and page cache backed pages.  That
> way we can simply scan only that what we want to scan.
> 
> There are several ways available to balance pressure
> between both sets of lists.
> 
> Splitting them out will also make it possible to do
> proper use-once replacement for the page cache pages.
> Ie. leaving the really active page cache pages on the
> page cache active list, instead of deactivating them
> because they're lower priority than anonymous pages.

Well I would expect this to have marginal improvements and delay the 
inevitable for awhile until we have even bigger memory. If the app uses 
mmapped data areas then the problem is still there. And such tinkering 
does not solve the issue of large scale I/O requiring the handling of 
gazillions of page structs. I do not think that there is a way around 
somehow handling larger chunks of memory in an easier way. We already do 
handle larger page sizes for some limited purposes and with huge pages we 
already have a larger page size. Mel's defrag/anti-frag patches are 
necessary to allow us to deal with the resulting fragmentation problems.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 19:31                   ` Christoph Lameter
@ 2007-03-02 19:40                     ` Rik van Riel
  0 siblings, 0 replies; 99+ messages in thread
From: Rik van Riel @ 2007-03-02 19:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Mel Gorman, npiggin, mingo, jschopp, arjan,
	torvalds, mbligh, linux-mm, linux-kernel

Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Rik van Riel wrote:
> 
>> I would like to see separate pageout selection queues
>> for anonymous/tmpfs and page cache backed pages.  That
>> way we can simply scan only that what we want to scan.
>>
>> There are several ways available to balance pressure
>> between both sets of lists.
>>
>> Splitting them out will also make it possible to do
>> proper use-once replacement for the page cache pages.
>> Ie. leaving the really active page cache pages on the
>> page cache active list, instead of deactivating them
>> because they're lower priority than anonymous pages.
> 
> Well I would expect this to have marginal improvements and delay the 
> inevitable for awhile until we have even bigger memory. If the app uses 
> mmapped data areas then the problem is still there.

I suspect we would not need to treat mapped file backed memory any
different from page cache that's not mapped.  After all, if we do
proper use-once accounting, the working set will be on the active
list and other cache will be flushed out the inactive list quickly.

Also, the IO cost for mmapped data areas is the same as the IO
cost for unmapped files, so there's no IO reason to treat them
differently, either.


-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 18:06             ` Andrew Morton
  2007-03-02 18:15               ` Christoph Lameter
@ 2007-03-02 20:59               ` Bill Irwin
  1 sibling, 0 replies; 99+ messages in thread
From: Bill Irwin @ 2007-03-02 20:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Christoph Lameter, Mel Gorman, npiggin, mingo,
	jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, 02 Mar 2007 12:43:42 -0500 Rik van Riel <riel@redhat.com> wrote:
>> I can't share all the details, since a lot of the problems are customer
>> workloads.
>> One particular case is a 32GB system with a database that takes most
>> of memory.  The amount of actually freeable page cache memory is in
>> the hundreds of MB.

On Fri, Mar 02, 2007 at 10:06:19AM -0800, Andrew Morton wrote:
> Where's the rest of the memory? tmpfs?  mlocked?  hugetlb?

I know of one sounding similar to this where unreclaimable pages are
pinned by refcounts held by bio's spread across about 850 spindles.
It's mostly read traffic. Several different tunables could be used
to work around it, nr_requests in particular, but also clamping down
on dirty limits to preposterously low levels and setting preposterously
large values of min_free_kbytes. Their kernel is, of course,
substantially downrev (2.6.9-based IIRC), so douse things heavily with
grains of salt.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 18:23                 ` Rik van Riel
  2007-03-02 19:31                   ` Christoph Lameter
@ 2007-03-02 21:12                   ` Bill Irwin
  2007-03-02 21:19                     ` Rik van Riel
  1 sibling, 1 reply; 99+ messages in thread
From: Bill Irwin @ 2007-03-02 21:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Christoph Lameter, Andrew Morton, Mel Gorman, npiggin, mingo,
	jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, Mar 02, 2007 at 01:23:28PM -0500, Rik van Riel wrote:
> With 32 CPUs diving into the page reclaim simultaneously,
> each trying to scan a fraction of memory, this is disastrous
> for performance.  A 256GB system should be even worse.

Thundering herds of a sort pounding the LRU locks from direct reclaim
have set off the NMI oopser for users here.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 21:12                   ` Bill Irwin
@ 2007-03-02 21:19                     ` Rik van Riel
  2007-03-02 21:52                       ` Andrew Morton
  0 siblings, 1 reply; 99+ messages in thread
From: Rik van Riel @ 2007-03-02 21:19 UTC (permalink / raw)
  To: Bill Irwin, Rik van Riel, Christoph Lameter, Andrew Morton,
	Mel Gorman, npiggin, mingo, jschopp, arjan, torvalds, mbligh,
	linux-mm, linux-kernel

Bill Irwin wrote:
> On Fri, Mar 02, 2007 at 01:23:28PM -0500, Rik van Riel wrote:
>> With 32 CPUs diving into the page reclaim simultaneously,
>> each trying to scan a fraction of memory, this is disastrous
>> for performance.  A 256GB system should be even worse.
> 
> Thundering herds of a sort pounding the LRU locks from direct reclaim
> have set off the NMI oopser for users here.

Ditto here.

The main reason they end up pounding the LRU locks is the
swappiness heuristic.  They scan too much before deciding
that it would be a good idea to actually swap something
out, and with 32 CPUs doing such scanning simultaneously...

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 16:31     ` Joel Schopp
@ 2007-03-02 21:37       ` Bill Irwin
  0 siblings, 0 replies; 99+ messages in thread
From: Bill Irwin @ 2007-03-02 21:37 UTC (permalink / raw)
  To: Joel Schopp
  Cc: Mel Gorman, Bill Irwin, akpm, npiggin, clameter, mingo, arjan,
	torvalds, mbligh, Linux Memory Management List,
	Linux Kernel Mailing List

At some point in the past, Mel Gorman wrote:
>> I can't think of a workload that totally makes a mess out of list-based. 
>> However, list-based makes no guarantees on availability. If a system 
>> administrator knows they need between 10,000 and 100,000 huge pages and 
>> doesn't want to waste memory pinning too many huge pages at boot-time, 
>> the zone-based mechanism would be what he wanted.

On Fri, Mar 02, 2007 at 10:31:39AM -0600, Joel Schopp wrote:
> From our testing with earlier versions of list based for memory hot-unplug 
> on pSeries machines we were able to hot-unplug huge amounts of memory after 
> running the nastiest workloads we could find for over a week.  Without the 
> patches we were unable to hot-unplug anything within minutes of running the 
> same workloads.
> If something works for 99.999% of people (list based) and there is an easy 
> way to configure it for the other 0.001% of the people ("zone" based) I 
> call that a great solution.  I really don't understand what the resistance 
> is to these patches.

Sorry if I was unclear; I was anticipating others' objections and
offering to assist in responding to them. I myself have no concerns
about the above strategy, apart from generally wanting to recover the
list-based patch's hugepage availability without demanding it as a
merging criterion.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 21:19                     ` Rik van Riel
@ 2007-03-02 21:52                       ` Andrew Morton
  2007-03-02 22:03                         ` Rik van Riel
  0 siblings, 1 reply; 99+ messages in thread
From: Andrew Morton @ 2007-03-02 21:52 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Bill Irwin, Christoph Lameter, Mel Gorman, npiggin, mingo,
	jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, 02 Mar 2007 16:19:19 -0500
Rik van Riel <riel@redhat.com> wrote:

> Bill Irwin wrote:
> > On Fri, Mar 02, 2007 at 01:23:28PM -0500, Rik van Riel wrote:
> >> With 32 CPUs diving into the page reclaim simultaneously,
> >> each trying to scan a fraction of memory, this is disastrous
> >> for performance.  A 256GB system should be even worse.
> > 
> > Thundering herds of a sort pounding the LRU locks from direct reclaim
> > have set off the NMI oopser for users here.
> 
> Ditto here.

Opterons?

> The main reason they end up pounding the LRU locks is the
> swappiness heuristic.  They scan too much before deciding
> that it would be a good idea to actually swap something
> out, and with 32 CPUs doing such scanning simultaneously...

What kernel version?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 21:52                       ` Andrew Morton
@ 2007-03-02 22:03                         ` Rik van Riel
  2007-03-02 22:22                           ` Andrew Morton
  0 siblings, 1 reply; 99+ messages in thread
From: Rik van Riel @ 2007-03-02 22:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Bill Irwin, Christoph Lameter, Mel Gorman, npiggin, mingo,
	jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

Andrew Morton wrote:
> On Fri, 02 Mar 2007 16:19:19 -0500
> Rik van Riel <riel@redhat.com> wrote:
>> Bill Irwin wrote:
>>> On Fri, Mar 02, 2007 at 01:23:28PM -0500, Rik van Riel wrote:
>>>> With 32 CPUs diving into the page reclaim simultaneously,
>>>> each trying to scan a fraction of memory, this is disastrous
>>>> for performance.  A 256GB system should be even worse.
>>> Thundering herds of a sort pounding the LRU locks from direct reclaim
>>> have set off the NMI oopser for users here.
>> Ditto here.
> 
> Opterons?

It's happened on IA64, too.  Probably on Intel x86-64 as well.

>> The main reason they end up pounding the LRU locks is the
>> swappiness heuristic.  They scan too much before deciding
>> that it would be a good idea to actually swap something
>> out, and with 32 CPUs doing such scanning simultaneously...
> 
> What kernel version?

Customers are on the 2.6.9 based RHEL4 kernel, but I believe
we have reproduced the problem on 2.6.18 too during stress
tests.

I have no reason to believe we should stick our heads in the
sand and pretend it no longer exists on 2.6.21.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 22:03                         ` Rik van Riel
@ 2007-03-02 22:22                           ` Andrew Morton
  2007-03-02 22:34                             ` Rik van Riel
  2007-03-03  0:33                             ` William Lee Irwin III
  0 siblings, 2 replies; 99+ messages in thread
From: Andrew Morton @ 2007-03-02 22:22 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Bill Irwin, Christoph Lameter, Mel Gorman, npiggin, mingo,
	jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, 02 Mar 2007 17:03:10 -0500
Rik van Riel <riel@redhat.com> wrote:

> Andrew Morton wrote:
> > On Fri, 02 Mar 2007 16:19:19 -0500
> > Rik van Riel <riel@redhat.com> wrote:
> >> Bill Irwin wrote:
> >>> On Fri, Mar 02, 2007 at 01:23:28PM -0500, Rik van Riel wrote:
> >>>> With 32 CPUs diving into the page reclaim simultaneously,
> >>>> each trying to scan a fraction of memory, this is disastrous
> >>>> for performance.  A 256GB system should be even worse.
> >>> Thundering herds of a sort pounding the LRU locks from direct reclaim
> >>> have set off the NMI oopser for users here.
> >> Ditto here.
> > 
> > Opterons?
> 
> It's happened on IA64, too.  Probably on Intel x86-64 as well.

Opterons seem to be particularly prone to lock starvation where a cacheline
gets captured in a single package for ever.

> >> The main reason they end up pounding the LRU locks is the
> >> swappiness heuristic.  They scan too much before deciding
> >> that it would be a good idea to actually swap something
> >> out, and with 32 CPUs doing such scanning simultaneously...
> > 
> > What kernel version?
> 
> Customers are on the 2.6.9 based RHEL4 kernel, but I believe
> we have reproduced the problem on 2.6.18 too during stress
> tests.

The prev_priority fixes were post-2.6.18

> I have no reason to believe we should stick our heads in the
> sand and pretend it no longer exists on 2.6.21.

I have no reason to believe anything.  All I see is handwaviness,
speculation and grand plans to rewrite vast amounts of stuff without even a
testcase to demonstrate that said rewrite improved anything.

None of this is going anywhere, is it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 22:22                           ` Andrew Morton
@ 2007-03-02 22:34                             ` Rik van Riel
  2007-03-02 22:51                               ` Martin Bligh
                                                 ` (2 more replies)
  2007-03-03  0:33                             ` William Lee Irwin III
  1 sibling, 3 replies; 99+ messages in thread
From: Rik van Riel @ 2007-03-02 22:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Bill Irwin, Christoph Lameter, Mel Gorman, npiggin, mingo,
	jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

Andrew Morton wrote:
> On Fri, 02 Mar 2007 17:03:10 -0500
> Rik van Riel <riel@redhat.com> wrote:
> 
>> Andrew Morton wrote:
>>> On Fri, 02 Mar 2007 16:19:19 -0500
>>> Rik van Riel <riel@redhat.com> wrote:
>>>> Bill Irwin wrote:
>>>>> On Fri, Mar 02, 2007 at 01:23:28PM -0500, Rik van Riel wrote:
>>>>>> With 32 CPUs diving into the page reclaim simultaneously,
>>>>>> each trying to scan a fraction of memory, this is disastrous
>>>>>> for performance.  A 256GB system should be even worse.
>>>>> Thundering herds of a sort pounding the LRU locks from direct reclaim
>>>>> have set off the NMI oopser for users here.
>>>> Ditto here.
>>> Opterons?
>> It's happened on IA64, too.  Probably on Intel x86-64 as well.
> 
> Opterons seem to be particularly prone to lock starvation where a cacheline
> gets captured in a single package for ever.
> 
>>>> The main reason they end up pounding the LRU locks is the
>>>> swappiness heuristic.  They scan too much before deciding
>>>> that it would be a good idea to actually swap something
>>>> out, and with 32 CPUs doing such scanning simultaneously...
>>> What kernel version?
>> Customers are on the 2.6.9 based RHEL4 kernel, but I believe
>> we have reproduced the problem on 2.6.18 too during stress
>> tests.
> 
> The prev_priority fixes were post-2.6.18

We tested them.  They only alleviate the problem slightly in
good situations, but things still fall apart badly with less
friendly workloads.

>> I have no reason to believe we should stick our heads in the
>> sand and pretend it no longer exists on 2.6.21.
> 
> I have no reason to believe anything.  All I see is handwaviness,
> speculation and grand plans to rewrite vast amounts of stuff without even a
> testcase to demonstrate that said rewrite improved anything.

Your attitude is exactly why the VM keeps falling apart over
and over again.

Fixing "a testcase" in the VM tends to introduce problems for
other test cases, ad infinitum. There's a reason we end up
fixing the same bugs over and over again.

I have been looking through a few hundred VM related bugzillas
and have found the same bugs persist over many different
versions of Linux, sometimes temporarily fixed, but they seem
to always come back eventually...

> None of this is going anywhere, is is it?

I will test my changes before I send them to you, but I cannot
promise you that you'll have the computers or software needed
to reproduce the problems.  I doubt I'll have full time access
to such systems myself, either.

32GB is pretty much the minimum size to reproduce some of these
problems. Some workloads may need larger systems to easily trigger
them.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 22:34                             ` Rik van Riel
@ 2007-03-02 22:51                               ` Martin Bligh
  2007-03-02 22:54                                 ` Rik van Riel
  2007-03-02 22:52                               ` Chuck Ebbert
  2007-03-02 22:59                               ` Andrew Morton
  2 siblings, 1 reply; 99+ messages in thread
From: Martin Bligh @ 2007-03-02 22:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Bill Irwin, Christoph Lameter, Mel Gorman,
	npiggin, mingo, jschopp, arjan, torvalds, linux-mm, linux-kernel

>> None of this is going anywhere, is is it?
> 
> I will test my changes before I send them to you, but I cannot
> promise you that you'll have the computers or software needed
> to reproduce the problems.  I doubt I'll have full time access
> to such systems myself, either.
> 
> 32GB is pretty much the minimum size to reproduce some of these
> problems. Some workloads may need larger systems to easily trigger
> them.

We can find a 32GB system here pretty easily to test things on if
need be.  Setting up large commercial databases is much harder.

I don't have such a machine in the public set of machines we're going
to push to test.kernel.org from at the moment, but will see if I can
arrange it in the future if it's important.


M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 22:34                             ` Rik van Riel
  2007-03-02 22:51                               ` Martin Bligh
@ 2007-03-02 22:52                               ` Chuck Ebbert
  2007-03-02 22:59                               ` Andrew Morton
  2 siblings, 0 replies; 99+ messages in thread
From: Chuck Ebbert @ 2007-03-02 22:52 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Bill Irwin, Christoph Lameter, Mel Gorman,
	npiggin, mingo, jschopp, arjan, torvalds, mbligh, linux-mm,
	linux-kernel

Rik van Riel wrote:
> 32GB is pretty much the minimum size to reproduce some of these
> problems. Some workloads may need larger systems to easily trigger
> them.
> 

Hundreds of disks all doing IO at once may also be needed, as
wli points out. Such systems are not readily available for testing.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 22:51                               ` Martin Bligh
@ 2007-03-02 22:54                                 ` Rik van Riel
  2007-03-02 23:28                                   ` Martin J. Bligh
  0 siblings, 1 reply; 99+ messages in thread
From: Rik van Riel @ 2007-03-02 22:54 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Andrew Morton, Bill Irwin, Christoph Lameter, Mel Gorman,
	npiggin, mingo, jschopp, arjan, torvalds, linux-mm, linux-kernel

Martin Bligh wrote:
>>> None of this is going anywhere, is is it?
>>
>> I will test my changes before I send them to you, but I cannot
>> promise you that you'll have the computers or software needed
>> to reproduce the problems.  I doubt I'll have full time access
>> to such systems myself, either.
>>
>> 32GB is pretty much the minimum size to reproduce some of these
>> problems. Some workloads may need larger systems to easily trigger
>> them.
> 
> We can find a 32GB system here pretty easily to test things on if
> need be.  Setting up large commercial databases is much harder.

That's my problem, too.

There does not seem to exist any single set of test cases that
accurately predicts how the VM will behave with customer
workloads.

The one thing I can do relatively easily is go through a few
hundred bugzillas and figure out what kinds of problems have
been plaguing the VM consistently over the last few years.
I just finished doing that, and am trying to come up with
fixes for the problems that just don't seem to be easily
fixable with bandaids...

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 22:34                             ` Rik van Riel
  2007-03-02 22:51                               ` Martin Bligh
  2007-03-02 22:52                               ` Chuck Ebbert
@ 2007-03-02 22:59                               ` Andrew Morton
  2007-03-02 23:20                                 ` Rik van Riel
  2007-03-03  1:40                                 ` William Lee Irwin III
  2 siblings, 2 replies; 99+ messages in thread
From: Andrew Morton @ 2007-03-02 22:59 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Bill Irwin, Christoph Lameter, Mel Gorman, npiggin, mingo,
	jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, 02 Mar 2007 17:34:31 -0500
Rik van Riel <riel@redhat.com> wrote:

> >>>> The main reason they end up pounding the LRU locks is the
> >>>> swappiness heuristic.  They scan too much before deciding
> >>>> that it would be a good idea to actually swap something
> >>>> out, and with 32 CPUs doing such scanning simultaneously...
> >>> What kernel version?
> >> Customers are on the 2.6.9 based RHEL4 kernel, but I believe
> >> we have reproduced the problem on 2.6.18 too during stress
> >> tests.
> > 
> > The prev_priority fixes were post-2.6.18
> 
> We tested them.  They only alleviate the problem slightly in
> good situations, but things still fall apart badly with less
> friendly workloads.

What is it with vendors finding MM problems and either not fixing them or
kludging around them and not telling the upstream maintainers about *any*
of it?

> >> I have no reason to believe we should stick our heads in the
> >> sand and pretend it no longer exists on 2.6.21.
> > 
> > I have no reason to believe anything.  All I see is handwaviness,
> > speculation and grand plans to rewrite vast amounts of stuff without even a
> > testcase to demonstrate that said rewrite improved anything.
> 
> Your attitude is exactly why the VM keeps falling apart over
> and over again.
> 
> Fixing "a testcase" in the VM tends to introduce problems for
> other test cases, ad infinitum.

In that case it was a bad fix.  The aim is to fix known problems without
introducing regressions in other areas.  A perfectly legitimate approach.

You seem to be saying that we'd be worse off if we actually had a testcase.

> There's a reason we end up
> fixing the same bugs over and over again.

No we don't.

> I have been looking through a few hundred VM related bugzillas
> and have found the same bugs persist over many different
> versions of Linux, sometimes temporarily fixed, but they seem
> to always come back eventually...
> 
> > None of this is going anywhere, is is it?
> 
> I will test my changes before I send them to you, but I cannot
> promise you that you'll have the computers or software needed
> to reproduce the problems.  I doubt I'll have full time access
> to such systems myself, either.
> 
> 32GB is pretty much the minimum size to reproduce some of these
> problems. Some workloads may need larger systems to easily trigger

32GB isn't particularly large.

Somehow I don't believe that a person or organisation which is incapable of
preparing even a simple testcase will be capable of fixing problems such as
this without breaking things.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 22:59                               ` Andrew Morton
@ 2007-03-02 23:20                                 ` Rik van Riel
  2007-03-03  1:40                                 ` William Lee Irwin III
  1 sibling, 0 replies; 99+ messages in thread
From: Rik van Riel @ 2007-03-02 23:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Bill Irwin, Christoph Lameter, Mel Gorman, npiggin, mingo,
	jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

Andrew Morton wrote:

> Somehow I don't believe that a person or organisation which is incapable of
> preparing even a simple testcase will be capable of fixing problems such as
> this without breaking things.

I don't believe anybody who relies on one simple test case will
ever be capable of evaluating a patch without breaking things.

Test cases can show problems, but fixing a test case is no
guarantee at all that your VM will behave ok with real world
workloads.  Test cases for the VM can *never* be relied on
to show that a problem went away.

I'll do my best, but I can't promise a simple test case
for every single problem that's plaguing the VM.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 22:54                                 ` Rik van Riel
@ 2007-03-02 23:28                                   ` Martin J. Bligh
  2007-03-03  0:24                                     ` Andrew Morton
  0 siblings, 1 reply; 99+ messages in thread
From: Martin J. Bligh @ 2007-03-02 23:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Bill Irwin, Christoph Lameter, Mel Gorman,
	npiggin, mingo, jschopp, arjan, torvalds, linux-mm, linux-kernel

>>> 32GB is pretty much the minimum size to reproduce some of these
>>> problems. Some workloads may need larger systems to easily trigger
>>> them.
>>
>> We can find a 32GB system here pretty easily to test things on if
>> need be.  Setting up large commercial databases is much harder.
> 
> That's my problem, too.
> 
> There does not seem to exist any single set of test cases that
> accurately predicts how the VM will behave with customer
> workloads.

Tracing might help? Showing Andrew traces of what happened in
production for the prev_priority change made it much easier to
demonstrate and explain the real problem ...

M.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 17:16               ` Linus Torvalds
  2007-03-02 18:45                 ` Mark Gross
@ 2007-03-02 23:58                 ` Martin J. Bligh
  1 sibling, 0 replies; 99+ messages in thread
From: Martin J. Bligh @ 2007-03-02 23:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Gross, Andrew Morton, Balbir Singh, Mel Gorman, npiggin,
	clameter, mingo, jschopp, arjan, linux-mm, linux-kernel

> .. and think about a realistic future.
> 
> EVERYBODY will do on-die memory controllers. Yes, Intel doesn't do it 
> today, but in the one- to two-year timeframe even Intel will.
> 
> What does that mean? It means that in bigger systems, you will no longer 
> even *have* 8 or 16 banks where turning off a few banks makes sense. 
> You'll quite often have just a few DIMM's per die, because that's what you 
> want for latency. Then you'll have CSI or HT or another interconnect.
> 
> And with a few DIMM's per die, you're back where even just 2-way 
> interleaving basically means that in order to turn off your DIMM, you 
> probably need to remove HALF the memory for that CPU.
> 
> In other words: TURNING OFF DIMM's IS A BEDTIME STORY FOR DIMWITTED 
> CHILDREN.

Even with only 4 banks per CPU, and 2-way interleaving, we could still
power off half the DIMMs in the system. That's a huge impact on the
power budget for a large cluster.

No, it's not ideal, but what was that quote again ... "perfect is the
enemy of good"? Something like that ;-)

> There are maybe a couple machines IN EXISTENCE TODAY that can do it. But 
> nobody actually does it in practice, and nobody even knows if it's going 
> to be viable (yes, DRAM takes energy, but trying to keep memory free will 
> likely waste power *too*, and I doubt anybody has any real idea of how 
> much any of this would actually help in practice).

Batch jobs across clusters have spikes at different times of the day,
etc that are fairly predictable in many cases.

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 23:28                                   ` Martin J. Bligh
@ 2007-03-03  0:24                                     ` Andrew Morton
  0 siblings, 0 replies; 99+ messages in thread
From: Andrew Morton @ 2007-03-03  0:24 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Rik van Riel, Bill Irwin, Christoph Lameter, Mel Gorman, npiggin,
	mingo, jschopp, arjan, torvalds, linux-mm, linux-kernel

On Fri, 02 Mar 2007 15:28:43 -0800
"Martin J. Bligh" <mbligh@mbligh.org> wrote:

> >>> 32GB is pretty much the minimum size to reproduce some of these
> >>> problems. Some workloads may need larger systems to easily trigger
> >>> them.
> >>
> >> We can find a 32GB system here pretty easily to test things on if
> >> need be.  Setting up large commercial databases is much harder.
> > 
> > That's my problem, too.
> > 
> > There does not seem to exist any single set of test cases that
> > accurately predicts how the VM will behave with customer
> > workloads.
> 
> Tracing might help? Showing Andrew traces of what happened in
> production for the prev_priority change made it much easier to
> demonstrate and explain the real problem ...
> 

Tracing is one way.

The other way is the old scientific method:

- develop a theory
- add sufficient instrumentation to prove or disprove that theory
- run workload, crunch on numbers
- repeat

Of course, multiple theories can be proven/disproven in a single pass.

Practically, this means adding one new /prov/vmstat entry for each `goto
keep*' in shrink_page_list().  And more instrumentation in
shrink_active_list() to determine the behaviour of swap_tendency.

Once that process is finished, we should have a thorough understanding of
what the problem is.  We can then construct a testcase (it'll be a couple
hundred lines only) and use that testcase to determine what implementation
changes are needed, and whether it actually worked.

Then go back to the real workload, verify that it's still fixed.

Then do whitebox testing of other workloads to check that they haven't
regressed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 22:22                           ` Andrew Morton
  2007-03-02 22:34                             ` Rik van Riel
@ 2007-03-03  0:33                             ` William Lee Irwin III
  2007-03-03  0:54                               ` Andrew Morton
  2007-03-03  3:15                               ` Christoph Lameter
  1 sibling, 2 replies; 99+ messages in thread
From: William Lee Irwin III @ 2007-03-03  0:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Bill Irwin, Christoph Lameter, Mel Gorman, npiggin,
	mingo, jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, Mar 02, 2007 at 02:22:56PM -0800, Andrew Morton wrote:
> Opterons seem to be particularly prone to lock starvation where a cacheline
> gets captured in a single package for ever.

AIUI that phenomenon is universal to NUMA. Maybe it's time we
reexamined our locking algorithms in the light of fairness
considerations.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-03  0:33                             ` William Lee Irwin III
@ 2007-03-03  0:54                               ` Andrew Morton
  2007-03-03  3:15                               ` Christoph Lameter
  1 sibling, 0 replies; 99+ messages in thread
From: Andrew Morton @ 2007-03-03  0:54 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Rik van Riel, Bill Irwin, Christoph Lameter, Mel Gorman, npiggin,
	mingo, jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007 16:33:19 -0800
William Lee Irwin III <wli@holomorphy.com> wrote:

> On Fri, Mar 02, 2007 at 02:22:56PM -0800, Andrew Morton wrote:
> > Opterons seem to be particularly prone to lock starvation where a cacheline
> > gets captured in a single package for ever.
> 
> AIUI that phenomenon is universal to NUMA. Maybe it's time we
> reexamined our locking algorithms in the light of fairness
> considerations.
> 

It's also a multicore thing.  iirc Kiran was seeing it on Intel CPUs.

I expect the phenomenon would be observeable on a number of locks in the
kernel, give the appropriate workload.  We just hit it first on lru_lock.

I'd have thought that increasing SWAP_CLUSTER_MAX by two or four orders of
magnitude would plug it, simply by decreasing the acquisition frequency but
I think Kiran fiddled with that to no effect.


See below for Linus's thoughts, forwarded without permission..





Begin forwarded message:

Date: Mon, 22 Jan 2007 13:49:02 -0800 (PST)
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Andrew Morton <akpm@osdl.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>, Ravikiran G Thirumalai <kiran@scalex86.org>
Subject: Re: High lock spin time for zone->lru_lock under extreme conditions



On Mon, 22 Jan 2007, Andrew Morton wrote:
> 
> Please review the whole thread sometime.  I think we're pretty screwed, and
> the problem will only become worse as more cores get rolled out and I don't
> know what to do about it apart from whining to Intel, but that won't fix
> anything.

I think people need to realize that spinlocks are always going to be 
unfair, and *extremely* so under some conditions. And yes, multi-core 
brought those conditions home to roost for some people (two or more cores 
much closer to each other than others, and able to basically ping-pong the 
spinlock to each other, with nobody else ever able to get it).

There's only a few possible solutions:

 - use the much slower semaphores, which actually try to do fairness. 

 - if you cannot sleep, introduce a separate "fair spinlock" type. It's 
   going to be appreciably slower (and will possibly have a bigger memory 
   footprint) than a regular spinlock, though. But it's certainly a 
   possible thing to do.

 - make sure no lock that you care about ever has high enough contention 
   to matter. NOTE! back-off etc simply will not help. This is not a 
   back-off issue. Back-off helps keep down coherency traffic, but it 
   doesn't help fairness.

If somebody wants to play with fair spinlocks, go wild. I looked at it at 
one point, and it was not wonderful. It's pretty complicated to do, and 
the best way I could come up with was literally a list of waiting CPU's 
(but you only need one static list entry per CPU). I didn't bother to 
think a whole lot about it.

The "never enough contention" is the real solution. For example, anything 
that drops and just re-takes the lock again (which some paths do for 
latency reduction) won't do squat. The same CPU that dropped the lock will 
basically always be able to retake it (and multi-core just means that is 
even more true, with the lock staying within one die even if some other 
core can get it).

Of course, "never enough contention" may not be possible for all locks. 
Which is why a "fair spinlock" may be the solution - use it for the few 
locks that care (and the VM locks could easily be it).

What CANNOT work: timeouts. A watchdog won't work. If you have workloads 
with enough contention, once you have enough CPU's, there's no upper bound 
on one of the cores not being able to get the lock.

On the other hand, what CAN work is: not caring. If it's ok to not be 
fair, and it only happens under extreme load, then "we don't care" is a 
perfectly fine option. 

In the "it could work" corner, I used to hope that cache coherency 
protocols in hw would do some kind of fairness thing, but I've come to the 
conclusion that it's just too hard. It's hard enough for software, it's 
likely really painful for hw too. So not only does hw generally not do it 
today (although certain platforms are better at it than others), I don't 
really expect this to change.

If anything, we'll see more of it, since multicore is one thing that makes 
things worse (as does multiple levels of caching - NUMA machines tend to 
have this problem even without multi-core, simply because they don't have 
a shared bus, which happens to hide many cases).

I'm personally in the "don't care" camp, until somebody shows a real-life 
workload. I'd often prefer to disable a watchdog if that's the biggest 
problem, for example. But if there's a real load that shows this as a real 
problem...

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 22:59                               ` Andrew Morton
  2007-03-02 23:20                                 ` Rik van Riel
@ 2007-03-03  1:40                                 ` William Lee Irwin III
  2007-03-03  1:58                                   ` Andrew Morton
  1 sibling, 1 reply; 99+ messages in thread
From: William Lee Irwin III @ 2007-03-03  1:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Bill Irwin, Christoph Lameter, Mel Gorman, npiggin,
	mingo, jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, Mar 02, 2007 at 02:59:06PM -0800, Andrew Morton wrote:
> What is it with vendors finding MM problems and either not fixing them or
> kludging around them and not telling the upstream maintainers about *any*
> of it?

I'm not in the business of defending vendors, but a lot of times the
base is so far downrev it's difficult to relate it to much of anything
current. It may be best not to say precisely how far downrev things can
get, since some of these things are so old even distro vendors won't
touch them.


On Fri, Mar 02, 2007 at 02:59:06PM -0800, Andrew Morton wrote:
> Somehow I don't believe that a person or organisation which is incapable of
> preparing even a simple testcase will be capable of fixing problems such as
> this without breaking things.

My gut feeling is to agree, but I get nagging doubts when I try to
think of how to boil things like [major benchmarks whose names are
trademarked/copyrighted/etc. censored] down to simple testcases. Some
other things are obvious but require vast resources, like zillions of
disks fooling throttling/etc. heuristics of ancient downrev kernels.
I guess for those sorts of things the voodoo incantations, chicken
blood, and carcasses of freshly slaughtered goats come out. Might as
well throw in a Tarot reading and some tea leaves while I'm at it.

My tack on basic stability was usually testbooting on several arches,
which various people have an active disinterest in (suggesting, for
example, that I throw out all of my sparc32 systems and replace them
with Opterons, or that anything that goes wrong on ia64 is not only
irrelevant but also that neither I nor anyone else should ever fix them;
you know who you are). It's become clear to me that this is insufficient,
and that I'll need to start using some sort of suite of regression tests,
at the very least to save myself the embarrassment of acking a patch that
oopses when exercised, but also to elevate the standard.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-03  1:40                                 ` William Lee Irwin III
@ 2007-03-03  1:58                                   ` Andrew Morton
  2007-03-03  3:55                                     ` William Lee Irwin III
  0 siblings, 1 reply; 99+ messages in thread
From: Andrew Morton @ 2007-03-03  1:58 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Rik van Riel, Bill Irwin, Christoph Lameter, Mel Gorman, npiggin,
	mingo, jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007 17:40:04 -0800
William Lee Irwin III <wli@holomorphy.com> wrote:

> On Fri, Mar 02, 2007 at 02:59:06PM -0800, Andrew Morton wrote:
> > Somehow I don't believe that a person or organisation which is incapable of
> > preparing even a simple testcase will be capable of fixing problems such as
> > this without breaking things.
> 
> My gut feeling is to agree, but I get nagging doubts when I try to
> think of how to boil things like [major benchmarks whose names are
> trademarked/copyrighted/etc. censored] down to simple testcases. Some
> other things are obvious but require vast resources, like zillions of
> disks fooling throttling/etc. heuristics of ancient downrev kernels.

noooooooooo.  You're approaching it from the wrong direction.

Step 1 is to understand what is happening on the affected production
system.  Completely.  Once that is fully understood then it is a relatively
simple matter to concoct a test case which triggers the same failure mode.

It is very hard to go the other way: to poke around with various stress
tests which you think are doing something similar to what you think the
application does in the hope that similar symptoms will trigger so you can
then work out what the kernel is doing.  yuk.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-03  0:33                             ` William Lee Irwin III
  2007-03-03  0:54                               ` Andrew Morton
@ 2007-03-03  3:15                               ` Christoph Lameter
  2007-03-03  4:19                                 ` William Lee Irwin III
  2007-03-03 17:16                                 ` Martin J. Bligh
  1 sibling, 2 replies; 99+ messages in thread
From: Christoph Lameter @ 2007-03-03  3:15 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Andrew Morton, Rik van Riel, Bill Irwin, Mel Gorman, npiggin,
	mingo, jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, William Lee Irwin III wrote:

> On Fri, Mar 02, 2007 at 02:22:56PM -0800, Andrew Morton wrote:
> > Opterons seem to be particularly prone to lock starvation where a cacheline
> > gets captured in a single package for ever.
> 
> AIUI that phenomenon is universal to NUMA. Maybe it's time we
> reexamined our locking algorithms in the light of fairness
> considerations.

This is a phenomenon that is usually addressed at the cache logic level. 
Its a hardware maturation issue. A certain package should not be allowed
to hold onto a cacheline forever and other packages must have a mininum 
time when they can operate on that cacheline.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-03  1:58                                   ` Andrew Morton
@ 2007-03-03  3:55                                     ` William Lee Irwin III
  0 siblings, 0 replies; 99+ messages in thread
From: William Lee Irwin III @ 2007-03-03  3:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Bill Irwin, Christoph Lameter, Mel Gorman, npiggin,
	mingo, jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007 17:40:04 -0800 William Lee Irwin III <wli@holomorphy.com> wrote:
>> My gut feeling is to agree, but I get nagging doubts when I try to
>> think of how to boil things like [major benchmarks whose names are
>> trademarked/copyrighted/etc. censored] down to simple testcases. Some
>> other things are obvious but require vast resources, like zillions of
>> disks fooling throttling/etc. heuristics of ancient downrev kernels.

On Fri, Mar 02, 2007 at 05:58:56PM -0800, Andrew Morton wrote:
> noooooooooo.  You're approaching it from the wrong direction.
> Step 1 is to understand what is happening on the affected production
> system.  Completely.  Once that is fully understood then it is a relatively
> simple matter to concoct a test case which triggers the same failure mode.
> It is very hard to go the other way: to poke around with various stress
> tests which you think are doing something similar to what you think the
> application does in the hope that similar symptoms will trigger so you can
> then work out what the kernel is doing.  yuk.

Yeah, it's really great when it's possible to get debug info out of
people e.g. they're willing to boot into a kernel instrumented with
the appropriate printk's/etc. Most of the time it's all guesswork.
People who post to lkml are much better about all this on average.

I never truly understood the point of kprobes/jprobes/dprobes (or
whatever the probing letter is), crash dumps, and so on until I ran
into this, not that I use personally them (though I may yet start).
Most of the time I just read the code instead and smoke out what
could be going on by something like the process of devising
counterexamples. For instance, I told that colouroff patch guy about
the possibility of getting the wrong page for the start of the buffer
from virt_to_page() on a cache colored buffer pointer (clearly
cache->gfporder >= 4 in such a case). Deriving the head page without
__GFP_COMP might be considered to be ugly-looking, though.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-03  3:15                               ` Christoph Lameter
@ 2007-03-03  4:19                                 ` William Lee Irwin III
  2007-03-03 17:16                                 ` Martin J. Bligh
  1 sibling, 0 replies; 99+ messages in thread
From: William Lee Irwin III @ 2007-03-03  4:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Rik van Riel, Bill Irwin, Mel Gorman, npiggin,
	mingo, jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, William Lee Irwin III wrote:
>> AIUI that phenomenon is universal to NUMA. Maybe it's time we
>> reexamined our locking algorithms in the light of fairness
>> considerations.

On Fri, Mar 02, 2007 at 07:15:38PM -0800, Christoph Lameter wrote:
> This is a phenomenon that is usually addressed at the cache logic level. 
> Its a hardware maturation issue. A certain package should not be allowed
> to hold onto a cacheline forever and other packages must have a mininum 
> time when they can operate on that cacheline.

I think when I last asked about that I was told "cache directories are
too expensive" or something on that order, if I'm not botching this,
too. In any event, the above shows a gross inaccuracy in my statement.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-03  3:15                               ` Christoph Lameter
  2007-03-03  4:19                                 ` William Lee Irwin III
@ 2007-03-03 17:16                                 ` Martin J. Bligh
  2007-03-03 17:50                                   ` Christoph Lameter
  1 sibling, 1 reply; 99+ messages in thread
From: Martin J. Bligh @ 2007-03-03 17:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: William Lee Irwin III, Andrew Morton, Rik van Riel, Bill Irwin,
	Mel Gorman, npiggin, mingo, jschopp, arjan, torvalds, linux-mm,
	linux-kernel

Christoph Lameter wrote:
> On Fri, 2 Mar 2007, William Lee Irwin III wrote:
> 
>> On Fri, Mar 02, 2007 at 02:22:56PM -0800, Andrew Morton wrote:
>>> Opterons seem to be particularly prone to lock starvation where a cacheline
>>> gets captured in a single package for ever.
>> AIUI that phenomenon is universal to NUMA. Maybe it's time we
>> reexamined our locking algorithms in the light of fairness
>> considerations.
> 
> This is a phenomenon that is usually addressed at the cache logic level. 
> Its a hardware maturation issue. A certain package should not be allowed
> to hold onto a cacheline forever and other packages must have a mininum 
> time when they can operate on that cacheline.

That'd be nice. Unfortunately we're stuck in the real world with
real hardware, and the situation is likely to remain thus for
quite some time ...

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-03 17:16                                 ` Martin J. Bligh
@ 2007-03-03 17:50                                   ` Christoph Lameter
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2007-03-03 17:50 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: William Lee Irwin III, Andrew Morton, Rik van Riel, Bill Irwin,
	Mel Gorman, npiggin, mingo, jschopp, arjan, torvalds, linux-mm,
	linux-kernel

On Sat, 3 Mar 2007, Martin J. Bligh wrote:

> That'd be nice. Unfortunately we're stuck in the real world with
> real hardware, and the situation is likely to remain thus for
> quite some time ...

Our real hardware does behave as described and therefore does not suffer 
from the problem.

If you want a software solution then you may want to look at Zoran 
Radovic's work on Hierachical Backoff locks. I had a draft of a patch a 
couple of years back that showed some promise to reduce lock contention. 
HBO locks can solve starvation issues by stopping local lock takers.

See Zoran Radovic "Software Techniques for Distributed Shared Memory", 
Uppsala Universitet, 2005 ISBN 91-554-6385-1.

http://www.gelato.org/pdf/may2005/gelato_may2005_numa_lameter_sgi.pdf

http://www.gelato.unsw.edu.au/archives/linux-ia64/0506/14368.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  8:12                                 ` Nick Piggin
  2007-03-02  8:21                                   ` Christoph Lameter
@ 2007-03-04  1:26                                   ` Rik van Riel
  2007-03-04  1:51                                     ` Andrew Morton
  1 sibling, 1 reply; 99+ messages in thread
From: Rik van Riel @ 2007-03-04  1:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Andrew Morton, Mel Gorman, mingo, jschopp,
	arjan, torvalds, mbligh, linux-mm, linux-kernel

Nick Piggin wrote:

> Different issue, isn't it? Rik wants to be smarter in figuring out which
> pages to throw away. More work per page == worse for you.

Being smarter about figuring out which pages to evict does
not equate to spending more work.  One big component is
sorting the pages beforehand, so we do not end up scanning
through (and randomizing the LRU order of) anonymous pages
when we do not want to, or cannot, evict them anyway.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-04  1:26                                   ` Rik van Riel
@ 2007-03-04  1:51                                     ` Andrew Morton
  2007-03-04  1:58                                       ` Rik van Riel
  0 siblings, 1 reply; 99+ messages in thread
From: Andrew Morton @ 2007-03-04  1:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Nick Piggin, Christoph Lameter, Mel Gorman, mingo, jschopp,
	arjan, torvalds, mbligh, linux-mm, linux-kernel

On Sat, 03 Mar 2007 20:26:15 -0500 Rik van Riel <riel@redhat.com> wrote:

> Nick Piggin wrote:
> 
> > Different issue, isn't it? Rik wants to be smarter in figuring out which
> > pages to throw away. More work per page == worse for you.
> 
> Being smarter about figuring out which pages to evict does
> not equate to spending more work.  One big component is
> sorting the pages beforehand, so we do not end up scanning
> through (and randomizing the LRU order of) anonymous pages
> when we do not want to, or cannot, evict them anyway.
> 

My gut feel is that we could afford to expend a lot more cycles-per-page
doing stuff to avoid IO than we presently do.

At least, reclaim normally just doesn't figure in system CPU time, except
for when it's gone completely stupid.

It could well be that we sleep too much in there though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-04  1:51                                     ` Andrew Morton
@ 2007-03-04  1:58                                       ` Rik van Riel
  0 siblings, 0 replies; 99+ messages in thread
From: Rik van Riel @ 2007-03-04  1:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Christoph Lameter, Mel Gorman, mingo, jschopp,
	arjan, torvalds, mbligh, linux-mm, linux-kernel

Andrew Morton wrote:
> On Sat, 03 Mar 2007 20:26:15 -0500 Rik van Riel <riel@redhat.com> wrote:
>> Nick Piggin wrote:
>>
>>> Different issue, isn't it? Rik wants to be smarter in figuring out which
>>> pages to throw away. More work per page == worse for you.
>> Being smarter about figuring out which pages to evict does
>> not equate to spending more work.  One big component is
>> sorting the pages beforehand, so we do not end up scanning
>> through (and randomizing the LRU order of) anonymous pages
>> when we do not want to, or cannot, evict them anyway.
>>
> 
> My gut feel is that we could afford to expend a lot more cycles-per-page
> doing stuff to avoid IO than we presently do.

In general, yes.

In the specific "128GB RAM, 90GB anon/shm/... and 2GB swap" case, no :)

> At least, reclaim normally just doesn't figure in system CPU time, except
> for when it's gone completely stupid.
> 
> It could well be that we sleep too much in there though.

It's all about minimizing IO, I suspect.

Not just the total amount of IO though, also the amount of
pageout IO that's in flight at once, so we do not introduce
stupidly high latencies.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 17:05     ` Joel Schopp
@ 2007-03-05  3:21       ` Nick Piggin
  2007-03-05 15:20         ` Joel Schopp
  0 siblings, 1 reply; 99+ messages in thread
From: Nick Piggin @ 2007-03-05  3:21 UTC (permalink / raw)
  To: Joel Schopp
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, clameter, mingo,
	arjan, mbligh, linux-mm, linux-kernel

On Fri, Mar 02, 2007 at 11:05:15AM -0600, Joel Schopp wrote:
> Linus Torvalds wrote:
> >
> >On Thu, 1 Mar 2007, Andrew Morton wrote:
> >>So some urgent questions are: how are we going to do mem hotunplug and
> >>per-container RSS?
> 
> The people who were trying to do memory hot-unplug basically all stopped 
> waiting for these patches, or something similar, to solve the fragmentation 
> problem.  Our last working set of patches built on top of an earlier 
> version of Mel's list based solution.
> 
> >
> >Also: how are we going to do this in virtualized environments? Usually the 
> >people who care abotu memory hotunplug are exactly the same people who 
> >also care (or claim to care, or _will_ care) about virtualization.
> 
> Yes, we are.  And we are very much in favor of these patches.  At last 
> year's OLS developers from IBM, HP, Xen coauthored a paper titled "Resizing 
> Memory with Balloons and Hotplug".  
> http://www.linuxsymposium.org/2006/linuxsymposium_procv2.pdf  Our 
> conclusion was that ballooning is simply not good enough and we need memory 
> hot-unplug.  Here is a quote from the article I find relevant to today's 
> discussion:

But if you don't require a lot of higher order allocations anyway, then
guest fragmentation caused by ballooning doesn't seem like much problem.

If you need higher order allocations, then ballooning is bad because of
fragmentation, so you need memory unplug, so you need higher order
allocations. Goto 1.

Balooning probably does skew memory management stats and watermarks, but
that's just because it is implemented as a module. A couple of hooks
should be enough to allow things to be adjusted?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-05  3:21       ` Nick Piggin
@ 2007-03-05 15:20         ` Joel Schopp
  2007-03-05 16:01           ` Nick Piggin
  2007-05-03  8:49           ` Andy Whitcroft
  0 siblings, 2 replies; 99+ messages in thread
From: Joel Schopp @ 2007-03-05 15:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, clameter, mingo,
	arjan, mbligh, linux-mm, linux-kernel

> But if you don't require a lot of higher order allocations anyway, then
> guest fragmentation caused by ballooning doesn't seem like much problem.

If you only need to allocate 1 page size and smaller allocations then no it's not a 
problem.  As soon as you go above that it will be.  You don't need to go all the way 
up to MAX_ORDER size to see an impact, it's just increasingly more severe as you get 
away from 1 page and towards MAX_ORDER.

> 
> If you need higher order allocations, then ballooning is bad because of
> fragmentation, so you need memory unplug, so you need higher order
> allocations. Goto 1.

Yes, it's a closed loop.  But hotplug isn't the only one that needs higher order 
allocations.  In fact it's pretty far down the list.  I look at it like this, a lot 
of users need high order allocations for better performance and things like on-demand 
hugepages.  As a bonus you get memory hot-remove.

> Balooning probably does skew memory management stats and watermarks, but
> that's just because it is implemented as a module. A couple of hooks
> should be enough to allow things to be adjusted?

That is a good idea independent of the current discussion.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-05 15:20         ` Joel Schopp
@ 2007-03-05 16:01           ` Nick Piggin
  2007-03-05 16:45             ` Joel Schopp
  2007-05-03  8:49           ` Andy Whitcroft
  1 sibling, 1 reply; 99+ messages in thread
From: Nick Piggin @ 2007-03-05 16:01 UTC (permalink / raw)
  To: Joel Schopp
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, clameter, mingo,
	arjan, mbligh, linux-mm, linux-kernel

On Mon, Mar 05, 2007 at 09:20:10AM -0600, Joel Schopp wrote:
> >But if you don't require a lot of higher order allocations anyway, then
> >guest fragmentation caused by ballooning doesn't seem like much problem.
> 
> If you only need to allocate 1 page size and smaller allocations then no 
> it's not a problem.  As soon as you go above that it will be.  You don't 
> need to go all the way up to MAX_ORDER size to see an impact, it's just 
> increasingly more severe as you get away from 1 page and towards MAX_ORDER.

We allocate order 1 and 2 pages for stuff without too much problem.

> >If you need higher order allocations, then ballooning is bad because of
> >fragmentation, so you need memory unplug, so you need higher order
> >allocations. Goto 1.
> 
> Yes, it's a closed loop.  But hotplug isn't the only one that needs higher 
> order allocations.  In fact it's pretty far down the list.  I look at it 
> like this, a lot of users need high order allocations for better 
> performance and things like on-demand hugepages.  As a bonus you get memory 
> hot-remove.

on-demand hugepages could be done better anyway by having the hypervisor
defrag physical memory and provide some way for the guest to ask for a
hugepage, no?

> >Balooning probably does skew memory management stats and watermarks, but
> >that's just because it is implemented as a module. A couple of hooks
> >should be enough to allow things to be adjusted?
> 
> That is a good idea independent of the current discussion.

Well it shouldn't be too difficult. If you cc linux-mm and/or me with
any thoughts or requirements then I could try to help with it.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-05 16:01           ` Nick Piggin
@ 2007-03-05 16:45             ` Joel Schopp
  0 siblings, 0 replies; 99+ messages in thread
From: Joel Schopp @ 2007-03-05 16:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, clameter, mingo,
	arjan, mbligh, linux-mm, linux-kernel

>> If you only need to allocate 1 page size and smaller allocations then no 
>> it's not a problem.  As soon as you go above that it will be.  You don't 
>> need to go all the way up to MAX_ORDER size to see an impact, it's just 
>> increasingly more severe as you get away from 1 page and towards MAX_ORDER.
> 
> We allocate order 1 and 2 pages for stuff without too much problem.

The question I want to know is where do you draw the line as to what is acceptable to 
allocate in a single contiguous block?

1 page?  8 pages?  256 pages?  4K pages?  Obviously 1 page works fine. With 4K page 
size and 16MB MAX_ORDER 4K pages is theoretically supported, but doesn't work under 
almost any circumstances (unless you use Mel's patches).

> on-demand hugepages could be done better anyway by having the hypervisor
> defrag physical memory and provide some way for the guest to ask for a
> hugepage, no?

Unless you break the 1:1 virt-phys mapping it doesn't matter if the hypervisor can 
defrag this for you as the kernel will have the physical address cached away 
somewhere and expect the data not to move.

I'm a big fan of making this somebody else's problem and the hypervisor would be a 
good place.  I just can't figure out how to actually do it at that layer without 
changing Linux in a way that is unacceptable to the community at large.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  3:44       ` Linus Torvalds
                           ` (2 preceding siblings ...)
  2007-03-02  5:13         ` Jeremy Fitzhardinge
@ 2007-03-06  4:16         ` Paul Mackerras
  3 siblings, 0 replies; 99+ messages in thread
From: Paul Mackerras @ 2007-03-06  4:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Balbir Singh, Andrew Morton, Mel Gorman, npiggin, clameter,
	mingo, jschopp, arjan, mbligh, linux-mm, linux-kernel

Linus Torvalds writes:

> The point being that in the guests, hotunplug is almost useless (for 
> bigger ranges), and we're much better off just telling the virtualization 
> hosts on a per-page level whether we care about a page or not, than to 
> worry about fragmentation.

We don't have that luxury on IBM System p machines, where the
hypervisor manages memory in much larger units than a page.  Typically
the size of memory block that the hypervisor uses to manage memory is
16MB or more -- which makes sense from the point of view that if the
hypervisor had to manage individual pages, it would end up adding a
lot more overhead than it does.

Paul.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-05 15:20         ` Joel Schopp
  2007-03-05 16:01           ` Nick Piggin
@ 2007-05-03  8:49           ` Andy Whitcroft
  1 sibling, 0 replies; 99+ messages in thread
From: Andy Whitcroft @ 2007-05-03  8:49 UTC (permalink / raw)
  To: Joel Schopp
  Cc: Nick Piggin, Linus Torvalds, Andrew Morton, Mel Gorman, clameter,
	mingo, arjan, mbligh, linux-mm, linux-kernel

Joel Schopp wrote:
>> But if you don't require a lot of higher order allocations anyway, then
>> guest fragmentation caused by ballooning doesn't seem like much problem.
> 
> If you only need to allocate 1 page size and smaller allocations then no
> it's not a problem.  As soon as you go above that it will be.  You don't
> need to go all the way up to MAX_ORDER size to see an impact, it's just
> increasingly more severe as you get away from 1 page and towards MAX_ORDER.

Yep, the allocator thinks of things less than order-4 as "easy to
obtain" in that it is willing to wait indefinatly for one to to appear,
above that they are not expected to appear.  With random placement the
chances of finding a page tend to 0 pretty quickly as order increases.
That was the motivation for the linear reclaim/lumpy reclaim patch
series which do make it significantly more possible to get higher
orders.  However very high orders such as we see with huge pages are
still almost impossible to obtain without placement controls in place.

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

end of thread, other threads:[~2007-05-03  8:49 UTC | newest]

Thread overview: 99+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-01 10:12 The performance and behaviour of the anti-fragmentation related patches Mel Gorman
2007-03-02  0:09 ` Andrew Morton
2007-03-02  0:44   ` Linus Torvalds
2007-03-02  1:52     ` Balbir Singh
2007-03-02  3:44       ` Linus Torvalds
2007-03-02  3:59         ` Andrew Morton
2007-03-02  5:11           ` Linus Torvalds
2007-03-02  5:50             ` KAMEZAWA Hiroyuki
2007-03-02  6:15               ` Paul Mundt
2007-03-02 17:01                 ` Mel Gorman
2007-03-02 16:20             ` Mark Gross
2007-03-02 17:07               ` Andrew Morton
2007-03-02 17:35                 ` Mark Gross
2007-03-02 18:02                   ` Andrew Morton
2007-03-02 19:02                     ` Mark Gross
2007-03-02 17:16               ` Linus Torvalds
2007-03-02 18:45                 ` Mark Gross
2007-03-02 19:03                   ` Linus Torvalds
2007-03-02 23:58                 ` Martin J. Bligh
2007-03-02  4:18         ` Balbir Singh
2007-03-02  5:13         ` Jeremy Fitzhardinge
2007-03-06  4:16         ` Paul Mackerras
2007-03-02 16:58     ` Mel Gorman
2007-03-02 17:05     ` Joel Schopp
2007-03-05  3:21       ` Nick Piggin
2007-03-05 15:20         ` Joel Schopp
2007-03-05 16:01           ` Nick Piggin
2007-03-05 16:45             ` Joel Schopp
2007-05-03  8:49           ` Andy Whitcroft
2007-03-02  1:39   ` Balbir Singh
2007-03-02  2:34   ` KAMEZAWA Hiroyuki
2007-03-02  3:05   ` Christoph Lameter
2007-03-02  3:57     ` Nick Piggin
2007-03-02  4:06       ` Christoph Lameter
2007-03-02  4:21         ` Nick Piggin
2007-03-02  4:31           ` Christoph Lameter
2007-03-02  5:06             ` Nick Piggin
2007-03-02  5:40               ` Christoph Lameter
2007-03-02  5:49                 ` Nick Piggin
2007-03-02  5:53                   ` Christoph Lameter
2007-03-02  6:08                     ` Nick Piggin
2007-03-02  6:19                       ` Christoph Lameter
2007-03-02  6:29                         ` Nick Piggin
2007-03-02  6:51                           ` Christoph Lameter
2007-03-02  7:03                             ` Andrew Morton
2007-03-02  7:19                             ` Nick Piggin
2007-03-02  7:44                               ` Christoph Lameter
2007-03-02  8:12                                 ` Nick Piggin
2007-03-02  8:21                                   ` Christoph Lameter
2007-03-02  8:38                                     ` Nick Piggin
2007-03-02 17:09                                       ` Christoph Lameter
2007-03-04  1:26                                   ` Rik van Riel
2007-03-04  1:51                                     ` Andrew Morton
2007-03-04  1:58                                       ` Rik van Riel
2007-03-02  5:50               ` Christoph Lameter
2007-03-02  4:29         ` Andrew Morton
2007-03-02  4:33           ` Christoph Lameter
2007-03-02  4:58             ` Andrew Morton
2007-03-02  4:20       ` Paul Mundt
2007-03-02 13:50   ` Arjan van de Ven
2007-03-02 15:29   ` Rik van Riel
2007-03-02 16:58     ` Andrew Morton
2007-03-02 17:09       ` Mel Gorman
2007-03-02 17:23       ` Christoph Lameter
2007-03-02 17:35         ` Andrew Morton
2007-03-02 17:43           ` Rik van Riel
2007-03-02 18:06             ` Andrew Morton
2007-03-02 18:15               ` Christoph Lameter
2007-03-02 18:23                 ` Andrew Morton
2007-03-02 18:23                 ` Rik van Riel
2007-03-02 19:31                   ` Christoph Lameter
2007-03-02 19:40                     ` Rik van Riel
2007-03-02 21:12                   ` Bill Irwin
2007-03-02 21:19                     ` Rik van Riel
2007-03-02 21:52                       ` Andrew Morton
2007-03-02 22:03                         ` Rik van Riel
2007-03-02 22:22                           ` Andrew Morton
2007-03-02 22:34                             ` Rik van Riel
2007-03-02 22:51                               ` Martin Bligh
2007-03-02 22:54                                 ` Rik van Riel
2007-03-02 23:28                                   ` Martin J. Bligh
2007-03-03  0:24                                     ` Andrew Morton
2007-03-02 22:52                               ` Chuck Ebbert
2007-03-02 22:59                               ` Andrew Morton
2007-03-02 23:20                                 ` Rik van Riel
2007-03-03  1:40                                 ` William Lee Irwin III
2007-03-03  1:58                                   ` Andrew Morton
2007-03-03  3:55                                     ` William Lee Irwin III
2007-03-03  0:33                             ` William Lee Irwin III
2007-03-03  0:54                               ` Andrew Morton
2007-03-03  3:15                               ` Christoph Lameter
2007-03-03  4:19                                 ` William Lee Irwin III
2007-03-03 17:16                                 ` Martin J. Bligh
2007-03-03 17:50                                   ` Christoph Lameter
2007-03-02 20:59               ` Bill Irwin
2007-03-02  1:52 ` Bill Irwin
2007-03-02 10:38   ` Mel Gorman
2007-03-02 16:31     ` Joel Schopp
2007-03-02 21:37       ` Bill Irwin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).