linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE: [RFC] Avoiding fragmentation through different allocator
@ 2005-01-17 16:48 Tolentino, Matthew E
  2005-01-19 13:17 ` Mel Gorman
  0 siblings, 1 reply; 19+ messages in thread
From: Tolentino, Matthew E @ 2005-01-17 16:48 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux Memory Management List, Linux Kernel Mailing List

>I considered adding a new zone but I felt it would be a massive job for
>what I considered to be a simple problem. I think my approach is nice
>and isolated within the allocator itself and will be less likely to
>affect other code.

Just for clarity, I prefer this approach over adding zones, 
hence my pursuit of something akin to it.  

>On possibility is that we could say that the UserRclm and KernRclm pool
>are always eligible for hotplug and have hotplug banks only 
>satisy those
>allocations pushing KernNonRclm allocations to fixed banks. How is it
>currently known if a bank of memory is hotplug? Is there a 
>node for each
>hotplug bank? If yes, we could flag those nodes to only 
>satisify UserRclm
>and KernRclm allocations and force fallback to other nodes. 

The hardware/firmware has to tell the kernel in some way.  In 
my case it is ACPI that delineates between regions that may be 
removed.  No, there isn't a node for each bank of hot-plug 
memory.  The reason I was pursuing this was to be able to 
avoid coarse granularity distinctions like that.  Depending
on the platform, ACPI may provide only memory ranges (via
memory devices detailed in the namespace) for single node
systems or group memory ranges according to nodes via the 
ACPI abstraction called containers.  It's my understanding
that containers then have some mapping to nodes.  

>The danger is
>that allocations would fail because non-hotplug banks were already full
>and pageout would not happen because the watermarks were satisified.

Which implies a potential need for balancing between user/kernel
lists, no?    

>(Bear in mind I can't test hotplug-related issues due to lack 
>of suitable
>hardware)

Sure.  I can, although most of this has been done via emulation 
initially and then tested on real hardware soon afterwards.

>If you have already posted a version of the patch (you have 
>feedback so I
>guess it's there somewhere), can you send me a link to the thread where
>you introduced your approach? It's possible that we just need 
>to merge the
>ideas.

No, I hadn't posted it yet due to chasing a bug.  However, perhaps 
now I'll instead focus on adding the necessary hotplug support 
into your patch, hence merging the hotplug requirements/ideas?

>It's because I consider all 2^MAX_ORDER pages in a zone to be 
>equal where
>as I'm guessing you don't. Until they are split, there is 
>nothing special
>about them. It is only when it is split that I want it reserved for a
>purpose.
>
>However, if we knew there were blocks that were hot-pluggable, we could
>just have a hotplug-global and non-hotplug-global pool. If 
>it's a UserRclm
>or KernRclm allocation, split from hotplug-global, otherwise use
>non-hotplug-global. It'd increase the memory requirements of 
>the patch a
>bit though.

Exactly.  Perhaps this could just be isolated via the 
CONFIG_MEMORY_HOTPLUG build option, thus not increasing the memory
requirements in the common case...

matt


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC] Avoiding fragmentation through different allocator
  2005-01-17 16:48 [RFC] Avoiding fragmentation through different allocator Tolentino, Matthew E
@ 2005-01-19 13:17 ` Mel Gorman
  0 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2005-01-19 13:17 UTC (permalink / raw)
  To: Tolentino, Matthew E
  Cc: Linux Memory Management List, Linux Kernel Mailing List

On Mon, 17 Jan 2005, Tolentino, Matthew E wrote:

> >I considered adding a new zone but I felt it would be a massive job for
> >what I considered to be a simple problem. I think my approach is nice
> >and isolated within the allocator itself and will be less likely to
> >affect other code.
>
> Just for clarity, I prefer this approach over adding zones,
> hence my pursuit of something akin to it.
>

Ok, cool.

> >On possibility is that we could say that the UserRclm and KernRclm pool
> >are always eligible for hotplug and have hotplug banks only
> >satisy those
> >allocations pushing KernNonRclm allocations to fixed banks. How is it
> >currently known if a bank of memory is hotplug? Is there a
> >node for each
> >hotplug bank? If yes, we could flag those nodes to only
> >satisify UserRclm
> >and KernRclm allocations and force fallback to other nodes.
>
> The hardware/firmware has to tell the kernel in some way.  In
> my case it is ACPI that delineates between regions that may be
> removed.  No, there isn't a node for each bank of hot-plug
> memory.  The reason I was pursuing this was to be able to
> avoid coarse granularity distinctions like that.

As there is not a node for each hotplug bank, this has to happen at the
zone level. Architectures have the option of defining their own
memmap_init() although only ia64 take advantage of it. If we wanted to be
able to identify hotplug pages in an independant manner, we woulc
implement memmap_init() for hotplug and fill zone->free_area_usemap
accordingly. Currently the bitmap in there has two bits for each
2^MAX_ORDER block that looks like;

00 = Kernel non-reclaimable
10 = Kernel reclaimable
01 = User reclaimable

So, we could say 11 is for hotplug memory. Alternatively if it is possible
to have a system that consists entirely of hotplug memory, we could add a
third bit. As it is one bit per 2^MAX_ORDER pages in the system, it would
not be a big chunk of memory.

The question is what to do with that information then. We can't just say
that User pages go to hotplug regions as that will introduce two problems.
One of balancing and the second of what happens when an unreclaimable page
is in a bank we want to move without page migration in place.

> >The danger is
> >that allocations would fail because non-hotplug banks were already full
> >and pageout would not happen because the watermarks were satisified.
>
> Which implies a potential need for balancing between user/kernel
> lists, no?
>

If we're not careful, yes and I have a gut-feeling that says we should not
need to be balancing anything for hotplug.

> >If you have already posted a version of the patch (you have
> >feedback so I
> >guess it's there somewhere), can you send me a link to the thread where
> >you introduced your approach? It's possible that we just need
> >to merge the
> >ideas.
>
> No, I hadn't posted it yet due to chasing a bug.  However, perhaps
> now I'll instead focus on adding the necessary hotplug support
> into your patch, hence merging the hotplug requirements/ideas?
>

I've no problem with that. It's just a case of how we use the information
exactly.

> >It's because I consider all 2^MAX_ORDER pages in a zone to be
> >equal where
> >as I'm guessing you don't. Until they are split, there is
> >nothing special
> >about them. It is only when it is split that I want it reserved for a
> >purpose.
> >
> >However, if we knew there were blocks that were hot-pluggable, we could
> >just have a hotplug-global and non-hotplug-global pool. If
> >it's a UserRclm
> >or KernRclm allocation, split from hotplug-global, otherwise use
> >non-hotplug-global. It'd increase the memory requirements of
> >the patch a
> >bit though.
>
> Exactly.  Perhaps this could just be isolated via the
> CONFIG_MEMORY_HOTPLUG build option, thus not increasing the memory
> requirements in the common case...
>

I think so. The changes would be removing pages from the global pool with
macros rather than directly and having the hotplug and non-hotplug
versions.

We just need to think more of what happens when a kernel allocation comes
along, fixed memory is depleted but there is plenty of hotplug memory
left.

-- 
Mel Gorman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Avoiding fragmentation through different allocator
  2005-01-17 23:08       ` Yasunori Goto
@ 2005-01-19 13:45         ` Mel Gorman
  0 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2005-01-19 13:45 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: Tolentino, Matthew E, Linux Memory Management List,
	Linux Kernel Mailing List

On Mon, 17 Jan 2005, Yasunori Goto wrote:

> > Is there an architecture-independant way of finding this out?
>
>   No. At least, I have no idea. :-(
>

In another mail to Matthew, I suggested that the zone->free_area_usemap
could be used to track hotplug blocks of pages by either using a
bit-pattern of 11 for hotplug pages or adding a third bit.

get_pageblock_type() could then be taught to identify a hotplug region
within page_alloc.c at least. If the information is needed outside the
allocator, it will need more work though.

> > What's the current attidute for adding a new zone? I felt there would be
> > resistence as a new zone would affect a lot of code paths and be yet
> > another zone that needed balancing. For example, is there a HIGHMEM
> > version of the ZONE_REMOVABLE or could normal and highmem be in this zone?
>
> Yes. In my current patch of memory hotplug, Removable is like Highmem.
>  ( <http://sourceforge.net/mailarchive/forum.php?forum_id=223>
>      It is group B of "Hot Add patches for NUMA" )
>
> I tried to make new removable zone which could be with normal and dma
> before it. But, it needs too much work as you said. So, I gave up it.
> I heard Matt-san has some ideas for it. So, I'm looking forward to
> see it.
>

I'm taking a look through these patches just so I know what the other
approaches were.

> > > I agree that dividing per-cpu caches is not good way.
> > > But if Kernel-nonreclaimable allocation use its UserRclm pool,
> > > its removable memory bank will be harder to remove suddenly.
> > > Is it correct? If so, it is not good for memory hotplug.
> > > Hmmmm.
> > >
> >
> > It is correct. However, this will only happen in low-memory conditions.
> > For a kernel-nonreclaimable allocation to use the userrclm pool, three
> > conditions have to be met;
> >
> > 1. Kernel-nonreclaimable pool has no pages
> > 2. There are no global 2^MAX_ORDER pages
> > 3. Kern-reclaimable pool has no pages
>
> I suppose if this patch have worked for one year,
> unlucky case might occur. Probably, enterprise system will not
> allow it. So, I will try disabling fallback for KernNoRclm.
>

I can almost guarentee that will fail in low-memory conditions. Before I
implemented proper fallback logic, I used to get oopses in low-memory
conditions. I found it was because KernNoRclm had nowhere to fallback but
there was loads of free memory so kswapd was not taking place.

So, just disabling fallback is not the right answer.

-- 
Mel Gorman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Avoiding fragmentation through different allocator
  2005-01-16 16:21     ` Mel Gorman
@ 2005-01-17 23:08       ` Yasunori Goto
  2005-01-19 13:45         ` Mel Gorman
  0 siblings, 1 reply; 19+ messages in thread
From: Yasunori Goto @ 2005-01-17 23:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Tolentino, Matthew E, Linux Memory Management List,
	Linux Kernel Mailing List

> > There are 2 types of memory hotplug.
> >
> > a)SMP machine case
> >   A some part of memory will be added and removed.
> >
> > b)NUMA machine case.
> >   Whole of a node will be able to remove and add.
> >   However, if a block of memory like DIMM is broken and disabled,
> >   Its close from a).
> >
> > How to know where is hotpluggable bank is platform/archtecture
> > dependent issue.
> >  ex) Asking to ACPI.
> >      Just node0 become unremovable, and other nodes are removable.
> >      etc...
> >
> 
> Is there an architecture-independant way of finding this out?

  No. At least, I have no idea. :-(


> > In current your patch, first attribute of all pages are NoRclm.
> > But if your patches has interface to decide where will be Rclm for
> > each arch/platform, it might be good.
> >
> 
> It doesn't have an API as such. In page_alloc.c, there is a function
> get_pageblock_type() that returns what type of allocation the block of
> memory is being used for. There is no guarentee there is only those type
> of allocations there though.

OK. I will write a patch of function to set it for some arch/platform.

> What's the current attidute for adding a new zone? I felt there would be
> resistence as a new zone would affect a lot of code paths and be yet
> another zone that needed balancing. For example, is there a HIGHMEM
> version of the ZONE_REMOVABLE or could normal and highmem be in this zone?

Yes. In my current patch of memory hotplug, Removable is like Highmem.
 ( <http://sourceforge.net/mailarchive/forum.php?forum_id=223>
     It is group B of "Hot Add patches for NUMA" )

I tried to make new removable zone which could be with normal and dma
before it. But, it needs too much work as you said. So, I gave up it.
I heard Matt-san has some ideas for it. So, I'm looking forward to 
see it.

> > I agree that dividing per-cpu caches is not good way.
> > But if Kernel-nonreclaimable allocation use its UserRclm pool,
> > its removable memory bank will be harder to remove suddenly.
> > Is it correct? If so, it is not good for memory hotplug.
> > Hmmmm.
> >
> 
> It is correct. However, this will only happen in low-memory conditions.
> For a kernel-nonreclaimable allocation to use the userrclm pool, three
> conditions have to be met;
> 
> 1. Kernel-nonreclaimable pool has no pages
> 2. There are no global 2^MAX_ORDER pages
> 3. Kern-reclaimable pool has no pages

I suppose if this patch have worked for one year,
unlucky case might occur. Probably, enterprise system will not
allow it. So, I will try disabling fallback for KernNoRclm.

Thanks.

-- 
Yasunori Goto <ygoto at us.fujitsu.com>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Avoiding fragmentation through different allocator
  2005-01-16  4:03   ` Yasunori Goto
@ 2005-01-16 16:21     ` Mel Gorman
  2005-01-17 23:08       ` Yasunori Goto
  0 siblings, 1 reply; 19+ messages in thread
From: Mel Gorman @ 2005-01-16 16:21 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: Tolentino, Matthew E, Linux Memory Management List,
	Linux Kernel Mailing List

On Sat, 15 Jan 2005, Yasunori Goto wrote:

> There are 2 types of memory hotplug.
>
> a)SMP machine case
>   A some part of memory will be added and removed.
>
> b)NUMA machine case.
>   Whole of a node will be able to remove and add.
>   However, if a block of memory like DIMM is broken and disabled,
>   Its close from a).
>
> How to know where is hotpluggable bank is platform/archtecture
> dependent issue.
>  ex) Asking to ACPI.
>      Just node0 become unremovable, and other nodes are removable.
>      etc...
>

Is there an architecture-independant way of finding this out?

> In current your patch, first attribute of all pages are NoRclm.
> But if your patches has interface to decide where will be Rclm for
> each arch/platform, it might be good.
>

It doesn't have an API as such. In page_alloc.c, there is a function
get_pageblock_type() that returns what type of allocation the block of
memory is being used for. There is no guarentee there is only those type
of allocations there though.

>
> > The danger is
> > that allocations would fail because non-hotplug banks were already full
> > and pageout would not happen because the watermarks were satisified.
>
> In this case, if user can change attribute Rclm area to
> NoRclm, it is better than nothing.
> In hotplug patches, there will be new zone as ZONE_REMOVABLE.

What's the current attidute for adding a new zone? I felt there would be
resistence as a new zone would affect a lot of code paths and be yet
another zone that needed balancing. For example, is there a HIGHMEM
version of the ZONE_REMOVABLE or could normal and highmem be in this zone?

> But in this patch, this change attribute is a little bit difficult.
> (At first remove the pages from free_area of removable zone,
>  then add them to free_area of Un-removable zone.)
> Probably its change is easier in your patch.
>

I think the difficulty would be similar because it's still Move Pages From
A To B.

> I agree that dividing per-cpu caches is not good way.
> But if Kernel-nonreclaimable allocation use its UserRclm pool,
> its removable memory bank will be harder to remove suddenly.
> Is it correct? If so, it is not good for memory hotplug.
> Hmmmm.
>

It is correct. However, this will only happen in low-memory conditions.
For a kernel-nonreclaimable allocation to use the userrclm pool, three
conditions have to be met;

1. Kernel-nonreclaimable pool has no pages
2. There are no global 2^MAX_ORDER pages
3. Kern-reclaimable pool has no pages

This is because of the fallback order. If you were interested in testing a
particular workload, you could apply the patch, run a workload and then
look at /proc/buddyinfo. There are three counters at the end of the
output like this;

KernNoRclm Fallback count: 0
KernRclm   Fallback count: 0
UserRclm   Fallback count: 425

A fallback can get counted twice. For example, if KernNoRclm falls back to
KernRclm and then UserRclm, it's considered to be two fallbacks.

I also have (yet another) tool that is able to track exactly where each
type of allocation is. If you wanted to know precisely where each page is
and see how many non-reclaimable pages are ending up in the wrong place,
the tool could be modified to do that.

-- 
Mel Gorman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Avoiding fragmentation through different allocator
  2005-01-12 23:12 ` Mel Gorman
  2005-01-13  8:02   ` Hirokazu Takahashi
@ 2005-01-16  4:03   ` Yasunori Goto
  2005-01-16 16:21     ` Mel Gorman
  1 sibling, 1 reply; 19+ messages in thread
From: Yasunori Goto @ 2005-01-16  4:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Tolentino, Matthew E, Linux Memory Management List,
	Linux Kernel Mailing List

Hello.

I'm also very interested in your patches, because I'm working for
memory hotplug too.

> On possibility is that we could say that the UserRclm and KernRclm pool
> are always eligible for hotplug and have hotplug banks only satisy those
> allocations pushing KernNonRclm allocations to fixed banks. How is it
> currently known if a bank of memory is hotplug? Is there a node for each
> hotplug bank? If yes, we could flag those nodes to only satisify UserRclm
> and KernRclm allocations and force fallback to other nodes. 

There are 2 types of memory hotplug.

a)SMP machine case
  A some part of memory will be added and removed.

b)NUMA machine case.
  Whole of a node will be able to remove and add.
  However, if a block of memory like DIMM is broken and disabled,
  Its close from a).

How to know where is hotpluggable bank is platform/archtecture
dependent issue. 
 ex) Asking to ACPI.
     Just node0 become unremovable, and other nodes are removable.
     etc...

In current your patch, first attribute of all pages are NoRclm.
But if your patches has interface to decide where will be Rclm for
each arch/platform, it might be good.


> The danger is
> that allocations would fail because non-hotplug banks were already full
> and pageout would not happen because the watermarks were satisified.

In this case, if user can change attribute Rclm area to 
NoRclm, it is better than nothing. 
In hotplug patches, there will be new zone as ZONE_REMOVABLE.
But in this patch, this change attribute is a little bit difficult.
(At first remove the pages from free_area of removable zone, 
 then add them to free_area of Un-removable zone.)
Probably its change is easier in your patch.


> (Bear in mind I can't test hotplug-related issues due to lack of suitable
> hardware)

I also don't have real hotplug machine now. ;-)
I just use software emulation.

> > It looks like you left the per_cpu_pages as-is.  Did you
> > consider separating those as well to reflect kernel vs. user
> > pools?
> >
> 
> I kept the per-cpu caches for UserRclm-style allocations only because
> otherwise a Kernel-nonreclaimable allocation could easily be taken from a
> UserRclm pool.

I agree that dividing per-cpu caches is not good way.
But if Kernel-nonreclaimable allocation use its UserRclm pool, 
its removable memory bank will be harder to remove suddenly.
Is it correct? If so, it is not good for memory hotplug.
Hmmmm.

Anyway, thank you for your patch. It is very interesting.

Bye.

-- 
Yasunori Goto <ygoto at us.fujitsu.com>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Avoiding fragmentation through different allocator
  2005-01-15  1:31     ` William Lee Irwin III
@ 2005-01-15 19:19       ` Mel Gorman
  0 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2005-01-15 19:19 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Marcelo Tosatti, Linux Memory Management List, Linux Kernel Mailing List

On Fri, 14 Jan 2005, William Lee Irwin III wrote:

> On Wed, Jan 12, 2005 at 11:31:46PM -0800, William Lee Irwin III wrote:
> >> I'd expect to do better with kernel/user discrimination only, having
> >> address-ordering biases in opposite directions for each case.
>
> On Fri, Jan 14, 2005 at 07:42:18PM -0200, Marcelo Tosatti wrote:
> > What you mean with "address-ordering biases in opposite directions
> > for each case" ?
> > You mean to have each case allocate from the top and bottom of the
> > free list, respectively, and in opposite address direction ? What you
> > gain from that?
> > And what that means during a long period of VM stress ?
>
> It's one of the standard anti-fragmentation tactics. The large free
> areas come from the middle, address ordering disposes of holes in the
> used areas, and the areas at opposite ends reflect expected lifetimes.
>
> It's more useful for cases where there is not an upper bound on the
> size of an allocation (or power-of-two blocksizes). On second thought,
> Mel's approach exploits both the bound and the power-of-two restriction
> advantageously.
>

I think so too and I reckon I have the figures to prove it. Patches with
test tools and figures are on the way. Working at the moment at running
the last of the tests and getting the patches in order.

-- 
Mel Gorman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Avoiding fragmentation through different allocator
  2005-01-14 21:42   ` Marcelo Tosatti
@ 2005-01-15  1:31     ` William Lee Irwin III
  2005-01-15 19:19       ` Mel Gorman
  0 siblings, 1 reply; 19+ messages in thread
From: William Lee Irwin III @ 2005-01-15  1:31 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Mel Gorman, Linux Memory Management List, Linux Kernel Mailing List

On Wed, Jan 12, 2005 at 11:31:46PM -0800, William Lee Irwin III wrote:
>> I'd expect to do better with kernel/user discrimination only, having
>> address-ordering biases in opposite directions for each case.

On Fri, Jan 14, 2005 at 07:42:18PM -0200, Marcelo Tosatti wrote:
> What you mean with "address-ordering biases in opposite directions
> for each case" ? 
> You mean to have each case allocate from the top and bottom of the
> free list, respectively, and in opposite address direction ? What you
> gain from that?
> And what that means during a long period of VM stress ?

It's one of the standard anti-fragmentation tactics. The large free
areas come from the middle, address ordering disposes of holes in the
used areas, and the areas at opposite ends reflect expected lifetimes.

It's more useful for cases where there is not an upper bound on the
size of an allocation (or power-of-two blocksizes). On second thought,
Mel's approach exploits both the bound and the power-of-two restriction
advantageously.


-- wli

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Avoiding fragmentation through different allocator
  2005-01-13  7:31 ` William Lee Irwin III
  2005-01-13 10:11   ` Mel Gorman
@ 2005-01-14 21:42   ` Marcelo Tosatti
  2005-01-15  1:31     ` William Lee Irwin III
  1 sibling, 1 reply; 19+ messages in thread
From: Marcelo Tosatti @ 2005-01-14 21:42 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Mel Gorman, Linux Memory Management List, Linux Kernel Mailing List

On Wed, Jan 12, 2005 at 11:31:46PM -0800, William Lee Irwin III wrote:
> On Wed, Jan 12, 2005 at 09:09:24PM +0000, Mel Gorman wrote:
> > So... What the patch does. Allocations are divided up into three different
> > types of allocations;
> > UserReclaimable - These are userspace pages that are easily reclaimable. Right
> > 	now, I'm putting all allocations of GFP_USER and GFP_HIGHUSER as
> > 	well as disk-buffer pages into this category. These pages are trivially
> > 	reclaimed by writing the page out to swap or syncing with backing
> > 	storage
> > KernelReclaimable - These are pages allocated by the kernel that are easily
> > 	reclaimed. This is stuff like inode caches, dcache, buffer_heads etc.
> > 	These type of pages potentially could be reclaimed by dumping the
> > 	caches and reaping the slabs (drastic, but you get the idea). We could
> > 	also add pages into this category that are known to be only required
> > 	for a short time like buffers used with DMA
> > KernelNonReclaimable - These are pages that are allocated by the kernel that
> > 	are not trivially reclaimed. For example, the memory allocated for a
> > 	loaded module would be in this category. By default, allocations are
> > 	considered to be of this type
> 
> I'd expect to do better with kernel/user discrimination only, having
> address-ordering biases in opposite directions for each case.

What you mean with "address-ordering biases in opposite directions for each case" ? 

You mean to have each case allocate from the top and bottom of the free list, respectively,
and in opposite address direction ? What you gain from that?

And what that means during a long period of VM stress ?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Avoiding fragmentation through different allocator
  2005-01-13  8:02   ` Hirokazu Takahashi
@ 2005-01-13 10:27     ` Mel Gorman
  0 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2005-01-13 10:27 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: matthew.e.tolentino, linux-mm, linux-kernel

On Thu, 13 Jan 2005, Hirokazu Takahashi wrote:

> Hi Mel,
>
> The global list looks interesting.
>
> > > >Instead of having one global MAX_ORDER-sized array of free
> > > >lists, there are
> > > >three, one for each type of allocation. Finally, there is a
> > > >list of pages of
> > > >size 2^MAX_ORDER which is a global pool of the largest pages
> > > >the kernel deals with.
>
> > > is it so that the pages can
> > > evolve according to system demands (assuming MAX_ORDER sized
> > > chunks are eventually available again)?
> > >
> >
> > Exactly. Once a 2^MAX_ORDER block has been merged again, it will not be
> > reserved until the next split.
>
> FYI, MAX_ORDER is huge in some architectures.
> I guess another watermark should be introduced instead of MAX_ORDER.
>

It could be, but remember that the watermark will decide what the largest
non-fragmented block-size will be and I am not sure that is something
architectures really want. i.e. why would an architecture not want to push
to have the largest possible block available?

If they did really want the option, I could add MAX_FRAG_ORDER (ok, bad
name but it's morning) that architectures can optionally define. then in
the main code, just

#ifndef MAX_FRAG_ORDER
  #define MAX_FRAG_ORDER MAX_ORDER
#endif

The global lists would then be expected to hold the lists between
MAX_FRAG_ORDER and MAX_ORDER. Would that make sense and would
architectures really want it? If yes, I can code it up.

-- 
Mel Gorman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Avoiding fragmentation through different allocator
  2005-01-13  7:03 ` Matt Mackall
  2005-01-13  7:20   ` Trond Myklebust
@ 2005-01-13 10:22   ` Mel Gorman
  1 sibling, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2005-01-13 10:22 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Linux Memory Management List, Linux Kernel Mailing List

On Wed, 12 Jan 2005, Matt Mackall wrote:

> On Wed, Jan 12, 2005 at 09:09:24PM +0000, Mel Gorman wrote:
> > I stress-tested this patch very heavily and it never oopsed so I am
> > confident of it's stability, so what is left is to look at the results of
> > this patch were and I think they look promising in a number of respects. I
> > have graphs that do not translate to text very well, so I'll just point you
> > to http://www.csn.ul.ie/~mel/projects/mbuddy-results-1 instead.
>
> This graph rather hard to comprehend.
>

There is not a lot of ways to illustrate how pages are allocated
throughout an address space. If you have a better suggestion, I can update
the relevant tool.

> > The results were not spectacular but still very interesting. Under heavy
> > stresing (updatedb + 4 simultaneous -j4 kernel compiles with avg load 15)
> > fragmentation is consistently lower than the standard allocator. It could
> > also be a lot better if there was some means of purging caches, userpages
> > and buffers but thats in the future. For the moment, the only real control
> > I had was the buffer pages.
>
> You might stress higher order page allocation with a) 8k stacks turned
> on b) UDP NFS with large read/write.
>

CONFIG_4KSTACKS is not set on my system so I'm already using 8k stacks.
Based on Trond's mail, I checked my network and the MTU is 1500 bytes so I
don't think I can force high-order allocations there.

Also, while I agree this test would be important, I think it'll be way
more important when there is something in place that actually tries to
free up high-order blocks and then compare.

-- 
Mel Gorman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Avoiding fragmentation through different allocator
  2005-01-13  7:31 ` William Lee Irwin III
@ 2005-01-13 10:11   ` Mel Gorman
  2005-01-14 21:42   ` Marcelo Tosatti
  1 sibling, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2005-01-13 10:11 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Linux Memory Management List, Linux Kernel Mailing List

On Wed, 12 Jan 2005, William Lee Irwin III wrote:

> On Wed, Jan 12, 2005 at 09:09:24PM +0000, Mel Gorman wrote:
> > So... What the patch does. Allocations are divided up into three different
> > types of allocations;
> > UserReclaimable - These are userspace pages that are easily reclaimable. Right
> > 	now, I'm putting all allocations of GFP_USER and GFP_HIGHUSER as
> > 	well as disk-buffer pages into this category. These pages are trivially
> > 	reclaimed by writing the page out to swap or syncing with backing
> > 	storage
> > KernelReclaimable - These are pages allocated by the kernel that are easily
> > 	reclaimed. This is stuff like inode caches, dcache, buffer_heads etc.
> > 	These type of pages potentially could be reclaimed by dumping the
> > 	caches and reaping the slabs (drastic, but you get the idea). We could
> > 	also add pages into this category that are known to be only required
> > 	for a short time like buffers used with DMA
> > KernelNonReclaimable - These are pages that are allocated by the kernel that
> > 	are not trivially reclaimed. For example, the memory allocated for a
> > 	loaded module would be in this category. By default, allocations are
> > 	considered to be of this type
>
> I'd expect to do better with kernel/user discrimination only, having
> address-ordering biases in opposite directions for each case.
>

I thought about this and was not 100% sure. I'll explain how I see it
before I go redo large portions of the patch.

1. Discriminating kernel/user will leave the kernel area fragmented much
worse than my approach. On my systems I tested, I found that a significant
portion of kernel memory was taken up by slabs like buffer_head,
ext3_inode_cache, ntfs_inode_cache, ntfs_big_inode_cache and
radix_tree_node. I did not want to mix these allocations, which
potentially are easy to get rid of, with allocations that are incredibly
difficult.  (Totally aside, I also found that slab caches suffer from
terrible internal fragmentation, sometimes wasting up to 60% of their
memory but thats a separate problem)

2. Address-ordering biases was also something I looked at early on, but
then figured it didn't matter. If I don't reserve a 2^MAX_ORDER blocks,
the system will just get terribly fragmented under memory presure when
they meet in the middle. If I do reserve, I just end up with a similar
layout to what I have now.

-- 
Mel Gorman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Avoiding fragmentation through different allocator
  2005-01-12 23:12 ` Mel Gorman
@ 2005-01-13  8:02   ` Hirokazu Takahashi
  2005-01-13 10:27     ` Mel Gorman
  2005-01-16  4:03   ` Yasunori Goto
  1 sibling, 1 reply; 19+ messages in thread
From: Hirokazu Takahashi @ 2005-01-13  8:02 UTC (permalink / raw)
  To: mel; +Cc: matthew.e.tolentino, linux-mm, linux-kernel

Hi Mel,

The global list looks interesting.

> > >Instead of having one global MAX_ORDER-sized array of free
> > >lists, there are
> > >three, one for each type of allocation. Finally, there is a
> > >list of pages of
> > >size 2^MAX_ORDER which is a global pool of the largest pages
> > >the kernel deals with.

> > is it so that the pages can
> > evolve according to system demands (assuming MAX_ORDER sized
> > chunks are eventually available again)?
> >
> 
> Exactly. Once a 2^MAX_ORDER block has been merged again, it will not be
> reserved until the next split.

FYI, MAX_ORDER is huge in some architectures.
I guess another watermark should be introduced instead of MAX_ORDER.

Thanks,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Avoiding fragmentation through different allocator
  2005-01-12 21:09 Mel Gorman
  2005-01-13  7:03 ` Matt Mackall
@ 2005-01-13  7:31 ` William Lee Irwin III
  2005-01-13 10:11   ` Mel Gorman
  2005-01-14 21:42   ` Marcelo Tosatti
  1 sibling, 2 replies; 19+ messages in thread
From: William Lee Irwin III @ 2005-01-13  7:31 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux Memory Management List, Linux Kernel Mailing List

On Wed, Jan 12, 2005 at 09:09:24PM +0000, Mel Gorman wrote:
> So... What the patch does. Allocations are divided up into three different
> types of allocations;
> UserReclaimable - These are userspace pages that are easily reclaimable. Right
> 	now, I'm putting all allocations of GFP_USER and GFP_HIGHUSER as
> 	well as disk-buffer pages into this category. These pages are trivially
> 	reclaimed by writing the page out to swap or syncing with backing
> 	storage
> KernelReclaimable - These are pages allocated by the kernel that are easily
> 	reclaimed. This is stuff like inode caches, dcache, buffer_heads etc.
> 	These type of pages potentially could be reclaimed by dumping the
> 	caches and reaping the slabs (drastic, but you get the idea). We could
> 	also add pages into this category that are known to be only required
> 	for a short time like buffers used with DMA
> KernelNonReclaimable - These are pages that are allocated by the kernel that
> 	are not trivially reclaimed. For example, the memory allocated for a
> 	loaded module would be in this category. By default, allocations are
> 	considered to be of this type

I'd expect to do better with kernel/user discrimination only, having
address-ordering biases in opposite directions for each case.

-- wli

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Avoiding fragmentation through different allocator
  2005-01-13  7:03 ` Matt Mackall
@ 2005-01-13  7:20   ` Trond Myklebust
  2005-01-13 10:22   ` Mel Gorman
  1 sibling, 0 replies; 19+ messages in thread
From: Trond Myklebust @ 2005-01-13  7:20 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Mel Gorman, Linux Memory Management List, Linux Kernel Mailing List

on den 12.01.2005 Klokka 23:03 (-0800) skreiv Matt Mackall:

> You might stress higher order page allocation with a) 8k stacks turned
> on b) UDP NFS with large read/write.

   b) Unless your network uses jumbo frames, UDP NFS should not be doing
higher order page allocation.

Cheers,
  Trond

-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Avoiding fragmentation through different allocator
  2005-01-12 21:09 Mel Gorman
@ 2005-01-13  7:03 ` Matt Mackall
  2005-01-13  7:20   ` Trond Myklebust
  2005-01-13 10:22   ` Mel Gorman
  2005-01-13  7:31 ` William Lee Irwin III
  1 sibling, 2 replies; 19+ messages in thread
From: Matt Mackall @ 2005-01-13  7:03 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux Memory Management List, Linux Kernel Mailing List

On Wed, Jan 12, 2005 at 09:09:24PM +0000, Mel Gorman wrote:
> I stress-tested this patch very heavily and it never oopsed so I am
> confident of it's stability, so what is left is to look at the results of
> this patch were and I think they look promising in a number of respects. I
> have graphs that do not translate to text very well, so I'll just point you
> to http://www.csn.ul.ie/~mel/projects/mbuddy-results-1 instead.

This graph rather hard to comprehend.

> The results were not spectacular but still very interesting. Under heavy
> stresing (updatedb + 4 simultaneous -j4 kernel compiles with avg load 15)
> fragmentation is consistently lower than the standard allocator. It could
> also be a lot better if there was some means of purging caches, userpages
> and buffers but thats in the future. For the moment, the only real control
> I had was the buffer pages.

You might stress higher order page allocation with a) 8k stacks turned
on b) UDP NFS with large read/write.
 
> Opinions/Feedback?

Looks interesting.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC] Avoiding fragmentation through different allocator
  2005-01-12 22:45 Tolentino, Matthew E
@ 2005-01-12 23:12 ` Mel Gorman
  2005-01-13  8:02   ` Hirokazu Takahashi
  2005-01-16  4:03   ` Yasunori Goto
  0 siblings, 2 replies; 19+ messages in thread
From: Mel Gorman @ 2005-01-12 23:12 UTC (permalink / raw)
  To: Tolentino, Matthew E
  Cc: Linux Memory Management List, Linux Kernel Mailing List

On Wed, 12 Jan 2005, Tolentino, Matthew E wrote:

> Hi Mel!
>

Hi.

First off, I think the differences in our approaches are based on
motiviation. I'm tackling fragmentation where as you were tackling
hotplug. That distinction will be the root of a lot of "why are you doing
that?" type questions. This is a failing of my part as I'm only beginning
to get back to grips with memory-management-related fun.

> >Instead of having one global MAX_ORDER-sized array of free
> >lists, there are
> >three, one for each type of allocation. Finally, there is a
> >list of pages of
> >size 2^MAX_ORDER which is a global pool of the largest pages
> >the kernel deals with.
>
> I've got a patch that I've been testing recently for memory
> hotplug that does nearly the exact same thing - break up the
> management of page allocations based on type - after having
> had a number of conversations with Dave Hansen on this topic.
> I've also prototyped this to use as an alternative to adding
> duplicate zones for delineating between memory that may be
> removed and memory that is not likely to ever be removable.

I considered adding a new zone but I felt it would be a massive job for
what I considered to be a simple problem. I think my approach is nice
and isolated within the allocator itself and will be less likely to
affect other code.

On possibility is that we could say that the UserRclm and KernRclm pool
are always eligible for hotplug and have hotplug banks only satisy those
allocations pushing KernNonRclm allocations to fixed banks. How is it
currently known if a bank of memory is hotplug? Is there a node for each
hotplug bank? If yes, we could flag those nodes to only satisify UserRclm
and KernRclm allocations and force fallback to other nodes. The danger is
that allocations would fail because non-hotplug banks were already full
and pageout would not happen because the watermarks were satisified.

(Bear in mind I can't test hotplug-related issues due to lack of suitable
hardware)

> I've
> only tested in the context of memory hotplug, but it does
> greatly simplify memory removal within individual zones.   Your
> distinction between areas is pretty cool considering I've only
> distinguished at the coarser granularity of user vs. kernel
> to date.  It would be interesting to throw KernelNonReclaimable
> into the mix as well although I haven't gotten there yet...  ;-)
>

If you have already posted a version of the patch (you have feedback so I
guess it's there somewhere), can you send me a link to the thread where
you introduced your approach? It's possible that we just need to merge the
ideas.

> >Once a 2^MAX_ORDER block of pages it split for a type of
> >allocation, it is
> >added to the free-lists for that type, in effect reserving it.
> >Hence, over
> >time, pages of the related types can be clustered together. This means
> >that if we wanted 2^MAX_ORDER number of pages, we could linearly scan a
> >block of pages allocated for UserReclaimable and page each of them out.
>
> Interesting.  I took a slightly different approach due to some
> known delineations between areas that are defined to be non-
> removable vs. areas that may be removed at some point.  Thus I'm
> only managing two distinct free_area lists currently.  I'm curious
> as to the motivation for having a global MAX_ORDER size list that
> is allocation agnostic initially...

It's because I consider all 2^MAX_ORDER pages in a zone to be equal where
as I'm guessing you don't. Until they are split, there is nothing special
about them. It is only when it is split that I want it reserved for a
purpose.

However, if we knew there were blocks that were hot-pluggable, we could
just have a hotplug-global and non-hotplug-global pool. If it's a UserRclm
or KernRclm allocation, split from hotplug-global, otherwise use
non-hotplug-global. It'd increase the memory requirements of the patch a
bit though.

> is it so that the pages can
> evolve according to system demands (assuming MAX_ORDER sized
> chunks are eventually available again)?
>

Exactly. Once a 2^MAX_ORDER block has been merged again, it will not be
reserved until the next split.

> It looks like you left the per_cpu_pages as-is.  Did you
> consider separating those as well to reflect kernel vs. user
> pools?
>

I kept the per-cpu caches for UserRclm-style allocations only because
otherwise a Kernel-nonreclaimable allocation could easily be taken from a
UserRclm pool. Over a period of time, the UserRclm pool would be harder to
defragment. Even if we paged out everything and dumped all buffers, there
would still be kernel non-reclaimable allocations that have to be moved.
The concession I would make there is that allocations for caches could use
the per-cpu caches as they are easy to get rid of.

> >-	struct free_area	free_area[MAX_ORDER];
> >+	struct free_area	free_area_lists[ALLOC_TYPES][MAX_ORDER];
> >+	struct free_area	free_area_global;
> >+
> >+	/*
> >+	 * This map tracks what each 2^MAX_ORDER sized block
> >has been used for.
> >+	 * When a page is freed, it's index within this bitmap
> >is calculated
> >+	 * using (address >> MAX_ORDER) * 2 . This means that pages will
> >+	 * always be freed into the correct list in free_area_lists
> >+	 */
> >+	unsigned long		*free_area_usemap;
>
> So, the current user/kernelreclaim/kernelnonreclaim determination
> is based on this bitmap.  Couldn't this be managed in individual
> struct pages instead, kind of like the buddy bitmap patches?
>

Yes, but it would be a waste of memory and the struct page flags are
already under a lot of pressure (In fact, I am 99.9999% certain that those
bits are at a premium and Andrew Morton at least will be fierce unhappy if
I try and use another one). As I only care about a 2^MAX_ORDER block of
pages, I only need two bits per 2^MAX_ORDER pages to track them.
Per-page, I would need one bit per page which would be a fierce waste of
memory.

> I'm trying to figure out one last bug when I remove memory (via
> nonlinear sections) that has been dedicated to user allocations.
> After which perhaps I'll post it as well, although it is *very*
> similar.  However it does demonstrate the utility of this approach
> for memory hotplug - specifically memory removal - without the
> complexity of adding more zones.
>

When you post that, make sure linux-mm is cc'd and I'll certainly see it.
On the linux-kernel mailing list, I might miss it. Thanks

-- 
Mel Gorman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC] Avoiding fragmentation through different allocator
@ 2005-01-12 22:45 Tolentino, Matthew E
  2005-01-12 23:12 ` Mel Gorman
  0 siblings, 1 reply; 19+ messages in thread
From: Tolentino, Matthew E @ 2005-01-12 22:45 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List; +Cc: Linux Kernel Mailing List

Hi Mel!

>Instead of having one global MAX_ORDER-sized array of free 
>lists, there are
>three, one for each type of allocation. Finally, there is a 
>list of pages of
>size 2^MAX_ORDER which is a global pool of the largest pages 
>the kernel deals with.

I've got a patch that I've been testing recently for memory
hotplug that does nearly the exact same thing - break up the 
management of page allocations based on type - after having
had a number of conversations with Dave Hansen on this topic.  
I've also prototyped this to use as an alternative to adding
duplicate zones for delineating between memory that may be
removed and memory that is not likely to ever be removable.  I've
only tested in the context of memory hotplug, but it does
greatly simplify memory removal within individual zones.   Your
distinction between areas is pretty cool considering I've only 
distinguished at the coarser granularity of user vs. kernel 
to date.  It would be interesting to throw KernelNonReclaimable 
into the mix as well although I haven't gotten there yet...  ;-)

>Once a 2^MAX_ORDER block of pages it split for a type of 
>allocation, it is
>added to the free-lists for that type, in effect reserving it. 
>Hence, over
>time, pages of the related types can be clustered together. This means
>that if we wanted 2^MAX_ORDER number of pages, we could linearly scan a
>block of pages allocated for UserReclaimable and page each of them out.

Interesting.  I took a slightly different approach due to some
known delineations between areas that are defined to be non-
removable vs. areas that may be removed at some point.  Thus I'm
only managing two distinct free_area lists currently.  I'm curious
as to the motivation for having a global MAX_ORDER size list that
is allocation agnostic initially...is it so that the pages can
evolve according to system demands (assuming MAX_ORDER sized 
chunks are eventually available again)?

It looks like you left the per_cpu_pages as-is.  Did you
consider separating those as well to reflect kernel vs. user
pools?

>-	struct free_area	free_area[MAX_ORDER];
>+	struct free_area	free_area_lists[ALLOC_TYPES][MAX_ORDER];
>+	struct free_area	free_area_global;
>+
>+	/*
>+	 * This map tracks what each 2^MAX_ORDER sized block 
>has been used for.
>+	 * When a page is freed, it's index within this bitmap 
>is calculated
>+	 * using (address >> MAX_ORDER) * 2 . This means that pages will
>+	 * always be freed into the correct list in free_area_lists
>+	 */
>+	unsigned long		*free_area_usemap;

So, the current user/kernelreclaim/kernelnonreclaim determination
is based on this bitmap.  Couldn't this be managed in individual
struct pages instead, kind of like the buddy bitmap patches?  

I'm trying to figure out one last bug when I remove memory (via
nonlinear sections) that has been dedicated to user allocations.  
After which perhaps I'll post it as well, although it is *very*
similar.  However it does demonstrate the utility of this approach
for memory hotplug - specifically memory removal - without the 
complexity of adding more zones.  

matt

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC] Avoiding fragmentation through different allocator
@ 2005-01-12 21:09 Mel Gorman
  2005-01-13  7:03 ` Matt Mackall
  2005-01-13  7:31 ` William Lee Irwin III
  0 siblings, 2 replies; 19+ messages in thread
From: Mel Gorman @ 2005-01-12 21:09 UTC (permalink / raw)
  To: Linux Memory Management List; +Cc: Linux Kernel Mailing List

Hi folks,

I was reading through the archives on threads involving page fragmentation
where high-order allocations are difficult to meet. The solutions I saw
proposed and implemented were focusing on moving the pages around using the
same mechanisms as used for memory hotplug and NUMA node migration. From what I
can gather, this is both expensive and difficult. This patch is a proposal to
change the page allocator to avoid some of the nastier fragmentation problems
in the first place. The objective is that higher order pages will be more
likely to be available and if not, there will be an easy way of finding large
blocks to free up. It does not negate the defragmentation-through-migration
(Marcelo's work I believe) approach to the problem but it should make that
approach's job a lot easier.

The patch is against 2.6.9 but as this is not ready for inclusion yet anyway
so I figured I would get opinions first before forward-porting or thinking
about any optimisations.

So... What the patch does. Allocations are divided up into three different
types of allocations;

UserReclaimable - These are userspace pages that are easily reclaimable. Right
	now, I'm putting all allocations of GFP_USER and GFP_HIGHUSER as
	well as disk-buffer pages into this category. These pages are trivially
	reclaimed by writing the page out to swap or syncing with backing
	storage

KernelReclaimable - These are pages allocated by the kernel that are easily
	reclaimed. This is stuff like inode caches, dcache, buffer_heads etc.
	These type of pages potentially could be reclaimed by dumping the
	caches and reaping the slabs (drastic, but you get the idea). We could
	also add pages into this category that are known to be only required
	for a short time like buffers used with DMA

KernelNonReclaimable - These are pages that are allocated by the kernel that
	are not trivially reclaimed. For example, the memory allocated for a
	loaded module would be in this category. By default, allocations are
	considered to be of this type

Instead of having one global MAX_ORDER-sized array of free lists, there are
three, one for each type of allocation. Finally, there is a list of pages of
size 2^MAX_ORDER which is a global pool of the largest pages the kernel deals
with.

Once a 2^MAX_ORDER block of pages it split for a type of allocation, it is
added to the free-lists for that type, in effect reserving it. Hence, over
time, pages of the related types can be clustered together. This means
that if we wanted 2^MAX_ORDER number of pages, we could linearly scan a
block of pages allocated for UserReclaimable and page each of them out.

The patch also implements fallback logic for when there are no 2^MAX_ORDER
pages available and there are no free pages of the desired type. The fallback
lists were chosen in a way that keeps the most easily reclaimable pages
together.

I stress-tested this patch very heavily and it never oopsed so I am
confident of it's stability, so what is left is to look at the results of
this patch were and I think they look promising in a number of respects. I
have graphs that do not translate to text very well, so I'll just point you
to http://www.csn.ul.ie/~mel/projects/mbuddy-results-1 instead.

The two figures at the top show the normal zone with the normal allocator
and with the modified allocator. A value of 0 shows the page is free. 1 is
a user reclaimable page. 2 is a kernel reclaimable page and 3 is a
non-reclaimable page. The standard allocator has these all mixed in
together. The modified allocator has each of them grouped together.

The Normal zone is the main zone of interest, but I included the graph of
what HighMem looks like. As it is almost all userspace allocations anyway,
it does not present a defragmentation problem (just linearlly scan pages,
and either page our or sync with backing storage and dump)

So.... the actual fragmentation results.

The results were not spectacular but still very interesting. Under heavy
stresing (updatedb + 4 simultaneous -j4 kernel compiles with avg load 15)
fragmentation is consistently lower than the standard allocator. It could
also be a lot better if there was some means of purging caches, userpages
and buffers but thats in the future. For the moment, the only real control
I had was the buffer pages.

The figures below are the output of /proc/buddyinfo before and after deleting
the kernel trees that were compiled (flushing their buffer pages).

Standard allocator - Before deleting trees
Node 0, zone      DMA      3      1      4      4      4      1      2      0      1      1      2
Node 0, zone   Normal    298      8    892    460    268     81     28      8      1      0     14
Node 0, zone  HighMem     30     15      1      0      0      0      0      0      1      0      0

Standard allocator - After deleting trees
Node 0, zone      DMA      3      1      4      4      4      1      2      0      1      1      2
Node 0, zone   Normal  18644  13067   7911   4224   1583    351     52     10      1      0     14
Node 0, zone  HighMem    370    343    218    157   3078    355     80     48     26     13      5


Modified allocator - Before deleting trees
Node 0, zone      DMA     1      0      4      4      4      1      2      0      1      1      2
Node 0, zone   Normal   658    230    113    578    176     73     23      9     10      4     13
Node 0, zone  HighMem    40     10      4      0      1      1      1      1      0      0      0

Modified allocator - After deleting trees
Node 0, zone      DMA     1      0      4      4      4      1      2      0      1      1      2
Node 0, zone   Normal 15407  11695   7280   3858   1563    523     87     13     11      4     13
Node 0, zone  HighMem  6816   3616   3407   3328    566    144     89     53     17      9      4


In both cases, deleting the trees really freed up HighMem which is not a
suprise. The fragmentation there will really depend on when buffer pages were
allocated (I could make this close to perfect if I put only buffer pages into
UserRclm) so there is not a lot that can be done so lets look at Normal.

Before the trees were deleted, the modified alloactor was in better shape
than the standard allocator at the higher orders although arguably they
are in similar shape. However, after the trees were deleted, the modified
allocator was in way better shape. Deleting buffers barely made a difference
in high-order page availability in the standard allocator but really improved
with the modified one

S: Node 0, zone   Normal  18644  13067   7911   4224   1583    351     52     10      1      0     14
M: Node 0, zone   Normal  15407  11695   7280   3858   1563    523     87     13     11      4     13
Delta:			  -3237  -1372   -631   -366    -20   +172    +35     +3    +10     +4     -1

Ok, order 10 lost out but I consider that just unlucky :) . From order-5
on 5, there were definite improvements. I could show what the figures look
like in each of the UserRclm, KernRclm and KernNoRclm pools but what they
show is that the UserRclm pool really benefits and it is obvious that the
other two pools were used frequently for fallbacks under the heavy memory
pressure.

So.... bottom line, there is consistent improvements in fragmentation using
the allocator before anything drastic like linear page scanning or migration
of pages takes place. I have not measured it yet, but I do not believe the
overhead of the scheme is severe either.

Opinions/Feedback?

>>>Patch follows<<<
diff -rup linux-2.6.9-clean/fs/buffer.c linux-2.6.9-mbuddy/fs/buffer.c
--- linux-2.6.9-clean/fs/buffer.c	2004-10-18 22:54:32.000000000 +0100
+++ linux-2.6.9-mbuddy/fs/buffer.c	2005-01-03 19:33:23.000000000 +0000
@@ -1203,7 +1203,8 @@ grow_dev_page(struct block_device *bdev,
 	struct page *page;
 	struct buffer_head *bh;

-	page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
+	page = find_or_create_page(inode->i_mapping, index,
+					GFP_NOFS | __GFP_USERRCLM);
 	if (!page)
 		return NULL;

@@ -3065,7 +3066,8 @@ static void recalc_bh_state(void)

 struct buffer_head *alloc_buffer_head(int gfp_flags)
 {
-	struct buffer_head *ret = kmem_cache_alloc(bh_cachep, gfp_flags);
+	struct buffer_head *ret = kmem_cache_alloc(bh_cachep,
+						   gfp_flags|__GFP_KERNRCLM);
 	if (ret) {
 		preempt_disable();
 		__get_cpu_var(bh_accounting).nr++;
diff -rup linux-2.6.9-clean/fs/dcache.c linux-2.6.9-mbuddy/fs/dcache.c
--- linux-2.6.9-clean/fs/dcache.c	2004-10-18 22:53:21.000000000 +0100
+++ linux-2.6.9-mbuddy/fs/dcache.c	2005-01-03 19:34:12.000000000 +0000
@@ -691,7 +691,8 @@ struct dentry *d_alloc(struct dentry * p
 	struct dentry *dentry;
 	char *dname;

-	dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL);
+	dentry = kmem_cache_alloc(dentry_cache,
+			  	  GFP_KERNEL|__GFP_KERNRCLM);
 	if (!dentry)
 		return NULL;

diff -rup linux-2.6.9-clean/fs/ext2/super.c linux-2.6.9-mbuddy/fs/ext2/super.c
--- linux-2.6.9-clean/fs/ext2/super.c	2004-10-18 22:54:38.000000000 +0100
+++ linux-2.6.9-mbuddy/fs/ext2/super.c	2005-01-03 19:34:35.000000000 +0000
@@ -134,7 +134,7 @@ static kmem_cache_t * ext2_inode_cachep;
 static struct inode *ext2_alloc_inode(struct super_block *sb)
 {
 	struct ext2_inode_info *ei;
-	ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep, SLAB_KERNEL);
+	ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep, SLAB_KERNEL|__GFP_KERNRCLM);
 	if (!ei)
 		return NULL;
 #ifdef CONFIG_EXT2_FS_POSIX_ACL
diff -rup linux-2.6.9-clean/fs/ext3/super.c linux-2.6.9-mbuddy/fs/ext3/super.c
--- linux-2.6.9-clean/fs/ext3/super.c	2004-10-18 22:55:07.000000000 +0100
+++ linux-2.6.9-mbuddy/fs/ext3/super.c	2005-01-03 19:34:47.000000000 +0000
@@ -442,7 +442,7 @@ static struct inode *ext3_alloc_inode(st
 {
 	struct ext3_inode_info *ei;

-	ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS);
+	ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS|__GFP_KERNRCLM);
 	if (!ei)
 		return NULL;
 #ifdef CONFIG_EXT3_FS_POSIX_ACL
diff -rup linux-2.6.9-clean/fs/ntfs/inode.c linux-2.6.9-mbuddy/fs/ntfs/inode.c
--- linux-2.6.9-clean/fs/ntfs/inode.c	2004-10-18 22:55:07.000000000 +0100
+++ linux-2.6.9-mbuddy/fs/ntfs/inode.c	2005-01-03 19:35:27.000000000 +0000
@@ -314,7 +314,7 @@ struct inode *ntfs_alloc_big_inode(struc

 	ntfs_debug("Entering.");
 	ni = (ntfs_inode *)kmem_cache_alloc(ntfs_big_inode_cache,
-			SLAB_NOFS);
+			SLAB_NOFS|__GFP_KERNRCLM);
 	if (likely(ni != NULL)) {
 		ni->state = 0;
 		return VFS_I(ni);
@@ -339,7 +339,8 @@ static inline ntfs_inode *ntfs_alloc_ext
 	ntfs_inode *ni;

 	ntfs_debug("Entering.");
-	ni = (ntfs_inode *)kmem_cache_alloc(ntfs_inode_cache, SLAB_NOFS);
+	ni = (ntfs_inode *)kmem_cache_alloc(ntfs_inode_cache,
+					    SLAB_NOFS|__GFP_KERNRCLM);
 	if (likely(ni != NULL)) {
 		ni->state = 0;
 		return ni;
diff -rup linux-2.6.9-clean/include/linux/gfp.h linux-2.6.9-mbuddy/include/linux/gfp.h
--- linux-2.6.9-clean/include/linux/gfp.h	2004-10-18 22:53:44.000000000 +0100
+++ linux-2.6.9-mbuddy/include/linux/gfp.h	2005-01-09 18:34:07.000000000 +0000
@@ -37,21 +37,24 @@ struct vm_area_struct;
 #define __GFP_NORETRY	0x1000	/* Do not retry.  Might fail */
 #define __GFP_NO_GROW	0x2000	/* Slab internal usage */
 #define __GFP_COMP	0x4000	/* Add compound page metadata */
+#define __GFP_KERNRCLM  0x8000  /* Kernel page that is easily reclaimable */
+#define __GFP_USERRCLM  0x10000 /* User is a userspace user */

-#define __GFP_BITS_SHIFT 16	/* Room for 16 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 18	/* Room for 16 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)

 /* if you forget to add the bitmask here kernel will crash, period */
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
 			__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
-			__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP)
+			__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
+			__GFP_KERNRCLM|__GFP_USERRCLM)

 #define GFP_ATOMIC	(__GFP_HIGH)
 #define GFP_NOIO	(__GFP_WAIT)
 #define GFP_NOFS	(__GFP_WAIT | __GFP_IO)
 #define GFP_KERNEL	(__GFP_WAIT | __GFP_IO | __GFP_FS)
-#define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS)
-#define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_USERRCLM)
+#define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | __GFP_USERRCLM)

 /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
    platforms, used as appropriate on others */
@@ -130,4 +133,7 @@ extern void FASTCALL(free_cold_page(stru

 void page_alloc_init(void);

+/* VM Regress: define to indicate tracing callbacks is enabled */
+#define TRACE_PAGE_ALLOCS
+
 #endif /* __LINUX_GFP_H */
diff -rup linux-2.6.9-clean/include/linux/mmzone.h linux-2.6.9-mbuddy/include/linux/mmzone.h
--- linux-2.6.9-clean/include/linux/mmzone.h	2004-10-18 22:54:31.000000000 +0100
+++ linux-2.6.9-mbuddy/include/linux/mmzone.h	2005-01-09 15:19:44.000000000 +0000
@@ -19,6 +19,10 @@
 #else
 #define MAX_ORDER CONFIG_FORCE_MAX_ZONEORDER
 #endif
+#define ALLOC_TYPES 3
+#define ALLOC_KERNNORCLM 0
+#define ALLOC_KERNRCLM 1
+#define ALLOC_USERRCLM 2

 struct free_area {
 	struct list_head	free_list;
@@ -162,9 +166,23 @@ struct zone {
 	int prev_priority;

 	/*
-	 * free areas of different sizes
+	 * free areas of different sizes. There is one global free area that
+	 * contains lists of pages of 2^MAX_ORDER in size. Once split, the
+	 * buddy will be added to one of the free_area lists in free_area_lists.
+	 * Each type is one of User Reclaimable, Kernel Reclaimable and
+	 * Kernel Non-reclaimable. The idea is that pages of related use will
+	 * be clustered together to reduce fragmentation.
 	 */
-	struct free_area	free_area[MAX_ORDER];
+	struct free_area	free_area_lists[ALLOC_TYPES][MAX_ORDER];
+	struct free_area	free_area_global;
+
+	/*
+	 * This map tracks what each 2^MAX_ORDER sized block has been used for.
+	 * When a page is freed, it's index within this bitmap is calculated
+	 * using (address >> MAX_ORDER) * 2 . This means that pages will
+	 * always be freed into the correct list in free_area_lists
+	 */
+	unsigned long		*free_area_usemap;

 	/*
 	 * wait_table		-- the array holding the hash table
diff -rup linux-2.6.9-clean/mm/page_alloc.c linux-2.6.9-mbuddy/mm/page_alloc.c
--- linux-2.6.9-clean/mm/page_alloc.c	2004-10-18 22:53:11.000000000 +0100
+++ linux-2.6.9-mbuddy/mm/page_alloc.c	2005-01-12 14:45:20.000000000 +0000
@@ -40,8 +40,21 @@ unsigned long totalram_pages;
 unsigned long totalhigh_pages;
 long nr_swap_pages;
 int numnodes = 1;
+int fallback_count=0;
+int drastic_fallback_count=0;
+int global_steal=0;
+int global_refill=0;
+int kernnorclm_count=0;
+int kernrclm_count=0;
+int userrclm_count=0;
 int sysctl_lower_zone_protection = 0;

+int fallback_allocs[ALLOC_TYPES][ALLOC_TYPES] = {
+ 	{ ALLOC_KERNNORCLM, ALLOC_KERNRCLM,   ALLOC_USERRCLM },
+ 	{ ALLOC_KERNRCLM,   ALLOC_KERNNORCLM, ALLOC_USERRCLM },
+ 	{ ALLOC_USERRCLM,   ALLOC_KERNNORCLM, ALLOC_KERNRCLM }
+};
+
 EXPORT_SYMBOL(totalram_pages);
 EXPORT_SYMBOL(nr_swap_pages);

@@ -53,6 +66,7 @@ struct zone *zone_table[1 << (ZONES_SHIF
 EXPORT_SYMBOL(zone_table);

 static char *zone_names[MAX_NR_ZONES] = { "DMA", "Normal", "HighMem" };
+static char *type_names[ALLOC_TYPES] = { "KernNoRclm", "KernRclm", "UserRclm"};
 int min_free_kbytes = 1024;

 unsigned long __initdata nr_kernel_pages;
@@ -94,6 +108,48 @@ static void bad_page(const char *functio
 	page->mapping = NULL;
 }

+/*
+ * Return what type of use the 2^MAX_ORDER block of pages is in use for
+ * that the given page is part of
+ */
+static int get_pageblock_type(struct page *page) {
+	struct zone *zone = page_zone(page);
+	int pageidx = (page - zone->zone_mem_map) >> MAX_ORDER;
+	int bitidx = pageidx * 2;
+
+	/* Bit 1 will be set if the block is kernel reclaimable */
+	if (test_bit(bitidx,zone->free_area_usemap)) return ALLOC_KERNRCLM;
+
+	/* Bit 2 will be set if the block is user reclaimable */
+	if (test_bit(bitidx+1, zone->free_area_usemap)) return ALLOC_USERRCLM;
+
+	return ALLOC_KERNNORCLM;
+}
+
+static void set_pageblock_type(struct page *page, int type) {
+	int bit1, bit2;
+	struct zone *zone = page_zone(page);
+	int pageidx = (page - zone->zone_mem_map) >> MAX_ORDER;
+	int bitidx = pageidx * 2;
+	bit1 = bit2 = 0;
+
+	if (type == ALLOC_KERNRCLM) {
+		set_bit(bitidx, zone->free_area_usemap);
+		clear_bit(bitidx+1, zone->free_area_usemap);
+		return;
+	}
+
+	if (type == ALLOC_USERRCLM) {
+		clear_bit(bitidx, zone->free_area_usemap);
+		set_bit(bitidx+1, zone->free_area_usemap);
+		return;
+	}
+
+	clear_bit(bitidx, zone->free_area_usemap);
+	clear_bit(bitidx+1, zone->free_area_usemap);
+
+}
+
 #ifndef CONFIG_HUGETLB_PAGE
 #define prep_compound_page(page, order) do { } while (0)
 #define destroy_compound_page(page, order) do { } while (0)
@@ -193,7 +249,9 @@ static inline void __free_pages_bulk (st
 	while (order < MAX_ORDER-1) {
 		struct page *buddy1, *buddy2;

+		/* MBUDDY TODO: Fix this bug check
 		BUG_ON(area >= zone->free_area + MAX_ORDER);
+		*/
 		if (!__test_and_change_bit(index, area->map))
 			/*
 			 * the buddy page is still allocated.
@@ -211,7 +269,16 @@ static inline void __free_pages_bulk (st
 		area++;
 		index >>= 1;
 		page_idx &= mask;
+
+
 	}
+
+	/* Switch to the global list if this is the MAX_ORDER */
+	if (order >= MAX_ORDER-1) {
+		area = &(zone->free_area_global);
+		global_refill++;
+	}
+
 	list_add(&(base + page_idx)->lru, &area->free_list);
 }

@@ -253,14 +320,23 @@ free_pages_bulk(struct zone *zone, int c
 	struct free_area *area;
 	struct page *base, *page = NULL;
 	int ret = 0;
+	int alloctype;

 	base = zone->zone_mem_map;
-	area = zone->free_area + order;
 	spin_lock_irqsave(&zone->lock, flags);
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 	while (!list_empty(list) && count--) {
 		page = list_entry(list->prev, struct page, lru);
+
+		/*
+		 * Find what area this page belonged to. It is possible that
+		 * the pages are always of the same type and this expensive
+		 * check could be only performed once. Needs investigation
+		 */
+		alloctype = get_pageblock_type(page);
+		area = zone->free_area_lists[alloctype] + order;
+
 		/* have to delete it as __free_pages_bulk list manipulates */
 		list_del(&page->lru);
 		__free_pages_bulk(page, base, zone, area, order);
@@ -363,15 +439,37 @@ static void prep_new_page(struct page *p
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static struct page *__rmqueue(struct zone *zone, unsigned int order, int flags)
 {
 	struct free_area * area;
 	unsigned int current_order;
 	struct page *page;
 	unsigned int index;
+	int global_split=0;
+
+	/* Select area to use based on gfp_flags */
+	int alloctype;
+	int retry_count=0;
+	if (flags & __GFP_USERRCLM) {
+		alloctype = ALLOC_USERRCLM;
+		userrclm_count++;
+	}
+	else if (flags & __GFP_KERNRCLM) {
+		alloctype = ALLOC_KERNRCLM;
+		kernrclm_count++;
+	} else {
+		alloctype = ALLOC_KERNNORCLM;
+		kernnorclm_count++;
+	}
+
+	/* Ok, pick the fallback order based on the type */
+	int *fallback_list = fallback_allocs[alloctype];

+retry:
 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
-		area = zone->free_area + current_order;
+		alloctype = fallback_list[retry_count];
+		area = zone->free_area_lists[alloctype] + current_order;
+
 		if (list_empty(&area->free_list))
 			continue;

@@ -384,6 +482,34 @@ static struct page *__rmqueue(struct zon
 		return expand(zone, page, index, order, current_order, area);
 	}

+	/* Take from the global pool if this is the first attempt */
+	if (!global_split && !list_empty(&(zone->free_area_global.free_list))){
+		/*
+		 * Remove a MAX_ORDER block from the global pool and add
+		 * it to the list of desired alloc_type
+		 */
+		page = list_entry(zone->free_area_global.free_list.next,
+				  struct page, lru);
+		list_del(&page->lru);
+		list_add(&page->lru,
+			&(zone->free_area_lists[alloctype][MAX_ORDER-1].free_list));
+		global_steal++;
+		global_split=1;
+
+		/* Mark this block of pages as for use with this alloc type */
+		set_pageblock_type(page, alloctype);
+
+		goto retry;
+	}
+
+	/*
+	 * Here, the alloc type lists has been depleted as well as the global
+	 * pool, so fallback
+	 */
+	retry_count++;
+	fallback_count++;
+	if (retry_count != ALLOC_TYPES) goto retry;
+
 	return NULL;
 }

@@ -393,7 +519,8 @@ static struct page *__rmqueue(struct zon
  * Returns the number of new pages which were placed at *list.
  */
 static int rmqueue_bulk(struct zone *zone, unsigned int order,
-			unsigned long count, struct list_head *list)
+			unsigned long count, struct list_head *list,
+			int gfp_flags)
 {
 	unsigned long flags;
 	int i;
@@ -402,7 +529,7 @@ static int rmqueue_bulk(struct zone *zon

 	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
-		page = __rmqueue(zone, order);
+		page = __rmqueue(zone, order, gfp_flags);
 		if (page == NULL)
 			break;
 		allocated++;
@@ -438,7 +565,7 @@ int is_head_of_free_region(struct page *
 {
         struct zone *zone = page_zone(page);
         unsigned long flags;
-	int order;
+	int order,type;
 	struct list_head *curr;

 	/*
@@ -446,12 +573,16 @@ int is_head_of_free_region(struct page *
 	 * suspend anyway, but...
 	 */
 	spin_lock_irqsave(&zone->lock, flags);
-	for (order = MAX_ORDER - 1; order >= 0; --order)
-		list_for_each(curr, &zone->free_area[order].free_list)
+	for (type = 0 ; type < ALLOC_TYPES ; type++) {
+		for (order = MAX_ORDER - 1; order >= 0; --order) {
+			list_for_each(curr,
+				&zone->free_area_lists[type][order].free_list)
 			if (page == list_entry(curr, struct page, lru)) {
 				spin_unlock_irqrestore(&zone->lock, flags);
 				return 1 << order;
 			}
+		}
+	}
 	spin_unlock_irqrestore(&zone->lock, flags);
         return 0;
 }
@@ -545,14 +676,15 @@ buffered_rmqueue(struct zone *zone, int
 	struct page *page = NULL;
 	int cold = !!(gfp_flags & __GFP_COLD);

-	if (order == 0) {
+	if (order == 0 && (gfp_flags & __GFP_USERRCLM)) {
 		struct per_cpu_pages *pcp;

 		pcp = &zone->pageset[get_cpu()].pcp[cold];
 		local_irq_save(flags);
 		if (pcp->count <= pcp->low)
 			pcp->count += rmqueue_bulk(zone, 0,
-						pcp->batch, &pcp->list);
+						pcp->batch, &pcp->list,
+						gfp_flags);
 		if (pcp->count) {
 			page = list_entry(pcp->list.next, struct page, lru);
 			list_del(&page->lru);
@@ -564,7 +696,7 @@ buffered_rmqueue(struct zone *zone, int

 	if (page == NULL) {
 		spin_lock_irqsave(&zone->lock, flags);
-		page = __rmqueue(zone, order);
+		page = __rmqueue(zone, order, gfp_flags);
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}

@@ -1040,6 +1172,7 @@ void show_free_areas(void)
 	unsigned long inactive;
 	unsigned long free;
 	struct zone *zone;
+	int type;

 	for_each_zone(zone) {
 		show_node(zone);
@@ -1130,8 +1263,11 @@ void show_free_areas(void)
 		spin_lock_irqsave(&zone->lock, flags);
 		for (order = 0; order < MAX_ORDER; order++) {
 			nr = 0;
-			list_for_each(elem, &zone->free_area[order].free_list)
-				++nr;
+			for (type=0; type < ALLOC_TYPES; type++) {
+				list_for_each(elem,
+					&zone->free_area_lists[type][order].free_list)
+					++nr;
+			}
 			total += nr << order;
 			printk("%lu*%lukB ", nr, K(1UL) << order);
 		}
@@ -1445,19 +1581,35 @@ unsigned long pages_to_bitmap_size(unsig
 void zone_init_free_lists(struct pglist_data *pgdat, struct zone *zone, unsigned long size)
 {
 	int order;
-	for (order = 0; ; order++) {
-		unsigned long bitmap_size;
+	int type;
+	struct free_area *area;
+	unsigned long bitmap_size;

-		INIT_LIST_HEAD(&zone->free_area[order].free_list);
-		if (order == MAX_ORDER-1) {
-			zone->free_area[order].map = NULL;
-			break;
-		}
+	/* Initialse the three size ordered lists of free_areas */
+	for (type=0; type < ALLOC_TYPES; type++) {
+
+		for (order = 0; ; order++) {
+			area = zone->free_area_lists[type];
+
+			INIT_LIST_HEAD(&area[order].free_list);
+			if (order == MAX_ORDER-1) {
+				area[order].map = NULL;
+				break;
+			}

-		bitmap_size = pages_to_bitmap_size(order, size);
-		zone->free_area[order].map =
-		  (unsigned long *) alloc_bootmem_node(pgdat, bitmap_size);
+			bitmap_size = pages_to_bitmap_size(order, size);
+			area[order].map =
+		  		(unsigned long*)alloc_bootmem_node(pgdat,
+							   	 bitmap_size);
+		}
 	}
+
+	/* Initialise the global pool of 2^size pages */
+	INIT_LIST_HEAD(&zone->free_area_global.free_list);
+	bitmap_size = pages_to_bitmap_size(MAX_ORDER-1, size);
+	zone->free_area_global.map =
+		(unsigned long*)alloc_bootmem_node(pgdat,
+					   	   bitmap_size);
 }

 #ifndef __HAVE_ARCH_MEMMAP_INIT
@@ -1575,6 +1727,11 @@ static void __init free_area_init_core(s
 		zone_start_pfn += size;

 		zone_init_free_lists(pgdat, zone, zone->spanned_pages);
+		zone->free_area_usemap =
+			(unsigned long *)alloc_bootmem_node(pgdat,
+						   (size >> MAX_ORDER) * 2);
+		memset((unsigned long *)zone->free_area_usemap, ALLOC_KERNNORCLM,
+			(size >> MAX_ORDER) * 2);
 	}
 }

@@ -1653,25 +1810,83 @@ static int frag_show(struct seq_file *m,
 	struct zone *zone;
 	struct zone *node_zones = pgdat->node_zones;
 	unsigned long flags;
-	int order;
+	int order, type;
+	struct list_head *elem;

+	/* Show global fragmentation statistics */
 	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
 		if (!zone->present_pages)
 			continue;

 		spin_lock_irqsave(&zone->lock, flags);
-		seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
-		for (order = 0; order < MAX_ORDER; ++order) {
-			unsigned long nr_bufs = 0;
+		seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
+		unsigned long nr_bufs = 0;
+		for (order = 0; order < MAX_ORDER-1; ++order) {
+			nr_bufs = 0;
+
+			for (type=0; type < ALLOC_TYPES; type++) {
+				list_for_each(elem, &(zone->free_area_lists[type][order].free_list))
+					++nr_bufs;
+			}
+			seq_printf(m, "%6lu ", nr_bufs);
+		}
+
+		/* Scan global list */
+		nr_bufs = 0;
+		list_for_each(elem, &(zone->free_area_global.free_list))
+			++nr_bufs;
+		seq_printf(m, "%6lu ", nr_bufs);
+
+		spin_unlock_irqrestore(&zone->lock, flags);
+		seq_putc(m, '\n');
+	}
+
+	/* Show statistics for each allocation type */
+	seq_printf(m, "\nPer-allocation-type statistics");
+	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
+		if (!zone->present_pages)
+			continue;
+
+		spin_lock_irqsave(&zone->lock, flags);
+		unsigned long nr_bufs = 0;
+		for (type=0; type < ALLOC_TYPES; type++) {
+			seq_printf(m, "\nNode %d, zone %8s, type %10s",
+					pgdat->node_id, zone->name,
+					type_names[type]);
 			struct list_head *elem;
+			for (order = 0; order < MAX_ORDER; ++order) {
+				nr_bufs = 0;

-			list_for_each(elem, &(zone->free_area[order].free_list))
-				++nr_bufs;
-			seq_printf(m, "%6lu ", nr_bufs);
+				list_for_each(elem, &(zone->free_area_lists[type][order].free_list))
+					++nr_bufs;
+				seq_printf(m, "%6lu ", nr_bufs);
+			}
 		}
+
+		/* Scan global list */
+		seq_printf(m, "\n");
+		seq_printf(m, "Node %d, zone %8s, type %10s",
+					pgdat->node_id, zone->name,
+					"MAX_ORDER");
+		nr_bufs = 0;
+		list_for_each(elem, &(zone->free_area_global.free_list))
+			++nr_bufs;
+		seq_printf(m, "%6lu ", nr_bufs);
+
 		spin_unlock_irqrestore(&zone->lock, flags);
 		seq_putc(m, '\n');
 	}
+
+	/* Show bean counters */
+	seq_printf(m, "\nGlobal beancounters\n");
+	seq_printf(m, "Global steals:     %d\n", global_steal);
+	seq_printf(m, "Global refills:    %d\n", global_refill);
+	seq_printf(m, "Fallback count:    %d\n", fallback_count);
+	seq_printf(m, "KernNoRclm allocs: %d\n", kernnorclm_count);
+	seq_printf(m, "KernRclm allocs:   %d\n", kernrclm_count);
+	seq_printf(m, "UserRclm allocs:   %d\n", userrclm_count);
+
+
 	return 0;
 }


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2005-01-19 13:45 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-01-17 16:48 [RFC] Avoiding fragmentation through different allocator Tolentino, Matthew E
2005-01-19 13:17 ` Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2005-01-12 22:45 Tolentino, Matthew E
2005-01-12 23:12 ` Mel Gorman
2005-01-13  8:02   ` Hirokazu Takahashi
2005-01-13 10:27     ` Mel Gorman
2005-01-16  4:03   ` Yasunori Goto
2005-01-16 16:21     ` Mel Gorman
2005-01-17 23:08       ` Yasunori Goto
2005-01-19 13:45         ` Mel Gorman
2005-01-12 21:09 Mel Gorman
2005-01-13  7:03 ` Matt Mackall
2005-01-13  7:20   ` Trond Myklebust
2005-01-13 10:22   ` Mel Gorman
2005-01-13  7:31 ` William Lee Irwin III
2005-01-13 10:11   ` Mel Gorman
2005-01-14 21:42   ` Marcelo Tosatti
2005-01-15  1:31     ` William Lee Irwin III
2005-01-15 19:19       ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).