linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Active Memory Defragmentation: Our implementation & problems
@ 2004-02-04 20:12 Mark_H_Johnson
  0 siblings, 0 replies; 32+ messages in thread
From: Mark_H_Johnson @ 2004-02-04 20:12 UTC (permalink / raw)
  To: root
  Cc: Dave Hansen, linux-kernel, linux-mm, Timothy Miller,
	owner-linux-mm, Alok Mooley





Richard B. Johnson wrote:

>Eventually you
>get to the fact that even contiguous physical RAM doesn't
>have to be contiguous and, in fact, with modern controllers
>it's quite unlikely that it is. It's a sack of bits that
>are uniquely addressable.

Yes and no. We are using a shared memory interface on a cluster that allows
us to map up to 512 Mbytes of memory from another machine. There are 16k
address translation table (ATT) entries in the card, so we're allocating
32K chunks of memory per ATT. We are using the bigphysarea patch for the
driver (in 2.4 kernels) only because the driver can't reliably get the
chunks of RAM it is asking for. We can continue to operate the way we've
been doing or get a mechanism to defragment physical RAM so the driver can
continue to work a week after we rebooted the machine.

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04 19:33       ` Richard B. Johnson
  2004-02-05  5:07         ` Alok Mooley
@ 2004-02-05 19:03         ` Pavel Machek
  1 sibling, 0 replies; 32+ messages in thread
From: Pavel Machek @ 2004-02-05 19:03 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Alok Mooley, linux-kernel, linux-mm, Dave Hansen

Hi!

> > If this is an Intel x86 machine, it is impossible
> > > for pages
> > > to get fragmented in the first place. The hardware
> > > allows any
> > > page, from anywhere in memory, to be concatenated
> > > into linear
> > > virtual address space. Even the kernel address space
> > > is virtual.
> > > The only time you need physically-adjacent pages is
> > > if you
> > > are doing DMA that is more than a page-length at a
> > > time. The
> > > kernel keeps a bunch of those pages around for just
> > > that
> > > purpose.
> > >
> > > So, if you are making a "memory defragmenter", it is
> > > a CPU time-sink.
> > > That's all.
> >
> > What if the external fragmentation increases so much
> > that it is not possible to find a large sized block?
> > Then, is it not better to defragment rather than swap
> > or fail?
> >
> > -Alok
> 
> All "blocks" are the same size, i.e., PAGE_SIZE. When RAM
> is tight the content of a page is written to the swap-file
> according to a least-recently-used protocol. This frees
> a page. Pages are allocated to a process only one page at
> a time. This prevents some hog from grabbing all the memory
> in the machine. Memory allocation and physical page allocation
> are two different things, I can malloc() a gigabyte of RAM on
> a machine. It only gets allocated when an attempt is made
> to access a page.

Alok is right. kernel needs to do kmalloc(8K) from time to time. And
notice that kernel uses 4M tables.
								Pavel
-- 
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04 23:24         ` Dave Hansen
@ 2004-02-05 16:32           ` Dave McCracken
  0 siblings, 0 replies; 32+ messages in thread
From: Dave McCracken @ 2004-02-05 16:32 UTC (permalink / raw)
  To: Dave Hansen, Timothy Miller; +Cc: linux-kernel, linux-mm


--On Wednesday, February 04, 2004 15:24:57 -0800 Dave Hansen
<haveblue@us.ibm.com> wrote:

>> Let's say this defragmenter allowed the kernel to detect when 1024 4k 
>> pages were contiguous and aligned properly and could silently replace 
>> the processor mapping tables so that all of these "pages" would be 
>> mapped by one TLB entry.  (At such time that some pages need to get 
>> freed, the VM would silently switch back to the 4k model.)
>> 
>> This would reduce TLB entries for a lot of programs above a certain 
>> size, and therefore improve peformance.
> 
> This is something that would be interesting to pursue, but we already
> have hugetlbfs which does this when you need it explicitly.  When you
> know that TLB coverage is a problem, that's your fix for now. 

Hugetlbfs runs from a dedicated pool of large mlocked pages.  This limits
how many hugetlbfs users there can be.

One of the things that's been discussed is letting hugetlbfs use the buddy
allocator in some fashion so it can grow dynamically.  The main barrier to
it is fragmentation, so a working defrag might make it feasible.

Dave McCracken


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04 19:56   ` Dave Hansen
@ 2004-02-05  5:19     ` Alok Mooley
  0 siblings, 0 replies; 32+ messages in thread
From: Alok Mooley @ 2004-02-05  5:19 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm


--- Dave Hansen <haveblue@us.ibm.com> wrote:
> On Wed, 2004-02-04 at 10:54, Alok Mooley wrote:
> > --- Dave Hansen <haveblue@us.ibm.com> wrote:
> Depending on the quantity of work that you're trying
> to do at once, this
> might be unavoidable.  
> 
> I know it's a difficult thing to think about, but I
> still don't
> understand the precise cases that you're concerned
> about.  Page faults
> to me seem like the least of your problems.  A
> bigger issue would be if
> the page is written to by userspace after you copy,
> but before you
> install the new pte.  Did I miss the code in your
> patch that invalidated
> the old tlb entries?

This is a non issue for us right now, since we update
the ptes in a lock, & so no one can access it before
it is completely updated. Yes, we invalidate the old
tlb entries as well as the cache entries as reqd. on
some other architectures.

-Alok


__________________________________
Do you Yahoo!?
Yahoo! Finance: Get your refund fast by filing online.
http://taxes.yahoo.com/filing.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04 19:33       ` Richard B. Johnson
@ 2004-02-05  5:07         ` Alok Mooley
  2004-02-05 19:03         ` Pavel Machek
  1 sibling, 0 replies; 32+ messages in thread
From: Alok Mooley @ 2004-02-05  5:07 UTC (permalink / raw)
  To: root; +Cc: linux-kernel, linux-mm


--- "Richard B. Johnson" <root@chaos.analogic.com>
wrote:
> All "blocks" are the same size, i.e., PAGE_SIZE.
> When RAM
> is tight the content of a page is written to the
> swap-file
> according to a least-recently-used protocol. This
> frees
> a page. Pages are allocated to a process only one
> page at
> a time. This prevents some hog from grabbing all the
> memory
> in the machine. Memory allocation and physical page
> allocation
> are two different things, I can malloc() a gigabyte
> of RAM on
> a machine. It only gets allocated when an attempt is
> made
> to access a page.

Only userspace processes are allocated pages via
page-faults, i.e., one page at a time. Processes
running in kernel mode can request higher order blocks
(8K,16K...4M, which are 2 pages,4 pages...1024 pages
respectively) from the buddy allocator directly. If
external fragmentation is rampant, requests for these
higher order blocks may fail. The defragmentation
utility intends to provide a faster option for higher
order block formation before swapping (which is the
last alternative). By the way, 
malloc finally takes memory from the buddy allocator
itself (by page-faults), & the defragmenter is out to
reduce the external fragmentation caused by the buddy
allocator. Swapping ofcourse cannot be completely
avoided if the machine is genuinely short of memory.
Defragmentation may now sound better than needless
swapping or memory allocation failures, not just
another cpu hog! 

Regards,
Alok


__________________________________
Do you Yahoo!?
Yahoo! Finance: Get your refund fast by filing online.
http://taxes.yahoo.com/filing.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04 21:59       ` Timothy Miller
@ 2004-02-04 23:24         ` Dave Hansen
  2004-02-05 16:32           ` Dave McCracken
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2004-02-04 23:24 UTC (permalink / raw)
  To: Timothy Miller; +Cc: Dave McCracken, root, linux-kernel, linux-mm

On Wed, 2004-02-04 at 13:59, Timothy Miller wrote:
> Dave McCracken wrote:
> > Um, wrong answer.  When you ask for more than one page from the buddy
> > allocator  (order greater than 0) it always returns physically contiguous
> > pages.
> > 
> > Also, one of the near-term goals in VM is to be able to allocate and free
> > large pages from the main memory pools, which requires that something like
> > order 9 or 10 allocations (based on the architecture) succeed.
> 
> What's the x86 large page size?  4M?  16M?  For the sake of arguement, 
> let's call it 4M.  Doesn't matter.

2M with PAE on.  4MB with it off.

> Let's say this defragmenter allowed the kernel to detect when 1024 4k 
> pages were contiguous and aligned properly and could silently replace 
> the processor mapping tables so that all of these "pages" would be 
> mapped by one TLB entry.  (At such time that some pages need to get 
> freed, the VM would silently switch back to the 4k model.)
> 
> This would reduce TLB entries for a lot of programs above a certain 
> size, and therefore improve peformance.

This is something that would be interesting to pursue, but we already
have hugetlbfs which does this when you need it explicitly.  When you
know that TLB coverage is a problem, that's your fix for now. 

> The question is:  How much overhead really is caused by TLB misses?  The 
> TLB in the Athlon is like 512 entries.  That means it can know about 2 
> megabytes worth of 4k pages at any one time.

I'm not sure how the Athlon works, but the PIII has very few TLB entries
for large pages.  The P4 can put large or small entries anywhere.  

--dave


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04 19:35     ` Dave McCracken
@ 2004-02-04 21:59       ` Timothy Miller
  2004-02-04 23:24         ` Dave Hansen
  0 siblings, 1 reply; 32+ messages in thread
From: Timothy Miller @ 2004-02-04 21:59 UTC (permalink / raw)
  To: Dave McCracken; +Cc: root, linux-kernel, linux-mm



Dave McCracken wrote:

> 
> Um, wrong answer.  When you ask for more than one page from the buddy
> allocator  (order greater than 0) it always returns physically contiguous
> pages.
> 
> Also, one of the near-term goals in VM is to be able to allocate and free
> large pages from the main memory pools, which requires that something like
> order 9 or 10 allocations (based on the architecture) succeed.
> 

What's the x86 large page size?  4M?  16M?  For the sake of arguement, 
let's call it 4M.  Doesn't matter.

Let's say this defragmenter allowed the kernel to detect when 1024 4k 
pages were contiguous and aligned properly and could silently replace 
the processor mapping tables so that all of these "pages" would be 
mapped by one TLB entry.  (At such time that some pages need to get 
freed, the VM would silently switch back to the 4k model.)

This would reduce TLB entries for a lot of programs above a certain 
size, and therefore improve peformance.

The question is:  How much overhead really is caused by TLB misses?  The 
TLB in the Athlon is like 512 entries.  That means it can know about 2 
megabytes worth of 4k pages at any one time.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04 19:37     ` Timothy Miller
  2004-02-04 19:43       ` Dave Hansen
@ 2004-02-04 19:59       ` Richard B. Johnson
  1 sibling, 0 replies; 32+ messages in thread
From: Richard B. Johnson @ 2004-02-04 19:59 UTC (permalink / raw)
  To: Timothy Miller; +Cc: Alok Mooley, Dave Hansen, linux-kernel, linux-mm

On Wed, 4 Feb 2004, Timothy Miller wrote:

>
>
> Richard B. Johnson wrote:
>
> > If this is an Intel x86 machine, it is impossible for pages
> > to get fragmented in the first place. The hardware allows any
> > page, from anywhere in memory, to be concatenated into linear
> > virtual address space. Even the kernel address space is virtual.
> > The only time you need physically-adjacent pages is if you
> > are doing DMA that is more than a page-length at a time. The
> > kernel keeps a bunch of those pages around for just that
> > purpose.
> >
> > So, if you are making a "memory defragmenter", it is a CPU time-sink.
> > That's all.
>
> Would memory fragmentation have any appreciable impact on L2 cache line
> collisions?
>
> Would defragmenting it help?
>
> In the case of the Opteron, there is a 1M cache that is (I forget) N-way
> set associative, and it's physically indexed.  If a bunch of pages were
> located such that there were a disproportionately large number of lines
> which hit the same tag, you could be thrashing the cache.
>
> There are two ways to deal with this:  (1) intelligently locates pages
> in physical memory; (2) hope that natural entropy keeps things random
> enough that it doesn't matter.
>
>

Certainly anybody can figure out some special case where a
cache-line might get flushed or whatever. Eventually you
get to the fact that even contiguous physical RAM doesn't
have to be contiguous and, in fact, with modern controllers
it's quite unlikely that it is. It's a sack of bits that
are uniquely addressable.

The rest of the hardware makes this look like a bunch of RAM
"pages", seldom the size of CPU pages, and the controller
makes these pages look like contiguous RAM sitting in linear
address-space. There is lots of wasted RAM in the Intel/IBM/PC
architecture because the stuff that would be where the screen
regen buffer is, where the graphics aperture exists, where
the screen BIOS ROM is, where the BIOS ROM is (unless shadowed),
all that space is occupied by RAM that can't even be addressed.
So there are address-holes that are never accessed. This means
that there are few nice clean cache-line fills for physical RAM
below 1 megabyte.

Fortunately most accesses are above 1 megabytes. The locality-of-
action helps reduce the amount of paging required in that memory
accesses remain with a very short distance of each other in real
programs. Of course, again, you can make a memory-corrupting
program that thrashes the swap-file, but if you design to reduce
that, you destroy interactivity.

Since the days of VAXen people have been trying to improve memory
managers. So far, every improvement in special cases hurts the
general case. I remember the controvesy at DEC when it was decided
to keep some zero-filled pages around (demand-zero paging). These
would be allocated as "readable". It's only when somebody would
attempt to write to them, that a real page needed to be allocated.
My question was; "Why a bunch of pages?". You only need one. As
a customer, I was told that I didn't understand. So, DEC is out-
of-business and nobody does demand-zero paging because it's dumb,
damn stupid.

The same is true of "memory defragmentation".

Cheers,
Dick Johnson
Penguin : Linux version 2.4.24 on an i686 machine (797.90 BogoMips).
            Note 96.31% of all statistics are fiction.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04 18:54 ` Alok Mooley
  2004-02-04 19:07   ` Richard B. Johnson
@ 2004-02-04 19:56   ` Dave Hansen
  2004-02-05  5:19     ` Alok Mooley
  1 sibling, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2004-02-04 19:56 UTC (permalink / raw)
  To: Alok Mooley; +Cc: linux-kernel, linux-mm

On Wed, 2004-02-04 at 10:54, Alok Mooley wrote:
> --- Dave Hansen <haveblue@us.ibm.com> wrote:
> 
> > The "work until we get interrupted and restart if
> > something changes
> > state" approach is very, very common.  Can you give
> > some more examples
> > of just how a page fault would ruin the defrag
> > process?
> > 
> 
> What I mean to say is that if we have identified some
> pages for movement, & we get preempted, the pages
> identified as movable may not remain movable any more
> when we are rescheduled. We are left with the task of
> identifying new movable pages.

Depending on the quantity of work that you're trying to do at once, this
might be unavoidable.  

I know it's a difficult thing to think about, but I still don't
understand the precise cases that you're concerned about.  Page faults
to me seem like the least of your problems.  A bigger issue would be if
the page is written to by userspace after you copy, but before you
install the new pte.  Did I miss the code in your patch that invalidated
the old tlb entries?

--dave


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04 19:37     ` Timothy Miller
@ 2004-02-04 19:43       ` Dave Hansen
  2004-02-04 19:59       ` Richard B. Johnson
  1 sibling, 0 replies; 32+ messages in thread
From: Dave Hansen @ 2004-02-04 19:43 UTC (permalink / raw)
  To: Timothy Miller; +Cc: root, Alok Mooley, linux-kernel, linux-mm

On Wed, 2004-02-04 at 11:37, Timothy Miller wrote:
> Would memory fragmentation have any appreciable impact on L2 cache line 
> collisions?
> Would defragmenting it help?

Nope.  The L2 lines are 32 or 64 bytes long, and the only unit we can
defrag in is pages which are 4k.  Since everything is aligned, a
cacheline cannot cross a page.  

> In the case of the Opteron, there is a 1M cache that is (I forget) N-way 
> set associative, and it's physically indexed.  If a bunch of pages were 
> located such that there were a disproportionately large number of lines 
> which hit the same tag, you could be thrashing the cache.
>
> There are two ways to deal with this:  (1) intelligently locates pages
> in physical memory; (2) hope that natural entropy keeps things random 
> enough that it doesn't matter.

You're talking about page coloring now.  That a whole different debate. 
I think it's been discussed here before. :)  It's good.  It's bad.  It's
good.  It's bad.  

--dave


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04 19:07   ` Richard B. Johnson
  2004-02-04 19:18     ` Alok Mooley
  2004-02-04 19:35     ` Dave McCracken
@ 2004-02-04 19:37     ` Timothy Miller
  2004-02-04 19:43       ` Dave Hansen
  2004-02-04 19:59       ` Richard B. Johnson
  2 siblings, 2 replies; 32+ messages in thread
From: Timothy Miller @ 2004-02-04 19:37 UTC (permalink / raw)
  To: root; +Cc: Alok Mooley, Dave Hansen, linux-kernel, linux-mm



Richard B. Johnson wrote:

> If this is an Intel x86 machine, it is impossible for pages
> to get fragmented in the first place. The hardware allows any
> page, from anywhere in memory, to be concatenated into linear
> virtual address space. Even the kernel address space is virtual.
> The only time you need physically-adjacent pages is if you
> are doing DMA that is more than a page-length at a time. The
> kernel keeps a bunch of those pages around for just that
> purpose.
> 
> So, if you are making a "memory defragmenter", it is a CPU time-sink.
> That's all.

Would memory fragmentation have any appreciable impact on L2 cache line 
collisions?

Would defragmenting it help?

In the case of the Opteron, there is a 1M cache that is (I forget) N-way 
set associative, and it's physically indexed.  If a bunch of pages were 
located such that there were a disproportionately large number of lines 
which hit the same tag, you could be thrashing the cache.

There are two ways to deal with this:  (1) intelligently locates pages 
in physical memory; (2) hope that natural entropy keeps things random 
enough that it doesn't matter.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04 19:07   ` Richard B. Johnson
  2004-02-04 19:18     ` Alok Mooley
@ 2004-02-04 19:35     ` Dave McCracken
  2004-02-04 21:59       ` Timothy Miller
  2004-02-04 19:37     ` Timothy Miller
  2 siblings, 1 reply; 32+ messages in thread
From: Dave McCracken @ 2004-02-04 19:35 UTC (permalink / raw)
  To: root; +Cc: linux-kernel, linux-mm


--On Wednesday, February 04, 2004 14:07:52 -0500 "Richard B. Johnson"
<root@chaos.analogic.com> wrote:

> If this is an Intel x86 machine, it is impossible for pages
> to get fragmented in the first place. The hardware allows any
> page, from anywhere in memory, to be concatenated into linear
> virtual address space. Even the kernel address space is virtual.
> The only time you need physically-adjacent pages is if you
> are doing DMA that is more than a page-length at a time. The
> kernel keeps a bunch of those pages around for just that
> purpose.
> 
> So, if you are making a "memory defragmenter", it is a CPU time-sink.

Um, wrong answer.  When you ask for more than one page from the buddy
allocator  (order greater than 0) it always returns physically contiguous
pages.

Also, one of the near-term goals in VM is to be able to allocate and free
large pages from the main memory pools, which requires that something like
order 9 or 10 allocations (based on the architecture) succeed.

Dave McCracken


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04 19:18     ` Alok Mooley
@ 2004-02-04 19:33       ` Richard B. Johnson
  2004-02-05  5:07         ` Alok Mooley
  2004-02-05 19:03         ` Pavel Machek
  0 siblings, 2 replies; 32+ messages in thread
From: Richard B. Johnson @ 2004-02-04 19:33 UTC (permalink / raw)
  To: Alok Mooley; +Cc: linux-kernel, linux-mm, Dave Hansen

On Wed, 4 Feb 2004, Alok Mooley wrote:

> --- "Richard B. Johnson" <root@chaos.analogic.com> >
> If this is an Intel x86 machine, it is impossible
> > for pages
> > to get fragmented in the first place. The hardware
> > allows any
> > page, from anywhere in memory, to be concatenated
> > into linear
> > virtual address space. Even the kernel address space
> > is virtual.
> > The only time you need physically-adjacent pages is
> > if you
> > are doing DMA that is more than a page-length at a
> > time. The
> > kernel keeps a bunch of those pages around for just
> > that
> > purpose.
> >
> > So, if you are making a "memory defragmenter", it is
> > a CPU time-sink.
> > That's all.
>
> What if the external fragmentation increases so much
> that it is not possible to find a large sized block?
> Then, is it not better to defragment rather than swap
> or fail?
>
> -Alok

All "blocks" are the same size, i.e., PAGE_SIZE. When RAM
is tight the content of a page is written to the swap-file
according to a least-recently-used protocol. This frees
a page. Pages are allocated to a process only one page at
a time. This prevents some hog from grabbing all the memory
in the machine. Memory allocation and physical page allocation
are two different things, I can malloc() a gigabyte of RAM on
a machine. It only gets allocated when an attempt is made
to access a page.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.24 on an i686 machine (797.90 BogoMips).
            Note 96.31% of all statistics are fiction.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04 19:07   ` Richard B. Johnson
@ 2004-02-04 19:18     ` Alok Mooley
  2004-02-04 19:33       ` Richard B. Johnson
  2004-02-04 19:35     ` Dave McCracken
  2004-02-04 19:37     ` Timothy Miller
  2 siblings, 1 reply; 32+ messages in thread
From: Alok Mooley @ 2004-02-04 19:18 UTC (permalink / raw)
  To: root; +Cc: linux-kernel, linux-mm, Dave Hansen

--- "Richard B. Johnson" <root@chaos.analogic.com> >
If this is an Intel x86 machine, it is impossible
> for pages
> to get fragmented in the first place. The hardware
> allows any
> page, from anywhere in memory, to be concatenated
> into linear
> virtual address space. Even the kernel address space
> is virtual.
> The only time you need physically-adjacent pages is
> if you
> are doing DMA that is more than a page-length at a
> time. The
> kernel keeps a bunch of those pages around for just
> that
> purpose.
> 
> So, if you are making a "memory defragmenter", it is
> a CPU time-sink.
> That's all.

What if the external fragmentation increases so much
that it is not possible to find a large sized block?
Then, is it not better to defragment rather than swap
or fail?

-Alok

__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free web site building tool. Try it!
http://webhosting.yahoo.com/ps/sb/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04 18:54 ` Alok Mooley
@ 2004-02-04 19:07   ` Richard B. Johnson
  2004-02-04 19:18     ` Alok Mooley
                       ` (2 more replies)
  2004-02-04 19:56   ` Dave Hansen
  1 sibling, 3 replies; 32+ messages in thread
From: Richard B. Johnson @ 2004-02-04 19:07 UTC (permalink / raw)
  To: Alok Mooley; +Cc: Dave Hansen, linux-kernel, linux-mm

On Wed, 4 Feb 2004, Alok Mooley wrote:

>
> --- Dave Hansen <haveblue@us.ibm.com> wrote:
>
> > The "work until we get interrupted and restart if
> > something changes
> > state" approach is very, very common.  Can you give
> > some more examples
> > of just how a page fault would ruin the defrag
> > process?
> >
>
> What I mean to say is that if we have identified some
> pages for movement, & we get preempted, the pages
> identified as movable may not remain movable any more
> when we are rescheduled. We are left with the task of
> identifying new movable pages.
>
> -Alok
>

If this is an Intel x86 machine, it is impossible for pages
to get fragmented in the first place. The hardware allows any
page, from anywhere in memory, to be concatenated into linear
virtual address space. Even the kernel address space is virtual.
The only time you need physically-adjacent pages is if you
are doing DMA that is more than a page-length at a time. The
kernel keeps a bunch of those pages around for just that
purpose.

So, if you are making a "memory defragmenter", it is a CPU time-sink.
That's all.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.24 on an i686 machine (797.90 BogoMips).
            Note 96.31% of all statistics are fiction.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
       [not found] <1075920386.27981.106.camel@nighthawk>
@ 2004-02-04 18:54 ` Alok Mooley
  2004-02-04 19:07   ` Richard B. Johnson
  2004-02-04 19:56   ` Dave Hansen
  0 siblings, 2 replies; 32+ messages in thread
From: Alok Mooley @ 2004-02-04 18:54 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm


--- Dave Hansen <haveblue@us.ibm.com> wrote:

> The "work until we get interrupted and restart if
> something changes
> state" approach is very, very common.  Can you give
> some more examples
> of just how a page fault would ruin the defrag
> process?
> 

What I mean to say is that if we have identified some
pages for movement, & we get preempted, the pages
identified as movable may not remain movable any more
when we are rescheduled. We are left with the task of
identifying new movable pages.

-Alok

__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free web site building tool. Try it!
http://webhosting.yahoo.com/ps/sb/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04  6:57       ` IWAMOTO Toshihiro
  2004-02-04  7:10         ` Dave Hansen
@ 2004-02-04 10:33         ` Hirokazu Takahashi
  1 sibling, 0 replies; 32+ messages in thread
From: Hirokazu Takahashi @ 2004-02-04 10:33 UTC (permalink / raw)
  To: iwamoto; +Cc: haveblue, rangdi, linux-kernel, linux-mm, mbligh

Hello,

> > Moving file-backed pages is mostly handled already.  You can do a
> > regular page-cache lookup with find_get_page(), make your copy,
> > invalidate the old one, then readd the new one.  The invalidation can be
> > done in the same style as shrink_list().
> 
> Actually, it is a bit more complicated.
> I have implemented similar functionality for memory hotremoval.
> 
> See my post about memory hotremoval
> http://marc.theaimsgroup.com/?l=linux-kernel&m=107354781130941&w=2
> for details.
> remap_onepage() and remapd() in the patch are the main functions.

My patch may be one of the samples.
To allocate continuous pages on demand, I used remap_onepage() to
defragment pages.

http://www.ussg.iu.edu/hypermail/linux/kernel/0401.1/0045.html

Thank you,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04  7:17               ` Martin J. Bligh
@ 2004-02-04  8:30                 ` Andrew Morton
  0 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2004-02-04  8:30 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: haveblue, rangdi, linux-kernel, linux-mm

"Martin J. Bligh" <mbligh@aracnet.com> wrote:
>
>  As long as we make sure the process doesn't run during the move, I don't
>  see why it'd be a problem. But I am less than convinced that rmap will
>  lead us back from the PTE page to the mm, at least w/o modification.

ptep_to_mm() at your service.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04  7:10         ` Dave Hansen
@ 2004-02-04  7:50           ` IWAMOTO Toshihiro
  0 siblings, 0 replies; 32+ messages in thread
From: IWAMOTO Toshihiro @ 2004-02-04  7:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: IWAMOTO Toshihiro, Alok Mooley, Linux Kernel Mailing List,
	linux-mm, Martin J. Bligh

At 03 Feb 2004 23:10:52 -0800,
Dave Hansen wrote:
> remap_onepage() is quite a function.  300 lines.  It sure does cover a
> lot of ground. :)
> 
> Defragmentation is a bit easier than removal because it isn't as
> mandatory.  Instead of having to worry about waiting on things like
> writeback, the defrag code can just bail.  

Waiting for !pagewriteback and writing back page at the beginning of
remap_onepage() are a sort of "easy part".
We need to wait for exclusive access of a page before copying anyway,
and interesting things such as vmtruncate can happen while waiting for
it.

I don't think the code can be much shorter without assuming a single
processor !CONFIG_PREEMPT system.

--
IWAMOTO Toshihiro

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04  6:40             ` Dave Hansen
@ 2004-02-04  7:17               ` Martin J. Bligh
  2004-02-04  8:30                 ` Andrew Morton
  0 siblings, 1 reply; 32+ messages in thread
From: Martin J. Bligh @ 2004-02-04  7:17 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Alok Mooley, Linux Kernel Mailing List, linux-mm

>> There are a couple of special cases that might be feasible without making
>> an ungodly mess. PTE pages spring to mind (particularly as they can be
>> in highmem too). They should be reasonably easy to move (assuming we can
>> use rmap to track them back to the process they belong to to lock them ...
>> hmmm ....)
> 
> We don't do any pte page reclaim at any time other than process exit and
> there are plenty of pte pages we can just plain free anyway.  Anthing
> that's completely mapping page cache, for instance.
> 
> In the replacement case, taking mm->page_table_lock, doing the copy, and
> replacing the pointer from the pmd should be all that it takes.  But, I
> wonder if we could miss any sets of the pte dirty bit this way...

As long as we make sure the process doesn't run during the move, I don't
see why it'd be a problem. But I am less than convinced that rmap will
lead us back from the PTE page to the mm, at least w/o modification.

M.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04  6:57       ` IWAMOTO Toshihiro
@ 2004-02-04  7:10         ` Dave Hansen
  2004-02-04  7:50           ` IWAMOTO Toshihiro
  2004-02-04 10:33         ` Hirokazu Takahashi
  1 sibling, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2004-02-04  7:10 UTC (permalink / raw)
  To: IWAMOTO Toshihiro
  Cc: Alok Mooley, Linux Kernel Mailing List, linux-mm, Martin J. Bligh

On Tue, 2004-02-03 at 22:57, IWAMOTO Toshihiro wrote:
> At 03 Feb 2004 21:54:34 -0800,
> Dave Hansen wrote:
> > Moving file-backed pages is mostly handled already.  You can do a
> > regular page-cache lookup with find_get_page(), make your copy,
> > invalidate the old one, then readd the new one.  The invalidation can be
> > done in the same style as shrink_list().
> 
> Actually, it is a bit more complicated.
> I have implemented similar functionality for memory hotremoval.
> 
> See my post about memory hotremoval
> http://marc.theaimsgroup.com/?l=linux-kernel&m=107354781130941&w=2
> for details.
> remap_onepage() and remapd() in the patch are the main functions.

remap_onepage() is quite a function.  300 lines.  It sure does cover a
lot of ground. :)

Defragmentation is a bit easier than removal because it isn't as
mandatory.  Instead of having to worry about waiting on things like
writeback, the defrag code can just bail.  

--dave


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04  5:54     ` Dave Hansen
  2004-02-04  6:05       ` Martin J. Bligh
@ 2004-02-04  6:57       ` IWAMOTO Toshihiro
  2004-02-04  7:10         ` Dave Hansen
  2004-02-04 10:33         ` Hirokazu Takahashi
  1 sibling, 2 replies; 32+ messages in thread
From: IWAMOTO Toshihiro @ 2004-02-04  6:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Alok Mooley, Linux Kernel Mailing List, linux-mm, Martin J. Bligh

At 03 Feb 2004 21:54:34 -0800,
Dave Hansen wrote:
> Moving file-backed pages is mostly handled already.  You can do a
> regular page-cache lookup with find_get_page(), make your copy,
> invalidate the old one, then readd the new one.  The invalidation can be
> done in the same style as shrink_list().

Actually, it is a bit more complicated.
I have implemented similar functionality for memory hotremoval.

See my post about memory hotremoval
http://marc.theaimsgroup.com/?l=linux-kernel&m=107354781130941&w=2
for details.
remap_onepage() and remapd() in the patch are the main functions.

--
IWAMOTO Toshihiro


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04  6:29           ` Martin J. Bligh
@ 2004-02-04  6:40             ` Dave Hansen
  2004-02-04  7:17               ` Martin J. Bligh
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2004-02-04  6:40 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Alok Mooley, Linux Kernel Mailing List, linux-mm

On Tue, 2004-02-03 at 22:29, Martin J. Bligh wrote:
> There are a couple of special cases that might be feasible without making
> an ungodly mess. PTE pages spring to mind (particularly as they can be
> in highmem too). They should be reasonably easy to move (assuming we can
> use rmap to track them back to the process they belong to to lock them ...
> hmmm ....)

We don't do any pte page reclaim at any time other than process exit and
there are plenty of pte pages we can just plain free anyway.  Anthing
that's completely mapping page cache, for instance.

In the replacement case, taking mm->page_table_lock, doing the copy, and
replacing the pointer from the pmd should be all that it takes.  But, I
wonder if we could miss any sets of the pte dirty bit this way...

--dave


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04  6:22         ` Dave Hansen
@ 2004-02-04  6:29           ` Martin J. Bligh
  2004-02-04  6:40             ` Dave Hansen
  0 siblings, 1 reply; 32+ messages in thread
From: Martin J. Bligh @ 2004-02-04  6:29 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Alok Mooley, Linux Kernel Mailing List, linux-mm

--Dave Hansen <haveblue@us.ibm.com> wrote (on Tuesday, February 03, 2004 22:22:36 -0800):

> On Tue, 2004-02-03 at 22:05, Martin J. Bligh wrote:
>> >> In order to move such pages, we will have to patch macros like
>> >> "virt_to_phys" & other related macros, so that the address 
>> >> translation for pages moved by us will take place vmalloc style, i.e.,
>> >> via page tables, instead of direct +-3GB. Is it worth introducing such
>> >> an overhead for address translation (vmalloc does it!)? If no, then is
>> >> there another way out, or is it better to stick to our current
>> >> definition of a movable page? 
>> > 
>> > Low memory kernel pages are a much bigger deal to defrag.  I've started
>> > to think about these for hotplug memory and it just makes my head hurt. 
>> > If you want to do this, you are right, you'll have to alter virt_to_phys
>> > and company.  The best way I've seen this is with CONFIG_NONLINEAR:
>> > http://lwn.net/2002/0411/a/discontig.php3
>> > Those lookup tables are pretty fast, and have benefits to many areas
>> > beyond defragmentation like NUMA and the memory hotplug projects.  
>> 
>> I don't think that helps you really - the mappings are usually done on
>> chunks signficantly larger than one page, and we don't want to break
>> away from using large pages for the kernel mappings.
> 
> Yeah, you're right about that one.  The defrag thing needs to deal with
> much smaller sections than nonlinear does.  That pretty much leaves
> being careful where you allocate.

There are a couple of special cases that might be feasible without making
an ungodly mess. PTE pages spring to mind (particularly as they can be
in highmem too). They should be reasonably easy to move (assuming we can
use rmap to track them back to the process they belong to to lock them ...
hmmm ....)

M.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04  6:05       ` Martin J. Bligh
@ 2004-02-04  6:22         ` Dave Hansen
  2004-02-04  6:29           ` Martin J. Bligh
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2004-02-04  6:22 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Alok Mooley, Linux Kernel Mailing List, linux-mm

On Tue, 2004-02-03 at 22:05, Martin J. Bligh wrote:
> >> In order to move such pages, we will have to patch macros like
> >> "virt_to_phys" & other related macros, so that the address 
> >> translation for pages moved by us will take place vmalloc style, i.e.,
> >> via page tables, instead of direct +-3GB. Is it worth introducing such
> >> an overhead for address translation (vmalloc does it!)? If no, then is
> >> there another way out, or is it better to stick to our current
> >> definition of a movable page? 
> > 
> > Low memory kernel pages are a much bigger deal to defrag.  I've started
> > to think about these for hotplug memory and it just makes my head hurt. 
> > If you want to do this, you are right, you'll have to alter virt_to_phys
> > and company.  The best way I've seen this is with CONFIG_NONLINEAR:
> > http://lwn.net/2002/0411/a/discontig.php3
> > Those lookup tables are pretty fast, and have benefits to many areas
> > beyond defragmentation like NUMA and the memory hotplug projects.  
> 
> I don't think that helps you really - the mappings are usually done on
> chunks signficantly larger than one page, and we don't want to break
> away from using large pages for the kernel mappings.

Yeah, you're right about that one.  The defrag thing needs to deal with
much smaller sections than nonlinear does.  That pretty much leaves
being careful where you allocate.

--dave


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04  5:54     ` Dave Hansen
@ 2004-02-04  6:05       ` Martin J. Bligh
  2004-02-04  6:22         ` Dave Hansen
  2004-02-04  6:57       ` IWAMOTO Toshihiro
  1 sibling, 1 reply; 32+ messages in thread
From: Martin J. Bligh @ 2004-02-04  6:05 UTC (permalink / raw)
  To: Dave Hansen, Alok Mooley; +Cc: Linux Kernel Mailing List, linux-mm

>> In order to move such pages, we will have to patch macros like
>> "virt_to_phys" & other related macros, so that the address 
>> translation for pages moved by us will take place vmalloc style, i.e.,
>> via page tables, instead of direct +-3GB. Is it worth introducing such
>> an overhead for address translation (vmalloc does it!)? If no, then is
>> there another way out, or is it better to stick to our current
>> definition of a movable page? 
> 
> Low memory kernel pages are a much bigger deal to defrag.  I've started
> to think about these for hotplug memory and it just makes my head hurt. 
> If you want to do this, you are right, you'll have to alter virt_to_phys
> and company.  The best way I've seen this is with CONFIG_NONLINEAR:
> http://lwn.net/2002/0411/a/discontig.php3
> Those lookup tables are pretty fast, and have benefits to many areas
> beyond defragmentation like NUMA and the memory hotplug projects.  

I don't think that helps you really - the mappings are usually done on
chunks signficantly larger than one page, and we don't want to break
away from using large pages for the kernel mappings.
 
> Rather than try to defrag kernel memory now, it's probably better to
> work on schemes that keep from fragmenting memory in the first place. 

Absolutely. Kernel pages are really hard (not any lowmem page is a 
kernel page, of course). 

>> Identifying pages moved by us may involve introducing a new page-flag. 
>> A new page-flag for per-cpu pages would be great, since we have to 
>> traverse the per-cpu hot & cold lists in order to identify if a page 
>> is on the pcp lists. 

Careful not to introduce new cacheline touches, etc whilst doing this.
The whole point of hot & cold pages is to be efficient.

If you don't need N kilobyte alignment on your N kilobyte page groups,
there's probably much more effective schemes that buddy allocator, but
that assumption may be too embedded to change.

> If the per-cpu allocator caches are your only problem, I don't see why
> we can't just flush them out when you're doing your operation.  Plus,
> they aren't *that* big, so you could pretty easily go scanning them. 
> Martin, can we just flush out and turn off the per-cpu hot/cold lists
> for the defrag period?

Yup, should be fairly easy to do. Just free them back with the standard
mechanisms.
 
>> As of now, we have adopted a failure based approach, i.e, we
>> defragment only when a higher order allocation failure has taken place
>> (just before kswapd starts swapping).  We now want to defragment based
>> on thresholds kept for each allocation order.  Instead of a daemon
>> kicking in on a threshold  violation (as proposed by Mr. Daniel
>> Phillips), we intend to capture idle cpu cycles by inserting a new
>> process just above the idle process.  
> 
> I think I'd agree with Dan on that one.  When kswapd is going, it's
> pretty much too late.  The daemon approach would be more flexible, allow
> you to start earlier, and more easily have various levels of
> aggressiveness.

I think the policy we've taken so far is that you can't *urgently* request
large contig areas. If you need that, you should be keeping your own cache.

M.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04  5:09   ` Alok Mooley
  2004-02-04  5:24     ` Mike Fedyk
@ 2004-02-04  5:54     ` Dave Hansen
  2004-02-04  6:05       ` Martin J. Bligh
  2004-02-04  6:57       ` IWAMOTO Toshihiro
  1 sibling, 2 replies; 32+ messages in thread
From: Dave Hansen @ 2004-02-04  5:54 UTC (permalink / raw)
  To: Alok Mooley; +Cc: Linux Kernel Mailing List, linux-mm, Martin J. Bligh

On Tue, 2004-02-03 at 21:09, Alok Mooley wrote:
>  Yes we have cut & pasted some code from there. But,
> try_to_unmap_one also does lots of other stuff like
>            /*
>             * Store the swap location in the pte.
>             * See handle_pte_fault() ...
>             */
> which we don't want. Hence we use a separate function.

If that function is fairly close, you might want to look at modifying it
in a way that will make it work for you, but maintain the current
semantics for current users.  There seems to be a lot more in common
than different there.  Think long and hard about things that you don't
think you need.  Will it *hurt*, or is it just a bit superfluous?

> Could you also please comment & advise us on our
> problems which are as below: -
> 
> We want to broaden our definition of a movable page, & consider kernel
> pages & file-backed pages also for movement (currently we consider
> only userspace anonymous pages as movable). Do file-backed pages also
> obey the 3GB rule?

The 3GB rule?  file-backed pages are referenced via the page cache which
can store arbitrary pages; they don't have to be in low memory.  

Moving file-backed pages is mostly handled already.  You can do a
regular page-cache lookup with find_get_page(), make your copy,
invalidate the old one, then readd the new one.  The invalidation can be
done in the same style as shrink_list().

> In order to move such pages, we will have to patch macros like
> "virt_to_phys" & other related macros, so that the address 
> translation for pages moved by us will take place vmalloc style, i.e.,
> via page tables, instead of direct +-3GB. Is it worth introducing such
> an overhead for address translation (vmalloc does it!)? If no, then is
> there another way out, or is it better to stick to our current
> definition of a movable page? 

Low memory kernel pages are a much bigger deal to defrag.  I've started
to think about these for hotplug memory and it just makes my head hurt. 
If you want to do this, you are right, you'll have to alter virt_to_phys
and company.  The best way I've seen this is with CONFIG_NONLINEAR:
http://lwn.net/2002/0411/a/discontig.php3
Those lookup tables are pretty fast, and have benefits to many areas
beyond defragmentation like NUMA and the memory hotplug projects.  

Rather than try to defrag kernel memory now, it's probably better to
work on schemes that keep from fragmenting memory in the first place. 
What kind of situations are causing you the most fragmentation?

We've thought about having many more allocator pools around to help ease
fragmentation.  I started to code one up that would allocate some higher
order pages for each struct address_space and hand little pages out from
that pool instead of going back to the buddy allocator.  That way, when
you go to free a file's page cache, you get some pretty large contiguous
areas instead of a bunch of scattered 0-order pages.

> Identifying pages moved by us may involve introducing a new page-flag. 
> A new page-flag for per-cpu pages would be great, since we have to 
> traverse the per-cpu hot & cold lists in order to identify if a page 
> is on the pcp lists. 

We already have pretty sparse utilization of the current page->flags. 
See some of the work that Bill Irwin (wli@holomorphy.com) has done in
the past.  But, I'm not sure you really even need to go there.

If the per-cpu allocator caches are your only problem, I don't see why
we can't just flush them out when you're doing your operation.  Plus,
they aren't *that* big, so you could pretty easily go scanning them. 
Martin, can we just flush out and turn off the per-cpu hot/cold lists
for the defrag period?

This is part of a larger problem that we'll see with memory removal as
well.  Given any random page in the system, we'll need to be able to see
what's being done with it.  There are a lot of things that do a
alloc_page() and vm never sees the page again.  We'll eventually need
ways to account for these and possibly have mechanisms to get some of
them back.

This might involve having some new allocator flags for things that can
allow themselves to be moved, with the default being for things that
either refuse to be moved, or don't know if they can be.

> As of now, we have adopted a failure based approach, i.e, we
> defragment only when a higher order allocation failure has taken place
> (just before kswapd starts swapping).  We now want to defragment based
> on thresholds kept for each allocation order.  Instead of a daemon
> kicking in on a threshold  violation (as proposed by Mr. Daniel
> Phillips), we intend to capture idle cpu cycles by inserting a new
> process just above the idle process.  

I think I'd agree with Dan on that one.  When kswapd is going, it's
pretty much too late.  The daemon approach would be more flexible, allow
you to start earlier, and more easily have various levels of
aggressiveness.

> Now, when we are scheduled, we are sure that the cpu is idle, & this
> is when we check for threshold violation & defragment.  One problem
> with this would be when to reschedule ourselves (allow our
> preemption)?  We do not want the memory state to change beneath us,
> so right now we are not allowing our preemption.

It's a real luxury if the state doesn't change underneath you.  It's
usually worthwhile to try and do it without locking too many things
down.  Take the page cache, for instance.  It does a lot of its work
without locks, and has all of the error handling necessary when thing
collide during a duplicate addition or go away from underneath you. 
It's a good example of some really capable code that doesn't require a
static state to work properly.

--dave


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-04  5:09   ` Alok Mooley
@ 2004-02-04  5:24     ` Mike Fedyk
  2004-02-04  5:54     ` Dave Hansen
  1 sibling, 0 replies; 32+ messages in thread
From: Mike Fedyk @ 2004-02-04  5:24 UTC (permalink / raw)
  To: Alok Mooley; +Cc: Dave Hansen, linux-kernel, linux-mm

Alok Mooley wrote:

> The regular buddy freeing function also increases the
>number of free pages. Since we are not actually
>freeing pages (we are just moving them), we do not
>want the original freeing function. But then we could 
>decrease the number of free pages by the same number &
>use the buddy freeing function. Will do. Thanks.
>  
>
Then you need to split the parts you want out into sub-functions and 
call it from the previous callers, and your new use for it...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-03 21:26 ` Dave Hansen
  2004-02-03 22:26   ` Martin J. Bligh
@ 2004-02-04  5:09   ` Alok Mooley
  2004-02-04  5:24     ` Mike Fedyk
  2004-02-04  5:54     ` Dave Hansen
  1 sibling, 2 replies; 32+ messages in thread
From: Alok Mooley @ 2004-02-04  5:09 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm


--- Dave Hansen <haveblue@us.ibm.com> wrote:

> 
> On Mon, 2004-02-02 at 20:46, Alok Mooley wrote:
> > 	/*
> > 	 * See that page is not file backed & flags are
> as desired.
> > 	 * PG_lru , PG_direct (currently) must be set
> > 	 */
> > 	if(!page->mapping && (flags==0x10020 ||
> flags==0x10024 || 
> > flags==0x10060 || 
> > flags==0x10064) && page_count(page))
> > 		return 1;
> > 	/*Page is not movable*/
> > 	return 0;
> 
> Why are these flags hard-coded?  

Will use the macros. Thanks.

> 
> > static void update_to_ptes(struct page *to)
> > ...
> >         pgprot.pgprot = (pte->pte_low) &
> (PAGE_SIZE-1);
> 
> You probably want to wrap this up in a macro for
> each arch.  I couldn't
> find an analagous one that exists.  
> 
> > static void free_allocated_page(struct page *page)
> 
> Is this really necessary?  Why don't the regular
> buddy allocator
> functions work?
> 
 The regular buddy freeing function also increases the
number of free pages. Since we are not actually
freeing pages (we are just moving them), we do not
want the original freeing function. But then we could 
decrease the number of free pages by the same number &
use the buddy freeing function. Will do. Thanks.

> > /*
> > * Flush the cache and tlb entries corresponding to
> the pte for the
> > * 'from' page
> > * Flush the tlb entries corresponding to the given
> page and not the 
> > whole
> > * tlb.
> > */
> > static void flush_from_caches(pte_addr_t paddr)
> > {
> 
> How does this function compare to
> try_to_unmap_one()?  Can you use that
> instead?  It looked like you cut and pasted some
> code from there.

 Yes we have cut & pasted some code from there. But,
try_to_unmap_one also does lots of other stuff like
           /*
            * Store the swap location in the pte.
            * See handle_pte_fault() ...
            */
which we don't want. Hence we use a separate function.

> 
> I'll stop there for now.  There seems to be a lot of
> code in the file
> that's one-off from current kernel code.  I think a
> close examination of
> currently available kernel functions could drasticly
> reduce the size of
> your module.  
Will try to reduce the code. Thanks.


Could you also please comment & advise us on our
problems which are as below: -

We want to broaden our definition of a movable
page, & consider kernel
pages & file-backed pages also for movement (currently
we consider only userspace anonymous pages as
movable). Do
file-backed pages also obey the
3GB rule? In order to move such pages, we will have to
patch macros like "virt_to_phys"
& other related macros, so that the address
translation for pages moved by us will take place
vmalloc style, i.e., via page tables, instead of
direct +-3GB. Is it worth introducing such
an overhead for address translation (vmalloc does
it!)? If no, then is there another
way out, or is it better to stick to our current
definition of a movable page?
Identifying pages moved by us may involve introducing
a new page-flag. A new page-flag 
for per-cpu pages would be great, since we have to
traverse the per-cpu hot & cold lists
in order to identify if a page is on the pcp lists. 

As of now, we have adopted a failure based approach,
i.e, we defragment only when 
a higher order allocation failure has taken place
(just before kswapd starts swapping). 
We now want to defragment based on thresholds kept for
each allocation order. 
Instead of a daemon kicking in on a threshold 
violation (as proposed by Mr. Daniel Phillips), we
intend to capture idle cpu cycles
by inserting a new process just above the idle
process. Now, when we are scheduled,
we are sure that the cpu is idle, & this is when we
check for threshold violation & defragment.
One problem with this would be when to reschedule
ourselves (allow our preemption)? 
We do not want the memory state to change beneath us,
so right now we are not 
allowing our preemption.
This will ofcourse hog the cpu, & we may not be able
to reschedule just by checking
the need_resched flag. Any advice or suggestions
regarding this problem? Also, will
the idle cpu approach be better or will the daemon
based approach be better?

We will be thankful for any suggestions,advice &
comments. 

Thanking you,
-Alok & project group




__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free web site building tool. Try it!
http://webhosting.yahoo.com/ps/sb/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-03 21:26 ` Dave Hansen
@ 2004-02-03 22:26   ` Martin J. Bligh
  2004-02-04  5:09   ` Alok Mooley
  1 sibling, 0 replies; 32+ messages in thread
From: Martin J. Bligh @ 2004-02-03 22:26 UTC (permalink / raw)
  To: Dave Hansen, Alok Mooley; +Cc: Linux Kernel Mailing List, linux-mm

> I'll stop there for now.  There seems to be a lot of code in the file
> that's one-off from current kernel code.  I think a close examination of
> currently available kernel functions could drasticly reduce the size of
> your module.  

Preferably to 0 ... this should be part of the core kernel, not a module.

M.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Active Memory Defragmentation: Our implementation & problems
  2004-02-03  4:46 Alok Mooley
@ 2004-02-03 21:26 ` Dave Hansen
  2004-02-03 22:26   ` Martin J. Bligh
  2004-02-04  5:09   ` Alok Mooley
  0 siblings, 2 replies; 32+ messages in thread
From: Dave Hansen @ 2004-02-03 21:26 UTC (permalink / raw)
  To: Alok Mooley; +Cc: Linux Kernel Mailing List, linux-mm

Your patch looks whitespace-munged.  You might want to check your
mailer.  Also, you should probably take a look at
Documentation/CodingStyle.  There are quite a few deviations here from
regular kernel coding standards.  You should also make an effort to
integrate these functions into the appropriate places.  For instance,
the pte functions need to go into things like pgtable.h and the buddy
allocator functions need to end up in mm/page_alloc.c.  Then you can
post your work as a patch, which is how we usually do it.


On Mon, 2004-02-02 at 20:46, Alok Mooley wrote:
> 	/*
> 	 * See that page is not file backed & flags are as desired.
> 	 * PG_lru , PG_direct (currently) must be set
> 	 */
> 	if(!page->mapping && (flags==0x10020 || flags==0x10024 || 
> flags==0x10060 || 
> flags==0x10064) && page_count(page))
> 		return 1;
> 	/*Page is not movable*/
> 	return 0;

Why are these flags hard-coded?  

> static void update_to_ptes(struct page *to)
> ...
>         pgprot.pgprot = (pte->pte_low) & (PAGE_SIZE-1);

You probably want to wrap this up in a macro for each arch.  I couldn't
find an analagous one that exists.  

> static void free_allocated_page(struct page *page)
> {
> 	unsigned long page_idx,index,mask,order = 0;
> 	struct zone *zone = page_zone(page);
> 	struct page *base = zone->zone_mem_map;
> 	struct free_area *area = zone->free_area;
> 	mask = ~0;
> 
> 	page_idx = page - base;
> 
> 	/*
> 	 * The bitmap offset is given by index
> 	 */
> 	index = page_idx >> (1 + order);
> 	cnt_order = 0;
> 
> 	while (mask + (1 << (MAX_ORDER-1))) {
> 
> 		struct page *buddy1, *buddy2;
> 		if (!__test_and_change_bit(index, area->map)) {						/*
> 			 * the buddy page is still allocated.
> 			 */
> 			break;
> 		}
> 
> 		/*
> 		 * Move the buddy up one level.
> 		 * This code is taking advantage of the identity:
> 		 * 	-mask = 1+~mask
> 		 */
> 		cnt_order++;
> 		buddy1 = base + (page_idx ^ -mask);
> 		buddy2 = base + page_idx;
> 		list_del(&buddy1->list);
> 		/* Mask for the immediate upper order*/
> 		mask <<= 1;
> 		area++;
> 		/*Bit offset at level n is offset at level (n-1) >> 1 */
> 		index >>= 1;
> 		page_idx &= mask;
> 	}
> 	list_add(&(base + page_idx)->list, &area->free_list);
> }

Is this really necessary?  Why don't the regular buddy allocator
functions work?

> /*
> * Flush the cache and tlb entries corresponding to the pte for the
> * 'from' page
> * Flush the tlb entries corresponding to the given page and not the 
> whole
> * tlb.
> */
> static void flush_from_caches(pte_addr_t paddr)
> {

How does this function compare to try_to_unmap_one()?  Can you use that
instead?  It looked like you cut and pasted some code from there.

I'll stop there for now.  There seems to be a lot of code in the file
that's one-off from current kernel code.  I think a close examination of
currently available kernel functions could drasticly reduce the size of
your module.  

--dave


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Active Memory Defragmentation: Our implementation & problems
@ 2004-02-03  4:46 Alok Mooley
  2004-02-03 21:26 ` Dave Hansen
  0 siblings, 1 reply; 32+ messages in thread
From: Alok Mooley @ 2004-02-03  4:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2764 bytes --]

We are a group of 4 students doing our final year
engg. project. To this aim,
we are implementing Active Memory Defragmentation,
based on the paper written
by Mr. Daniel Phillips titled "My research agenda for
2.7".
We are doing this on Linux kernel 2.6.0, & we have
written a module to this
effect. The module is attached to this mail.

Currently we are considering only the userspace
anonymous pages for movement.
As of yet, we have not considered kernel pages for
movement due to the 3GB rule.
Since userspace allocations are by page-faults, we
consider only single pages
(0th order) for movement. Thus, only the pages which
are not file-backed & are 
on LRU lists are currently being considered for
movement.

We now want to broaden our definition of a movable
page, & consider kernel
pages & file-backed pages also for movement. Do
file-backed pages also obey the
3GB rule? In order to move such pages, we will have to
patch macros like "virt_to_phys"
& other related macros, so that the address
translation for pages moved by us will take place
vmalloc style, i.e., via page tables, instead of
direct +-3GB. Is it worth introducing such
an overhead for address translation (vmalloc does
it!)? If no, then is there another
way out, or is it better to stick to our current
definition of a movable page?
Identifying pages moved by us may involve introducing
a new page-flag. A new page-flag 
for per-cpu pages would be great, since we have to
traverse the per-cpu hot & cold lists
in order to identify if a page is on the pcp lists. 

As of now, we have adopted a failure based approach,
i.e, we defragment only when 
a higher order allocation failure has taken place
(just before kswapd starts swapping). 
We now want to defragment based on thresholds kept for
each allocation order. 
Instead of a daemon kicking in on a threshold 
violation (as proposed by Mr. Daniel Phillips), we
intend to capture idle cpu cycles
by inserting a new process just above the idle
process. Now, when we are scheduled,
we are sure that the cpu is idle, & this is when we
check for threshold violation & defragment.
One problem with this would be when to reschedule
ourselves (allow our preemption)? 
We do not want the memory state to change beneath us,
so right now we are not 
allowing our preemption.
This will ofcourse hog the cpu, & we may not be able
to reschedule just by checking
the need_resched flag. Any advice or suggestions
regarding this problem? Also, will
the idle cpu approach be better or will the daemon
based approach be better?

Any suggestions,advice & comments will be highly
appreciated. 

Thanking you,
-Alok


__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free web site building tool. Try it!
http://webhosting.yahoo.com/ps/sb/

[-- Attachment #2: AMD.c --]
[-- Type: text/plain, Size: 25949 bytes --]

/*
* Active Memory Defragmentation
*/
#include<linux/asap.h>

#include"AMD.h"

MODULE_LICENSE("GPL");

/*
* This is the order till which the block freed went
* (for debugging)
*/
static int cnt_order;
/*
*This is the init module
*/
static int amd_init(void)
{
	int order = ORDER;
        printk("\nModule: AMD loaded\n");
#ifdef CALL_FROM_HERE
	/*
	 * Personally call the module
	 */
	defrag(order,zone_table[ZONE_NORMAL]);
#else
	/*
	 * The buddy allocator calls the module just before fallback
	 */
	func_defrag = defrag;
#endif
        return 0;
}

/*
* Userspace pages which are not file-backed as movable are considered.
* The Page Flags needed to be set currently are PG_lru and PG_direct
* The PG_active and PG_referenced are don't care
* The page mappings must be NULL i.e it must not be file backed
*/
static int movable_page(struct page *page)
{
        unsigned long flags;
        /*
	 * mask the upper 12 flag bits (top 8 bits contain the zone no.)
	 */
	flags = page->flags & FLAG_MASK;
	/*
	 * See that page is not file backed & flags are as desired.
	 * PG_lru , PG_direct (currently) must be set
	 */
	if(!page->mapping && (flags==0x10020 || flags==0x10024 || 
flags==0x10060 || 
flags==0x10064) && page_count(page))
		return 1;
	/*Page is not movable*/
	return 0;

}

/*
* This will update the PTEs to point to the new page.
* This function sets the previous PTE entry to the new value.
* "pte" is the PTE to be modified
*/
static void update_to_ptes(struct page *to)
{
	pte_t *pte = to->pte.direct;
	pgprot_t pgprot;
	/*
	 * These give the protection bits
	 * This code is not architecture independent
	 */
        pgprot.pgprot = (pte->pte_low) & (PAGE_SIZE-1);
	/*
	 * The physical address of the pte is changed to that of the new page
	 */
	set_pte_atomic(pte,mk_pte(to,pgprot));
}

/*
* Frees the page pointed by 'page'
*/
static void free_allocated_page(struct page *page)
{
	unsigned long page_idx,index,mask,order = 0;
	struct zone *zone = page_zone(page);
	struct page *base = zone->zone_mem_map;
	struct free_area *area = zone->free_area;
	mask = ~0;

	page_idx = page - base;

	/*
	 * The bitmap offset is given by index
	 */
	index = page_idx >> (1 + order);
	cnt_order = 0;

	while (mask + (1 << (MAX_ORDER-1))) {

		struct page *buddy1, *buddy2;
		if (!__test_and_change_bit(index, area->map)) {						/*
			 * the buddy page is still allocated.
			 */
			break;
		}

		/*
		 * Move the buddy up one level.
		 * This code is taking advantage of the identity:
		 * 	-mask = 1+~mask
		 */
		cnt_order++;
		buddy1 = base + (page_idx ^ -mask);
		buddy2 = base + page_idx;
		list_del(&buddy1->list);
		/* Mask for the immediate upper order*/
		mask <<= 1;
		area++;
		/*Bit offset at level n is offset at level (n-1) >> 1 */
		index >>= 1;
		page_idx &= mask;
	}
	list_add(&(base + page_idx)->list, &area->free_list);
}

/*
* Flush the cache and tlb entries corresponding to the pte for the
* 'from' page
* Flush the tlb entries corresponding to the given page and not the 
whole
* tlb.
*/
static void flush_from_caches(pte_addr_t paddr)
{
	pte_t *ptep = rmap_ptep_map(paddr);
	unsigned long address = ptep_to_address(ptep);
	struct mm_struct * mm = ptep_to_mm(ptep);
	struct vm_area_struct * vma;

	vma = find_vma(mm, address);
	if (!vma) {
		printk("\n Page vma is NULL");
		return ;
	}

	/* The page is mlock()d, cannot swap it out. */
	if (vma->vm_flags & VM_LOCKED) {
		printk("\n Page is vmlocked");
		return ;
	}

	/* Nuke the page table entry. */
	flush_cache_page(vma, address);
	flush_tlb_page(vma, address);
}

/*
* This will update the "to" & "from" struct pages
* The _to_ will become the exact replica of the _from_.
*/
static void update_struct_page(struct page *to,struct page *from,struct 
zone 
*zone)
{
	pte_addr_t pte_to_flush;

	unsigned long flag;
	/*
	 * delete "to" from the order 0 free-list
	 * If the to->private is 0 then remove it from the free_list
	 * For higher order pages they were removed during the time of
	 * selecting them for 'to_pages'.
	 */
	if(!to->private) {
		list_del(&to->list);
               	__change_bit((to - 
zone->zone_mem_map)>>1,zone->free_area[0].map);
	}
	 /*
	  * Assign all other members as it is
	  */
	*to = *from;

	spin_lock_irqsave(&zone->lru_lock,flag);

	/*
	 * Add "to" to the LRU list
	 * Remove "from" from the LRU lists
	 */
	list_add_tail(&to->lru,&from->lru);
	list_del(&from->lru);
	spin_unlock_irqrestore(&zone->lru_lock,flag);

	/*
	 * Now update the individual entries of "from"
	 * Reset the flag dword for "from"
	 * Set the no. of references to 0
	 * In this case set the direct ptr to 0
	 * (Currently `from pages' have direct bit set)
	 */
	from->flags &= ~FLAG_MASK;
	set_page_count(from,0);
	pte_to_flush = from->pte.direct;
	from->pte.direct = NULL;

	/*
	 * Now return this page (which was previously allocated)
	 * to the free-list.
	 */
	free_allocated_page(from);
	update_to_ptes(to);
	flush_from_caches(pte_to_flush);
}
/*
* Traverse the free list of the given order
* This is done once before calling defrag and again after defragmenting
* It gives the the free pages in the _order_ in given _zone_
*/
static int traverse_free_list(int order,struct zone *zone)
{
	struct free_area *area = zone->free_area;
	struct list_head *ele;
	int cnt = 0;
	list_for_each(ele,&area[order].free_list) {
		cnt++;
	}
	return cnt;
}

/*
* Perform the copying, updating and freeing here
*/
static void move_page(struct page *from,struct page *to,struct zone 
*zone)
{
	copy_page(page_address(to),page_address(from));
	update_struct_page(to,from,zone);
}

/*
* In case sufficient no. of free pages (to) were not found, return all
* the pages to buddy.
* The pages have page->private corresponding to the order from whose 
free_list
* they were removed. Order 0 pages were not removed from the free_list.
*/
static void return_to_buddy(struct page **to_pages,struct zone 
*zone,int 
allocated)
{
	int count = 0;
	struct free_area *area = zone->free_area;
	struct page *base = zone->zone_mem_map;

	while(count<allocated && to_pages[count]) {
		struct page *page = to_pages[count];

		/*
		 * Return only higher order pages to buddy as order 0 pages
		 * were not ACTUALLY allocated.
		 */
		if(page->private) {
			list_add(&page->list,&area[page->private].free_list);
			
__change_bit((page-base)>>(1+page->private),area[page->private].map);
			count += 1<<(page->private);
		}
		else
			count++;
	}
}

/*
* Defrag & make a block of required order
*/
static int defrag(int failed_order,struct zone *zone)
{
	unsigned long flags;
	int count,to_count,no_slots,ret_val;
        struct page *from;
        int cnt_before=0,cnt_after=0,ret=0,allocated=0;
	unsigned long *alloc_bitmap;
	struct page **to_pages;

	printk("\n\n**** Entering AMD for order %d****\n",failed_order);
	no_slots = ((1<<failed_order)*sizeof(struct page *));
	to_pages = (struct page **)kmalloc(no_slots,GFP_KERNEL);
	if(!to_pages){
		printk("\n No MEM for to_pages");
		return 0;
	}
	if(!(alloc_bitmap = (unsigned long 
*)kmalloc(1<<(MAX_ORDER-4),GFP_KERNEL))){
		printk("\n NO MEM FOR  alloc bitmap ");
		kfree((void *)to_pages);
		return 0;
	}
	for(count=0;count<(1<<failed_order);count++)
		to_pages[count]=NULL;

        spin_lock_irqsave(&zone->lock,flags);
	cnt_before = traverse_free_list(failed_order,zone);
	allocated = 0;

	from = get_from_block(failed_order-1,zone,alloc_bitmap);

	if(from) {
		allocated = count_alloc_pages(failed_order,alloc_bitmap);
		printk("\n Allocated = %d failed_order %d",allocated,failed_order);
		ret = find_to_pages(failed_order,zone,from,allocated,to_pages);
		if(!ret){
			return_to_buddy(to_pages,zone,allocated);
			printk("\n Didn't find to pages");
			ret_val = 0;
			goto bail_out;
		}
	} else {
		ret_val = 0;
		goto bail_out;
	}

	count = (1<<failed_order)-1;
	to_count = allocated-1;
	ret_val = 1;
	while(count>=0) {
		if(bit_test(count,alloc_bitmap)) {
			move_page(from+count,to_pages[to_count--],zone);
		}
		count--;
	}

	cnt_after = traverse_free_list(failed_order ,zone);

bail_out:
	spin_unlock_irqrestore(&zone->lock,flags);
	kfree((void *)to_pages);
	kfree((void *)alloc_bitmap);
	if(from && ret_val) {
		printk ("\n The free in order %d before = %d and after = 
%d",failed_order,cnt_before,cnt_after);
		printk("\n The free page gone to order %d",cnt_order);
	}
	return ret_val;
}

static void amd_exit(void)
{
        printk("\n Module: AMD removed\n");
#ifndef CALL_FROM_HERE
	func_defrag = 0;
#endif
}
module_init(amd_init);
module_exit(amd_exit);

/*
* 
*************************************************************************
* ******************** TO PAGES 
*******************************************
* 
*************************************************************************
*/

/*
* order = order of failure
* In the worst case all the pages in the _from_ block need to be moved
* Total number of pages in the _from_ block is 1<<(order) where
* order is the order of failure
* This function is called once the from pages are found
*/
static int count_alloc_pages(int order,unsigned long *alloc_bitmap)
{
	int count =0;
	int i;
	printk("\n Alloc bitmap\n ");
	for(i = 0;i < (1<<order);i++) {
		if(bit_test(i,alloc_bitmap)){
			printk("1");
			count++;
		}
		else
			printk("0");
	}
	return count;
}

/*
* The bitmap of _order_ is set to 1 ,now find out which of the buddies 
is 
free
* We have 2 options
* 	a. Scan all the pages of a given buddy. If any one page is allocated
* 	   then it is the allocated one. We return its buddy
* 	b. Scan the free list of that order if the buddy is in the free list
* 	   it is the free buddy else its buddy is the free one.
* order is the order where the 1 was found in the bitmap at bit_offset
* if order <= LINEAR_SCAN_THRESHOLD do 'a' else 'b'
*/
static struct page *get_free_buddy(int order,int bit_offset,struct zone 
*zone)
{
	struct page *base = zone->zone_mem_map;
	struct free_area *area = zone->free_area;
	if(order <= LINEAR_SCAN_THRESHOLD) {
		struct page *buddy1,*buddy2,*page;
		int mask= (~0)<<order,count;
		page = buddy1 = base + (bit_offset<<(order+1));
		buddy2 = base + ((bit_offset<<(order+1))^(-mask));
		count = 1<<order;
		while(count--) {
			if(page_count(page))
				return buddy2;
			else {
				if(page_free(page))
					page++;
				else
					return buddy2;
			}
		}
		return buddy1;
	} else {
		/*
                 * Find the free block using free list
		 */
	        struct list_head *ele;
		struct page *buddy1,*buddy2;
		int mask =  (~0)<<order;
		/*
	         * buddy1 and buddy2 are start pages for the two buddies
	         */
	        buddy1 = base + (bit_offset<<(order+1));
	        buddy2 = base + ((bit_offset<<(order+1))^(-mask));
		list_for_each(ele,&area[order].free_list) {
			struct page *page = list_entry(ele,struct page,list);
			if(buddy1 == page)
				return buddy1;
			else if(buddy2 == page)
				return buddy2;
		}
	}
	return	NULL;
}

/*
* Split the block starting from 'free_buddy' of order _order_
* so that the 'need' is satisfied.
* "non_movable" & "movable" have to be the PRECISE indices where the 
corr.
* entries ended.
*/
static void split_block(int order,int *non_movable,int *movable,struct 
page 
*free_buddy,struct page**to_pages,struct zone *zone)
{
	int diff,diff_init,count;
	struct page *base = zone->zone_mem_map;
	struct page *temp;
	/*
	 * Delete the block from the free-list & toggle the bit.
	 * This is done to bring the higher order block to order 0
	 * Now see how many pages are needed and put the remaining blocks
	 * in the appropriate free lists.
	 */
	list_del(&free_buddy->list);
	__change_bit((free_buddy-base)>>(1+order),zone->free_area[order].map);

	/*
	 * a. 1<<order gives the total no of pages in the block of _order_
	 * b. movable- (non_movable+1)-- the need
	 * diff =  a - b  :: are the excess pages
	 * if (diff >0)
	 * 	free the access pages
	 * Now the remaining pages are added to the to_pages array
	 * The start page has the page->private member with the
	 * order in whose free_list the block was
	 */
	diff_init = diff = (1<<order)-((*movable)-(*non_movable)-1);
	while(diff>0) {
		free_allocated_page(free_buddy+(1<<order)-diff);
		diff--;
	}
	/*
	 * Find the no. of spaces in to_pages to be filled
	 */
	if(diff_init<=0)
		diff_init = (1<<order);
	else
		diff_init = (1<<order)-diff_init;

	temp = free_buddy;
	/*
	 * Store the order in free_buddy->private
	 * This variable is not used for elements in the free lists
	 * Order 0 pages are removed from their free lists only when
	 * a page in the _from_ block is copied in it.
	 * For pages brought from higher order i.e split
	 * put them back in the appropriate free lists if defrag not possible
	 */
	for(count=(*non_movable)+1;count<((*non_movable)+1+diff_init);count++) 
{
		temp->private = order;
		to_pages[count] = temp++;
	}
	*non_movable += diff_init;
}
/*
* Find 1's in higher order 'order' except forbidden 
(start+no_forbidden_bits)
* Try to get 'need' no of free pages and store page pointers in 
to_pages
*/
static int get_to_pages_higher(int order,struct zone *zone,struct page 
*start_page,int *non_movable,int *movable,struct page **to_pages,int 
no_forbidden_bits)
{
	struct page *base = zone->zone_mem_map;
	struct page *free_buddy;
	/*
	 * don't check these bits, since they represent _from_ block
	 */
	int forbidden_start = (start_page - base) >>(order + 1);
	int num_bits = zone->present_pages>>(order+1);
	int count = 0;
	while(count<num_bits) {
		if(count==forbidden_start) {
			count += no_forbidden_bits;
			continue;
		}
		if(bit_test(count,zone->free_area[order].map)) {
			/*
			 * a 1 is found, split it to get free order 0
			 * pages so as to satisfy need
			 */
			free_buddy = get_free_buddy(order,count,zone);
			split_block(order,non_movable,movable,free_buddy,to_pages,zone);
			if( (*movable)==(*non_movable)+1)
				return 0;
		}
		count++;
	}
	/*The need is returned*/
	return ((*movable) - (*non_movable) - 1);
}

/*
* Get the free pages from order 0 from where _from_ pages are to be 
dumped
* movable = allocated
* non_movable = -1;
* page->private = order for all pages selected from _order_
* Non-movable moves fill array from left and movable from the right
*/

static void get_to_pages(struct zone *zone,struct page *start_page,int 
allocated,struct page **to_pages,int *non_movable,int *movable,int 
forbidden_bits)
{
	/*
	 * This is the separate routine for scannig order-0  1's
	 */
	int num_bits = (zone->present_pages)>>1;
	int count = 0;
	struct page *base = zone->zone_mem_map;
	int forbidden_zone_start = (start_page - base)>>1;
	int mask = (~0);

	*non_movable = -1;
	*movable = allocated;

	while(count<num_bits) {
		/*
		 * Skip the forbidden zone
		 */
		if(count==forbidden_zone_start)	{
			count = count + forbidden_bits;
			continue;
		} else if(bit_test(count,zone->free_area[0].map)) {
			struct page *buddy1 = base + (count<<1);
			struct page *buddy2 = base + ((count<<1) ^ -mask);
			if(page_count(buddy1)) {
				buddy2->private = 0;
				if(movable_page(buddy1))
					to_pages[--(*movable)] = buddy2;
				else
					to_pages[++(*non_movable)] = buddy2;
			} else if(page_count(buddy2)){
				buddy1->private = 0;
				if(movable_page(buddy2))
					to_pages[--(*movable)] = buddy1;
				else
					to_pages[++(*non_movable)] = buddy1;
			}
			if( (*movable)==(*non_movable)+1)
			{
				printk("\n Alloc :%d Non_movable: %d  Movable: 
%d",allocated,(*non_movable)+1,allocated - (*movable));
				return;
			}
		}
		count++;
	}
}

/*
* Find out free pages where 'from' is to be moved
*  - allocated(no. of allocated pages) is counted and passed
*  - called with order = 0
*  - start_pages indicates start of forbidden block( 'from' block)
*  - to_pages stores free page pointers
*/

static int find_to_pages(int failed_order,struct zone *zone,struct page 
*start_page,int allocated,struct page **to_pages)
{
	int non_movable,movable;
	int need = 0,order;
	int no_forbidden_bits = 1<<(failed_order-1); //no of pages in 'from'
	/*
	 * get non-movable and movable 1's in order 0 bitmap
	 * here need = (movable - non_movable - 1)
	 * non_movable is offset where non-movable pointers end
	 */
	
get_to_pages(zone,start_page,allocated,to_pages,&non_movable,&movable,no_forbidden_bits);
	/*
	 * keep finding free pages in higher orders until the need is 
satisfied
	 * or failure order is reached
	 */
	order = 1;
	need = (movable - non_movable - 1);
	printk("\n After order 0 need = %u\n",need);
	while(need && order<failed_order) {
		printk("\n Going to Order %d for to_pages need %d",order,need);
		no_forbidden_bits>>=1;
		need = 
get_to_pages_higher(order++,zone,start_page,&non_movable,&movable,to_pages,no_forbidden_bits);
	}
	printk("\n Scanned upto order %d need = %d 
failed_order=%d",order-1,need,failed_order);

	if(!need)
		return 1;	//successful
	printk("\n NEED NOT MET !!!");
	return 0;		//failed
}

/* 
*************************************************************************
* *********************** FROM PAGES 
**************************************
* 
*************************************************************************
*/

/*
* While finding _from_ pages only the allocated pages are to be moved
* The allocated pages in the block to be moved are set in the bitmap at 
their
* corresponding offset from the start page of the block
* */
static void clear_bitmap(unsigned long *bitmap,int order)
{
	int i;
	int cnt = (1<<(MAX_ORDER-4))/sizeof(unsigned long);
	for(i=0;i<cnt;i++)
		bitmap[i] = 0;

}
/*
* The test bit of the kernel is for long and if unsigned long is to be
*/
static int bit_test(int nr,unsigned long *addr)
{
	int mask;
	addr += nr >>5;
	mask = 1<<(nr & 0x1f);
	return ((*addr & mask) != 0);
}

/*
* This is the first thing done for deframentation.
* This is the function which does the job of identifying the
* block which is to be moved to create a free block.
* (order+1) is the order whose free block is not avaialable
*/

/*
* Get the start page of the allocated block which can be moved
* Here order+1 is the _order_ of the block to create(free)
*
* Find a 1 in the bitmap of order and find out whether the allocated
* buddy is movable or not
* alloc_bitmap: is the bitmap used to specify whether the page in the
* 		 movable block is allocated(1) or free (0)
* */
static struct page * get_from_block(int order,struct zone 
*zone,unsigned 
long   *alloc_bitmap)
{
	struct page * start_page= NULL;
	unsigned long *map = zone->free_area[order].map;
	long bit_offset = 0,num_bits = zone->present_pages>>(order+1);
	for(bit_offset=0;bit_offset<num_bits;bit_offset++){
		clear_bitmap(alloc_bitmap,order+1);
		if((bit_test(bit_offset,map)==1) && (start_page = 
movable_block(bit_offset,order-1,zone,alloc_bitmap))) {
			return start_page;
		}
	}
	clear_bitmap(alloc_bitmap,order +1);
	printk("\n Entering scan mem_map pages");
	start_page = scan_mem_page(order+1,zone,alloc_bitmap);
	if(!start_page){
		printk("\n No movable block found of order %d",order+1);
		return NULL;
	}
	printk("\n End of _from_ routine :: start page %x",(unsigned 
int)start_page);
	return start_page;
}
/*
* Linearly scan the pages of a buddy given by start_page
* The buddy is of order given by order
* Finds out whether the 0 is movable in order = order
* Return value (flag):
* 0 = free
* 1 = non-movable
* 2 = don't scan next 0 of order,since this one is movable
* */
static int linear_scan_pages(struct page *page,struct page* 
start_page,int 
order,unsigned long *alloc_bitmap)
{
	int count = 1<<order,flag = 0;
	while(count--)	{
		if(page_count(page)){
			/*
			 * block is allocated
			 */
			flag = 2;
			if(movable_page(page))
				/*
				 * Set the bit corresponding to this page
				 * in alloc_bitmap
				 */
				set_bit(page-start_page,alloc_bitmap);
			else
				/*
				 * page is non-movable
				 */
				  return 1;

		}
		else {
			if(!page_free(page))
				return 1;
		}
		page++;
	}
	/*
	 * page is free/movable
	 */
	return flag;
}
/*
* bit_offset: the offset of the 1 found in (order+1) bitmap
* find out whether the 1 in order=order+1 results in a movable block
* returns the start page of the block corresponding to the 1
*/
static struct page *movable_block(long bit_offset,int order,struct zone 
*zone,unsigned long *alloc_bitmap)
{
	struct page *base = zone->zone_mem_map;
	struct free_area *area = zone->free_area;
	/*
	 * start_page is the page correspong to the first page in the
	 * buddy of the order whose shortage called above fn
	 * hence order+2 is the order which is failed
	 */
	struct page *start_page = base + (bit_offset<<(order+2));
	while(order>=0)	{
		if(bit_test(bit_offset<<1,area[order].map)==1)
			bit_offset <<= 1;
		else {
			if (bit_test((bit_offset<<1)+1,area[order].map)==1)
				bit_offset = (bit_offset<<1)+1;
			else
				break;
		}
		order--;
	}
	printk("\n Last 1 in order %u ",order+1);
	/*
	 * here (order+1) gives us the order where the last 1 was found
	 * and bit_offset gives us the offset of the last 1 in order+1
	 */
	if(order == -1)	{
		/*
		 * We are here : as a the block of order which is required
		 * is split to satisfy an order 0 allocation
		 * We now find which of the buddy is the allocated one
		 */
		struct page *buddy1,*buddy2;
		int mask = (~0);
		buddy1 = base + (bit_offset<<1);
		buddy2 = base + ((bit_offset<<1)^(-mask));

		if(page_count(buddy1)) {
			if(movable_page(buddy1)) {
				set_bit(buddy1-start_page,alloc_bitmap);
				return start_page;
			}
		} else if(page_count(buddy2)){
			if(movable_page(buddy2)) {
				set_bit(buddy2-start_page,alloc_bitmap);
				return start_page;
			}
		}
		/*
		 * The order 0 allocation is for a non-movable page
		 */
		return NULL;
	}
	/*
	 * (order+1)bitmap has 0's as its children in _order_
	 * scan the 0's of order to find the allocated & movable blocks
	 */
	if(order < LINEAR_SCAN_THRESHOLD) {
		/*
		 * bit at bit_offset in order+1 map is a 1
		 */
		struct page *buddy1,*buddy2;
		int mask;
		/*
		 * current order = order of two 0s
		 * go to order = order where 1 is found
		 */
		order++;
		mask= (~0)<<order;
		buddy1 = base + (bit_offset<<(order+1));
		switch(linear_scan_pages(buddy1,start_page,order,alloc_bitmap)) {
		case 0:
			/*
			 * scan second 0,since first one is free
			 */
			buddy2 = base + (bit_offset<<(order+1)^(-mask));
			if(linear_scan_pages(buddy2,start_page,order,alloc_bitmap)==2) {
				/*
				 * this block is movable,so return start page
				 */
				return start_page;
			} else {
				/*
				 * second 0 is non-movable,return failure
				 */
				return NULL;
			}
			break;
		case 2:	/*
			 * first 0 is movable,
			 */
			return start_page;

		case 1:
			/*
			 * first 0 is non-movable,so return failure
			 */
			return NULL;
		}
	} else {
		/*
		 * Find the free block by scanning free list of order = order+1
		 */
		struct list_head *ele;
		struct page *buddy1,*buddy2;
		int mask;
		/*
		 * current order = order of two 0s
		 * go to order = order where 1 is found
		 */
		order++;
	       	mask= (~0)<<order;
		/*
		 * buddy1 and buddy2 are start pages for the two 0's
		 */
		buddy1 = base + (bit_offset<<(order+1));
		buddy2 = base + ((bit_offset<<(order+1))^(-mask));
		list_for_each(ele,&area[order].free_list) {
			struct page *page = list_entry(ele,struct page,list);
			if(buddy1 == page) {
			/*
			 * buddy1 is free
			 */
				if(linear_scan_pages(buddy2,start_page,order,alloc_bitmap) == 1)
					return NULL;//buddy2 non-movable
				return start_page;

			}
			else if(buddy2 == page){
			/*
			 * buddy2 is free
			 */
			if(linear_scan_pages(buddy1,start_page,order,alloc_bitmap) == 1)
				return NULL;//buddy1 non-movable
			return start_page;
			}
		}
	}
	return NULL;
}


/*
* Checks if a page is really free
* All pages having page count = 0 are free EXCEPT pcp pages which are
* semi-free, they have page count 0 and are not on free lists
* Check whether the page is on hot or cold pcp lists
* If its found on the pcp lists then its treated as non-movable 
currently
*/
static int page_free(struct page *page)
{
	struct zone *zone = page_zone(page);

	struct per_cpu_pages *pcp = &zone->pageset[0].pcp[0];
	unsigned long flags;
	struct list_head *ele;
	struct page *temp;

	if(!page_count(page) && !PageCompound(page)&& !page->mapping ) {
		local_irq_save(flags);
		/*
		 * Currently the only way of finding whether a page is in
		 * pcp lists is to scan the pcp lists
		 */
		list_for_each(ele,&pcp->list){
			temp = list_entry(ele,struct page,list);
			if(temp==page) {
				local_irq_restore(flags);
				return 0;
			}
		}
		pcp = &zone->pageset[0].pcp[1];
		list_for_each(ele,&pcp->list) {
			temp = list_entry(ele,struct page,list);
			if(temp==page) {
				local_irq_restore(flags);
				return 0;
			}

		}
		local_irq_restore(flags);
		/*
		 * page is not in pcp lists and thus is free
		 */
		return 1;
	}
	return 0;
}

/*
* Linearly scan the pages in memory for movable buddies
* of order failed_order.
* The scanning is done for each of the buddies of the failed order in 
the
* memory. If all the pages of the buddy can be moved then it is 
selected
* For all the allocated pages their corresponding bit in the 
_alloc_bitmap_
* are set to identify the pages whose content are to be copied
*/
static struct page * scan_mem_page(int order,struct zone * 
zone,unsigned 
long   *alloc_bitmap)
{
	int pages_per_block = (1<<order),page_no,cnt,flag,mov_cnt;
	struct page *start_page = zone->zone_mem_map,*page_in_block;
	page_no = 0;
	mov_cnt =0;

	while(page_no<zone->present_pages) {
		page_in_block = start_page;
		cnt = 0;
		flag = 0;
		mov_cnt=0;
		while(cnt<pages_per_block &&((flag=movable_page(page_in_block)) || 
page_free(page_in_block))) {
			if(flag){
				set_bit(page_in_block-start_page,alloc_bitmap);
				mov_cnt=1;
			}
			page_in_block++;
			cnt++;
		}
		if(cnt==pages_per_block && mov_cnt)
			return start_page;
		clear_bitmap(alloc_bitmap,order);
		start_page +=pages_per_block;
		page_no = page_no + pages_per_block;
	}
	/* After the scan of buddies in the memory no movable buddy of the
	 * failed order are found
	 */
	return NULL;
}




[-- Attachment #3: AMD.h --]
[-- Type: text/plain, Size: 2504 bytes --]

#ifndef AMD_DEFRAG_H
#define AMD_DEFRAG_H

#include<linux/module.h>
#include<linux/init.h>
#include<linux/mmzone.h>
#include<linux/mm.h>
#include<asm/page.h>
#include<linux/string.h>
#include<linux/list.h>
#include<linux/preempt.h>
#include<linux/page-flags.h>
#include<asm/rmap.h>
#include<linux/slab.h>
#include<linux/vmalloc.h>
#include<linux/slab.h>
#include<asm/tlbflush.h>
#include<asm/cacheflush.h>
#include<asm/bitops.h>

#define LINEAR_SCAN_THRESHOLD ((MAX_ORDER>>1)-1)

#define FLAG_MASK 0x000fffff
#define MAX_SIZE (1<<MAX_ORDER)
#define CALL_FROM_HERE

#define ORDER 1



/*
* header for main module
* */
static int amd_init(void);
static void amd_exit(void);
static int movable_page(struct page *page);
static void update_to_ptes(struct page *to);
static void free_allocated_page(struct page *page);
static void flush_from_caches(pte_addr_t paddr);
static void update_struct_page(struct page *to,struct page *from,struct 
zone 
*zone);
static int defrag(int order,struct zone *zone);
static int traverse_free_list(int order,struct zone *zone);
static void move_page(struct page *from,struct page *to,struct zone 
*zone);
static int page_free(struct page *page);

/*
* header for to_pages
* */
static int count_alloc_pages(int order,unsigned long *alloc_bitmap);

static struct page *get_free_buddy(int order,int bit_offset,struct zone 
*zone);

static void split_block(int order,int *non_movable,int *movable,struct 
page 
*free_buddy,struct page**to_pages,struct zone *zone);

static int get_to_pages_higher(int order,struct zone *zone,struct page 
*start_page,int *non_movable,int *movable,struct page **to_pages,int 
no_forbidden_bits);

static void get_to_pages(struct zone *zone,struct page *start_page,int 
allocated,struct page **to_pages,int *non_movable,int *movable,int 
forbidden_bits);

static int find_to_pages(int failed_order,struct zone *zone,struct page 
*start_page,int allocated,struct page **to_pages);

/*
* header for from_pages
* */
static struct page *get_from_block(int order,struct zone *zone,unsigned 
long 
*alloc_bitmap);

static int linear_scan_pages(struct page *page,struct page* 
start_page,int 
order,unsigned long *alloc_bitmap);

static struct page *movable_block(long bit_offset,int order,struct zone 
*zone,unsigned long *alloc_bitmap);

static struct page *scan_mem_page(int order,struct zone * zone,unsigned 
long 
*alloc_bitmap);

static void clear_bitmap(unsigned long *bitmap,int order);

static int bit_test(int nr,unsigned long *addr);

#endif



^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2004-02-05 19:05 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-02-04 20:12 Active Memory Defragmentation: Our implementation & problems Mark_H_Johnson
     [not found] <1075920386.27981.106.camel@nighthawk>
2004-02-04 18:54 ` Alok Mooley
2004-02-04 19:07   ` Richard B. Johnson
2004-02-04 19:18     ` Alok Mooley
2004-02-04 19:33       ` Richard B. Johnson
2004-02-05  5:07         ` Alok Mooley
2004-02-05 19:03         ` Pavel Machek
2004-02-04 19:35     ` Dave McCracken
2004-02-04 21:59       ` Timothy Miller
2004-02-04 23:24         ` Dave Hansen
2004-02-05 16:32           ` Dave McCracken
2004-02-04 19:37     ` Timothy Miller
2004-02-04 19:43       ` Dave Hansen
2004-02-04 19:59       ` Richard B. Johnson
2004-02-04 19:56   ` Dave Hansen
2004-02-05  5:19     ` Alok Mooley
  -- strict thread matches above, loose matches on Subject: below --
2004-02-03  4:46 Alok Mooley
2004-02-03 21:26 ` Dave Hansen
2004-02-03 22:26   ` Martin J. Bligh
2004-02-04  5:09   ` Alok Mooley
2004-02-04  5:24     ` Mike Fedyk
2004-02-04  5:54     ` Dave Hansen
2004-02-04  6:05       ` Martin J. Bligh
2004-02-04  6:22         ` Dave Hansen
2004-02-04  6:29           ` Martin J. Bligh
2004-02-04  6:40             ` Dave Hansen
2004-02-04  7:17               ` Martin J. Bligh
2004-02-04  8:30                 ` Andrew Morton
2004-02-04  6:57       ` IWAMOTO Toshihiro
2004-02-04  7:10         ` Dave Hansen
2004-02-04  7:50           ` IWAMOTO Toshihiro
2004-02-04 10:33         ` Hirokazu Takahashi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).