linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] My research agenda for 2.7
@ 2003-06-24 23:11 Daniel Phillips
  2003-06-25  0:47 ` William Lee Irwin III
  2003-06-25  9:29 ` Mel Gorman
  0 siblings, 2 replies; 30+ messages in thread
From: Daniel Phillips @ 2003-06-24 23:11 UTC (permalink / raw)
  To: linux-kernel, linux-mm

This note describes several items on my research agenda for the 2.7 kernel 
development timeframe.  First a little history.

In 2.3/2.4, the dominant change to Linux's memory management subsystem was the 
unification of the page and buffer cache, in the sense that most double 
storage was eliminated and copying from the buffer to page cache was no 
longer required on each file write.  In 2.5, the dominant change was the 
addition of reverse page mapping.  I participated actively in the latter, 
being motivated by the belief that with reverse mapping in place, a number of 
significant evolutionary improvements to the memory management subsystem 
would become possible, or indeed, easy.  Here, I list the three main projects 
I hope to complete in the 2.7 timeframe, for comment:

  1) Active memory defragmentation
  2) Variable sized page cache objects
  3) Physical block cache

These are ordered from least to most controversial.  I believe all three will 
prove to be worthwhile improvements to the Linux's memory management 
subsystem, and hopefully, I support that belief adequately in the text below.  
Of course, working code and benchmarks are the final arbiter of goodness, and 
will appear in due course.

1) Active memory defragmentation

I doubt anyone will deny that this is desirable.  Active defragmentation will 
eliminate higher order allocation failures for non-atomic allocations, and I 
hope, generally improve the efficiency and transparency of the kernel memory 
allocator.

The purpose of active memory defragmentation is to catch corner cases, rather 
than to be a complete replacement for the current allocation system.  The 
most obvious and problematic corner case is the one where all physical memory 
units of a given order are used up, in which case the allocator only has two 
options: wait or fail.  Active defragmentation introduces a third option, 
which should eliminate nearly all instances of the former and all of the 
latter, except when physical memory is genuinely exhausted for some reason 
(i.e., bona fide OOM).

The idea is to run a defragmentation daemon that kicks in whenever 
availability of some allocation order falls below a threshold.  The defrag  
daemon first scans memory for easily movable pages that can form new, free 
units of the needed order.  If this pass fails, the daemon could go as far as 
quiescing the system (a technique already used in RCU locking) and move some 
not-so-easy-to-move pages.

In order to move a page of physical memory, we need to know what is pointing 
at it.  This is often easy, for example in the common case when the only 
pointer to a page is held by the page cache and the page is not under IO.  We 
only need to hold the page cache lock and the page table lock to move such a 
page.

Moving anonymous memory is pretty easy as well, with the help of reverse page 
mapping.  We need to hold the appropriate locks, then walk a page's reverse 
map list, updating pointers to a new copy of the page.  (If we encounter 
nasty corner cases here, we could possibly fall back to a quiescing 
strategy.)

Some difficult situations may be dealt with by creating a new internal kernel 
API that provides a way of notifying some subsystem that page ownership 
information is wanted, or that certain pages should be reallocated according 
to the wishes of the defragmentation daemon.  Obviously, there is plenty of 
opportunity for over-engineering in this area, but equally, there is 
opportunity for clever design work that derives much benefit while avoiding 
most of the potential complexity.

Physical memory defragmentation is an enabler for variable-sized pages, next 
on my list.

2) Variable sized page objects

This item will no doubt seem as controversial as the first is uncontroversial.  
It may help to know that my prototype code, done under 2.4, indicates that 
the complete system actually gets smaller with this feature, and possibly 
slightly faster as well.  Essentially, if we have variable sized pages then 
we can eliminate the messy buffer lists that are (sometimes) attached to 
pages, and indeed, eliminate buffers themselves.  Traditional buffer IO and 
data operations can be trivially reexpressed in terms of pages, provided page 
objects can take on the same range of sizes as buffers currently do.  The 
block IO library also gets shorter, since much looping and state handling 
becomes redundant.

For my purpose, "variable sized" means that each struct page object can 
reference a data frame of any binary size, including smaller than a physical 
page.  To keep the implementation sane, all pages of a given address_space 
are the same size.  Also, page objects smaller than a physical page are only 
allowed for file-backed memory, not anonymous memory.  Rules for large pages 
remain to be determined, however since there is already considerable work 
being done in this area, I will not dwell on it further.

The most interesting aspect of variable sized pages is how subpages are 
handled.  This could get messy, but fortunately a simple approach is possible.
Subpage page structs do not need to reside in the mem_map; instead they can 
be dynamically allocated from slab cache.  The extra bookkeeping needed inside
the page cache operations to keep track of this is not much, and particularly, 
does not add more than a sub-cycle penalty to the case where subpages are 
not used (even so, I expect this penalty to be more than made up by shorter, 
straighter paths in the block IO library).

One benefit of subpages that may not be immediately obvious is the opportunity 
to save some space in the mem_map array: with subpages it becomes quite
attractive to use a larger PAGE_CACHE_SIZE, i.e., a filesystem that must use 
a small block size for some reason won't cause a lot of additional internal 
fragmentation.

But to my mind, the biggest benefit to subpages is the opportunity to 
eliminate some redundant state information that is currently shared between 
pages and buffer_heads.  To be sure, I've been less than successful to date 
at communicating the importance of this simplification, but in this case, the 
code should be the proof.

Variable-size pages will deliver immediate benefits to filesystems such as 
Ext2 and Ext3, in the form of larger volume size limits and more efficient 
transfers.  As a side effect, we will probably need to implement tail merging 
in Ext2/3 to control the resulting increased internal fragmentation, but that 
is another story, for another mailing list.

Variable size pages should fit together nicely with the current work being 
done on large (2 and 4 MB) page handling, and especially, it will be nice for 
architectures like MIPS that can optimize variable sized pages in 
hardware.

Some bullet points:

  - Rationalize state info: state represented only in struct page, not struct
    page + list of struct buffer_head

  - 1K filesystems aren't a special case any more

  - More efficient IO path, esp for 1K fs

  - Net removal of code by simplifying the vfs block IO library (new code
    is added to page cache access functions).

  - Makes the page locking unit match the filesystem locking unit for most
    filesystems

  - Generalizes to superpages

  - Performance.  It's a little more efficient.  Eliminates one class of
    objects, allowing us to concentrate more on the remaining class.

  - Large file blocks (multiple physical pages)

  - Eliminate buffer heads.  Final use of buffer heads is as data handle for
    filesystem metadata (IO function of buffer heads will be taken over by
    BIO).  Basically, we can just substitute struct page for struct
    buffer_head.  Can make this transition gradual, after establishing
    one buffer head per page.

  - Memory pressure now acts on only one class of object, making balancing
    more even.

Relies on:

  - Active defragmentation

How it works:

  - Page size is represented on a per-address space basis with a shift count.
    In practice, the smallest is 9 (512 byte sector), could imagine 7 (each
    ext2 inode is separate page) or 8 (actual hardsect size on some drives).
    12 will be the most common size.  13 gives 8K blocksize for, e.g., alpha.
    21 and 22 give 2M and 4M page size, matching hardware capabilities of
    x86, and other sizes are possible on machines like MIPS, where page size
    is software controllable

  - Implemented only for file-backed memory (page cache)

  - Special case these ops in page cache access layer instead of having the
    messy code in the block IO library

  - Subpage struct pages are dynamically allocated.  But buffer_heads are gone
    so this is a lateral change.

3) Physical block cache

This item is not strictly concerned with memory management, but as it impacts 
the same subsystems, I have included it in this note.

In brief, a physical block cache lets the vfs efficiently answer the question: 
"given a physical data block, tell me if and where it is mapped into any 
address_space on the same volume".  This need not be a big change to the 
existing strategy: the normal lookup and other operations remain available.  
However, the vfs gets the additional responsibility of maintaining a special 
per-volume address_space coherently with the multiple file-backed 
address_spaces on the volume.  

In fact, we already have such per-volume address spaces, and there really 
isn't that much work to do here, in order to test the effects of making the 
two types of address_space coherent with one another.  One way of looking at 
this is, full coherence in this area would complete the work of unifying the 
page and buffer caches, started some years ago.

Having discussed this idea with a few developers, I've been warned that 
difficult issues will arise with some of the more complex filesystems, such 
as Ext3.  Fortunately, a prototype physical block cache can be adequately 
tested with Ext2 alone, which doesn't have a lot of issues.  If this proves 
to deliver the performance benefits I expect, further work would be justified 
to extend the functionality to other filesystems.

So what are the anticpated performance benefits?  I've identified two so far:

 1. Physical readahead.  That is, we can load a block into the page cache
    before we know which address_space, if any, it actually belongs to.
    Later, when we do know, additionally entering it into its proper
    address space is efficient.  This will help with the traversal of
    many small files case, which Linux currently handles poorly.

 2. Raid 5.  The biggest performance problem with Raid 5 stems from the
    fact that for small, isolated writes it is forced to read a few blocks
    to compute every new parity block, and in the process suffers large
    amounts of rotational latency.  A big cache helps with this a great,
    however, the size of cache we could expect to find in, e.g., a high
    end scsi drive, is not adequate to eliminate the bad effects, and in
    any event, bus saturation becomes a very real problem.  We could also
    have a separate, physical block cache, but this wastes memory and
    causes unnecessary copying.  Being able to implement the big cache
    directly in the page cache is thus a big advantage in terms of memory
    savings, and reduced data copying.  There is also a latency advantage,

Summary

Please note that all of the above is unofficial, experimental work.  However, 
I do believe that all three of these items have the potential to deliver 
substantial improvements in terms of reliability, efficiency and obvious 
correctness.

Thankyou for your indulgence in reading all the way down to here.  The 
timeframe for this work is:

  - Starts as soon as 2.5 closes
  - Prototypes to be posted shortly thereafter

Regards,

Daniel



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-24 23:11 [RFC] My research agenda for 2.7 Daniel Phillips
@ 2003-06-25  0:47 ` William Lee Irwin III
  2003-06-25  1:07   ` Daniel Phillips
  2003-06-25  9:29 ` Mel Gorman
  1 sibling, 1 reply; 30+ messages in thread
From: William Lee Irwin III @ 2003-06-25  0:47 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, linux-mm

On Wed, Jun 25, 2003 at 01:11:01AM +0200, Daniel Phillips wrote:
>   - Page size is represented on a per-address space basis with a shift count.
>     In practice, the smallest is 9 (512 byte sector), could imagine 7 (each
>     ext2 inode is separate page) or 8 (actual hardsect size on some drives).
>     12 will be the most common size.  13 gives 8K blocksize for, e.g., alpha.
>     21 and 22 give 2M and 4M page size, matching hardware capabilities of
>     x86, and other sizes are possible on machines like MIPS, where page size
>     is software controllable
>   - Implemented only for file-backed memory (page cache)

Per struct address_space? This is an unnecessary limitation.


On Wed, Jun 25, 2003 at 01:11:01AM +0200, Daniel Phillips wrote:
>   - Special case these ops in page cache access layer instead of having the
>     messy code in the block IO library
>   - Subpage struct pages are dynamically allocated.  But buffer_heads are gone
>     so this is a lateral change.

This gives me the same data structure proliferation chills as bh's.


-- wli

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-25  0:47 ` William Lee Irwin III
@ 2003-06-25  1:07   ` Daniel Phillips
  2003-06-25  1:10     ` William Lee Irwin III
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Phillips @ 2003-06-25  1:07 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: linux-kernel, linux-mm

On Wednesday 25 June 2003 02:47, William Lee Irwin III wrote:
> On Wed, Jun 25, 2003 at 01:11:01AM +0200, Daniel Phillips wrote:
> >   - Page size is represented on a per-address space basis with a shift
> > count. In practice, the smallest is 9 (512 byte sector), could imagine 7
> > (each ext2 inode is separate page) or 8 (actual hardsect size on some
> > drives). 12 will be the most common size.  13 gives 8K blocksize for,
> > e.g., alpha. 21 and 22 give 2M and 4M page size, matching hardware
> > capabilities of x86, and other sizes are possible on machines like MIPS,
> > where page size is software controllable
> >   - Implemented only for file-backed memory (page cache)
>
> Per struct address_space? This is an unnecessary limitation.

It's a sensible limitation, it keeps the radix tree lookup simple.

> On Wed, Jun 25, 2003 at 01:11:01AM +0200, Daniel Phillips wrote:
> >   - Special case these ops in page cache access layer instead of having
> > the messy code in the block IO library
> >   - Subpage struct pages are dynamically allocated.  But buffer_heads are
> > gone so this is a lateral change.
>
> This gives me the same data structure proliferation chills as bh's.

It's not nearly as bad.  There is no distinction between subpage and base 
struct page for almost all page operations, e.g., locking, IO, data access.

Regards,

Daniel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-25  1:07   ` Daniel Phillips
@ 2003-06-25  1:10     ` William Lee Irwin III
  2003-06-25  1:25       ` Daniel Phillips
  0 siblings, 1 reply; 30+ messages in thread
From: William Lee Irwin III @ 2003-06-25  1:10 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, linux-mm

On Wednesday 25 June 2003 02:47, William Lee Irwin III wrote:
>> Per struct address_space? This is an unnecessary limitation.

On Wed, Jun 25, 2003 at 03:07:18AM +0200, Daniel Phillips wrote:
> It's a sensible limitation, it keeps the radix tree lookup simple.

It severely limits its usefulness. Dropping in a more flexible data
structure should be fine.


On Wednesday 25 June 2003 02:47, William Lee Irwin III wrote:
>> This gives me the same data structure proliferation chills as bh's.

On Wed, Jun 25, 2003 at 03:07:18AM +0200, Daniel Phillips wrote:
> It's not nearly as bad.  There is no distinction between subpage and base 
> struct page for almost all page operations, e.g., locking, IO, data access.

But those are code sanitation issues. You need to make sure this
doesn't explode on PAE.


-- wli

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-25  1:10     ` William Lee Irwin III
@ 2003-06-25  1:25       ` Daniel Phillips
  2003-06-25  1:30         ` William Lee Irwin III
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Phillips @ 2003-06-25  1:25 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: linux-kernel, linux-mm

On Wednesday 25 June 2003 03:10, William Lee Irwin III wrote:
> On Wednesday 25 June 2003 02:47, William Lee Irwin III wrote:
> >> Per struct address_space? This is an unnecessary limitation.
>
> On Wed, Jun 25, 2003 at 03:07:18AM +0200, Daniel Phillips wrote:
> > It's a sensible limitation, it keeps the radix tree lookup simple.
>
> It severely limits its usefulness. Dropping in a more flexible data
> structure should be fine.

Eventually it could well make sense to do that, e.g., the radix tree 
eventually ought to evolve into a btree of extents (probably).  But making 
things so complex in the first version, thus losing much of the incremental 
development advantage, would not be smart.  With a single size of page per 
address_space,  changes to the radix tree code are limited to a couple of 
lines, for example.

But perhaps you'd like to supply some examples where more than one size of 
page in the same address space really matters?

> On Wednesday 25 June 2003 02:47, William Lee Irwin III wrote:
> >> This gives me the same data structure proliferation chills as bh's.
>
> On Wed, Jun 25, 2003 at 03:07:18AM +0200, Daniel Phillips wrote:
> > It's not nearly as bad.  There is no distinction between subpage and base
> > struct page for almost all page operations, e.g., locking, IO, data
> > access.
>
> But those are code sanitation issues. You need to make sure this
> doesn't explode on PAE.

Indeed, that is important.  Good night, see you tomorrow.

Daniel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-25  1:25       ` Daniel Phillips
@ 2003-06-25  1:30         ` William Lee Irwin III
  0 siblings, 0 replies; 30+ messages in thread
From: William Lee Irwin III @ 2003-06-25  1:30 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, linux-mm

On Wednesday 25 June 2003 03:10, William Lee Irwin III wrote:
>> It severely limits its usefulness. Dropping in a more flexible data
>> structure should be fine.

On Wed, Jun 25, 2003 at 03:25:47AM +0200, Daniel Phillips wrote:
> Eventually it could well make sense to do that, e.g., the radix tree 
> eventually ought to evolve into a btree of extents (probably).  But making 
> things so complex in the first version, thus losing much of the incremental 
> development advantage, would not be smart.  With a single size of page per 
> address_space,  changes to the radix tree code are limited to a couple of 
> lines, for example.
> But perhaps you'd like to supply some examples where more than one size of 
> page in the same address space really matters?

Software-refill TLB architectures would very much like to be handed the
largest physically contiguous chunk of memory out of pagecache possible
and map it out using the fewest number of TLB entries possible. Dropping
in a B+ tree to replace radix trees should be a weekend project at worst.
Speculatively allocating elements that are "as large as sane/possible"
will invariably result in variable-sized elements in the same tree.


-- wli

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-24 23:11 [RFC] My research agenda for 2.7 Daniel Phillips
  2003-06-25  0:47 ` William Lee Irwin III
@ 2003-06-25  9:29 ` Mel Gorman
  2003-06-26 19:00   ` Daniel Phillips
  1 sibling, 1 reply; 30+ messages in thread
From: Mel Gorman @ 2003-06-25  9:29 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, linux-mm

> 1) Active memory defragmentation
> 
> I doubt anyone will deny that this is desirable.  Active defragmentation will 
> eliminate higher order allocation failures for non-atomic allocations, and I 
> hope, generally improve the efficiency and transparency of the kernel memory 
> allocator.
> 

It might be just me, but this scheme sounds a bit complicated (I'm still
absorbing the other two). I find it difficult to see what happens when a
page used by a kernel pointer changes for any case other than vmalloc()
but I probably am missing something.

How about: Move order-0 allocations to slab (ignoring bootstrap issues for 
now but shouldn't be hard anyway)

Each cache slab is 2^MAX_GFP_ORDER large and there is three caches
  o order0-user
  o order0-kreclaim
  o order0-knoreclaim

order0-user is for any userspace allocation. These pages should be
trivially reclaimable with rmap available. If a large order block is
necessary, select one slab and reclaim it. This will break LRU ordering
something rotten but I am guessing that LRU ordering is not the prime
concern here. If a defrag daemon exists, scan MAX_DEFRAG_SCAN slabs and 
pick the one with the most clean filesystem backed pages to chuck out 
(least IO involved in reclaim).

order0-kreclaim is for kernel allocations which are trivially reclaimable
and that can be safely discared knowing that no pointer exists to them.  
This is most likely to be usable for slab allocations of caches like
dentry's which can be safely thrown out. A quick look of /proc/slabinfo
shows that most slabs are just 1 page large. Slabs already have a
set_shrinker() callback for the removal of objects so it is likely that
this could be extended for telling caches to clear all objects and discard
a particular slab.

order0-knoreclaim is for kernel allocations which cannot be easily 
reclaimed and have pointers to the allocation which are difficult to 
reclaim. For all intents and purposes, these are not reclaimable without 
impementing swapping in kernel space.

This has a couple of things going for it

o Reclaimable pages are in easy to search globs
o Gets nifty per-cpu alloc and caching that comes with slab automatically
o Freeing high order pages is a case of discarding pages in one slab
o Doesn't require fancy pants updating of pointers or page tables
o Possible ditch the mempool interface as slab already has similar functionality
o Seems simple

Opinions?

-- 
Mel Gorman

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-25  9:29 ` Mel Gorman
@ 2003-06-26 19:00   ` Daniel Phillips
  2003-06-26 20:01     ` Mel Gorman
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Phillips @ 2003-06-26 19:00 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-kernel, linux-mm

On Wednesday 25 June 2003 11:29, Mel Gorman wrote:
> > 1) Active memory defragmentation
> >
> > I doubt anyone will deny that this is desirable.  Active defragmentation
> > will eliminate higher order allocation failures for non-atomic
> > allocations, and I hope, generally improve the efficiency and
> > transparency of the kernel memory allocator.
>
> It might be just me, but this scheme sounds a bit complicated (I'm still
> absorbing the other two).

Mel,

I probably spent too much time dwelling on the hard cases of page moving, and 
didn't sufficiently emphasize that handling a few easy cases would do a 
pretty good job, certainly better than we do now.  For example:

  * Most process pages are easily movable, since usually only page tables
    will hold references.

  * Most page cache pages are easily movable, likewise

  * Similarly, page table pages are not too hard to move 

Most slab pages are hard to move.  We could try to fix that for certain common 
object types, or we could just tell slab to use its own biggish chunks of 
memory, which it can play in as it sees fit.

> I find it difficult to see what happens when a
> page used by a kernel pointer changes for any case other than vmalloc()
> but I probably am missing something.

The point you apparently missed is that the defragger will identify and update 
those kernel pointers, being careful about races of course.

> How about: Move order-0 allocations to slab (ignoring bootstrap issues for
> now but shouldn't be hard anyway)

That sounds like radical surgery to me, but to each his own experiments.

> Each cache slab is 2^MAX_GFP_ORDER large and there is three caches
>   o order0-user
>   o order0-kreclaim
>   o order0-knoreclaim
>
> order0-user is for any userspace allocation. These pages should be
> trivially reclaimable with rmap available. If a large order block is
> necessary, select one slab and reclaim it. This will break LRU ordering
> something rotten but I am guessing that LRU ordering is not the prime
> concern here. If a defrag daemon exists, scan MAX_DEFRAG_SCAN slabs and
> pick the one with the most clean filesystem backed pages to chuck out
> (least IO involved in reclaim).

Defragmentation by selective eviction is possible, but isn't necessarily 
optimal.  In the case of memory that isn't swap or file-backed, it isn't even 
possible.  On the other hand, you may think of the page move case as simply 
an optimized evict-and-reload, if that helps understand where I'm going.

Regardless, it would be good to teach vmscan to evict pages that will help 
build higher order allocation units, when these are in short supply.  This 
would be an optimization heuristic; it would still necessary to handle the 
general case of data moving in order to make any guarantee re fragmentation 
control.

> order0-kreclaim is for kernel allocations which are trivially reclaimable
> and that can be safely discared knowing that no pointer exists to them.
> This is most likely to be usable for slab allocations of caches like
> dentry's which can be safely thrown out. A quick look of /proc/slabinfo
> shows that most slabs are just 1 page large. Slabs already have a
> set_shrinker() callback for the removal of objects so it is likely that
> this could be extended for telling caches to clear all objects and discard
> a particular slab.
>
> order0-knoreclaim is for kernel allocations which cannot be easily
> reclaimed and have pointers to the allocation which are difficult to
> reclaim. For all intents and purposes, these are not reclaimable without
> impementing swapping in kernel space.
>
> This has a couple of things going for it
>
> o Reclaimable pages are in easy to search globs
> o Gets nifty per-cpu alloc and caching that comes with slab automatically
> o Freeing high order pages is a case of discarding pages in one slab
> o Doesn't require fancy pants updating of pointers or page tables

Without updating pointers, active defragmentation is not possible.  But 
perhaps you meant to say that active defragmentation is not needed?

> o Possible ditch the mempool interface as slab already has similar
> functionality
> o Seems simple
>
> Opinions?

Yes :-)

Regards,

Daniel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-26 19:00   ` Daniel Phillips
@ 2003-06-26 20:01     ` Mel Gorman
  2003-06-26 20:10       ` Andrew Morton
  2003-06-27  0:30       ` Daniel Phillips
  0 siblings, 2 replies; 30+ messages in thread
From: Mel Gorman @ 2003-06-26 20:01 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, linux-mm

On Thu, 26 Jun 2003, Daniel Phillips wrote:

>   * Most process pages are easily movable, since usually only page tables
>     will hold references.
>
>   * Most page cache pages are easily movable, likewise
>
>   * Similarly, page table pages are not too hard to move
>

I think that finding pages like this together is unlikely, especially if
the system has been running a long time. In the worst case you will have
every easily-moved page adjactent to a near-impossible-to-move page. The
buddy allocator could be taught to be selective of which freelists it used
for different types of allocation but that doesn't sound any easier than
moving order0 pages to slab.

There also would be trouble identifing adjactent pages of order 0 which
are are used for these particular task. Buddy allocators, including the
one implemented in Linux, do not record what order allocation a struct
page belongs to. I could be wrong but it just feels like it would be a lot
of work to free up just a few adjacent pages.

Tentatively, I would assert that being able to group these types of pages
together is a pre-requestive for effective defragging and I think moving
order-0 pages to slab would enforce this grouping. For example, the defrag
daemon could scan just order0-user slabs and move pages from sparsly
populated slabs to other fuller slabs and free slabs as it goes.

In other words, order0 caches are not a mutually exclusive goal to
defragging but I bet you a shiny penny it'd make the implementation a lot
easier.

> Most slab pages are hard to move.  We could try to fix that for certain
> common object types, or we could just tell slab to use its own biggish
> chunks of memory, which it can play in as it sees fit.
>

Moving slab pages is probably not an option unless some quick way of
updating all the pointers to the objects is found. I would say the only
slab pages that are "movable" belong to caches which can quickly discard
their objects.

> > I find it difficult to see what happens when a
> > page used by a kernel pointer changes for any case other than vmalloc()
> > but I probably am missing something.
>
> The point you apparently missed is that the defragger will identify and
> update those kernel pointers, being careful about races of course.
>

I didn't miss it as such, but I don't see how it could be implemented
either. I also wonder if moving kernel pages is really worth the hassle.
It's perfectly possible that defragging only user pages will be enough.
The only way to be sure would be to walk all struct pages in the mem_map
and see what the layout looks like.

> > How about: Move order-0 allocations to slab (ignoring bootstrap issues for
> > now but shouldn't be hard anyway)
>
> That sounds like radical surgery to me, but to each his own experiments.
>

I don't think it would be too radical because the buddy allocator would
still remain the principal page allocator. What would be a bitch is the
additional storage requirements required for such a large number of slabs
needed to contain pages. At the very least, the slab allocator would need
a means of discarding the slab descriptors for full slabs and maintaining
just pointers to the first page of each full slab. A comparison of page
allocation from buddy and page allocation from slab would also be
necessary to make sure the page allocation performance wasn't butchered.

> Defragmentation by selective eviction is possible, but isn't necessarily
> optimal.  In the case of memory that isn't swap or file-backed, it isn't even
> possible.  On the other hand, you may think of the page move case as simply
> an optimized evict-and-reload, if that helps understand where I'm going.
>

Well, take this this case

1) Slabs can contain 10 order0 pages
2) Defragger finds two slabs; slabA with 2 pages and slabB with 5
3) Defragger copies 2 pages from slabA to slabB and users rmap to update
   page tables

The alternative sounds like it would be scanning mem_map looking for pages
that can be moved which sounds expensive. With order0 in slab, you can
easily find blocks of pages which when moved immeditaly free up a large
adjacent block.

> > o Reclaimable pages are in easy to search globs
> > o Gets nifty per-cpu alloc and caching that comes with slab automatically
> > o Freeing high order pages is a case of discarding pages in one slab
> > o Doesn't require fancy pants updating of pointers or page tables
>
> Without updating pointers, active defragmentation is not possible.  But
> perhaps you meant to say that active defragmentation is not needed?
>

I am saying that full defragmentation would be a very expensive operation
which might not work if it cannot find suitably freeable adjacent pages.
If order0 pages were in slab, the whole searching problem becomes trivial
(just go to the relevant cache and scan the slabs).

> > o Possible ditch the mempool interface as slab already has similar
> > functionality
> > o Seems simple
> >
> > Opinions?
>
> Yes :-)
>

heh.

-- 
Mel Gorman

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-26 20:01     ` Mel Gorman
@ 2003-06-26 20:10       ` Andrew Morton
  2003-06-27  0:30       ` Daniel Phillips
  1 sibling, 0 replies; 30+ messages in thread
From: Andrew Morton @ 2003-06-26 20:10 UTC (permalink / raw)
  To: Mel Gorman; +Cc: phillips, linux-kernel, linux-mm

Mel Gorman <mel@csn.ul.ie> wrote:
>
> Buddy allocators, including the
>  one implemented in Linux, do not record what order allocation a struct
>  page belongs to.

We can do that.


--- 25/mm/page_alloc.c~a	2003-06-26 13:09:11.000000000 -0700
+++ 25-akpm/mm/page_alloc.c	2003-06-26 13:09:24.000000000 -0700
@@ -123,6 +123,7 @@ static void prep_compound_page(struct pa
 		SetPageCompound(p);
 		p->lru.next = (void *)page;
 	}
+	page[1].index = order;
 }
 
 static void destroy_compound_page(struct page *page, unsigned long order)

_


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-26 20:01     ` Mel Gorman
  2003-06-26 20:10       ` Andrew Morton
@ 2003-06-27  0:30       ` Daniel Phillips
  2003-06-27 13:00         ` Mel Gorman
  1 sibling, 1 reply; 30+ messages in thread
From: Daniel Phillips @ 2003-06-27  0:30 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-kernel, linux-mm

On Thursday 26 June 2003 22:01, Mel Gorman wrote:
> I think that finding pages like this together is unlikely, especially if
> the system has been running a long time. In the worst case you will have
> every easily-moved page adjactent to a near-impossible-to-move page.

Mel,

I addressed that in my previous post: "Most slab pages are hard to move ... we 
could just tell slab to use its own biggish chunks of memory, which it can 
play in as it sees fit".  Another way of putting this is, we isolate the type 
of data that's hard to defrag in its own space, so it can't fragment our 
precious movable space.  That separate space can expand and contract 
chunkwise.

To be sure, this solution means we still can't forcibly create higher order 
units in those undefraggable regions, but that is no worse than the current 
situation.  In the defraggable regions, hopefully the majority of memory, we 
have a much improved situation: we can forcibly create as many high order 
allocation units as we need, in order to handle, e.g., an Ext2/3 filesystem 
that's running with a large blocksize.

> Moving slab pages is probably not an option unless some quick way of
> updating all the pointers to the objects is found.

That was the part in my original post about a "new API".  If we choose to do 
so, we could fix up most kinds of slab objects to support some protocol to 
assist in the effort, or use a data structure better suited to relocation.  I 
doubt it's necessary to go that far.  As far as I can see, slab is not a big 
offender in terms of fragmentation at the moment.

> I also wonder if moving kernel pages is really worth the hassle.

That's the question of course.  The benefit is getting rid of high order 
allocation failures, and gaining some confidence that larger filesystem 
blocksizes will work reliably, however the workload evolves.  The cost is 
some additional complexity, no argument about that.  On the other hand, it 
would no longer be necessary to use voodoo to try to get the allocator to 
magically not fragment.

> It's perfectly possible that defragging only user pages will be enough.

Indeed.  That's an excellent place to start.  We'd include page cache pages in 
that, I'd expect.  Then it would be attractive to go on to handle page table 
pages, and with that we'd have the low-hanging fruit.

> The alternative sounds like it would be scanning mem_map looking for pages
> that can be moved which sounds expensive.

It looks linear to me, and it's not supposed to happen often.  It would 
typically be a response to a nasty situation like you'd get by mounting a 
volume that uses 32K blocks after running a long time with mostly 4K blocks.

> > Without updating pointers, active defragmentation is not possible.  But
> > perhaps you meant to say that active defragmentation is not needed?
>
> I am saying that full defragmentation would be a very expensive operation
> which might not work if it cannot find suitably freeable adjacent pages.

Full defragmentation, whatever that is, would be pointless for the 
applications I've described, though perhaps it could be needed by some 
bizarre application like software hotplug memory.  In the latter case you 
wouldn't be too worried about how long it takes.

In general though, the defragger would only run long enough to restore balance  
to the free counts across the various orders.  In rare cases, it would have 
to run synchronously in order to prevent an allocation from failing.

> If order0 pages were in slab, the whole searching problem becomes trivial
> (just go to the relevant cache and scan the slabs).

You might want to write a separate [rfc] to describe your idea.  For one 
thing, I don't see why you'd want to use slab for that.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-27  0:30       ` Daniel Phillips
@ 2003-06-27 13:00         ` Mel Gorman
  2003-06-27 14:38           ` Martin J. Bligh
                             ` (3 more replies)
  0 siblings, 4 replies; 30+ messages in thread
From: Mel Gorman @ 2003-06-27 13:00 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, linux-mm

On Fri, 27 Jun 2003, Daniel Phillips wrote:

> On Thursday 26 June 2003 22:01, Mel Gorman wrote:
> > I think that finding pages like this together is unlikely, especially if
> > the system has been running a long time. In the worst case you will have
> > every easily-moved page adjactent to a near-impossible-to-move page.
>
> I addressed that in my previous post: "Most slab pages are hard to move
> ... we could just tell slab to use its own biggish chunks of memory,
> which it can play in as it sees fit".

Ah, ok, sorry, I missed that but it wouldn't be the first time I missed
something. It was just a few days ago I wrote a pile of material on the
new buddy allocator as part of a publication and still missed that the
order of pages can be identified because of compound pages, thanks Andrew.

> > I also wonder if moving kernel pages is really worth the hassle.
>
> That's the question of course.  The benefit is getting rid of high order
> allocation failures, and gaining some confidence that larger filesystem
> blocksizes will work reliably, however the workload evolves.

I'm still working on 2.6 documentation which I expect will be published in
a few months. When I get that written, I'll look into seeing what can be
done with VM Regress to calculate fragmentation and to see how often do
high order allocations actually fail. It might help determine where
defragging is most needed.

IIRC, Martin J. Bligh had a patch which displayed information about the
buddy allocator freelist so that will probably be the starting point. From
there, it should be handy enough to see how intermixed are kernel page
allocations with user allocations. It might turn out that kernel pages
tend to be clustered together anyway.

> > If order0 pages were in slab, the whole searching problem becomes trivial
> > (just go to the relevant cache and scan the slabs).
>
> You might want to write a separate [rfc] to describe your idea.  For one
> thing, I don't see why you'd want to use slab for that.
>

You're right, I will need to write a proper RFC one way or the other. I
was thinking of using slabs because that way there wouldn't be need to
scan all of mem_map, just a small number of slabs. I have no basis for
this other than hand waving gestures though.

Anyway, as I know I won't be coding any time soon due to writing docs,
I'll shut up for the moment :-)

-- 
Mel Gorman

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-27 13:00         ` Mel Gorman
@ 2003-06-27 14:38           ` Martin J. Bligh
  2003-06-27 14:41           ` Daniel Phillips
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 30+ messages in thread
From: Martin J. Bligh @ 2003-06-27 14:38 UTC (permalink / raw)
  To: Mel Gorman, Daniel Phillips; +Cc: linux-kernel, linux-mm


> IIRC, Martin J. Bligh had a patch which displayed information about the
> buddy allocator freelist so that will probably be the starting point. From
> there, it should be handy enough to see how intermixed are kernel page
> allocations with user allocations. It might turn out that kernel pages
> tend to be clustered together anyway.

That should be merged now - /proc/buddyinfo. I guess you could do the same
for allocated pages (though it'd be rather heavyweight ;-))

M.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-27 13:00         ` Mel Gorman
  2003-06-27 14:38           ` Martin J. Bligh
@ 2003-06-27 14:41           ` Daniel Phillips
  2003-06-27 14:43           ` Martin J. Bligh
  2003-07-02 21:10           ` Mike Fedyk
  3 siblings, 0 replies; 30+ messages in thread
From: Daniel Phillips @ 2003-06-27 14:41 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-kernel, linux-mm

On Friday 27 June 2003 15:00, Mel Gorman wrote:
> On Fri, 27 Jun 2003, Daniel Phillips wrote:
> I was thinking of using slabs because that way there wouldn't be need to
> scan all of mem_map, just a small number of slabs. I have no basis for
> this other than hand waving gestures though.

That's the right idea, it's just not necessary to use slab cache to achieve 
it.  Actually, to handle huge (hardware) pages efficiently, my first instinct 
is to partition them into their own largish chunks as well, and allocate new 
chunks as necessary.  To get rid of a chunk (because freespace of that type 
of chunk has fallen below some threshold) it has to be entirely empty, which 
can be accomplished using the same move logic as for defragmentation.

You're right to be worried about intrusion of unmovable pages into regions 
that are supposed to be defraggable.  It's very easy for some random kernel 
code to take a use count on a page and hang onto it forever, making the page 
unmovable.  My hope is that:

  - This doesn't happen much
  - Code that does that can be cleaned up
  - Even when it happens it won't hurt much
  - If all of the above fails, fix the api for the offending code or create
    a new one
  - If that fails too, give up.

Regards,

Daniel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-27 13:00         ` Mel Gorman
  2003-06-27 14:38           ` Martin J. Bligh
  2003-06-27 14:41           ` Daniel Phillips
@ 2003-06-27 14:43           ` Martin J. Bligh
  2003-06-27 14:54             ` Daniel Phillips
  2003-07-02 21:10           ` Mike Fedyk
  3 siblings, 1 reply; 30+ messages in thread
From: Martin J. Bligh @ 2003-06-27 14:43 UTC (permalink / raw)
  To: Mel Gorman, Daniel Phillips; +Cc: linux-kernel, linux-mm


>> > I also wonder if moving kernel pages is really worth the hassle.
>> 
>> That's the question of course.  The benefit is getting rid of high order
>> allocation failures, and gaining some confidence that larger filesystem
>> blocksizes will work reliably, however the workload evolves.

Oh, BTW ... I suspect you've realised this already, but ....

The buddy allocator is not a good system for getting rid of fragmentation. 
If I group pages together in aligned pairs, and F is free and A is 
allocated, it'll not do anything useful with this:

F A   A F   F A   A F   F A   A F   F A   A F   F A   F A 

because the adjacent "F"s aren't "buddies". It seems that the purpose of
the buddy allocator was to be quick at allocating pages. Now that we stuck
a front end cache on it, in the form of hot & cold pages, that goal no
longer seems paramount - altering it to reduce fragmentation at the source,
rather than actively defrag afterwards would seem like a good goal to me.

M.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-27 14:43           ` Martin J. Bligh
@ 2003-06-27 14:54             ` Daniel Phillips
  2003-06-27 15:04               ` Martin J. Bligh
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Phillips @ 2003-06-27 14:54 UTC (permalink / raw)
  To: Martin J. Bligh, Mel Gorman; +Cc: linux-kernel, linux-mm

On Friday 27 June 2003 16:43, Martin J. Bligh wrote:
> The buddy allocator is not a good system for getting rid of fragmentation.

We've talked in the past about throwing out the buddy allocator and adopting 
something more modern and efficient and I hope somebody will actually get 
around to doing that.  In any event, defragging is an orthogonal issue.  Some 
allocation strategies may be statistically more resistiant to fragmentation 
than others, but no allocator has been invented, or ever will be, that can 
guarantee that terminal fragmentation will never occur - only active 
defragmentation can provide such a guarantee.

Regards,

Daniel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-27 14:54             ` Daniel Phillips
@ 2003-06-27 15:04               ` Martin J. Bligh
  2003-06-27 15:17                 ` Daniel Phillips
  0 siblings, 1 reply; 30+ messages in thread
From: Martin J. Bligh @ 2003-06-27 15:04 UTC (permalink / raw)
  To: Daniel Phillips, Mel Gorman; +Cc: linux-kernel, linux-mm

--Daniel Phillips <phillips@arcor.de> wrote (on Friday, June 27, 2003 16:54:46 +0200):

> On Friday 27 June 2003 16:43, Martin J. Bligh wrote:
>> The buddy allocator is not a good system for getting rid of fragmentation.
> 
> We've talked in the past about throwing out the buddy allocator and adopting 
> something more modern and efficient and I hope somebody will actually get 
> around to doing that.  In any event, defragging is an orthogonal issue.  Some 
> allocation strategies may be statistically more resistiant to fragmentation 
> than others, but no allocator has been invented, or ever will be, that can 
> guarantee that terminal fragmentation will never occur - only active 
> defragmentation can provide such a guarantee.

Whilst I agree with that in principle, it's inevitably expensive. Thus 
whilst we may need to have that code, we should try to avoid using it ;-)

The buddy allocator is obviously flawed in this department ... strategies
like allocating a 4M block to a process up front, then allocing out of that
until we're low on mem, then reclaiming in as large blocks as possible from
those process caches, etc, etc, would obviously help too. Though maybe
we're just permanently low on mem after a while, so it'd be better to just
group pagecache pages together ... that would actually be pretty simple to
change ... hmmm.

M.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-27 15:04               ` Martin J. Bligh
@ 2003-06-27 15:17                 ` Daniel Phillips
  2003-06-27 15:22                   ` Mel Gorman
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Phillips @ 2003-06-27 15:17 UTC (permalink / raw)
  To: Martin J. Bligh, Mel Gorman; +Cc: linux-kernel, linux-mm

On Friday 27 June 2003 17:04, Martin J. Bligh wrote:
> Daniel Phillips <phillips@arcor.de> wrote (on Friday, June 27, 2003
> > Some allocation strategies may be statistically more
> > resistiant to fragmentation than others, but no allocator has been
> > invented, or ever will be, that can guarantee that terminal fragmentation
> > will never occur - only active defragmentation can provide such a
> > guarantee.
>
> Whilst I agree with that in principle, it's inevitably expensive. Thus
> whilst we may need to have that code, we should try to avoid using it ;-)

That's exactly the idea.  Active defragmentation is just a fallback to handle  
currently-unhandled corner cases.  A good, efficient allocator that resists 
fragmentation in the first place is still needed.

Regards,

Daniel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-27 15:17                 ` Daniel Phillips
@ 2003-06-27 15:22                   ` Mel Gorman
  2003-06-27 15:50                     ` Daniel Phillips
  0 siblings, 1 reply; 30+ messages in thread
From: Mel Gorman @ 2003-06-27 15:22 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Martin J. Bligh, linux-kernel, linux-mm

On Fri, 27 Jun 2003, Daniel Phillips wrote:

> On Friday 27 June 2003 17:04, Martin J. Bligh wrote:
> > Daniel Phillips <phillips@arcor.de> wrote (on Friday, June 27, 2003
> > > Some allocation strategies may be statistically more
> > > resistiant to fragmentation than others, but no allocator has been
> > > invented, or ever will be, that can guarantee that terminal fragmentation
> > > will never occur - only active defragmentation can provide such a
> > > guarantee.
> >
> > Whilst I agree with that in principle, it's inevitably expensive. Thus
> > whilst we may need to have that code, we should try to avoid using it ;-)
>
> That's exactly the idea.  Active defragmentation is just a fallback to handle
> currently-unhandled corner cases.  A good, efficient allocator that resists
> fragmentation in the first place is still needed.
>

I still suspect moving order0 allocations to slab will be a fragmentation
resistent allocator but my main concern would be that the slab allocator
overhead, both CPU and storage requirements will be too high.

On the other hand, it would do some things you are looking for. For
example, it allocates large blocks of memory in one lump and then
allocates them piecemeal. Second, it would be resistent to the FAFAFA
problem Martin pointed out. As slabs would be allocated in a large block
from the buddy, you are guarenteed that you'll be able to free up buddies.
Lastly, as there would be a cache specifically for userspace pages, a
defragger that looked exclusively at user pages will still be sure of
being able to free adjacent buddies.

I need to write a proper RFC.....

-- 
Mel Gorman

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-27 15:22                   ` Mel Gorman
@ 2003-06-27 15:50                     ` Daniel Phillips
  2003-06-27 16:00                       ` Daniel Phillips
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Phillips @ 2003-06-27 15:50 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Martin J. Bligh, linux-kernel, linux-mm

On Friday 27 June 2003 17:22, Mel Gorman wrote:
> I still suspect moving order0 allocations to slab will be a fragmentation
> resistent allocator but my main concern would be that the slab allocator
> overhead, both CPU and storage requirements will be too high.
>
> On the other hand, it would do some things you are looking for. For
> example, it allocates large blocks of memory in one lump and then
> allocates them piecemeal. Second, it would be resistent to the FAFAFA
> problem Martin pointed out. As slabs would be allocated in a large block
> from the buddy, you are guarenteed that you'll be able to free up buddies.
> Lastly, as there would be a cache specifically for userspace pages, a
> defragger that looked exclusively at user pages will still be sure of
> being able to free adjacent buddies.
>
> I need to write a proper RFC.....

You might want to have a look at this:

   http://www.research.att.com/sw/tools/vmalloc/
   (Vmalloc: A Memory Allocation Library)

Regards,

Daniel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-27 15:50                     ` Daniel Phillips
@ 2003-06-27 16:00                       ` Daniel Phillips
  2003-06-29 19:25                         ` Mel Gorman
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Phillips @ 2003-06-27 16:00 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Martin J. Bligh, linux-kernel, linux-mm

On Friday 27 June 2003 17:50, Daniel Phillips wrote:
> You might want to have a look at this:
>
>    http://www.research.att.com/sw/tools/vmalloc/
>    (Vmalloc: A Memory Allocation Library)

Whoops, I retract that.  I did a quick scan of the page and ran into a link 
labled "Non-Commercial License Agreement and Software", attached to some form 
of license loaded with ugly patent words and so on.  My immediate reaction 
is: stay far, far away from that work, it is encumbered.  It would be nice if 
somebody knows differently, but for now I must ignore Vo's work.

Regards,

Daniel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-29 19:25                         ` Mel Gorman
@ 2003-06-28 21:06                           ` Daniel Phillips
  2003-06-29 21:26                             ` Mel Gorman
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Phillips @ 2003-06-28 21:06 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Martin J. Bligh, linux-kernel, linux-mm

On Sunday 29 June 2003 21:25, Mel Gorman wrote:
> As you can see, order0 allocations were a *lot* more common, at least in
> my system.

Mel,

There's no question that that's the case today.  However, there are good 
reasons for using a largish filesystem blocksize, 16K for example, once it 
becomes possible to do so.  With an active volume mounted using 16K blocks, 
you'd see that the balance of allocations shifts towards order 2.  The size 
of the shift will be workload-dependent, ranging from almost no order 2 
allocations, to almost all.  To keep things interesting, it's quite possible 
for the balance to change suddenly and/or strongly.

> Because they are so common in comparison to other orders, I
> think that putting order0 in slabs of size 2^MAX_ORDER will make
> defragmentation *so* much easier, if not plain simple, because you can
> shuffle around order0 pages in the slabs to free up one slab which frees
> up one large 2^MAX_ORDER adjacent block of pages.

But how will you shuffle those pages around?

Regards,

Daniel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-29 21:26                             ` Mel Gorman
@ 2003-06-28 21:54                               ` Daniel Phillips
  2003-06-29 22:07                                 ` William Lee Irwin III
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Phillips @ 2003-06-28 21:54 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Martin J. Bligh, linux-kernel, linux-mm

On Sunday 29 June 2003 23:26, Mel Gorman wrote:
> On Sat, 28 Jun 2003, Daniel Phillips wrote:
> > > Because they are so common in comparison to other orders, I
> > > think that putting order0 in slabs of size 2^MAX_ORDER will make
> > > defragmentation *so* much easier, if not plain simple, because you can
> > > shuffle around order0 pages in the slabs to free up one slab which
> > > frees up one large 2^MAX_ORDER adjacent block of pages.
> >
> > But how will you shuffle those pages around?
>
> Thats where your defragger would need to kick in. The defragger would scan
> at most MAX_DEFRAG_SCAN slabs belonging to the order0 userspace cache
> where MAX_DEFRAG_SCAN is related to how urgent the request is. Select the
> slab with the least objects in them and either:
> a) Reclaim the pages by swapping them out or whatever
> b) Copy the pages to another slab and update the page tables via rmap
> Once the pages are copied from the selected slab, destroy it and you have
> a large block of free pages.

Yes, now we're on the same page, so to speak.  Personally, I don't have much 
interest in working on improving the allocator per se.  I'd love to see 
somebody else take a run at that, and I will occupy myself with the gritty 
details of how to move pages without making the system crater.

> The point which I am getting across, quite
> badly, is that by having order0 pages in slab, you are guarenteed to be
> able to quickly find a cluster of pages to move which will free up a
> contiguous block of 2^MAX_ORDER pages, or at least 2^MAX_GFP_ORDER with
> the current slab implementation.

I don't see that it's guaranteed, but I do see that organizing pages in 
slab-like chunks is a good thing to do - a close reading of my original 
response to you shows I was thinking about that.

I also don't see that the slab cache in its current incarnation is the right 
tool for the job.  It handles things that we just don't need to handle, such 
as objects of arbitary size and alignment.  I'm sure you could make it work, 
but why not just tweak alloc_pages to know about chunks instead?

Regards,

Daniel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-29 22:07                                 ` William Lee Irwin III
@ 2003-06-28 23:18                                   ` Daniel Phillips
  0 siblings, 0 replies; 30+ messages in thread
From: Daniel Phillips @ 2003-06-28 23:18 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Mel Gorman, Martin J. Bligh, linux-kernel, linux-mm

On Monday 30 June 2003 00:07, William Lee Irwin III wrote:
> On Sunday 29 June 2003 23:26, Mel Gorman wrote:
> > ...I will occupy myself with the
> > gritty details of how to move pages without making the system crater.
>
> This sounds like it's behind dependent on physically scanning slabs,
> since one must choose slab pages for replacement on the basis of their
> potential to restore contiguity, not merely "dump whatever's replaceable
> and check how much got freed".

Though I'm not sure what "behind dependent" means, and I'm not the one 
advocating slab for this, it's quite correct that scanning strategy would 
need to change, at least when the system runs into cross-order imbalances.  
But this isn't much different from the kinds of things we do already.

Regards,

Daniel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-27 16:00                       ` Daniel Phillips
@ 2003-06-29 19:25                         ` Mel Gorman
  2003-06-28 21:06                           ` Daniel Phillips
  0 siblings, 1 reply; 30+ messages in thread
From: Mel Gorman @ 2003-06-29 19:25 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Martin J. Bligh, linux-kernel, linux-mm

On Fri, 27 Jun 2003, Daniel Phillips wrote:

> >    http://www.research.att.com/sw/tools/vmalloc/
> >    (Vmalloc: A Memory Allocation Library)
>
> Whoops, I retract that.  I did a quick scan of the page and ran into a link
> labled "Non-Commercial License Agreement and Software", attached to some form
> of license loaded with ugly patent words and so on.

The paper itself is still freely available and the idea could still be
implemented but I still think that the slab allocator will be sufficent
for order0 allocations and use caches to cluster the different types of
order0 allocations together. I've read that paper before and I think it is
overkill for the kernel because the allocation pattern isn't that
complicated in comparison to userspace memory management which that
allocator is aimed at. Not to mention that to get the most out of the
allocator, you need to know what your allocation pattern is going to be
like in advance.

Of course, at this point, I still need to write an RFC about using slab
for order0 allocations and ultimatly I'll have to put code where my emails
are :-/

To help show why slab might work though, I updated vmregress and released
0.10 available at http://www.csn.ul.ie/~mel/projects/vmregress/ . The main
addition is that it now can collect some basic statistics on the types of
page allocations. Specifically, with the trace_alloc module loaded and the
trace_pagealloc.diff patch applied (see kernel_patches/ directory),
/proc/vmregress/trace_alloc will print out the number of allocations for
each order that used either the GFP_KERNEL or GFP_USER flags.

To test it, I booted a clean 2.5.73 system[1], started X and ran
konqueror.  The results were;

Kernel allocations
------------------
	 BeforeX	After X
Order 0: 6886		14595
Order 1: 335		383
Order 2: 17		23
Order 3: 2		2
Order 4: 3		3
Order 5: 0		0
Order 6: 0		0
Order 7: 0		0
Order 8: 0		0
Order 9: 0		0
Order 10: 0		0

Userspace allocations
---------------------
Order 0: 53050		71505
Order 1: 1		1
Order 2: 0		0
Order 3: 0		0
Order 4: 0		0
Order 5: 0		0
Order 6: 0		0
Order 7: 0		0
Order 8: 0		0
Order 9: 0		0
Order 10: 0		0

As you can see, order0 allocations were a *lot* more common, at least in
my system. Because they are so common in comparison to other orders, I
think that putting order0 in slabs of size 2^MAX_ORDER will make
defragmentation *so* much easier, if not plain simple, because you can
shuffle around order0 pages in the slabs to free up one slab which frees
up one large 2^MAX_ORDER adjacent block of pages.

It's possible that a journalled filesystem (or some other subsystem I am
not familiar with) would show that order > 0 allocations are a lot more
important than my system shows but the means to collect the information is
in vmregress so be my guest. When the time requires, I'll extend vm
regress to see how fragmented memory actually gets.

Much much later, it would be worth looking at the buddy allocator and
seeing would it be worth changing to something much simpler as most of its
work would be handled by the slab allocator instead.

[1] Tonnes of errors were spewed out on uninitialised timers. With highmem
    enabled, I got a sleeping while atomic error warning in a loop caused
    by default_idle(). I still have to see can I get around to tracking
    down if it's a PEBKAC problem or a kernel bug

-- 
Mel Gorman

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-28 21:06                           ` Daniel Phillips
@ 2003-06-29 21:26                             ` Mel Gorman
  2003-06-28 21:54                               ` Daniel Phillips
  0 siblings, 1 reply; 30+ messages in thread
From: Mel Gorman @ 2003-06-29 21:26 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Martin J. Bligh, linux-kernel, linux-mm

On Sat, 28 Jun 2003, Daniel Phillips wrote:

> There's no question that that's the case today.  However, there are good
> reasons for using a largish filesystem blocksize, 16K for example, once it
> becomes possible to do so.

Fair enough. I am not very familiar with filesystems or why it would be
desirable to have a blocksize greater than a page size. I'd be grateful if
you'd recommend a good book, paper or documentation set that would
enlighten me.

> > Because they are so common in comparison to other orders, I
> > think that putting order0 in slabs of size 2^MAX_ORDER will make
> > defragmentation *so* much easier, if not plain simple, because you can
> > shuffle around order0 pages in the slabs to free up one slab which frees
> > up one large 2^MAX_ORDER adjacent block of pages.
>
> But how will you shuffle those pages around?
>

Thats where your defragger would need to kick in. The defragger would scan
at most MAX_DEFRAG_SCAN slabs belonging to the order0 userspace cache
where MAX_DEFRAG_SCAN is related to how urgent the request is. Select the
slab with the least objects in them and either:

a) Reclaim the pages by swapping them out or whatever
b) Copy the pages to another slab and update the page tables via rmap

Once the pages are copied from the selected slab, destroy it and you have
a large block of free pages. The point which I am getting across, quite
badly, is that by having order0 pages in slab, you are guarenteed to be
able to quickly find a cluster of pages to move which will free up a
contiguous block of 2^MAX_ORDER pages, or at least 2^MAX_GFP_ORDER with
the current slab implementation.

For kernel pages, it would get a lot more complicated but at least the
kernel pages would be clustered together in the order0 cache for kernel
pages.

-- 
Mel Gorman

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-28 21:54                               ` Daniel Phillips
@ 2003-06-29 22:07                                 ` William Lee Irwin III
  2003-06-28 23:18                                   ` Daniel Phillips
  0 siblings, 1 reply; 30+ messages in thread
From: William Lee Irwin III @ 2003-06-29 22:07 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Mel Gorman, Martin J. Bligh, linux-kernel, linux-mm

On Sunday 29 June 2003 23:26, Mel Gorman wrote:
>> Thats where your defragger would need to kick in. The defragger would scan
>> at most MAX_DEFRAG_SCAN slabs belonging to the order0 userspace cache
>> where MAX_DEFRAG_SCAN is related to how urgent the request is. Select the
>> slab with the least objects in them and either:
>> a) Reclaim the pages by swapping them out or whatever
>> b) Copy the pages to another slab and update the page tables via rmap
>> Once the pages are copied from the selected slab, destroy it and you have
>> a large block of free pages.

On Sat, Jun 28, 2003 at 11:54:43PM +0200, Daniel Phillips wrote:
> Yes, now we're on the same page, so to speak.  Personally, I don't have much 
> interest in working on improving the allocator per se.  I'd love to see 
> somebody else take a run at that, and I will occupy myself with the gritty 
> details of how to move pages without making the system crater.

This sounds like it's behind dependent on physically scanning slabs,
since one must choose slab pages for replacement on the basis of their
potential to restore contiguity, not merely "dump whatever's replaceable
and check how much got freed".


On Sunday 29 June 2003 23:26, Mel Gorman wrote:
>> The point which I am getting across, quite
>> badly, is that by having order0 pages in slab, you are guarenteed to be
>> able to quickly find a cluster of pages to move which will free up a
>> contiguous block of 2^MAX_ORDER pages, or at least 2^MAX_GFP_ORDER with
>> the current slab implementation.

On Sat, Jun 28, 2003 at 11:54:43PM +0200, Daniel Phillips wrote:
> I don't see that it's guaranteed, but I do see that organizing pages in 
> slab-like chunks is a good thing to do - a close reading of my original 
> response to you shows I was thinking about that.
> I also don't see that the slab cache in its current incarnation is the right 
> tool for the job.  It handles things that we just don't need to handle, such 
> as objects of arbitary size and alignment.  I'm sure you could make it work, 
> but why not just tweak alloc_pages to know about chunks instead?

A larger question in my mind is what slabs have to do with this at all.
Slabs are for fixed-size allocations. These techniques are explicitly
oriented toward making user allocations more variable, not less.


-- wli

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-06-27 13:00         ` Mel Gorman
                             ` (2 preceding siblings ...)
  2003-06-27 14:43           ` Martin J. Bligh
@ 2003-07-02 21:10           ` Mike Fedyk
  2003-07-03  2:04             ` Larry McVoy
  3 siblings, 1 reply; 30+ messages in thread
From: Mike Fedyk @ 2003-07-02 21:10 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Daniel Phillips, linux-kernel, linux-mm

On Fri, Jun 27, 2003 at 02:00:42PM +0100, Mel Gorman wrote:
> You're right, I will need to write a proper RFC one way or the other. I
> was thinking of using slabs because that way there wouldn't be need to
> scan all of mem_map, just a small number of slabs. I have no basis for
> this other than hand waving gestures though.

Mel,

This sounds much like something I was reading from Larry McVoy using page
objects (like one level higher in magnatude than pages).

I don't remember the URL, but there was something pretty extensive from
Larry already explaining the concept.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-07-02 21:10           ` Mike Fedyk
@ 2003-07-03  2:04             ` Larry McVoy
  2003-07-03  2:20               ` William Lee Irwin III
  0 siblings, 1 reply; 30+ messages in thread
From: Larry McVoy @ 2003-07-03  2:04 UTC (permalink / raw)
  To: Mel Gorman, Daniel Phillips, linux-kernel, linux-mm

On Wed, Jul 02, 2003 at 02:10:55PM -0700, Mike Fedyk wrote:
> On Fri, Jun 27, 2003 at 02:00:42PM +0100, Mel Gorman wrote:
> > You're right, I will need to write a proper RFC one way or the other. I
> > was thinking of using slabs because that way there wouldn't be need to
> > scan all of mem_map, just a small number of slabs. I have no basis for
> > this other than hand waving gestures though.
> 
> Mel,
> 
> This sounds much like something I was reading from Larry McVoy using page
> objects (like one level higher in magnatude than pages).
> 
> I don't remember the URL, but there was something pretty extensive from
> Larry already explaining the concept.

If we're thinking about the same thing, the basic idea was to store
information into a higher level object and make more intelligent paging
decisions based on the higher level object.  In my brain, since I'm a
SunOS guy, that means that you store information in the vnode (inode)
which reflects the status of all pages backed by this inode.

Instead of trying to figure out what to do at the page level, you figure
out what to do at the object level.  

Some postings about this:

http://groups.google.com/groups?q=topvn+mcvoy&hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=3cgeu9%24h96%40fido.asd.sgi.com&rnum=1

http://groups.google.com/groups?q=vnode+mcvoy&start=10&hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=l0ojgnINN59t%40appserv.Eng.Sun.COM&rnum=12

I can't find the writeup that you are thinking about.  I know what you mean,
there was a discussion of paging algs and I went off about how scanning a
page a time is insane.  If someone finds the URL let me know.
-- 
---
Larry McVoy              lm at bitmover.com          http://www.bitmover.com/lm

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] My research agenda for 2.7
  2003-07-03  2:04             ` Larry McVoy
@ 2003-07-03  2:20               ` William Lee Irwin III
  0 siblings, 0 replies; 30+ messages in thread
From: William Lee Irwin III @ 2003-07-03  2:20 UTC (permalink / raw)
  To: Larry McVoy, Mel Gorman, Daniel Phillips, linux-kernel, linux-mm

On Wed, Jul 02, 2003 at 07:04:45PM -0700, Larry McVoy wrote:
> If we're thinking about the same thing, the basic idea was to store
> information into a higher level object and make more intelligent paging
> decisions based on the higher level object.  In my brain, since I'm a
> SunOS guy, that means that you store information in the vnode (inode)
> which reflects the status of all pages backed by this inode.
> Instead of trying to figure out what to do at the page level, you figure
> out what to do at the object level.  
> Some postings about this:
> http://groups.google.com/groups?q=topvn+mcvoy&hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=3cgeu9%24h96%40fido.asd.sgi.com&rnum=1
> http://groups.google.com/groups?q=vnode+mcvoy&start=10&hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=l0ojgnINN59t%40appserv.Eng.Sun.COM&rnum=12
> I can't find the writeup that you are thinking about.  I know what you mean,
> there was a discussion of paging algs and I went off about how scanning a
> page a time is insane.  If someone finds the URL let me know.

I believe people are already on file object local page replacement,
though it's more in the planning than implementation phase.


-- wli

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2003-07-03  2:06 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-06-24 23:11 [RFC] My research agenda for 2.7 Daniel Phillips
2003-06-25  0:47 ` William Lee Irwin III
2003-06-25  1:07   ` Daniel Phillips
2003-06-25  1:10     ` William Lee Irwin III
2003-06-25  1:25       ` Daniel Phillips
2003-06-25  1:30         ` William Lee Irwin III
2003-06-25  9:29 ` Mel Gorman
2003-06-26 19:00   ` Daniel Phillips
2003-06-26 20:01     ` Mel Gorman
2003-06-26 20:10       ` Andrew Morton
2003-06-27  0:30       ` Daniel Phillips
2003-06-27 13:00         ` Mel Gorman
2003-06-27 14:38           ` Martin J. Bligh
2003-06-27 14:41           ` Daniel Phillips
2003-06-27 14:43           ` Martin J. Bligh
2003-06-27 14:54             ` Daniel Phillips
2003-06-27 15:04               ` Martin J. Bligh
2003-06-27 15:17                 ` Daniel Phillips
2003-06-27 15:22                   ` Mel Gorman
2003-06-27 15:50                     ` Daniel Phillips
2003-06-27 16:00                       ` Daniel Phillips
2003-06-29 19:25                         ` Mel Gorman
2003-06-28 21:06                           ` Daniel Phillips
2003-06-29 21:26                             ` Mel Gorman
2003-06-28 21:54                               ` Daniel Phillips
2003-06-29 22:07                                 ` William Lee Irwin III
2003-06-28 23:18                                   ` Daniel Phillips
2003-07-02 21:10           ` Mike Fedyk
2003-07-03  2:04             ` Larry McVoy
2003-07-03  2:20               ` William Lee Irwin III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).